[BUG] 模型进行全量微调后，loss正常，但推理时乱码 #472

Z-MU-Z · 2024-09-23T02:41:19Z

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

self.model = AutoModelForCausalLM.from_pretrained(
model_path, device_map='cuda', trust_remote_code=True).eval()

    self.tokenizer = AutoTokenizer.from_pretrained(model_path,
                                            trust_remote_code=True)
    self.tokenizer.padding_side = 'left'
    self.tokenizer.pad_token_id = self.tokenizer.eod_id

    self.prompt = '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nPicture 1: <img>{}</img>\nPlease find the object. The object description is as follows:<ref>{}</ref><|im_end|>\n<|im_start|>assistant\n'

token_result = self.tokenizer([prompt],
return_tensors='pt',
padding='longest')
input_ids = token_result.input_ids # print(self.tokenizer.decode(input_ids[0]))

    attention_mask = token_result.attention_mask
    pred = self.model.generate(
        input_ids=input_ids.cuda(),
        attention_mask=attention_mask.cuda(),
        do_sample=False,
        num_beams=1,
        max_new_tokens=28,
        min_new_tokens=10,
        length_penalty=1,
        num_return_sequences=1,
        use_cache=True,
        pad_token_id=self.tokenizer.eod_id,
        eos_token_id=self.tokenizer.eod_id,
        #masks_ids = mask_token
    )
    answers = [
        self.tokenizer.decode(_[input_ids.size(1):].cpu(),
                              skip_special_tokens=True) for _ in pred
    ]

模型预测全部乱码

期望行为 | Expected Behavior

模型应该正常预测输出。

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

No response

The text was updated successfully, but these errors were encountered:

Z-MU-Z · 2024-09-23T02:51:36Z

使用的是 finetune_ds.sh 进行训练，没有改过其他参数

Z-MU-Z · 2024-09-23T02:55:19Z

训练后ckpt的目录如下config.json
configuration_qwen.py
generation_config.json
model.safetensors
modeling_qwen.py
qwen_generation_utils.py
qwen.tiktoken
special_tokens_map.json
tokenization_qwen.py
tokenizer_config.json
trainer_state.json
training_args.bin
visual.py

whycantfindaname · 2024-09-27T03:39:55Z

想问一下全量微调需要的显存多大呢，我们在h20上batchsize只能开2

Z-MU-Z · 2024-09-27T03:55:53Z

@whycantfindaname 我开的per_divice_train_batch_size是1

whycantfindaname · 2024-09-27T03:58:32Z

我们用的是zero stage3，应该是最省显存的了。那看起来全量微调需要的资源还是挺贵的。

Z-MU-Z · 2024-09-29T12:03:37Z

看起来好像是模型保存的问题，理论上应该得保存四个safetensor，但实际上最后只保存了一个，但是看中间的checkpoint保存是正常的，目前我没有找到原因。有人知道怎么解决吗？

Z-MU-Z · 2024-10-12T09:08:07Z

我发现这个问题只在使用zero2的时候出现，zero3时正常，具体表现为最终训练完保存的时候输出了Removed shared tensor {'transformer.h.27.mlp.w2.weight', 'transformer.h.3.mlp.w1.weight', 'transformer.h.13.mlp.w1.weight', 'transformer.h.18.attn.c_attn.bias', 'transformer.visual.attn_pool.pos_embed', 'transformer.h.5.ln_1.weight', ...

看起来和transformer版本有关，我将transformers==4.37.2改为transformers==4.32.0后就正常了

Z-MU-Z closed this as completed Sep 25, 2024

Z-MU-Z reopened this Sep 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] 模型进行全量微调后，loss正常，但推理时乱码 #472

[BUG] 模型进行全量微调后，loss正常，但推理时乱码 #472

Z-MU-Z commented Sep 23, 2024

Z-MU-Z commented Sep 23, 2024

Z-MU-Z commented Sep 23, 2024

whycantfindaname commented Sep 27, 2024

Z-MU-Z commented Sep 27, 2024

whycantfindaname commented Sep 27, 2024 •

edited

Loading

Z-MU-Z commented Sep 29, 2024

Z-MU-Z commented Oct 12, 2024

[BUG] 模型进行全量微调后，loss正常，但推理时乱码 #472

[BUG] 模型进行全量微调后，loss正常，但推理时乱码 #472

Comments

Z-MU-Z commented Sep 23, 2024

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

期望行为 | Expected Behavior

复现方法 | Steps To Reproduce

运行环境 | Environment

备注 | Anything else?

Z-MU-Z commented Sep 23, 2024

Z-MU-Z commented Sep 23, 2024

whycantfindaname commented Sep 27, 2024

Z-MU-Z commented Sep 27, 2024

whycantfindaname commented Sep 27, 2024 • edited Loading

Z-MU-Z commented Sep 29, 2024

Z-MU-Z commented Oct 12, 2024

whycantfindaname commented Sep 27, 2024 •

edited

Loading