issue of mixed-precision training. #34

Bilibilee · 2024-10-01T06:30:47Z

In the scripts script/MLLMSD_7b.sh and script/SmartEdit_7b.sh, you have specified --bf16 True, yet it seems that the corresponding deepspeed configuration in scripts/zero_mixed.json is missing the line "bf16": {"enabled": "auto"}. As a result, the --bf16 True flag does not appear to be taking effect. I would like to confirm whether this is a mistake or intentional.

Additionally, when training the MLLMSD 7b model, the logs indicate the following data types, which indicates that you have set some parts of the model to use float32 and other parts to use bfloat16.

1. model.vision_tower.dtype: torch.float32
2. model.mm_projector.dtype: torch.float32
3.1. model.model.model(LLaMA).embed_tokens.dtype: torch.float32
3.2. model.model.model(LLaMA).dtype: torch.bfloat16 torch.bfloat16
3.3. model.lm_head.dtype: torch.float32
4.1. model.sd_query_tokens.dtype: torch.float32
4.2. model.sd_qformer.dtype: torch.float32
5.1. model.vae.dtype: torch.bfloat16
5.2. model.unet.dtype: torch.float32

It is acceptable to me that the LLM model uses torch.bfloat16. However, I am curious as to why the VAE model, which has a relatively small number of parameters, is also set to use torch.bfloat16. Could there be a specific reason for this choice?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issue of mixed-precision training. #34

issue of mixed-precision training. #34

Bilibilee commented Oct 1, 2024

issue of mixed-precision training. #34

issue of mixed-precision training. #34

Comments

Bilibilee commented Oct 1, 2024