Request for Missing Configuration File "zero3_offload.json" in llara_train.sh Script #5

wlxing1901 · 2024-08-19T15:03:42Z

Your work is excellent, and the open-source efforts are truly commendable. I have recently been trying out your code, but I encountered an issue during training. The llara_train.sh script mentions a configuration file --deepspeed ./scripts/zero3_offload.json, but I couldn't find this file in the specified directory. I attempted to use the corresponding file from LLAVA as a replacement, but it didn't work as expected. Could you please provide the correct configuration file?

Thank you very much for your assistance!

LostXine · 2024-08-19T15:44:42Z

Hi @FORREST1901 ,

Thanks for your interest in our work. It appears that these files were unintentionally omitted when the repository was first created. Now they are back at this link, which is the same as the original LLaVA repo. sorry for the confusion.

Best regards,

wlxing1901 · 2024-08-20T06:38:04Z

Thank you very much for your prompt response. However, I noticed that using zero3_offload.json causes the training process to use the CPU, which leads to CPU-related errors. When I switched to using zero3.json instead, I encountered a torch.cuda.OutOfMemoryError: CUDA out of memory error. I'm using 6 RTX 4090 GPUs with 24GB each, and my system memory is around 300GB. I wanted to ask if you might know the potential reason for this issue.

By the way, I was initially able to run the D-inBC-text-multi-train-0d8k-front dataset without any issues, but after encountering the OutOfMemoryError, subsequent attempts have consistently resulted in the same error.

LostXine · 2024-08-20T06:46:56Z

Hi @FORREST1901 ,

As you mentioned zero3_offload.json will offload the parameters of the optimizer to the CPU. When using deepspeed, it will compile some necessary binaries the first time you launch it. Please take a closer look at the error log and you should be able to find some clues.

6x24GB VRAM does not meet the need of training without offloading, at least 8x24GB is required according to our experiments, and you need to reduce the batch size in this case as well.

Please check if there are any zombie processes holding the VRAM.

Thanks.

wlxing1901 · 2024-08-20T07:00:50Z

Thank you for your detailed response again. I will try your suggestions and see. If I'm able to fix the problem, I'll be sure to share the solution here in this thread.

wlxing1901 · 2024-08-20T07:27:01Z

And one more question: Have you tried using LoRA instead of full fine-tuning? If so, how does the performance compare between LoRA and full fine-tuning?

LostXine · 2024-08-20T07:38:40Z

And one more question: Have you tried using LoRA instead of full fine-tuning? If so, how does the performance compare between LoRA and full fine-tuning?

Not yet, but I agree it will be an interesting topic to try.

LostXine closed this as completed Aug 19, 2024

LostXine reopened this Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for Missing Configuration File "zero3_offload.json" in llara_train.sh Script #5

Request for Missing Configuration File "zero3_offload.json" in llara_train.sh Script #5

wlxing1901 commented Aug 19, 2024

LostXine commented Aug 19, 2024

wlxing1901 commented Aug 20, 2024

LostXine commented Aug 20, 2024

wlxing1901 commented Aug 20, 2024

wlxing1901 commented Aug 20, 2024

LostXine commented Aug 20, 2024

Request for Missing Configuration File "zero3_offload.json" in llara_train.sh Script #5

Request for Missing Configuration File "zero3_offload.json" in llara_train.sh Script #5

Comments

wlxing1901 commented Aug 19, 2024

LostXine commented Aug 19, 2024

wlxing1901 commented Aug 20, 2024

LostXine commented Aug 20, 2024

wlxing1901 commented Aug 20, 2024

wlxing1901 commented Aug 20, 2024

LostXine commented Aug 20, 2024