Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for Missing Configuration File "zero3_offload.json" in llara_train.sh Script #5

Open
wlxing1901 opened this issue Aug 19, 2024 · 6 comments

Comments

@wlxing1901
Copy link

Your work is excellent, and the open-source efforts are truly commendable. I have recently been trying out your code, but I encountered an issue during training. The llara_train.sh script mentions a configuration file --deepspeed ./scripts/zero3_offload.json, but I couldn't find this file in the specified directory. I attempted to use the corresponding file from LLAVA as a replacement, but it didn't work as expected. Could you please provide the correct configuration file?

Thank you very much for your assistance!

@LostXine
Copy link
Owner

Hi @FORREST1901 ,

Thanks for your interest in our work. It appears that these files were unintentionally omitted when the repository was first created. Now they are back at this link, which is the same as the original LLaVA repo. sorry for the confusion.

Best regards,

@wlxing1901
Copy link
Author

Thank you very much for your prompt response. However, I noticed that using zero3_offload.json causes the training process to use the CPU, which leads to CPU-related errors. When I switched to using zero3.json instead, I encountered a torch.cuda.OutOfMemoryError: CUDA out of memory error. I'm using 6 RTX 4090 GPUs with 24GB each, and my system memory is around 300GB. I wanted to ask if you might know the potential reason for this issue.

By the way, I was initially able to run the D-inBC-text-multi-train-0d8k-front dataset without any issues, but after encountering the OutOfMemoryError, subsequent attempts have consistently resulted in the same error.

@LostXine LostXine reopened this Aug 20, 2024
@LostXine
Copy link
Owner

Hi @FORREST1901 ,

As you mentioned zero3_offload.json will offload the parameters of the optimizer to the CPU. When using deepspeed, it will compile some necessary binaries the first time you launch it. Please take a closer look at the error log and you should be able to find some clues.

6x24GB VRAM does not meet the need of training without offloading, at least 8x24GB is required according to our experiments, and you need to reduce the batch size in this case as well.

Please check if there are any zombie processes holding the VRAM.

Thanks.

@wlxing1901
Copy link
Author

Thank you for your detailed response again. I will try your suggestions and see. If I'm able to fix the problem, I'll be sure to share the solution here in this thread.

@wlxing1901
Copy link
Author

And one more question: Have you tried using LoRA instead of full fine-tuning? If so, how does the performance compare between LoRA and full fine-tuning?

@LostXine
Copy link
Owner

And one more question: Have you tried using LoRA instead of full fine-tuning? If so, how does the performance compare between LoRA and full fine-tuning?

Not yet, but I agree it will be an interesting topic to try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants