Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock and CUDA OOM Issues with finetune_docowl.sh Using DeepSpeed Zero 2 and Zero 3 #114

Open
cryingjin opened this issue Oct 8, 2024 · 0 comments

Comments

@cryingjin
Copy link

cryingjin commented Oct 8, 2024

Hello. Thank you very much for sharing such great results. I really want to fine-tune and use this model.

As far as I have understood so far, when running finetune_docowl.sh, there is a deadlock issue with DeepSpeed stage 3 and 3-offload (zero 3, zero-offload), and it seems to be the same with zero2 and 3 of finetune_docowl_lora.sh.

Currently, I haven't been able to use finetune_docowl.sh (w/ zero2) due to a CUDA OOM issue.

Am I understanding this correctly? If you have resolved any of these deadlock issues, please share.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant