You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please check that this issue hasn't been reported before.
I searched previous Bug Reports didn't find any similar reports.
Expected Behavior
When launching training with zero3, with zero3 init it should load the models straight into ram instead of VRAM. There's probably some regression in the codebase as I'm able to run the same training config in a previous version of the repository (commit: 6e354682e3c1735d3f7fb9e362280c38e922260f). The old commit properly loads the model into RAM right from the get go and performs training as expected with zero3 settings.
Current behaviour
With current main, the training process dies while loading checkpoints due to lack of VRAM with zero3 config.
Steps to reproduce
Using this YAML on a machine with 4x3090 and 500GB of RAM on runpod
I see some commits around changes to huggingface version and how accelerate gets initialized, I think somehow the flag to detect if zero3 is being used isn't working while loading the model to ensure its loading it to RAM first before offloading it the huggingface trainer
Which Operating Systems are you using?
Linux
macOS
Windows
Python Version
3.10
axolotl branch-commit
main/67f744dc8c9564ef7a42d5df780ae53e319dca61
Acknowledgements
My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.
The text was updated successfully, but these errors were encountered:
As for what I suspect is going wrong, I think somehow this variableskip_move_to_device isn't being set as from the logs and observing vram it starts loading the checkpoint shards straight onto the GPUs on the latest commit. I think the zero3 flag isn't being set appropriately for that helper function operate. That's my hunch, but I haven't had time to check if this is true.
This newer commit removes initializing the Accelerator() object which I think used to help with setting up zero init. I tried adding it back but caused errors as pointed out in same MR.
Please check that this issue hasn't been reported before.
Expected Behavior
When launching training with zero3, with zero3 init it should load the models straight into ram instead of VRAM. There's probably some regression in the codebase as I'm able to run the same training config in a previous version of the repository (commit: 6e354682e3c1735d3f7fb9e362280c38e922260f). The old commit properly loads the model into RAM right from the get go and performs training as expected with zero3 settings.
Current behaviour
With current main, the training process dies while loading checkpoints due to lack of VRAM with zero3 config.
Steps to reproduce
Using this YAML on a machine with 4x3090 and 500GB of RAM on runpod
Config yaml
Possible solution
I see some commits around changes to huggingface version and how accelerate gets initialized, I think somehow the flag to detect if zero3 is being used isn't working while loading the model to ensure its loading it to RAM first before offloading it the huggingface trainer
Which Operating Systems are you using?
Python Version
3.10
axolotl branch-commit
main/67f744dc8c9564ef7a42d5df780ae53e319dca61
Acknowledgements
The text was updated successfully, but these errors were encountered: