Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deepspeed zero3 training is loading models to GPUs (on init) instead of RAM #1983

Open
6 of 8 tasks
RameshArvind opened this issue Oct 19, 2024 · 5 comments · May be fixed by #1994
Open
6 of 8 tasks

Deepspeed zero3 training is loading models to GPUs (on init) instead of RAM #1983

RameshArvind opened this issue Oct 19, 2024 · 5 comments · May be fixed by #1994
Labels
bug Something isn't working

Comments

@RameshArvind
Copy link

RameshArvind commented Oct 19, 2024

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

When launching training with zero3, with zero3 init it should load the models straight into ram instead of VRAM. There's probably some regression in the codebase as I'm able to run the same training config in a previous version of the repository (commit: 6e354682e3c1735d3f7fb9e362280c38e922260f). The old commit properly loads the model into RAM right from the get go and performs training as expected with zero3 settings.

Current behaviour

With current main, the training process dies while loading checkpoints due to lack of VRAM with zero3 config.

Steps to reproduce

Using this YAML on a machine with 4x3090 and 500GB of RAM on runpod

Config yaml

base_model: Qwen/Qwen2.5-14B-Instruct

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: tatsu-lab/alpaca
    type: alpaca
dataset_prepared_path: last_run_prepared
val_set_size: 0.05
output_dir: ./outputs/out

sequence_len: 8192
sample_packing: true
pad_to_sequence_len: true

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 2e-5

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 100
evals_per_epoch: 2
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed: deepspeed_configs/zero3_bf16_cpuoffload_all.json
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  pad_token: <|end_of_text|>

Possible solution

I see some commits around changes to huggingface version and how accelerate gets initialized, I think somehow the flag to detect if zero3 is being used isn't working while loading the model to ensure its loading it to RAM first before offloading it the huggingface trainer

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

3.10

axolotl branch-commit

main/67f744dc8c9564ef7a42d5df780ae53e319dca61

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@RameshArvind RameshArvind added the bug Something isn't working label Oct 19, 2024
@NanoCode012
Copy link
Collaborator

Hey @RameshArvind , could you point to the commits that you suspect?

To clarify , commit 6e354682e3c1735d3f7fb9e362280c38e922260f is okay?

@RameshArvind
Copy link
Author

6e354682e3c1735d3f7fb9e362280c38e922260f works fine yes.

As for what I suspect is going wrong, I think somehow this variable skip_move_to_device isn't being set as from the logs and observing vram it starts loading the checkpoint shards straight onto the GPUs on the latest commit. I think the zero3 flag isn't being set appropriately for that helper function operate. That's my hunch, but I haven't had time to check if this is true.

This newer commit removes initializing the Accelerator() object which I think used to help with setting up zero init. I tried adding it back but caused errors as pointed out in same MR.

@chiwanpark
Copy link
Contributor

I have the same problem. The latest commit worked fine is ec4272c.

@muellerzr
Copy link

What version of transformers are we running off of here?

@chiwanpark
Copy link
Contributor

I'm using transformers 4.45.2.

@winglian winglian linked a pull request Oct 24, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants