Deepspeed zero3 training is loading models to GPUs (on init) instead of RAM #1983

RameshArvind · 2024-10-19T00:25:27Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

When launching training with zero3, with zero3 init it should load the models straight into ram instead of VRAM. There's probably some regression in the codebase as I'm able to run the same training config in a previous version of the repository (commit: 6e354682e3c1735d3f7fb9e362280c38e922260f). The old commit properly loads the model into RAM right from the get go and performs training as expected with zero3 settings.

Current behaviour

With current main, the training process dies while loading checkpoints due to lack of VRAM with zero3 config.

Steps to reproduce

Using this YAML on a machine with 4x3090 and 500GB of RAM on runpod

Config yaml

base_model: Qwen/Qwen2.5-14B-Instruct

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: tatsu-lab/alpaca
    type: alpaca
dataset_prepared_path: last_run_prepared
val_set_size: 0.05
output_dir: ./outputs/out

sequence_len: 8192
sample_packing: true
pad_to_sequence_len: true

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 2e-5

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 100
evals_per_epoch: 2
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed: deepspeed_configs/zero3_bf16_cpuoffload_all.json
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  pad_token: <|end_of_text|>

Possible solution

I see some commits around changes to huggingface version and how accelerate gets initialized, I think somehow the flag to detect if zero3 is being used isn't working while loading the model to ensure its loading it to RAM first before offloading it the huggingface trainer

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

3.10

axolotl branch-commit

main/67f744dc8c9564ef7a42d5df780ae53e319dca61

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

NanoCode012 · 2024-10-21T06:55:47Z

Hey @RameshArvind , could you point to the commits that you suspect?

To clarify , commit 6e354682e3c1735d3f7fb9e362280c38e922260f is okay?

RameshArvind · 2024-10-21T07:14:14Z

6e354682e3c1735d3f7fb9e362280c38e922260f works fine yes.

As for what I suspect is going wrong, I think somehow this variable skip_move_to_device isn't being set as from the logs and observing vram it starts loading the checkpoint shards straight onto the GPUs on the latest commit. I think the zero3 flag isn't being set appropriately for that helper function operate. That's my hunch, but I haven't had time to check if this is true.

This newer commit removes initializing the Accelerator() object which I think used to help with setting up zero init. I tried adding it back but caused errors as pointed out in same MR.

chiwanpark · 2024-10-22T13:23:37Z

I have the same problem. The latest commit worked fine is ec4272c.

muellerzr · 2024-10-22T14:06:01Z

What version of transformers are we running off of here?

chiwanpark · 2024-10-22T14:34:26Z

I'm using transformers 4.45.2.

RameshArvind added the bug Something isn't working label Oct 19, 2024

winglian linked a pull request Oct 24, 2024 that will close this issue

fix zero3 #1994

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepspeed zero3 training is loading models to GPUs (on init) instead of RAM #1983

Deepspeed zero3 training is loading models to GPUs (on init) instead of RAM #1983

RameshArvind commented Oct 19, 2024 •

edited

Loading

NanoCode012 commented Oct 21, 2024

RameshArvind commented Oct 21, 2024

chiwanpark commented Oct 22, 2024

muellerzr commented Oct 22, 2024

chiwanpark commented Oct 22, 2024

Deepspeed zero3 training is loading models to GPUs (on init) instead of RAM #1983

Deepspeed zero3 training is loading models to GPUs (on init) instead of RAM #1983

Comments

RameshArvind commented Oct 19, 2024 • edited Loading

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

NanoCode012 commented Oct 21, 2024

RameshArvind commented Oct 21, 2024

chiwanpark commented Oct 22, 2024

muellerzr commented Oct 22, 2024

chiwanpark commented Oct 22, 2024

RameshArvind commented Oct 19, 2024 •

edited

Loading