Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of Memory Error during execution of bash scripts/TrainStage1_7b.sh #40

Open
dohee01 opened this issue Oct 24, 2024 · 0 comments
Open

Comments

@dohee01
Copy link

dohee01 commented Oct 24, 2024

  1. While executing the script bash scripts/TrainStage1_7b.sh, I encountered an Out of Memory (OOM) error. The error message is as follows:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 0 has a total capacity of 23.59 GiB of which 104.75 MiB is free. Including non-PyTorch memory, this process has 23.42 GiB memory in use. Of the allocated memory, 23.17 GiB is allocated by PyTorch, and 2.91 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large, try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.

The error seems to occur at the following point in the code:

trainer = AlignLLMwithSDCLIPTrainer(model=model, tokenizer=llm_tokenizer, args=training_args, **data_module)

System Info:

PyTorch version: 2.1.0
Transformers version: 4.28.1
GPUs: 2x NVIDIA RTX 3090
RAM: 128GB
CUDA version: 11.8

Given these specs, I’m wondering if it’s feasible to train the model without encountering OOM errors, and if there are any suggestions for resolving the memory issues.

  1. Attempt with DeepSpeed:

To mitigate the OOM issue, I tried implementing DeepSpeed, but ran into compatibility issues. I am using DeepSpeed version 0.15.3 (as version 0.7.3 did not work due to the deprecation of torch._six). The following is the config file I used for DeepSpeed:

{
    "train_micro_batch_size_per_gpu": 1,
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 2e-4
        }
    },
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        }
    },
    "fp16": {
        "enabled": true
    }
}

I also modified the TrainStage1.py file as follows to include DeepSpeed:

def train():
    global local_rank
    # Add DeepSpeed config file path
    deepspeed_config_path = "../deepspeed_config.json"
    
    parser = transformers.HfArgumentParser((ModelArguments, DataArguments, TrainingArguments))
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()

    # Add DeepSpeed config path to TrainingArguments
    training_args.deepspeed = deepspeed_config_path
    
    local_rank = training_args.local_rank
    ...

However, I encountered the following error message when trying to run the modified code:

AttributeError: 'TrainingArguments' object has no attribute 'hf_deepspeed_config'

Even though I explicitly added the path to the DeepSpeed config file, this error persists. Could you provide any guidance on how to resolve the OOM issue using DeepSpeed, or suggest which version of DeepSpeed is compatible with Transformers 4.28.1? Any advice would be greatly appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant