OOM for training llama #1900

dkapur17 · 2025-01-07T22:36:23Z

I'm trying to use the llama-3.2-1B model with the Python API on a compute with 4 Tesla V100s (4*16GB), but the process keeps failing due to OOM. Watching nvidia-smi, I see the utilization shoot up to 16GB on each gpu and then the process dies. The 1B model should work with much lesser VRAM from my understanding, or maybe I'm doing something incorrect. Here is my code:

class LitLLM(L.LightningModule):
    def __init__(self, tokenizer_dir=None, trainer_ckpt_path=None):
        super().__init__()
 
        self.llm = LLM.load("meta-llama/Llama-3.2-1B", distribute=None, access_token=os.getenv("HF_TOKEN"))
        self.trainer_ckpt_path = trainer_ckpt_path

    def setup(self, stage):
        self.llm.trainer_setup(trainer_ckpt=self.trainer_ckpt_path)
        
    def training_step(self, batch):
        logits, loss = self.llm(input_ids=batch["input_ids"], target_ids=batch["labels"])
        self.log("train_loss", loss, prog_bar=True)
        return loss

    def validation_step(self, batch):
        logits, loss = self.llm(input_ids=batch["input_ids"], target_ids=batch["labels"])
        self.log("validation_loss", loss, prog_bar=True)
        return loss

    def configure_optimizers(self):
        warmup_steps = 10
        optimizer = torch.optim.AdamW(self.llm.model.parameters(), lr=0.0002, weight_decay=0.0, betas=(0.9, 0.95))
        scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lambda step: step / warmup_steps)
        return [optimizer], [scheduler]


batch_size = 2
accumulate_grad_batches = 1

lit_model = LitLLM()
data = Alpaca2k()

data.connect(lit_model.llm.tokenizer, batch_size=batch_size, max_seq_length=512)

trainer = L.Trainer(
    devices=4,
    accelerator="cuda",
    max_epochs=1,
    accumulate_grad_batches=accumulate_grad_batches,
    precision="bf16-true",
)
trainer.fit(lit_model, data)

The process dies before even the first training pass. I tried a few approaches with quantization, by defining quantize (and other params) in self.llm.distribute in the setup method as well, but none of these approaches seem to work. Any ideas on what I might be doing wrong? Thanks.

The text was updated successfully, but these errors were encountered:

rasbt · 2025-01-07T23:23:39Z

Thanks for the feedback. It does work on 4 x L4s, which have 24 Gb each. I can see that the usage is around 22-24 GB. Other than trying a smaller batch size or block size, or perhaps a different multi-GPU strategy, I am not sure how this can be improved.

dkapur17 · 2025-01-08T09:59:26Z

@rasbt thanks for the quick rely. So is it taking 22GB in total across the GPUs or on each GPU? I would think a sequential load strategy could help split the model across the GPUs and 64GB should be enough for it, but when using distribute it looks like it conflicts with the trainer. What would be the right way to distribute the model across the GPUs and then train it using the trainer? Also any inputs on quantizing the model?

rasbt · 2025-01-08T14:36:03Z

It was on each GPU. I think that it uses substantially less RAM than 22 x 4 in total though; it might be that it works just fine on a single GPU with 40 Gb but I haven't tried. You could also consider an FSDP strategy with cpu_offload=True to reduce GPU RAM usage, but this will then take a bit longer to train. Alternatively, the first thing I'd try in your case is to set the batch_size to 1 and then increase the gradient accumulation steps.

dkapur17 · 2025-01-08T15:41:01Z

Interestingly, using the CLI tool, I'm even able to finetune Llama 3.1 8B with no quantization across the 4 GPUs, although I suspect that's thanks to LoRA, will need to check if it works with the Python API as well.

rasbt · 2025-01-08T16:30:50Z

Ah yes, litgpt finetune ... uses LoRA by default. For full finetuning, it's litgpt finetune_full ...

dkapur17 added the question Further information is requested label Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM for training llama #1900

OOM for training llama #1900

dkapur17 commented Jan 7, 2025

rasbt commented Jan 7, 2025

dkapur17 commented Jan 8, 2025

rasbt commented Jan 8, 2025

dkapur17 commented Jan 8, 2025

rasbt commented Jan 8, 2025

OOM for training llama #1900

OOM for training llama #1900

Comments

dkapur17 commented Jan 7, 2025

rasbt commented Jan 7, 2025

dkapur17 commented Jan 8, 2025

rasbt commented Jan 8, 2025

dkapur17 commented Jan 8, 2025

rasbt commented Jan 8, 2025