OOM error when inferencing 128k context on 4 GPUs #25

HoBeedzc · 2024-08-07T11:30:19Z

Hello,

Thank you for providing DCA to scale model context up to 100K+. However, I encountered an issue when trying to inference with a 128k context using 4 GPUs. The program experiences an Out of Memory (OOM) error because the entire 100K context is loaded onto a single GPU, causing memory overflow (even with 96GB of VRAM).

My code is based on run_chunkllama_100k.py, with the only modification being that I replaced the original prompt with my own 128k prompt. Do you have any suggestions for resolving this issue?

Environment:

Number of GPUs: 4
VRAM per GPU: 96GB
Base script: run_chunkllama_100k.py
Base model: Qwen2-72b-Instruct

Steps to reproduce:

Use the run_chunkllama_100k.py script as a base
Replace the original prompt with a 128k context prompt
Run the script on 4 GPUs

Expected behavior:
The 128k context should be distributed across the 4 GPUs for successful inference.

Actual behavior:
The entire 100K context is loaded onto a single GPU, causing an OOM error.

Any help or guidance would be greatly appreciated.

Thank you,
Hobee

The text was updated successfully, but these errors were encountered:

ChenxinAn-fdu · 2024-08-08T02:15:45Z

Hi!
Please make sure you are using device_map="auto" to distribute the model weights to 4 GPUs. For Qwen 72b which takes about 144/4=36G memory on each GPU.
full code:
model = LlamaForCausalLM.from_pretrained(model_path, attn_implementation="flash_attention_2", device_map="auto", trust_remote_code=True, torch_dtype=torch.bfloat16).
You can set an input() to track whether the model is successfully loaded on 4 GPUs.

HoBeedzc · 2024-08-08T05:58:28Z

I've discovered that while the model weights are correctly distributed across the 4 GPUs, the issue occurs when using:

inputs = tokenizer(message, truncation=True, return_tensors="pt").to(model.device)

In this case, both input_ids and attention_mask are placed on a single GPU (cuda:0). With a 128K context, these tensors occupy approximately 200GB of VRAM, far exceeding the capacity of a single GPU and causing the OOM error.
I wonder is there a way to distribute these input tensors across multiple GPUs?

Thank you,
Hobee

ChenxinAn-fdu · 2024-08-08T06:11:29Z

The model is loaded with pipeline parallel which means these layers are placed on different machines. If your model are loaded correctly, it means 4 GPUs cannot support inference on 128K context .

ChenxinAn-fdu · 2024-08-08T06:12:02Z

have you tried the flash decoding code?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM error when inferencing 128k context on 4 GPUs #25

OOM error when inferencing 128k context on 4 GPUs #25

HoBeedzc commented Aug 7, 2024

ChenxinAn-fdu commented Aug 8, 2024 •

edited

Loading

HoBeedzc commented Aug 8, 2024 •

edited

Loading

ChenxinAn-fdu commented Aug 8, 2024

ChenxinAn-fdu commented Aug 8, 2024 •

edited

Loading

OOM error when inferencing 128k context on 4 GPUs #25

OOM error when inferencing 128k context on 4 GPUs #25

Comments

HoBeedzc commented Aug 7, 2024

ChenxinAn-fdu commented Aug 8, 2024 • edited Loading

HoBeedzc commented Aug 8, 2024 • edited Loading

ChenxinAn-fdu commented Aug 8, 2024

ChenxinAn-fdu commented Aug 8, 2024 • edited Loading

ChenxinAn-fdu commented Aug 8, 2024 •

edited

Loading

HoBeedzc commented Aug 8, 2024 •

edited

Loading

ChenxinAn-fdu commented Aug 8, 2024 •

edited

Loading