Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM error when inferencing 128k context on 4 GPUs #25

Open
HoBeedzc opened this issue Aug 7, 2024 · 4 comments
Open

OOM error when inferencing 128k context on 4 GPUs #25

HoBeedzc opened this issue Aug 7, 2024 · 4 comments

Comments

@HoBeedzc
Copy link

HoBeedzc commented Aug 7, 2024

Hello,

Thank you for providing DCA to scale model context up to 100K+. However, I encountered an issue when trying to inference with a 128k context using 4 GPUs. The program experiences an Out of Memory (OOM) error because the entire 100K context is loaded onto a single GPU, causing memory overflow (even with 96GB of VRAM).

My code is based on run_chunkllama_100k.py, with the only modification being that I replaced the original prompt with my own 128k prompt. Do you have any suggestions for resolving this issue?

Environment:

  • Number of GPUs: 4
  • VRAM per GPU: 96GB
  • Base script: run_chunkllama_100k.py
  • Base model: Qwen2-72b-Instruct

Steps to reproduce:

  1. Use the run_chunkllama_100k.py script as a base
  2. Replace the original prompt with a 128k context prompt
  3. Run the script on 4 GPUs

Expected behavior:
The 128k context should be distributed across the 4 GPUs for successful inference.

Actual behavior:
The entire 100K context is loaded onto a single GPU, causing an OOM error.

Any help or guidance would be greatly appreciated.

Thank you,
Hobee

@ChenxinAn-fdu
Copy link
Contributor

ChenxinAn-fdu commented Aug 8, 2024

Hi!
Please make sure you are using device_map="auto" to distribute the model weights to 4 GPUs. For Qwen 72b which takes about 144/4=36G memory on each GPU.
full code:
model = LlamaForCausalLM.from_pretrained(model_path, attn_implementation="flash_attention_2", device_map="auto", trust_remote_code=True, torch_dtype=torch.bfloat16).
You can set an input() to track whether the model is successfully loaded on 4 GPUs.

@HoBeedzc
Copy link
Author

HoBeedzc commented Aug 8, 2024

I've discovered that while the model weights are correctly distributed across the 4 GPUs, the issue occurs when using:

inputs = tokenizer(message, truncation=True, return_tensors="pt").to(model.device)

In this case, both input_ids and attention_mask are placed on a single GPU (cuda:0). With a 128K context, these tensors occupy approximately 200GB of VRAM, far exceeding the capacity of a single GPU and causing the OOM error.
I wonder is there a way to distribute these input tensors across multiple GPUs?

Thank you,
Hobee

@ChenxinAn-fdu
Copy link
Contributor

The model is loaded with pipeline parallel which means these layers are placed on different machines. If your model are loaded correctly, it means 4 GPUs cannot support inference on 128K context .

@ChenxinAn-fdu
Copy link
Contributor

ChenxinAn-fdu commented Aug 8, 2024

have you tried the flash decoding code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants