-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM error when inferencing 128k context on 4 GPUs #25
Comments
Hi! |
I've discovered that while the model weights are correctly distributed across the 4 GPUs, the issue occurs when using: inputs = tokenizer(message, truncation=True, return_tensors="pt").to(model.device) In this case, both Thank you, |
The model is loaded with pipeline parallel which means these layers are placed on different machines. If your model are loaded correctly, it means 4 GPUs cannot support inference on 128K context . |
have you tried the flash decoding code? |
Hello,
Thank you for providing DCA to scale model context up to 100K+. However, I encountered an issue when trying to inference with a 128k context using 4 GPUs. The program experiences an Out of Memory (OOM) error because the entire 100K context is loaded onto a single GPU, causing memory overflow (even with 96GB of VRAM).
My code is based on
run_chunkllama_100k.py
, with the only modification being that I replaced the original prompt with my own 128k prompt. Do you have any suggestions for resolving this issue?Environment:
Steps to reproduce:
run_chunkllama_100k.py
script as a baseExpected behavior:
The 128k context should be distributed across the 4 GPUs for successful inference.
Actual behavior:
The entire 100K context is loaded onto a single GPU, causing an OOM error.
Any help or guidance would be greatly appreciated.
Thank you,
Hobee
The text was updated successfully, but these errors were encountered: