-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalizable multi gpu to run e.g. Llama 65b #238
base: main
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
model_devices=device_config, | ||
verbose=is_verbose, | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the device map can be different if e.g., gpus 1-7 are all empty. But gpu 0 happens to have someone else's 5 gb model there.
Suppose you have 4 workers (2 gpus each)
The first worker will have a different device map from the rest, since it'll take into account the 5gb model memory being used.
The device map can sometimes heavily affect performance. In my experience when u can unlucky like (1.5x slower?). This is because you maybe you'll split the model in a suboptimal spot. Like exactly where the model is going to send much more tensors between the layers.
So ideally all the workers would use the same device map so you can approximate the time take to process the dataset better.
} | ||
if use_8bit | ||
else max_memory_used_devices | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the TLDR is that we need to tell infer_auto_device_map
that we want to use {"cuda:0": 30gb, "cuda:1": 40gb} and NO OTHER GPUS OR CPU OR DISK
5299180
to
8c6386c
Compare
7194faa
to
df1c0ff
Compare
9bf52eb
to
6d9e9ea
Compare
d108680
to
02602cb
Compare
b02b2d5
to
d3a8f29
Compare
18357c2
to
bf827ea
Compare
3abb5fe
to
301e6e2
Compare
This reverts commit 301e6e2.
55490ed
to
a5b3d5f
Compare
3e9b96d
to
99db2a0
Compare
db7acbf
to
55b18ab
Compare
for more information, see https://pre-commit.ci
**kwargs, | ||
) -> PreTrainedModel: | ||
"""Instantiate a model string with the appropriate `Auto` class.""" | ||
device = torch.device(device) | ||
kwargs["device_map"] = {"": device} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kwargs["device_map"] = {"": device} will be passed by the caller instead (because for e.g. when instantiating an empty model, we can't pass a device map. otherwise it'll really load the weights and won't be an empty model anymore
# If a torch_dtype was not specified, try to infer it. | ||
kwargs["torch_dtype"] = torch_dtype or determine_dtypes( | ||
model_str=model_str, is_cpu=is_cpu, load_in_8bit=load_in_8bit | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made this change because it previously was setting kwargs even if it was getting passed by the caller of instantiate_model, which confused me
Try it out with e.g. 2 gpus.
If you want to say "only use gpus that have 30gb available, you can pass
min_gpu_mem
as per normalon the cluster you may get this message. (i don't have perms to delete the lock file)
you can still try it out by passing other max examples params to bypass the cache