Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serve an LLM with multiple GPUs in GKE is doesn't work and fails with The node was low on resource: ephemeral-storage. #1581

Open
raushan2016 opened this issue Jan 3, 2025 · 2 comments · May be fixed by #1584

Comments

@raushan2016
Copy link

2025-01-03 11:06:50.367 PST
[2m2025-01-03T19:06:50.366843Z[0m [32m INFO[0m [2mtext_generation_launcher[0m[2m:[0m Download: [13/30] -- ETA: 0:06:42.769236
  • Somehow instead of /data some other mount /etc/hosts is getting filled and eventually runs out of disk space.
root@llm-689555d8bf-62gjd:/etc# df
\Filesystem     1K-blocks     Used Available Use% Mounted on
overlay         98831908 75370476  23445048  77% /
tmpfs              65536        0     65536   0% /dev
/dev/nvme0n2   153707984       28 153691572   1% /data
tmpfs           62914560       12  62914548   1% /dev/shm
/dev/nvme0n1p1  98831908 75370476  23445048  77% /etc/hosts
tmpfs           62914560       12  62914548   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs           49441048        0  49441048   0% /proc/acpi
tmpfs           49441048        0  49441048   0% /proc/scsi
tmpfs           49441048        0  49441048   0% /sys/firmware
root@llm-fb5d99cb-569b7:/usr/src# df  
Filesystem     1K-blocks      Used Available Use% Mounted on
overlay         98831908  25463724  73351800  26% /
tmpfs              65536         0     65536   0% /dev
/dev/nvme0n2   153707984 137809580  15882020  90% /data
tmpfs           62914560     48980  62865580   1% /dev/shm
/dev/nvme0n1p1  98831908  25463724  73351800  26% /etc/hosts
tmpfs           62914560        12  62914548   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs           49439752         0  49439752   0% /proc/acpi
tmpfs           49439752         0  49439752   0% /proc/scsi
tmpfs           49439752         0  49439752   0% /sys/firmware

NOTE: There might be other sample also impacted with the above change. Since we don't have any automated gates for the validation.

@raushan2016
Copy link
Author

cc @alvarobartt @moficodes @alizaidis can you please help with above PR and plan for addressing the issue impacting the public sample.

@alvarobartt
Copy link
Contributor

Hi here @raushan2016 thanks for flagging, indeed that's because the path to be mounted should be /tmp, I realised I didn't update that when updating the samples but just updated the container URI, and for the TGI DLCs we're mounting /tmp as the HF_HOME i.e. where the model weights are downloaded, so on, we should be mounting /tmp instead of /data, I'll create a PR to update the mounts. Sorry for the inconvenience if any!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants