Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

volcano.sh/vgpu-memory显示是0 #18

Closed
bx0216 opened this issue Aug 26, 2024 · 9 comments · Fixed by #19
Closed

volcano.sh/vgpu-memory显示是0 #18

bx0216 opened this issue Aug 26, 2024 · 9 comments · Fixed by #19

Comments

@bx0216
Copy link

bx0216 commented Aug 26, 2024

我的环境是8块A100的gpu, 当使用volcano-vgpu-device-plugin-with-monitor.yml 或 volcano-vgpu-device-plugin.yml时
kubectl get node/gpu-node -o yaml显示的gpu显存为0,如下:
微信图片_20240826151900

可以参考这个问题:
volcano-sh/devices#19

@MiterV1
Copy link

MiterV1 commented Oct 21, 2024

同样的问题,不知道是否已解决

@Hugh-yw
Copy link

Hugh-yw commented Nov 6, 2024

@bx0216 你好,请问你有么有测试过 卸载volcano-vgpu-device-plugin后,但是集群中节点信息中还存留 volcano.sh/vgpu-number、volcano.sh/gpu-memory 资源标签

@MiterV1
Copy link

MiterV1 commented Nov 6, 2024

是的,是卸载的问题。
标签没有更新。

@Hugh-yw
Copy link

Hugh-yw commented Nov 7, 2024

是的,是卸载的问题。 标签没有更新。

还没解决是麽

@MiterV1
Copy link

MiterV1 commented Nov 13, 2024

解决了 需要使用最新的版本;
设置插件的gpu-memory-factor配置=10,在V100上可以正常识别

@yangjie727
Copy link

解决了 需要使用最新的版本; 设置插件的gpu-memory-factor配置=10,在V100上可以正常识别

请问这个需要用哪个最新的版本?

@MiterV1
Copy link

MiterV1 commented Nov 19, 2024

最新的构建版本,
volcano-vgpu-device-plugin:latest@sha256:fa3cf5a26a2a4f9dffae519183a51d97024b4d5dacbc02b5dc2b9e466e9d6788

@yangjie727
Copy link

yangjie727 commented Nov 20, 2024 via email

@MiterV1
Copy link

MiterV1 commented Nov 20, 2024

请问你有遇到过cores无法隔离的问题吗,我现在发现cores无法隔离

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants