Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

local_docker scheduler unable to set gpu correctly #825

Open
9 tasks
ryxli opened this issue Feb 15, 2024 · 0 comments
Open
9 tasks

local_docker scheduler unable to set gpu correctly #825

ryxli opened this issue Feb 15, 2024 · 0 comments

Comments

@ryxli
Copy link
Contributor

ryxli commented Feb 15, 2024

🐛 Bug

Device Request capabilities should be updated to "gpu", not "compute"
https://github.com/pytorch/torchx/blob/main/torchx/schedulers/docker_scheduler.py#L308

                    c.kwargs["device_requests"] = [
                        DeviceRequest(
                            count=resource.gpu,
                            capabilities=[["compute"]],
                        )
                    ]

Module (check all that applies):

  • torchx.spec
  • torchx.component
  • torchx.apps
  • torchx.runtime
  • torchx.cli
  • [ x] torchx.schedulers
  • torchx.pipelines
  • torchx.aws
  • torchx.examples
  • other

To Reproduce

Steps to reproduce the behavior:

  1. start any container with local_docker scheduler on a machine with nvidia gpu
  2. run nvidia-smi inside container to verify that container does not detect gpu
pretrain/0 
pretrain/0 =============
pretrain/0 == PyTorch ==
pretrain/0 =============
pretrain/0 
pretrain/0 NVIDIA Release 23.12 (build 76438008)
pretrain/0 PyTorch Version 2.2.0a0+81ea7a4
pretrain/0 
pretrain/0 Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
pretrain/0 
pretrain/0 Copyright (c) 2014-2023 Facebook Inc.
pretrain/0 Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
pretrain/0 Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
pretrain/0 Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
pretrain/0 Copyright (c) 2011-2013 NYU                      (Clement Farabet)
pretrain/0 Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
pretrain/0 Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
pretrain/0 Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
pretrain/0 Copyright (c) 2015      Google Inc.
pretrain/0 Copyright (c) 2015      Yangqing Jia
pretrain/0 Copyright (c) 2013-2016 The Caffe contributors
pretrain/0 All rights reserved.
pretrain/0 
pretrain/0 Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
pretrain/0 
pretrain/0 This container image and its contents are governed by the NVIDIA Deep Learning Container License.
pretrain/0 By pulling and using the container, you accept the terms and conditions of this license:
pretrain/0 https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
pretrain/0 
pretrain/0 Failed to detect NVIDIA driver version.

Expected behavior

if device capability is properly set to "gpu", then i should see devices inside container and can detect nvidia driver

after changing "compute" to "gpu", works as expected

pretrain/0 
pretrain/0 =============
pretrain/0 == PyTorch ==
pretrain/0 =============
pretrain/0 
pretrain/0 NVIDIA Release 23.12 (build 76438008)
pretrain/0 PyTorch Version 2.2.0a0+81ea7a4
pretrain/0 
pretrain/0 Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
pretrain/0 
pretrain/0 Copyright (c) 2014-2023 Facebook Inc.
pretrain/0 Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
pretrain/0 Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
pretrain/0 Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
pretrain/0 Copyright (c) 2011-2013 NYU                      (Clement Farabet)
pretrain/0 Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
pretrain/0 Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
pretrain/0 Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
pretrain/0 Copyright (c) 2015      Google Inc.
pretrain/0 Copyright (c) 2015      Yangqing Jia
pretrain/0 Copyright (c) 2013-2016 The Caffe contributors
pretrain/0 All rights reserved.
pretrain/0 
pretrain/0 Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
pretrain/0 
pretrain/0 This container image and its contents are governed by the NVIDIA Deep Learning Container License.
pretrain/0 By pulling and using the container, you accept the terms and conditions of this license:
pretrain/0 https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
pretrain/0 
pretrain/0 NOTE: CUDA Forward Compatibility mode ENABLED.
pretrain/0   Using CUDA 12.3 driver version 545.23.08 with kernel driver version 535.129.03.
pretrain/0   See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
pretrain/0 

Environment

  • torchx version (e.g. 0.1.0rc1): 0.6.0
  • Python version: 3.10
  • OS (e.g., Linux): AL2
  • How you installed torchx (conda, pip, source, docker): pip
  • Docker image and tag (if using docker):
  • Git commit (if installed from source):
  • Execution environment (on-prem, AWS, GCP, Azure etc):
  • Any other relevant information:

Additional context

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant