-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add LSF scheduler #588
add LSF scheduler #588
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contributing! This is very exciting to see :)
Have you looked at https://github.com/IBMSpectrumComputing/lsf-python-api at all? Just wondering if that might be a slightly cleaner approach than shell commands
For the shell commands it'd be good to write some unit tests where we wrap subprocess.run to provide expected output for each method
It'd also be nice to use workspaces for docker/native to enable automatic patching though supporting both might be tricky without some other refactors. Maybe just start with Docker since DirWorkspace seems to not be used here?
@@ -51,7 +51,7 @@ def main() -> None: | |||
torchx_image = "dummy_image" | |||
dryrun = False | |||
|
|||
if scheduler in ("kubernetes", "local_docker", "aws_batch"): | |||
if scheduler in ("kubernetes", "local_docker", "aws_batch", "lsf"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any easy way for us to setup a local LSF scheduler in the CI test environment (i.e via docker)? Would be nice to have an E2E integration test like we do for the other schedulers
torchx/schedulers/lsf_scheduler.py
Outdated
|
||
class LsfOpts(TypedDict, total=False): | ||
lsf_queue: Optional[str] | ||
jobdir: Optional[str] # NOTE: *job_dir* cannot be used. somehow it overwrites --image flag (bug?). so use *jobdir* (without underscore, _) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is because you're using the DirWorkspace
you can remove that from the inheritance if you need to
torchx/schedulers/lsf_scheduler.py
Outdated
#------------ | ||
{self.materialize()}""" | ||
|
||
class LsfScheduler(Scheduler[LsfOpts], DirWorkspace): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is actually using the DirWorkspace? Ideally we could use both DockerWorkspace and DirWorkspace though that would require #590
host_network: Optional[bool] | ||
shm_size: Optional[str] | ||
|
||
def get_docker_command(job_name: str, role: Role, cfg: LsfOpts) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wonder if we can share the same command logic in the DockerScheduler
torchx/schedulers/lsf_scheduler.py
Outdated
if resource.gpu > 0: | ||
cmds += [f"--gpus", "all"] | ||
cmds += ["--entrypoint", role.entrypoint, "--rm", role.image] + [shlex.quote(arg) for arg in role.args] | ||
return " ".join(cmds).replace("$", "\\$") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to use something like shlex.join -- just using " ".join isn't safe from both a correctness and security standpoint
applies below as well
@d4l3k |
@takeshi-yoshimura there's a balance here -- not sure how hard it is to install the lsf library. If it's painful maybe that's not the best option |
@takeshi-yoshimura what's your plans for updating this pr? We'd like to include it but still needs some polish |
@d4l3k Regarding LSF Docker image for local tests, I found no official ones so far... Let me search more and share something again soon here. Sorry for my late responce, but this week I think I can concentrate on revising this PR. |
I updated lsf_scheduler.py according to your comment. can you please take a look? @d4l3k Honestly speaking, I don't recommend using lsf-python-api. Its critical weak point is no support for job submissions with GPUs (IBMSpectrumComputing/lsf-python-api#36). Also, it only has low-level, complex interfaces for python and I couldn't find good documentation on how to use it. I am also concerned about tests as you pointed out. As far as my investigation, no public container images are available currently. The issue was also discussed in dask-jobqueue (dask/dask-jobqueue#115). As they discussed, we can download LSF Suite Community Edition to build a Docker image. You need an IBM account to download it, but it's distributed under a free license that enables enough capability for testing (a single GPU and limited number of resources). Here is my test Dockerfile and other scripts (probably we cannot put the image on public places). Maybe we can add this kind of code for testing. Dockerfile (lsfsce10.2.0.12-x86_64.tar.gz is downloaded from here):
This Dockerfile contains lsf-python-api installation as well. I found pip package for this, but it was not updated for years https://pypi.org/project/platform-python-lsf-api/. myinstall.config:
startserver.sh:
|
Public images for LSF were deleted for security reasons in the past. An official instruction to build an LSF image is https://github.com/IBMSpectrumComputing/lsf-operator/blob/main/README-Building-the-images.md. |
Codecov Report
@@ Coverage Diff @@
## main #588 +/- ##
==========================================
- Coverage 94.94% 94.42% -0.53%
==========================================
Files 67 64 -3
Lines 4134 4429 +295
==========================================
+ Hits 3925 4182 +257
- Misses 209 247 +38
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
If lsf-python-api isn't in a good shape there's no strong need to use it then. That was my concern when I looked at it before so glad you have the same opinion. Lack of GPU support is a big blocker wow For Slurm we just use the CLI and it's stable enough -- we can just mock the inputs/outputs by using patch on subprocess which works fairly well. Re: testing -- having an LSF integration test isn't a blocker for landing this diff. We can mark this as prototype and add integ testing in a follow up diff. If there's ways to programmatically fetch the LSF scheduler and install it using the credentials we can add some creds to GitHub secrets which should keep them safe Do we have any contacts at LSF? Wondering if this policy around docker images is something that we can get changed/exempted from. Is it an option to get a small managed LSF test cluster provided by IBM? We can chat more on Slack |
@takeshi-yoshimura Think the main thing blocking this particular diff is just adding some comprehensive unit tests (and fixing lint/pyre) |
@d4l3k I fixed lint and pyre with unit tests. please check them. To unit tests without subprocess calls, I separated parser logic from the LsfScheduler methods.
I'm afraid I cannot get approval to have IBM hosts only for this test. I also asked LSF developers for the docker images but got no good answers. To be honest, I have no idea to solve this right now. There is a forum page for LSF https://community.ibm.com/community/user/businessanalytics/communities/community-home/digestviewer?communitykey=74d589b7-7276-4d70-acf5-0fc26430c6c0. I keep asking in IBM, but we can also tell our issue at the open channel. By the way, I requested a slack invitation from https://pytorch.org/resources last week but got no replies yet. Is the system working?
this idea is also difficult. The download page for the LSF community edition needs authentication at Web Browser. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contributing!
The one 3.7 failure seems to be unrelated -- think it's already fixed in trunk
It would be nice to test some of the subprocess calls w/ mocks but I can quickly add a couple of those when I land it
Sounds like for the integration tests it's not going to be very feasible which is unfortunate -- not much I can do from our side. With the auth we might be able to do something by storing the credentials in github secrets
@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: I prototyped the LSF scheduler for torchx. It supports native, Docker, and Singularity as runtime with a shared filesystem at this moment. I confirmed it worked with Gloo and NCCL on small VPC V100 clusters. Note: `torchx log` command is available only when the torchx host shares the filesystem with cluster nodes (e.g., NFS). In a nutshell, the LSF scheduler translates a torchx request to be LSF job submissions (i.e., `bsub`). For distributed apps, it creates multiple `bsub`. I also added lsf to scripts/component_integration_tests.py. Here is the log output with my three-node LSF cluster and you can find dryrun results there. [component_integration_tests.lsf.txt](https://github.com/pytorch/torchx/files/9424891/component_integration_tests.lsf.txt) Regarding Singularity image compatibility, it already automates to convert docker images into singularity image format, and so, only we have to do is to generate singularity-exec arguments from torchx requests. Note that users still need to set prefix docker:// for image names if they want to use docker images. The following are example commands. **Example: native hello_world and CLI utils** ``` $ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=native utils.echo --msg hello_world --num_replicas 3 lsf://torchx/echo-pxc3gn5ct061k $ torchx list -s lsf $ torchx status lsf://torchx/echo-pxc3gn5ct061k $ torchx cancel lsf://torchx/echo-pxc3gn5ct061k $ torchx log --stream stdout lsf://torchx/echo-pxc3gn5ct061k/echo/0 ``` **Example: Docker hello_world** ``` $ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=docker utils.echo --image alpine:latest --msg hello_world --num_replicas 3 ``` **Example: Singularity hello_world** ``` $ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=singularity utils.echo --image docker://alpine:latest --msg hello_world --num_replicas 3 ``` **Example: Docker Distributed** ``` $ cp scripts/dist_app.py /mnt/data/dist/ $ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=docker,host_network=True" dist.ddp -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data" ``` **Example: Singularity Distributed** ``` $ cp scripts/dist_app.py /mnt/data/dist/ $ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=singularity,host_network=True" dist.ddp --image docker://ghcr.io/pytorch/torchx:0.3.0dev0 -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data" ``` Pull Request resolved: pytorch#588 Reviewed By: msaroufim Differential Revision: D40184939 Pulled By: msaroufim fbshipit-source-id: 5a13d2ee88b3b5cf1b8e5a3f6786b955d47f21f8
@d4l3k |
Summary: Pull Request resolved: pytorch#610 I prototyped the LSF scheduler for torchx. It supports native, Docker, and Singularity as runtime with a shared filesystem at this moment. I confirmed it worked with Gloo and NCCL on small VPC V100 clusters. Note: `torchx log` command is available only when the torchx host shares the filesystem with cluster nodes (e.g., NFS). In a nutshell, the LSF scheduler translates a torchx request to be LSF job submissions (i.e., `bsub`). For distributed apps, it creates multiple `bsub`. I also added lsf to scripts/component_integration_tests.py. Here is the log output with my three-node LSF cluster and you can find dryrun results there. [component_integration_tests.lsf.txt](https://github.com/pytorch/torchx/files/9424891/component_integration_tests.lsf.txt) Regarding Singularity image compatibility, it already automates to convert docker images into singularity image format, and so, only we have to do is to generate singularity-exec arguments from torchx requests. Note that users still need to set prefix docker:// for image names if they want to use docker images. The following are example commands. **Example: native hello_world and CLI utils** ``` $ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=native utils.echo --msg hello_world --num_replicas 3 lsf://torchx/echo-pxc3gn5ct061k $ torchx list -s lsf $ torchx status lsf://torchx/echo-pxc3gn5ct061k $ torchx cancel lsf://torchx/echo-pxc3gn5ct061k $ torchx log --stream stdout lsf://torchx/echo-pxc3gn5ct061k/echo/0 ``` **Example: Docker hello_world** ``` $ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=docker utils.echo --image alpine:latest --msg hello_world --num_replicas 3 ``` **Example: Singularity hello_world** ``` $ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=singularity utils.echo --image docker://alpine:latest --msg hello_world --num_replicas 3 ``` **Example: Docker Distributed** ``` $ cp scripts/dist_app.py /mnt/data/dist/ $ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=docker,host_network=True" dist.ddp -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data" ``` **Example: Singularity Distributed** ``` $ cp scripts/dist_app.py /mnt/data/dist/ $ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=singularity,host_network=True" dist.ddp --image docker://ghcr.io/pytorch/torchx:0.3.0dev0 -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data" ``` Pull Request resolved: pytorch#588 Reviewed By: anirbanr-fb-r2p, msaroufim, kurman Differential Revision: D40184939 fbshipit-source-id: d4a4f68d74a2ca12f95f683080c6a00137966ca6
@takeshi-yoshimura The biggest current gap right now is workspaces. Having support for the DockerWorkspace to allow patching the images before launching the job would be very nice if you want to add singularity support that'd be a big help -- it'd be pretty nice to add some abstraction for the container execution side here (i.e. same interface for docker vs singularity) that we could use across a couple of schedulers Testing would also be a big help and circulating it around to get some user feedback |
I prototyped the LSF scheduler for torchx. It supports native, Docker, and Singularity as runtime with a shared filesystem at this moment. I confirmed it worked with Gloo and NCCL on small VPC V100 clusters.
Note:
torchx log
command is available only when the torchx host shares the filesystem with cluster nodes (e.g., NFS).In a nutshell, the LSF scheduler translates a torchx request to be LSF job submissions (i.e.,
bsub
). For distributed apps, it creates multiplebsub
. I also added lsf to scripts/component_integration_tests.py. Here is the log output with my three-node LSF cluster and you can find dryrun results there.component_integration_tests.lsf.txt
Regarding Singularity image compatibility, it already automates to convert docker images into singularity image format, and so, only we have to do is to generate singularity-exec arguments from torchx requests. Note that users still need to set prefix docker:// for image names if they want to use docker images.
The following are example commands.
Example: native hello_world and CLI utils
Example: Docker hello_world
Example: Singularity hello_world
Example: Docker Distributed
Example: Singularity Distributed