Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Summary: I prototyped the LSF scheduler for torchx. It supports native, Docker, and Singularity as runtime with a shared filesystem at this moment. I confirmed it worked with Gloo and NCCL on small VPC V100 clusters. Note: `torchx log` command is available only when the torchx host shares the filesystem with cluster nodes (e.g., NFS). In a nutshell, the LSF scheduler translates a torchx request to be LSF job submissions (i.e., `bsub`). For distributed apps, it creates multiple `bsub`. I also added lsf to scripts/component_integration_tests.py. Here is the log output with my three-node LSF cluster and you can find dryrun results there. [component_integration_tests.lsf.txt](https://github.com/pytorch/torchx/files/9424891/component_integration_tests.lsf.txt) Regarding Singularity image compatibility, it already automates to convert docker images into singularity image format, and so, only we have to do is to generate singularity-exec arguments from torchx requests. Note that users still need to set prefix docker:// for image names if they want to use docker images. The following are example commands. **Example: native hello_world and CLI utils** ``` $ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=native utils.echo --msg hello_world --num_replicas 3 lsf://torchx/echo-pxc3gn5ct061k $ torchx list -s lsf $ torchx status lsf://torchx/echo-pxc3gn5ct061k $ torchx cancel lsf://torchx/echo-pxc3gn5ct061k $ torchx log --stream stdout lsf://torchx/echo-pxc3gn5ct061k/echo/0 ``` **Example: Docker hello_world** ``` $ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=docker utils.echo --image alpine:latest --msg hello_world --num_replicas 3 ``` **Example: Singularity hello_world** ``` $ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=singularity utils.echo --image docker://alpine:latest --msg hello_world --num_replicas 3 ``` **Example: Docker Distributed** ``` $ cp scripts/dist_app.py /mnt/data/dist/ $ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=docker,host_network=True" dist.ddp -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data" ``` **Example: Singularity Distributed** ``` $ cp scripts/dist_app.py /mnt/data/dist/ $ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=singularity,host_network=True" dist.ddp --image docker://ghcr.io/pytorch/torchx:0.3.0dev0 -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data" ``` Pull Request resolved: pytorch#588 Reviewed By: msaroufim Differential Revision: D40184939 Pulled By: msaroufim fbshipit-source-id: 5a13d2ee88b3b5cf1b8e5a3f6786b955d47f21f8
- Loading branch information