Skip to content

Commit

Permalink
add LSF scheduler (pytorch#588)
Browse files Browse the repository at this point in the history
Summary:
I prototyped the LSF scheduler for torchx. It supports native, Docker, and Singularity as runtime with a shared filesystem at this moment. I confirmed it worked with Gloo and NCCL on small VPC V100 clusters.

Note: `torchx log` command is available only when the torchx host shares the filesystem with cluster nodes (e.g., NFS).

In a nutshell, the LSF scheduler translates a torchx request to be LSF job submissions (i.e., `bsub`). For distributed apps, it creates multiple `bsub`. I also added lsf to scripts/component_integration_tests.py. Here is the log output with my three-node LSF cluster and you can find dryrun results there.

[component_integration_tests.lsf.txt](https://github.com/pytorch/torchx/files/9424891/component_integration_tests.lsf.txt)

Regarding Singularity image compatibility, it already automates to convert docker images into singularity image format, and so, only we have to do is to generate singularity-exec arguments from torchx requests. Note that users still need to set prefix docker:// for image names if they want to use docker images.

The following are example commands.

**Example: native hello_world and CLI utils**

```
$ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=native utils.echo --msg hello_world --num_replicas 3
lsf://torchx/echo-pxc3gn5ct061k
$ torchx list -s lsf
$ torchx status lsf://torchx/echo-pxc3gn5ct061k
$ torchx cancel lsf://torchx/echo-pxc3gn5ct061k
$ torchx log --stream stdout lsf://torchx/echo-pxc3gn5ct061k/echo/0
```

**Example: Docker hello_world**
```
$ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=docker utils.echo --image alpine:latest --msg hello_world --num_replicas 3
```

**Example: Singularity hello_world**
```
$ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=singularity utils.echo --image docker://alpine:latest --msg hello_world --num_replicas 3
```

**Example: Docker Distributed**
```
$ cp scripts/dist_app.py /mnt/data/dist/
$ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=docker,host_network=True" dist.ddp -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data"
```

**Example: Singularity Distributed**
```
$ cp scripts/dist_app.py /mnt/data/dist/
$ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=singularity,host_network=True" dist.ddp --image docker://ghcr.io/pytorch/torchx:0.3.0dev0 -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data"
```

Pull Request resolved: pytorch#588

Reviewed By: msaroufim

Differential Revision: D40184939

Pulled By: msaroufim

fbshipit-source-id: 5a13d2ee88b3b5cf1b8e5a3f6786b955d47f21f8
  • Loading branch information
takeshi-yoshimura authored and facebook-github-bot committed Oct 7, 2022
1 parent 73b6f09 commit 43ad659
Show file tree
Hide file tree
Showing 6 changed files with 1,147 additions and 1 deletion.
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ Works With
schedulers/slurm
schedulers/ray
schedulers/aws_batch
schedulers/lsf

.. fbcode::

Expand Down
18 changes: 18 additions & 0 deletions docs/source/schedulers/lsf.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
IBM Spectrum LSF
=================

.. automodule:: torchx.schedulers.lsf_scheduler

.. currentmodule:: torchx.schedulers.lsf_scheduler

.. autoclass:: LsfScheduler
:members:
:show-inheritance:

.. autoclass:: LsfBsub
:members:

Reference
~~~~~~~~~~~~

.. autofunction:: create_scheduler
13 changes: 12 additions & 1 deletion scripts/component_integration_tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ def main() -> None:
torchx_image = "dummy_image"
dryrun = False

if scheduler in ("kubernetes", "local_docker", "aws_batch"):
if scheduler in ("kubernetes", "local_docker", "aws_batch", "lsf"):
try:
build = build_and_push_image()
torchx_image = build.torchx_image
Expand Down Expand Up @@ -105,6 +105,17 @@ def main() -> None:
},
"workspace": f"file://{os.getcwd()}",
},
"lsf": {
"providers": [
component_provider,
],
"image": torchx_image,
"cfg": {
"runtime": "docker",
"jobdir": "/mnt/data/torchx",
"host_network": True,
},
},
}

params = run_parameters[scheduler]
Expand Down
1 change: 1 addition & 0 deletions torchx/schedulers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
"kubernetes": "torchx.schedulers.kubernetes_scheduler",
"aws_batch": "torchx.schedulers.aws_batch_scheduler",
"ray": "torchx.schedulers.ray_scheduler",
"lsf": "torchx.schedulers.lsf_scheduler",
}


Expand Down
Loading

0 comments on commit 43ad659

Please sign in to comment.