Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add LSF scheduler #588

Closed
wants to merge 5 commits into from
Closed

Conversation

takeshi-yoshimura
Copy link
Contributor

I prototyped the LSF scheduler for torchx. It supports native, Docker, and Singularity as runtime with a shared filesystem at this moment. I confirmed it worked with Gloo and NCCL on small VPC V100 clusters.

Note: torchx log command is available only when the torchx host shares the filesystem with cluster nodes (e.g., NFS).

In a nutshell, the LSF scheduler translates a torchx request to be LSF job submissions (i.e., bsub). For distributed apps, it creates multiple bsub. I also added lsf to scripts/component_integration_tests.py. Here is the log output with my three-node LSF cluster and you can find dryrun results there.

component_integration_tests.lsf.txt

Regarding Singularity image compatibility, it already automates to convert docker images into singularity image format, and so, only we have to do is to generate singularity-exec arguments from torchx requests. Note that users still need to set prefix docker:// for image names if they want to use docker images.

The following are example commands.

Example: native hello_world and CLI utils

$ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=native utils.echo --msg hello_world --num_replicas 3
lsf://torchx/echo-pxc3gn5ct061k
$ torchx list -s lsf
$ torchx status lsf://torchx/echo-pxc3gn5ct061k
$ torchx cancel lsf://torchx/echo-pxc3gn5ct061k
$ torchx log --stream stdout lsf://torchx/echo-pxc3gn5ct061k/echo/0

Example: Docker hello_world

$ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=docker utils.echo --image alpine:latest --msg hello_world --num_replicas 3

Example: Singularity hello_world

$ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=singularity utils.echo --image docker://alpine:latest --msg hello_world --num_replicas 3

Example: Docker Distributed

$ cp scripts/dist_app.py /mnt/data/dist/
$ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=docker,host_network=True" dist.ddp -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data"

Example: Singularity Distributed

$ cp scripts/dist_app.py /mnt/data/dist/
$ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=singularity,host_network=True" dist.ddp --image docker://ghcr.io/pytorch/torchx:0.3.0dev0 -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data"

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 25, 2022
@d4l3k d4l3k self-requested a review August 25, 2022 17:44
Copy link
Member

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing! This is very exciting to see :)

Have you looked at https://github.com/IBMSpectrumComputing/lsf-python-api at all? Just wondering if that might be a slightly cleaner approach than shell commands

For the shell commands it'd be good to write some unit tests where we wrap subprocess.run to provide expected output for each method

It'd also be nice to use workspaces for docker/native to enable automatic patching though supporting both might be tricky without some other refactors. Maybe just start with Docker since DirWorkspace seems to not be used here?

@@ -51,7 +51,7 @@ def main() -> None:
torchx_image = "dummy_image"
dryrun = False

if scheduler in ("kubernetes", "local_docker", "aws_batch"):
if scheduler in ("kubernetes", "local_docker", "aws_batch", "lsf"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any easy way for us to setup a local LSF scheduler in the CI test environment (i.e via docker)? Would be nice to have an E2E integration test like we do for the other schedulers


class LsfOpts(TypedDict, total=False):
lsf_queue: Optional[str]
jobdir: Optional[str] # NOTE: *job_dir* cannot be used. somehow it overwrites --image flag (bug?). so use *jobdir* (without underscore, _)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is because you're using the DirWorkspace you can remove that from the inheritance if you need to

#------------
{self.materialize()}"""

class LsfScheduler(Scheduler[LsfOpts], DirWorkspace):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is actually using the DirWorkspace? Ideally we could use both DockerWorkspace and DirWorkspace though that would require #590

host_network: Optional[bool]
shm_size: Optional[str]

def get_docker_command(job_name: str, role: Role, cfg: LsfOpts) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wonder if we can share the same command logic in the DockerScheduler

if resource.gpu > 0:
cmds += [f"--gpus", "all"]
cmds += ["--entrypoint", role.entrypoint, "--rm", role.image] + [shlex.quote(arg) for arg in role.args]
return " ".join(cmds).replace("$", "\\$")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to use something like shlex.join -- just using " ".join isn't safe from both a correctness and security standpoint

applies below as well

@takeshi-yoshimura
Copy link
Contributor Author

@d4l3k
Thank you for commenting my code! As you point out, it seems much better to use the python library and focus on Docker. I wil try to fix the code and push it again here next week.

@d4l3k
Copy link
Member

d4l3k commented Aug 26, 2022

@takeshi-yoshimura there's a balance here -- not sure how hard it is to install the lsf library. If it's painful maybe that's not the best option

@d4l3k
Copy link
Member

d4l3k commented Sep 16, 2022

@takeshi-yoshimura what's your plans for updating this pr? We'd like to include it but still needs some polish

@takeshi-yoshimura
Copy link
Contributor Author

@d4l3k
The python library for LSF seems not to provide pip prebuilt binaries. We need to download and build their code at running LSF nodes as far as my test. It may be difficult to work with test cases and build process for torchx.

Regarding LSF Docker image for local tests, I found no official ones so far...

Let me search more and share something again soon here. Sorry for my late responce, but this week I think I can concentrate on revising this PR.

@takeshi-yoshimura
Copy link
Contributor Author

takeshi-yoshimura commented Sep 20, 2022

I updated lsf_scheduler.py according to your comment. can you please take a look? @d4l3k

Honestly speaking, I don't recommend using lsf-python-api. Its critical weak point is no support for job submissions with GPUs (IBMSpectrumComputing/lsf-python-api#36). Also, it only has low-level, complex interfaces for python and I couldn't find good documentation on how to use it.

I am also concerned about tests as you pointed out. As far as my investigation, no public container images are available currently. The issue was also discussed in dask-jobqueue (dask/dask-jobqueue#115). As they discussed, we can download LSF Suite Community Edition to build a Docker image. You need an IBM account to download it, but it's distributed under a free license that enables enough capability for testing (a single GPU and limited number of resources). Here is my test Dockerfile and other scripts (probably we cannot put the image on public places). Maybe we can add this kind of code for testing.

Dockerfile (lsfsce10.2.0.12-x86_64.tar.gz is downloaded from here):

from nvidia/cuda:11.7.1-devel-ubuntu20.04
ARG LSFSCE10_2_0_12=lsfsce10.2.0.12-x86_64.tar.gz
COPY $LSFSCE10_2_0_12 /
COPY startserver.sh /
COPY myinstall.config /

ENV HOSTNAME lsf

RUN apt-get update && apt-get install -y python3 python3-pip swig git ed vim && rm -rf /var/cache/apt/* && \
    useradd -m lsfadmin && \
    cd / && tar xzf lsfsce10.2.0.12-x86_64.tar.gz && cd lsfsce10.2.0.12-x86_64/lsf && tar xzf lsf10.1_lsfinstall_linux_x86_64.tar.Z && cd lsf10.1_lsfinstall && \
    ./lsfinstall -f /myinstall.config && rm -rf /myinstall.config /lsfsce10.2.0.12-x86_64* && echo "LSF_ROOT_USER=Y" >> /usr/share/lsf/conf/lsf.conf && \
    echo "LSB_GPU_NEW_SYNTAX=extend" >> /usr/share/lsf/conf/lsf.conf && \
    echo 'source /usr/share/lsf/conf/profile.lsf' >> /home/lsfadmin/.bashrc && echo 'source /usr/share/lsf/conf/profile.lsf' >> /root/.bashrc && \
    cd / && git clone https://github.com/IBMSpectrumComputing/lsf-python-api.git && cd lsf-python-api && \
    . /usr/share/lsf/conf/profile.lsf && python3 setup.py build && python3 setup.py install && cd / && rm -rf /lsf-python-api

USER root

This Dockerfile contains lsf-python-api installation as well. I found pip package for this, but it was not updated for years https://pypi.org/project/platform-python-lsf-api/.

myinstall.config:

LSF_TOP=/usr/share/lsf
LSF_ADMINS=lsfadmin
LSF_CLUSTER_NAME=lsf
LSF_MASTER_LIST=lsf
SILENT_INSTALL=Y
LSF_SILENT_INSTALL_TARLIST=ALL
ACCEPT_LICENSE=Y

startserver.sh:

#!/bin/bash
source /usr/share/lsf/conf/profile.lsf
lsf_daemons start

@takeshi-yoshimura
Copy link
Contributor Author

Public images for LSF were deleted for security reasons in the past. An official instruction to build an LSF image is https://github.com/IBMSpectrumComputing/lsf-operator/blob/main/README-Building-the-images.md.

@codecov
Copy link

codecov bot commented Sep 23, 2022

Codecov Report

Merging #588 (f8e9a0b) into main (b70811e) will decrease coverage by 0.52%.
The diff coverage is 89.23%.

@@            Coverage Diff             @@
##             main     #588      +/-   ##
==========================================
- Coverage   94.94%   94.42%   -0.53%     
==========================================
  Files          67       64       -3     
  Lines        4134     4429     +295     
==========================================
+ Hits         3925     4182     +257     
- Misses        209      247      +38     
Impacted Files Coverage Δ
torchx/schedulers/__init__.py 95.23% <ø> (ø)
torchx/schedulers/lsf_scheduler.py 89.23% <89.23%> (ø)
torchx/util/entrypoints.py 89.28% <0.00%> (-10.72%) ⬇️
torchx/specs/named_resources_aws.py 93.33% <0.00%> (-6.67%) ⬇️
torchx/runner/api.py 94.87% <0.00%> (-2.03%) ⬇️
torchx/specs/__init__.py 94.28% <0.00%> (-2.02%) ⬇️
torchx/specs/api.py 98.40% <0.00%> (ø)
torchx/util/types.py 100.00% <0.00%> (ø)
torchx/cli/cmd_log.py 95.74% <0.00%> (ø)
torchx/specs/finder.py 96.98% <0.00%> (ø)
... and 9 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@d4l3k
Copy link
Member

d4l3k commented Sep 23, 2022

If lsf-python-api isn't in a good shape there's no strong need to use it then. That was my concern when I looked at it before so glad you have the same opinion. Lack of GPU support is a big blocker wow

For Slurm we just use the CLI and it's stable enough -- we can just mock the inputs/outputs by using patch on subprocess which works fairly well.

Re: testing -- having an LSF integration test isn't a blocker for landing this diff. We can mark this as prototype and add integ testing in a follow up diff.

If there's ways to programmatically fetch the LSF scheduler and install it using the credentials we can add some creds to GitHub secrets which should keep them safe

Do we have any contacts at LSF? Wondering if this policy around docker images is something that we can get changed/exempted from. Is it an option to get a small managed LSF test cluster provided by IBM? We can chat more on Slack

@d4l3k
Copy link
Member

d4l3k commented Sep 23, 2022

@takeshi-yoshimura Think the main thing blocking this particular diff is just adding some comprehensive unit tests (and fixing lint/pyre)

@takeshi-yoshimura
Copy link
Contributor Author

@d4l3k I fixed lint and pyre with unit tests. please check them. To unit tests without subprocess calls, I separated parser logic from the LsfScheduler methods.

Do we have any contacts at LSF? Wondering if this policy around docker images is something that we can get changed/exempted from. Is it an option to get a small managed LSF test cluster provided by IBM? We can chat more on Slack

I'm afraid I cannot get approval to have IBM hosts only for this test. I also asked LSF developers for the docker images but got no good answers. To be honest, I have no idea to solve this right now.

There is a forum page for LSF https://community.ibm.com/community/user/businessanalytics/communities/community-home/digestviewer?communitykey=74d589b7-7276-4d70-acf5-0fc26430c6c0. I keep asking in IBM, but we can also tell our issue at the open channel.

By the way, I requested a slack invitation from https://pytorch.org/resources last week but got no replies yet. Is the system working?

If there's ways to programmatically fetch the LSF scheduler and install it using the credentials we can add some creds to GitHub secrets which should keep them safe

this idea is also difficult. The download page for the LSF community edition needs authentication at Web Browser.

Copy link
Member

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing!

The one 3.7 failure seems to be unrelated -- think it's already fixed in trunk

It would be nice to test some of the subprocess calls w/ mocks but I can quickly add a couple of those when I land it

Sounds like for the integration tests it's not going to be very feasible which is unfortunate -- not much I can do from our side. With the auth we might be able to do something by storing the credentials in github secrets

@facebook-github-bot
Copy link
Contributor

@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

d4l3k pushed a commit to d4l3k/torchx-1 that referenced this pull request Oct 7, 2022
Summary:
I prototyped the LSF scheduler for torchx. It supports native, Docker, and Singularity as runtime with a shared filesystem at this moment. I confirmed it worked with Gloo and NCCL on small VPC V100 clusters.

Note: `torchx log` command is available only when the torchx host shares the filesystem with cluster nodes (e.g., NFS).

In a nutshell, the LSF scheduler translates a torchx request to be LSF job submissions (i.e., `bsub`). For distributed apps, it creates multiple `bsub`. I also added lsf to scripts/component_integration_tests.py. Here is the log output with my three-node LSF cluster and you can find dryrun results there.

[component_integration_tests.lsf.txt](https://github.com/pytorch/torchx/files/9424891/component_integration_tests.lsf.txt)

Regarding Singularity image compatibility, it already automates to convert docker images into singularity image format, and so, only we have to do is to generate singularity-exec arguments from torchx requests. Note that users still need to set prefix docker:// for image names if they want to use docker images.

The following are example commands.

**Example: native hello_world and CLI utils**

```
$ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=native utils.echo --msg hello_world --num_replicas 3
lsf://torchx/echo-pxc3gn5ct061k
$ torchx list -s lsf
$ torchx status lsf://torchx/echo-pxc3gn5ct061k
$ torchx cancel lsf://torchx/echo-pxc3gn5ct061k
$ torchx log --stream stdout lsf://torchx/echo-pxc3gn5ct061k/echo/0
```

**Example: Docker hello_world**
```
$ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=docker utils.echo --image alpine:latest --msg hello_world --num_replicas 3
```

**Example: Singularity hello_world**
```
$ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=singularity utils.echo --image docker://alpine:latest --msg hello_world --num_replicas 3
```

**Example: Docker Distributed**
```
$ cp scripts/dist_app.py /mnt/data/dist/
$ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=docker,host_network=True" dist.ddp -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data"
```

**Example: Singularity Distributed**
```
$ cp scripts/dist_app.py /mnt/data/dist/
$ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=singularity,host_network=True" dist.ddp --image docker://ghcr.io/pytorch/torchx:0.3.0dev0 -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data"
```

Pull Request resolved: pytorch#588

Reviewed By: msaroufim

Differential Revision: D40184939

Pulled By: msaroufim

fbshipit-source-id: 5a13d2ee88b3b5cf1b8e5a3f6786b955d47f21f8
@d4l3k d4l3k mentioned this pull request Oct 7, 2022
@takeshi-yoshimura
Copy link
Contributor Author

@d4l3k
Thank you! I know this is only the first step for the LSF scheduler. I think I need to keep working on integration tests and documentation (, plus Singularity after dependent changes). Do we have any other TODO items for the LSF scheduler?

d4l3k pushed a commit to d4l3k/torchx-1 that referenced this pull request Oct 10, 2022
Summary:
Pull Request resolved: pytorch#610

I prototyped the LSF scheduler for torchx. It supports native, Docker, and Singularity as runtime with a shared filesystem at this moment. I confirmed it worked with Gloo and NCCL on small VPC V100 clusters.

Note: `torchx log` command is available only when the torchx host shares the filesystem with cluster nodes (e.g., NFS).

In a nutshell, the LSF scheduler translates a torchx request to be LSF job submissions (i.e., `bsub`). For distributed apps, it creates multiple `bsub`. I also added lsf to scripts/component_integration_tests.py. Here is the log output with my three-node LSF cluster and you can find dryrun results there.

[component_integration_tests.lsf.txt](https://github.com/pytorch/torchx/files/9424891/component_integration_tests.lsf.txt)

Regarding Singularity image compatibility, it already automates to convert docker images into singularity image format, and so, only we have to do is to generate singularity-exec arguments from torchx requests. Note that users still need to set prefix docker:// for image names if they want to use docker images.

The following are example commands.

**Example: native hello_world and CLI utils**

```
$ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=native utils.echo --msg hello_world --num_replicas 3
lsf://torchx/echo-pxc3gn5ct061k
$ torchx list -s lsf
$ torchx status lsf://torchx/echo-pxc3gn5ct061k
$ torchx cancel lsf://torchx/echo-pxc3gn5ct061k
$ torchx log --stream stdout lsf://torchx/echo-pxc3gn5ct061k/echo/0
```

**Example: Docker hello_world**
```
$ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=docker utils.echo --image alpine:latest --msg hello_world --num_replicas 3
```

**Example: Singularity hello_world**
```
$ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=singularity utils.echo --image docker://alpine:latest --msg hello_world --num_replicas 3
```

**Example: Docker Distributed**
```
$ cp scripts/dist_app.py /mnt/data/dist/
$ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=docker,host_network=True" dist.ddp -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data"
```

**Example: Singularity Distributed**
```
$ cp scripts/dist_app.py /mnt/data/dist/
$ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=singularity,host_network=True" dist.ddp --image docker://ghcr.io/pytorch/torchx:0.3.0dev0 -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data"
```

Pull Request resolved: pytorch#588

Reviewed By: anirbanr-fb-r2p, msaroufim, kurman

Differential Revision: D40184939

fbshipit-source-id: d4a4f68d74a2ca12f95f683080c6a00137966ca6
@d4l3k
Copy link
Member

d4l3k commented Oct 10, 2022

@takeshi-yoshimura The biggest current gap right now is workspaces. Having support for the DockerWorkspace to allow patching the images before launching the job would be very nice

if you want to add singularity support that'd be a big help -- it'd be pretty nice to add some abstraction for the container execution side here (i.e. same interface for docker vs singularity) that we could use across a couple of schedulers

Testing would also be a big help and circulating it around to get some user feedback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants