Set up the new LSFClusterManager.jl package, and remove all non-LSF…

… code, tests, and docs (#3)
JuliaParallel · Jan 6, 2025 · a12ecb0 · a12ecb0
1 parent 2a2d8c6
commit a12ecb0
Show file tree

Hide file tree

Showing 21 changed files with 27 additions and 1,071 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -16,7 +16,6 @@ jobs:
     timeout-minutes: 10
     needs:
       - unit-tests
-      - test-slurm
     # Important: the next line MUST be `if: always()`.
     # Do not change that line.
     # That line is necessary to make sure that this job runs even if tests fail.
@@ -25,13 +24,11 @@ jobs:
     steps:
       - run: |
           echo unit-tests: ${{ needs.unit-tests.result }}
-          echo test-slurm: ${{ needs.test-slurm.result }}
       - run: exit 1
         # The last line must NOT end with ||
         # All other lines MUST end with ||
         if: |
-          (needs.unit-tests.result != 'success') ||
-          (needs.test-slurm.result != 'success')
+          (needs.unit-tests.result != 'success')
   unit-tests:
     runs-on: ubuntu-latest
     timeout-minutes: 20
@@ -51,50 +48,3 @@ jobs:
       with:
         version: ${{ matrix.version }}
     - uses: julia-actions/julia-runtest@v1
-  test-slurm:
-    runs-on: ubuntu-latest
-    timeout-minutes: 20
-    strategy:
-      fail-fast: false
-      matrix:
-        version:
-          # Please note: You must specify the full Julia version number (major.minor.patch).
-          # This is because the value here will be directly interpolated into a download URL.
-          # - '1.2.0'  # minimum Julia version supported in Project.toml
-          - '1.6.7'  # previous LTS
-          - '1.10.7' # current LTS
-          - '1.11.2' # currently the latest stable release
-    steps:
-    - uses: actions/checkout@v4
-      with:
-        persist-credentials: false
-    - name: Print Docker version
-      run: |
-        docker --version
-        docker version
-    # This next bit of code is taken from:
-    # https://github.com/kleinhenz/SlurmClusterManager.jl
-    # Original author: Joseph Kleinhenz
-    # License: MIT
-    - name: Setup Slurm inside Docker
-      run: |
-        docker version
-        docker compose version
-        docker build --build-arg "JULIA_VERSION=${MATRIX_JULIA_VERSION:?}" -t slurm-cluster-julia -f ci/Dockerfile .
-        docker compose -f ci/docker-compose.yml up -d
-        docker ps
-      env:
-        MATRIX_JULIA_VERSION: ${{matrix.version}}
-    - name: Print some information for debugging purposes
-      run: |
-        docker exec -t slurmctld pwd
-        docker exec -t slurmctld ls -la
-        docker exec -t slurmctld ls -la ClusterManagers
-    - name: Instantiate package
-      run: docker exec -t slurmctld julia --project=ClusterManagers -e 'import Pkg; @show Base.active_project(); Pkg.instantiate(); Pkg.status()'
-    - name: Run tests without a Slurm allocation
-      run: docker exec -t slurmctld julia --project=ClusterManagers -e 'import Pkg; Pkg.test(; test_args=["slurm"])'
-    - name: Run tests inside salloc
-      run: docker exec -t slurmctld salloc -t 00:10:00 -n 2 julia --project=ClusterManagers -e 'import Pkg; Pkg.test(test_args=["slurm"])'
-    - name: Run tests inside sbatch
-      run: docker exec -t slurmctld ClusterManagers/ci/run_my_sbatch.sh
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,2 @@
+# macOS-specific:
+.DS_Store
diff --git a/Project.toml b/Project.toml
@@ -1,6 +1,6 @@
-name = "ClusterManagers"
-uuid = "34f1f09b-3a8b-5176-ab39-66d58a4d544e"
-version = "0.4.7"
+name = "LSFClusterManager"
+uuid = "af02cf76-cbe3-4eeb-96a8-af9391005858"
+version = "1.0.0-DEV"
 
 [deps]
 Distributed = "8ba89e20-285c-5b6f-9357-94700520ee1b"

diff --git a/README.md b/README.md
@@ -1,79 +1,22 @@
-# ClusterManagers.jl
+# LSFClusterManager.jl
 
-The `ClusterManager.jl` package implements code for different job queue systems commonly used on compute clusters.
+The `LSFClusterManager.jl` package implements code for the LSF (Load Sharing Facility) compute cluster job queue system.
 
-> [!WARNING]
-> This package is not currently being actively maintained or tested.
->
-> We are in the process of splitting this package up into multiple smaller packages, with a separate package for each job queue systems.
->
-> We are seeking maintainers for these new packages. If you are an active user of any of the job queue systems listed below and are interested in being a maintainer, please open a GitHub issue - say that you are interested in being a maintainer, and specify which job queue system you use.
-
-## Available job queue systems
+`LSFManager` supports IBM's scheduler.  See the `addprocs_lsf` docstring
+for more information.
 
-Implemented in this package (the `ClusterManagers.jl` package):
+Implemented in this package (the `LSFClusterManager.jl` package):
 
 | Job queue system | Command to add processors |
 | ---------------- | ------------------------- |
 | Load Sharing Facility (LSF) | `addprocs_lsf(np::Integer; bsub_flags=``, ssh_cmd=``)` or `addprocs(LSFManager(np, bsub_flags, ssh_cmd, retry_delays, throttle))` |
-| Sun Grid Engine (SGE) via `qsub` | `addprocs_sge(np::Integer; qsub_flags=``)` or `addprocs(SGEManager(np, qsub_flags))` |
-| Sun Grid Engine (SGE) via `qrsh` | `addprocs_qrsh(np::Integer; qsub_flags=``)` or `addprocs(QRSHManager(np, qsub_flags))` |
-| PBS (Portable Batch System) | `addprocs_pbs(np::Integer; qsub_flags=``)` or `addprocs(PBSManager(np, qsub_flags))` |
-| Scyld | `addprocs_scyld(np::Integer)` or `addprocs(ScyldManager(np))` |
-| HTCondor[^1] | `addprocs_htc(np::Integer)` or `addprocs(HTCManager(np))` |
-| Slurm | `addprocs_slurm(np::Integer; kwargs...)` or `addprocs(SlurmManager(np); kwargs...)` |
-| Local manager with CPU affinity setting | `addprocs(LocalAffinityManager(;np=CPU_CORES, mode::AffinityMode=BALANCED, affinities=[]); kwargs...)` |
-
-[^1]: HTCondor was previously named Condor.
-
-Implemented in external packages:
 
-| Job queue system | Command to add processors |
-| ---------------- | ------------------------- |
-| Kubernetes (K8s) via [K8sClusterManagers.jl](https://github.com/beacon-biosignals/K8sClusterManagers.jl) | `addprocs(K8sClusterManagers(np; kwargs...))` |
-| Azure scale-sets via [AzManagers.jl](https://github.com/ChevronETC/AzManagers.jl) | `addprocs(vmtemplate, n; kwargs...)` |
-
-You can also write your own custom cluster manager; see the instructions in the [Julia manual](https://docs.julialang.org/en/v1/manual/distributed-computing/#ClusterManagers).
-
-### Slurm: a simple example
+### LSF: a simple interactive example
 
 ```julia
-using Distributed, ClusterManagers
-
-# Arguments to the Slurm srun(1) command can be given as keyword
-# arguments to addprocs.  The argument name and value is translated to
-# a srun(1) command line argument as follows:
-# 1) If the length of the argument is 1 => "-arg value",
-#    e.g. t="0:1:0" => "-t 0:1:0"
-# 2) If the length of the argument is > 1 => "--arg=value"
-#    e.g. time="0:1:0" => "--time=0:1:0"
-# 3) If the value is the empty string, it becomes a flag value,
-#    e.g. exclusive="" => "--exclusive"
-# 4) If the argument contains "_", they are replaced with "-",
-#    e.g. mem_per_cpu=100 => "--mem-per-cpu=100"
-addprocs(SlurmManager(2), partition="debug", t="00:5:00")
+julia> using LSFClusterManager
 
-hosts = []
-pids = []
-for i in workers()
-	host, pid = fetch(@spawnat i (gethostname(), getpid()))
-	push!(hosts, host)
-	push!(pids, pid)
-end
-
-# The Slurm resource allocation is released when all the workers have
-# exited
-for i in workers()
-	rmprocs(i)
-end
-```
-
-### SGE - a simple interactive example
-
-```julia
-julia> using ClusterManagers
-
-julia> ClusterManagers.addprocs_sge(5; qsub_flags=`-q queue_name`)
+julia> LSFClusterManager.addprocs_sge(5; qsub_flags=`-q queue_name`)
 job id is 961, waiting for job to start .
 5-element Array{Any,1}:
 2
@@ -93,13 +36,13 @@ julia>  From worker 2:  compute-6
         From worker 3:  compute-6
 ```
 
-Some clusters require the user to specify a list of required resources. 
+Some clusters require the user to specify a list of required resources.
 For example, it may be necessary to specify how much memory will be needed by the job - see this [issue](https://github.com/JuliaLang/julia/issues/10390).
 The keyword `qsub_flags` can be used to specify these and other options.
 Additionally the keyword `wd` can be used to specify the working directory (which defaults to `ENV["HOME"]`).
 
 ```julia
-julia> using Distributed, ClusterManagers
+julia> using Distributed, LSFClusterManager
 
 julia> addprocs_sge(5; qsub_flags=`-q queue_name -l h_vmem=4G,tmem=4G`, wd=mktempdir())
 Job 5672349 in queue.
@@ -116,70 +59,8 @@ julia> pmap(x->run(`hostname`),workers());
 julia>  From worker 26: lum-7-2.local
         From worker 23: pace-6-10.local
         From worker 22: chong-207-10.local
-        From worker 24: pace-6-11.local
         From worker 25: cheech-207-16.local
 
 julia> rmprocs(workers())
 Task (done)
 ```
-
-### SGE via qrsh
-
-`SGEManager` uses SGE's `qsub` command to launch workers, which communicate the
-TCP/IP host:port info back to the master via the filesystem.  On filesystems
-that are tuned to make heavy use of caching to increase throughput, launching
-Julia workers can frequently timeout waiting for the standard output files to appear.
-In this case, it's better to use the `QRSHManager`, which uses SGE's `qrsh`
-command to bypass the filesystem and captures STDOUT directly.
-
-### Load Sharing Facility (LSF)
-
-`LSFManager` supports IBM's scheduler.  See the `addprocs_lsf` docstring
-for more information.
-
-### Using `LocalAffinityManager` (for pinning local workers to specific cores)
-
-- Linux only feature.
-- Requires the Linux `taskset` command to be installed.
-- Usage : `addprocs(LocalAffinityManager(;np=CPU_CORES, mode::AffinityMode=BALANCED, affinities=[]); kwargs...)`.
-
-where
-
-- `np` is the number of workers to be started.
-- `affinities`, if specified, is a list of CPU IDs. As many workers as entries in `affinities` are launched. Each worker is pinned
-to the specified CPU ID.
-- `mode` (used only when `affinities` is not specified, can be either `COMPACT` or `BALANCED`) - `COMPACT` results in the requested number
-of workers pinned to cores in increasing order, For example, worker1 => CPU0, worker2 => CPU1 and so on. `BALANCED` tries to spread
-the workers. Useful when we have multiple CPU sockets, with each socket having multiple cores. A `BALANCED` mode results in workers
-spread across CPU sockets. Default is `BALANCED`.
-
-### Using `ElasticManager` (dynamically adding workers to a cluster)
-
-The `ElasticManager` is useful in scenarios where we want to dynamically add workers to a cluster.
-It achieves this by listening on a known port on the master. The launched workers connect to this
-port and publish their own host/port information for other workers to connect to.
-
-On the master, you need to instantiate an instance of `ElasticManager`. The constructors defined are:
-
-```julia
-ElasticManager(;addr=IPv4("127.0.0.1"), port=9009, cookie=nothing, topology=:all_to_all, printing_kwargs=())
-ElasticManager(port) = ElasticManager(;port=port)
-ElasticManager(addr, port) = ElasticManager(;addr=addr, port=port)
-ElasticManager(addr, port, cookie) = ElasticManager(;addr=addr, port=port, cookie=cookie)
-```
-
-You can set `addr=:auto` to automatically use the host's private IP address on the local network, which will allow other workers on this network to connect. You can also use `port=0` to let the OS choose a random free port for you (some systems may not support this). Once created, printing the `ElasticManager` object prints the command which you can run on workers to connect them to the master, e.g.:
-
-```julia
-julia> em = ElasticManager(addr=:auto, port=0)
-ElasticManager:
-  Active workers : []
-  Number of workers to be added  : 0
-  Terminated workers : []
-  Worker connect command : 
-    /home/user/bin/julia --project=/home/user/myproject/Project.toml -e 'using ClusterManagers; ClusterManagers.elastic_worker("4cOSyaYpgSl6BC0C","127.0.1.1",36275)'
-```
-
-By default, the printed command uses the absolute path to the current Julia executable and activates the same project as the current session. You can change either of these defaults by passing `printing_kwargs=(absolute_exename=false, same_project=false))` to the first form of the `ElasticManager` constructor. 
-
-Once workers are connected, you can print the `em` object again to see them added to the list of active workers. 
diff --git a/ci/Dockerfile b/ci/Dockerfile
diff --git a/ci/docker-compose.yml b/ci/docker-compose.yml
diff --git a/ci/my_sbatch.sh b/ci/my_sbatch.sh
diff --git a/ci/run_my_sbatch.sh b/ci/run_my_sbatch.sh
diff --git a/slurm_test.jl b/slurm_test.jl