Skip to content

Commit

Permalink
hpc updates
Browse files Browse the repository at this point in the history
hpc updates
  • Loading branch information
mschwamb committed Jan 12, 2025
1 parent 2162419 commit 5a7c3f5
Showing 1 changed file with 8 additions and 8 deletions.
16 changes: 8 additions & 8 deletions docs/hpc.rst
Original file line number Diff line number Diff line change
@@ -1,34 +1,34 @@
.. _hpc:

Parallelization
Parallelization
===============================================

Embarrassingly Parallel Problem
------------------------------------

’s design lends itself perfectly to parallelization – when it simulates a large number of solar system objects, each one is considered in turn independently of all other objects. If you have access to a large number of computing cores, you can run ``Sorcha`` much more quickly by dividing up the labor: giving a small part of your model population to each core.
``Sorcha``’s design lends itself perfectly to parallelization – when it simulates a large number of solar system objects, each one is considered in turn independently of all other objects. If you have access to a large number of computing cores, you can run ``Sorcha`` much more quickly by dividing up the labor: giving a small part of your model population to each core.

This involves two subtasks: breaking up your model population into an appropriate number of input files with unique names and organizing a large number of cores to simultaneously run ``Sorcha`` on their own individually-named input files. Both of these tasks are easy in theory, but tricky enough in practice that we provide some guidance below.


SLURM
Slurm
---------

Slurm Workload Manager is a resource management utility commonly used by computing clusters. We provide starter code for running large parallel batches using SLURM, though general guidance we provide is applicable to any system. Documentation for SLURM is available `here <https://slurm.schedmd.com/>`_. Please note that your HPC (High Performance Computing) facility’s SLURM setup may differ from those on which ``Sorcha`` was tested, and it is always a good idea to read any facility-specific documentation or speak to the HPC maintainers before you begin to run jobs.
Slurm Workload Manager is a resource management utility commonly used by computing clusters. We provide starter code for running large parallel batches using slurm, though general guidance we provide is applicable to any system. Documentation for slurm is available `here <https://slurm.schedmd.com/>`_. Please note that your HPC (High Performance Computing) facility’s slurm setup may differ from those on which ``Sorcha`` was tested, and it is always a good idea to read any facility-specific documentation or speak to the HPC maintainers before you begin to run jobs.

Quickstart
--------------

We provide as a starting point our example scripts for running on HPC facilities using SLURM. Some modifications will be required to make them work for your facility.
We provide as a starting point our example scripts for running on HPC facilities using slurm. Some modifications will be required to make them work for your facility.

Below is a very simple SLURM script example designed to run the demo files three times on three cores in parallel. Here, one core has been assigned to each ``Sorcha`` run, with each core assigned 1Gb of memory.
Below is a very simple slurm script example designed to run the demo files three times on three cores in parallel. Here, one core has been assigned to each ``Sorcha`` run, with each core assigned 1Gb of memory.

.. literalinclude:: ./example_files/multi_sorcha.sh
:language: text

Please note that time taken to run and memory required will vary enormously based on the size of your input files, your input population, and the chunk size assigned in the ``Sorcha`` configuration file: we therefore recommend test runs before you commit to very large runs. The chunk size is an especially important parameter: too small and ``Sorcha`` will take a very long time to run, too large and the memory footprint may become prohibitive. We have found that chunk sizes of 1000 to 10,000 work best.

Below is a more complex example of a SLURM script. Here, multi_sorcha.sh calls multi_sorcha.py, which splits up an input file into a number of ‘chunks’ and runs ``Sorcha`` in parallel on a user-specified number of cores.
Below is a more complex example of a slurm script. Here, multi_sorcha.sh calls multi_sorcha.py, which splits up an input file into a number of ‘chunks’ and runs ``Sorcha`` in parallel on a user-specified number of cores.

multi_sorcha.sh:

Expand All @@ -41,7 +41,7 @@ multi_sorcha.py:
:language: python

.. note::
We provide these here for you to copy, paste, and edit as needed. You might have to some some slight modifications to both the SLURM script and multi_sorcha.py depending if you're using ``Sorcha`` without calling the stats file.
We provide these here for you to copy, paste, and edit as needed. You might have to some some slight modifications to both the slurm script and multi_sorcha.py depending if you're using ``Sorcha`` without calling the stats file.

multi_sorcha.sh requests many parallel Slurm jobs of multi_sorcha.py, feeding each a different --instance parameter. After changing ‘my_orbits.csv’, ‘my_colors.csv’, and ‘my_pointings.db’ to match the above, it could be run as sbatch --array=0-9 multi_sorcha.sh 25 4 to generate ten jobs, each with 4 cores running 25 orbits each.

Expand Down

0 comments on commit 5a7c3f5

Please sign in to comment.