Skip to content

Commit

Permalink
edit links to code and add image
Browse files Browse the repository at this point in the history
  • Loading branch information
natalya-patrikeeva committed Aug 29, 2024
1 parent 8b73ca8 commit 60eeb10
Show file tree
Hide file tree
Showing 6 changed files with 18 additions and 114 deletions.
12 changes: 7 additions & 5 deletions docs/4_parallel_jobs.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@ updateDate: 2024-08-29
## Multiprocessing Script
We can modify the `1_investment-serial.py` script to use `multiprocessing` python package to make the serial script parallel since all of the trials are independent.

{% include warning.html content="Because this Python code uses multiprocessing and the yens are a shared computing environment, we need to be careful about how Python sees and utilizes the shared cores on the yens."%}
{: .warning}
Because this Python code uses multiprocessing and the yens are a shared computing environment, we need to be careful about how Python sees and utilizes the shared cores on the yens.

Again, we will hard code the number of cores for
the script to use in this line in the python script:
Expand All @@ -19,20 +20,21 @@ the script to use in this line in the python script:
ncore = 12
````

Consider a slightly modified program, [2_investment-parallel.py](https://github.com/gsbdarc/rf_bootcamp_2024/blob/main/examples/python_examples/2_investment-parallel.py).
Consider a slightly modified program, [2_investment-parallel.py](https://github.com/gsbdarc/intermediate_yens_2024/blob/main/examples/2_investment-parallel.py).

**Important**: when using the yens, you must specify the number of cores in `Pool()` call. Otherwise, your python program would see all cores on the node and try to use them. But if you only request 10 cores in slurm and `Pool()` tries to use 256, bad things happen and your program will likely to get killed. Match the number of cores in the `Pool()` call to the number of cores you request in the submit script.
{: .important}
When using the yens, you must specify the number of cores in `Pool()` call. Otherwise, your python program would see all cores on the node and try to use them. But if you only request 10 cores in slurm and `Pool()` tries to use 256, bad things happen and your program will likely to get killed. It's a good idea to match the number of cores in the `Pool()` call to the number of cores you request in the submit script.

```python
# create a multiprocessing pool to run trials in parallel
pool = mp.Pool(processes = ncore)
```

We will have to adjust the [submit script](https://github.com/gsbdarc/rf_bootcamp_2024/blob/main/examples/python_examples/2_investment-parallel.slurm) as well to request more cores. We will request `cpus-per-task=12` (or using a shorthand `-c 12`) to request 12 cores to run in parallel.
We will have to adjust the [submit script](https://github.com/gsbdarc/intermediate_yens_2024/blob/main/examples/2_investment-parallel.slurm) as well to request more cores. We will request `cpus-per-task=12` (or using a shorthand `-c 12`) to request 12 cores to run in parallel.

Change the `2_investment-parallel.slurm` to include your email address.

To submit this script, we run:
To submit this script, run:

```bash
$ sbatch 2_investment-parallel.slurm
Expand Down
4 changes: 2 additions & 2 deletions docs/5_command_line_args.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,9 @@ We can modify the parallel script to accept the number of cores as a command lin

We will use the value of `cpus-per-task` that we request from Slurm to pass it as an argument to the python script and use that value as the number of cores in parallel. That way we only need to set the `cpus-per-task` in `#SBATCH` line and use the value stored in the Slurm environment variable.

The modified [python script](https://github.com/gsbdarc/rf_bootcamp_2024/blob/main/examples/python_examples/3_investment-parallel-args.py), `3_investment-parallel-args.py`, accepts one argument, the number of cores to use for parallel `map` call.
The modified [python script](https://github.com/gsbdarc/intermediate_yens_2024/blob/main/examples/3_investment-parallel-args.py), `3_investment-parallel-args.py`, accepts one argument, the number of cores to use for parallel `map` call.

Look at the [slurm file](https://github.com/gsbdarc/rf_bootcamp_2024/blob/main/examples/python_examples/3_investment-parallel-args.slurm) called `3_investment-parallel-args.slurm` and edit it to include your email address.
Look at the [slurm file](https://github.com/gsbdarc/intermediate_yens_2024/blob/main/examples/3_investment-parallel-args.slurm) called `3_investment-parallel-args.slurm` and edit it to include your email address.

Submit and monitor:
```bash
Expand Down
6 changes: 3 additions & 3 deletions docs/6_job_arrays.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ A Slurm job array is a convenient and efficient way to submit and manage a group

In this section, we will explore the concept of Slurm job arrays and demonstrate how to leverage this feature for batch job processing, simplifying the management of repetitive tasks and improving overall productivity on the Yen environment.

Let's take a look at a [python script](https://github.com/gsbdarc/rf_bootcamp_2024/blob/main/examples/python_examples/array/4_investment-job-task.py), `4_investment-job-task.py`, that will be run as an array of tasks.
Let's take a look at a [python script](https://github.com/gsbdarc/intermediate_yens_2024/blob/main/examples/array/4_investment-job-task.py), `4_investment-job-task.py`, that will be run as an array of tasks.

The script expects two command line arguments - cashflows and a discount rate and outputs NPV value for those inputs. Alternatively to using `multiprocessing` and `map()` function in `2_investment-parallel.py` script, we can compute NPV values over different inputs to the script that only computes the NPV value for 2 given inputs (cashflows and a discount rate).

Expand All @@ -40,11 +40,11 @@ You should see the following output:
100 lines of data have been written to inputs_to_job_array.csv.
```

Next, we'll submit a [job array script](https://github.com/gsbdarc/rf_bootcamp_2024/blob/main/examples/python_examples/array/4_investment-job-array.slurm), called `4_investment-job-array.slurm`, that runs 100 tasks in parallel using one line from input file to pass the value of arguments to the script.
Next, we'll submit a [job array script](https://github.com/gsbdarc/intermediate_yens_2024/blob/main/examples/array/4_investment-job-array.slurm), called `4_investment-job-array.slurm`, that runs 100 tasks in parallel using one line from input file to pass the value of arguments to the script.

This script extracts the line number that corresponds to the value of `$SLURM_ARRAY_TASK_ID` environment variable -- in this case, 1 through 100. When we submit this one slurm script to the scheduler, it will become 100 jobs running all at once with each task executing the `4_investment-job-task.py` script with different inputs.

Another advantage of job arrays instead of running one big scipt is that if some but not all job tasks have failed, you can resubmit only those by using the failed array indices. For example, if the inputs for job task 50 produced NaN and job failed, we can fix the inputs, then resumbit the slurm script with `--array=50-50` to rerun only that task.
Another advantage of job arrays instead of running one big script is that if some but not all job tasks have failed, you can resubmit only those by using the failed array indices. For example, if the inputs for job task 50 produced NaN and job failed, we can fix the inputs, then resubmit the slurm script with `--array=50-50` to rerun only that task.

To submit the script that executes 100 jobs, run:

Expand Down
10 changes: 6 additions & 4 deletions docs/7_yen_gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,8 @@ PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
gpu up 1-00:00:00 3 idle yen-gpu[1-3]
```

{% include warning.html content="There is a limit of 1 day runtime and 4 GPUs per user."%}
{: .warning}
There is a limit of 1 day runtime and 4 GPUs per user.

See partition limits with:
```bash
Expand Down Expand Up @@ -238,7 +239,7 @@ is not user writable.
The PyTorch example script uses the MNIST dataset for image classification, and consists of a simple fully connected neural network
with one hidden layer.

We will run the [`mnist.py`](https://github.com/gsbdarc/rf_bootcamp_2024/blob/main/examples/python_examples/mnist.py) script on the GPU node.
We will run the [`mnist.py`](https://github.com/gsbdarc/intermediate_yens_2024/blob/main/examples/mnist.py) script on the GPU node.

### Submit Slurm script

Expand Down Expand Up @@ -317,7 +318,7 @@ Wed Jun 26 12:16:41 2024
+-----------------------------------------------------------------------------------------+
```

`nvidia-smi` also tells you how much GPU RAM is used by the process. When training LLM or other models, it's important to fully utilize the GPU RAM so that the training is optimized. So if the GPU has 24 G of RAM, we can adjust the batch size to use as much data as fits into the GPU RAM and monitor `nvidia-smi` output so see how much RAM is used while the job is running. If the batch size is too large, your job will crash with OOM error. Try reducing the batch size then try again (while monitoring GPU memory usage).
`nvidia-smi` also tells you how much GPU RAM is used by the process. When training LLM or other models, it's important to fully utilize the GPU RAM so that the training is optimized. So if the GPU has 24 G of RAM, we can adjust the batch size to use as much data as fits into the GPU RAM and monitor `nvidia-smi` output so see how much RAM is used while the job is running. If the batch size is too large, your job will crash with OOM error. Reduce the batch size then try again (while monitoring GPU memory usage).

In the output example above, we are way under-utilizing the GPU RAM (using only 1 G out of 24 G).

Expand Down Expand Up @@ -372,4 +373,5 @@ not have GPUs).

![](../assets/images/pytorch-kernel.png)

**Note:** The Yens also have prebuilt `tensorflow` module that can be used in a similar way to `pytorch`.
{: .note }
The Yens also have prebuilt `tensorflow` module that can be used in a similar way to `pytorch`.
Binary file added docs/assets/images/pytorch-kernel.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 60eeb10

Please sign in to comment.