Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update building docs #291

Merged
merged 21 commits into from
Jan 16, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions .github/workflows/build-docs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
name: Build Docs with Sphinx

on:
pull_request:

jobs:
build-docs:
name: Build docs with Sphinx
runs-on: ubuntu-latest

steps:
# Step 1: Checkout the repository
- name: Checkout code
uses: actions/checkout@v4
with:
fetch-depth: 0
submodules: recursive # Ensure submodules are cloned

# Step 2: Install system dependencies
- name: Install system dependencies
run: |
sudo apt-get update
sudo apt-get install -y libmysqlclient-dev pandoc python3-sphinx

# Step 3: Set up Python and virtual environment
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version-file: .python-version

- name: Set up virtual environment
run: |
python -m venv .venv-docs
source .venv-docs/bin/activate
pip install --upgrade pip
pip install ".[torch,docs]"

# Step 4: Build the Sphinx documentation
- name: Build docs
run: |
source .venv-docs/bin/activate
cd docs
make clean
make html
1 change: 1 addition & 0 deletions .github/workflows/pytest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,3 +36,4 @@ jobs:
shell: bash -l {0}
run: .venv-pytorch/bin/pytest -v ./tests/ -m "not hpc"


2 changes: 1 addition & 1 deletion docs/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXOPTS ?= -W -v
SPHINXBUILD ?= sphinx-build
SOURCEDIR = .
BUILDDIR = _build
Expand Down
44 changes: 37 additions & 7 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,26 +5,56 @@ The docs can be built either locally on your system or remotely on JSC.
## Build docs locally

To build the docs locally and visualize them in your browser without relying on external
services (e.g., Read The Docs cloud), use the following commands:
services (e.g., Read The Docs cloud), follow these steps:

### Step 0 - Clone the Repository

If you haven't already cloned the repository, you can do it like this:

```bash
# Clone the repo, if not done yet
git clone --recurse-submodules https://github.com/interTwin-eu/itwinai.git itwinai-docs
cd itwinai-docs
```

Notice the `--recurse-submodules` flag here, which makes sure that any `git submodules`
are also installed.

# The first time, you may need to install some Linux packages (assuming Ubuntu system here)
### Step 1 - Install Linux Packages

You might need to install some Linux packages. With an Ubuntu system, you can use the
following commands:

```bash
sudo apt update && sudo apt install libmysqlclient-dev
sudo apt install pandoc
sudo apt install python3-sphinx
```

# Create a python virtual environment and install itwinai and its dependencies
### Step 2 - Create a Virtual Environment and Install itwinai

We first build a virtual environment and then install `itwinai` with the `docs` and
`torch` extras. If you didn't clone the repository recursively, then you also have to
update submodules. All of this is done with the following commands:

```bash
# Update submodules
git submodule update --init --recursive

# Create venv
python3 -m venv .venv-docs
source .venv-docs/bin/activate

# Choose the appropriate command for your OS here
# pip install ".[torch,docs,macos]"
pip install ".[torch,docs,linux]"
# Install itwinai
pip install ".[torch,docs]"
```

### Step 3 - Build the docs and start a server

Now you can go into the right directory, clean any old build files and then build
the docs. Finally, you can start a server to look at the docs. This can all be done
as follows:

```bash
# Move to the docs folder and build them using Sphinx
cd docs
make clean
Expand Down
19 changes: 11 additions & 8 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,6 @@
import subprocess
import sys

exclude_patterns = "requirements.txt"

sys.path.insert(0, os.path.abspath("../tutorials/ml-workflows/"))
sys.path.insert(0, os.path.abspath("../src"))
sys.path.insert(0, os.path.abspath("../images"))
Expand Down Expand Up @@ -54,10 +52,19 @@
}

templates_path = ["_templates"]
exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
exclude_patterns = [
"_build",
"Thumbs.db",
".DS_Store",
"requirements.txt",
"README.md",
"uv-tutorial.md",
jarlsondre marked this conversation as resolved.
Show resolved Hide resolved
]
suppress_warnings = ["myst.xref_missing", "myst.header"]

autodoc_mock_imports = ["mlflow"]


# Enable numref
numfig = True

Expand Down Expand Up @@ -94,8 +101,4 @@ def get_git_tag():
</div>
"""

html_sidebars = {
"**": [
html_footer # Adds the custom footer with version information
]
}
html_sidebars = {"**": [html_footer]} # Adds the custom footer with version information
File renamed without changes.
25 changes: 14 additions & 11 deletions docs/how-it-works/training/explain_ddp.rst
Original file line number Diff line number Diff line change
@@ -1,20 +1,23 @@
Distributed Data Parallelism
----------------------------------
Explanation of Distributed Data Parallelism
-------------------------------------------

**Author(s)**: Killian Verder (CERN), Matteo Bunino (CERN)

Deep neural networks (DNN) are often extremely large and are trained on massive amounts of data, more than most computers have memory for.
Even smaller DNNs can take days to train.
Distributed Data Parallel (DDP) addresses these two issues, long training times and limited memory, by using multiple machines to host and train both model and data.
Deep neural networks (DNN) are often extremely large and are trained on massive amounts
of data, more than most computers have memory for. Even smaller DNNs can take days to
train. Distributed Data Parallel (DDP) addresses these two issues, long training times
and limited memory, by using multiple machines to host and train both model and data.

Data parallelism is an easy way for a developer to vastly reduce training times.
Rather than using single-node parallelism, Distributed Data Parallelism (DDP) scales to multiple machines.
This scaling maximises parallelisation of your model and drastically reduces training times.
Data parallelism is an easy way for a developer to vastly reduce training times. Rather
than using single-node parallelism, DDP scales to multiple machines. This scaling
maximises parallelisation of your model and drastically reduces training times.

Another benefit of DDP is removal of single-machine memory constraints. Since a dataset or model can be stored across several machines,
it becomes possible to analyse much larger datasets or models.
Another benefit of DDP is removal of single-machine memory constraints. Since a dataset
or model can be stored across several machines, it becomes possible to analyse much
larger datasets or models.

Below is a list of resources expanding on theoretical aspects and practical implementations of DDP:
Below is a list of resources expanding on theoretical aspects and practical
implementations of DDP:

* Introduction to DP: https://siboehm.com/articles/22/data-parallel-training

Expand Down
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ contains thoroughly tested features aligned with the toolkit's most recent relea
getting-started/getting_started_with_itwinai
getting-started/slurm
getting-started/plugins
getting-started/uv-tutorial.md

.. toctree::
:maxdepth: 2
Expand Down
20 changes: 18 additions & 2 deletions docs/tutorials/distrib-ml/torch_scaling_test.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,25 @@ PyTorch scaling test

.. include:: ../../../tutorials/distributed-ml/torch-scaling-test/README.md
:parser: myst_parser.sphinx_
:end-before: Below follows an example of


Below follows an example of scalability plot generated by ``itwinai scalability-report``:
Plots of the scalability metrics
--------------------------------

We have the following scalability metrics available:

- Absolute wall-clock time comparison
- Relative wall-clock time speedup
- Communication vs. Computation time
- GPU Utilization (%)
- Power Consumption (Watt)

You can see example plots of these in the
:doc:`Virgo documentation <../../use-cases/virgo_doc>` or the
:doc:`EURAC documentation <../../use-cases/eurac_doc>`.

Additionally, we ran a larger scalability test with this tutorial on the full ImageNet
dataset with the older script. This only shows the relative speedup and can be seen here:

.. image:: ../../../tutorials/distributed-ml/torch-scaling-test/img/report.png

6 changes: 3 additions & 3 deletions docs/tutorials/distrib-ml/torch_tutorial_kubeflow_1.rst
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
Tutorial on Kubeflow and TorchTrainer class
=========================================
===========================================

.. include:: ../../../tutorials/distributed-ml/torch-kubeflow-1/README.md
:parser: myst_parser.sphinx_


train-cpu.py
++++++++
++++++++++++

.. literalinclude:: ../../../tutorials/distributed-ml/torch-kubeflow-1/train-cpu.py
:language: python
Expand All @@ -19,7 +19,7 @@ cpu.yaml
:language: yaml

Dockerfile
++++++++
++++++++++

.. literalinclude:: ../../../tutorials/distributed-ml/torch-kubeflow-1/Dockerfile
:language: dockerfile
2 changes: 2 additions & 0 deletions docs/use-cases/cyclones_doc.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
:orphan:

Tropical Cyclones Detection
===========================

Expand Down
49 changes: 46 additions & 3 deletions docs/use-cases/eurac_doc.rst
Original file line number Diff line number Diff line change
@@ -1,11 +1,54 @@
:orphan:

EURAC
=====
EURAC Use Case
==============
You can find the relevant code for the EURAC use case in the
`use case's folder on Github <https://github.com/interTwin-eu/itwinai/tree/main/use-cases/eurac>`_,
or by consulting the use case's README:


.. include:: ../../use-cases/eurac/README.md
:parser: myst_parser.sphinx_
:start-line: 2

Scalability Metrics
-------------------
Here are some examples of the scalability metrics for this use case:

Average Epoch Time Comparison
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This plot shows a comparison between the average time per epochs for each strategy
and number of nodes.

.. image:: ../../use-cases/eurac/scalability-plots/absolute_scalability_plot.png

Relative Epoch Time Speedup
~~~~~~~~~~~~~~~~~~~~~~~~~~~
This plot shows a comparison between the speedup between the different number of nodes
for each strategy. The speedup is calculated using the lowest number of nodes as a
baseline.

.. image:: ../../use-cases/eurac/scalability-plots/relative_scalability_plot.png

Communication vs Computation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This plot shows how much of the GPU time is spent doing computation compared to
communication between GPUs and nodes, for each strategy and number of nodes. The shaded
area is communication and the colored area is computation. They have all been
normalized so that the values are between 0 and 1.0.

.. image:: ../../use-cases/eurac/scalability-plots/communication_plot.png

GPU Utilization
~~~~~~~~~~~~~~~
This plot shows how high the GPU utilization is for each strategy and number of nodes,
as a percentage from 0 to 100. This is the defined as how much of the time is spent
in computation mode vs not, and does not directly correlate to FLOPs.

.. image:: ../../use-cases/eurac/scalability-plots/utilization_plot.png

Power Consumption
~~~~~~~~~~~~~~~~~
This plot shows the total energy consumption in watt-hours for the different strategies
and number of nodes.

.. image:: ../../use-cases/eurac/scalability-plots/gpu_energy_plot.png
2 changes: 1 addition & 1 deletion docs/use-cases/use_cases.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ How to run a use case

Each use case comes with their own tutorial on how to run it. Before running them,
however, you should set up a Python virtual environment. Refer to the
:doc:`getting started section <../getting-started/getting_started_with_itwinai.rst`
:doc:`getting started section <../getting-started/getting_started_with_itwinai>`
for more information on how to do this.

After installing and activating the virtual environment, you will want to install the
Expand Down
44 changes: 44 additions & 0 deletions docs/use-cases/virgo_doc.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
:orphan:

Virgo
=====

Expand All @@ -18,3 +20,45 @@ or by consulting the use case's README:
:parser: myst_parser.sphinx_
:start-line: 6

Scalability Metrics
-------------------
Here are some examples of the scalability metrics for this use case:

Average Epoch Time Comparison
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This plot shows a comparison between the average time per epochs for each strategy
and number of nodes.

.. image:: ../../use-cases/virgo/scalability-plots/absolute_scalability_plot.png

Relative Epoch Time Speedup
~~~~~~~~~~~~~~~~~~~~~~~~~~~
This plot shows a comparison between the speedup between the different number of nodes
for each strategy. The speedup is calculated using the lowest number of nodes as a
baseline.

.. image:: ../../use-cases/virgo/scalability-plots/relative_scalability_plot.png

Communication vs Computation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This plot shows how much of the GPU time is spent doing computation compared to
communication between GPUs and nodes, for each strategy and number of nodes. The shaded
area is communication and the colored area is computation. They have all been
normalized so that the values are between 0 and 1.0.

.. image:: ../../use-cases/virgo/scalability-plots/communication_plot.png

GPU Utilization
~~~~~~~~~~~~~~~
This plot shows how high the GPU utilization is for each strategy and number of nodes,
as a percentage from 0 to 100. This is the defined as how much of the time is spent
in computation mode vs not, and does not directly correlate to FLOPs.

.. image:: ../../use-cases/virgo/scalability-plots/utilization_plot.png

Power Consumption
~~~~~~~~~~~~~~~~~
This plot shows the total energy consumption in watt-hours for the different strategies
and number of nodes.

.. image:: ../../use-cases/virgo/scalability-plots/gpu_energy_plot.png
8 changes: 8 additions & 0 deletions env-files/torch/generic_torch.sh
Original file line number Diff line number Diff line change
Expand Up @@ -29,3 +29,11 @@ source $ENV_NAME/bin/activate
pip install -e ".[torch,tf,dev]" \
--no-cache-dir \
--extra-index-url https://download.pytorch.org/whl/cu121

# Install Prov4ML
if [[ "$(uname)" == "Darwin" ]]; then
pip install --no-cache-dir "prov4ml[apple]@git+https://github.com/matbun/ProvML@new-main"
else
# Assuming Nvidia GPUs are available
pip install --no-cache-dir "prov4ml[nvidia]@git+https://github.com/matbun/ProvML@new-main"
fi
Loading
Loading