interTwin-eu · jarlsondre · Jan 16, 2025 · Jan 10, 2025 · Jan 13, 2025 · Jan 14, 2025
diff --git a/.github/workflows/build-docs.yml b/.github/workflows/build-docs.yml
@@ -0,0 +1,44 @@
+name: Build Docs with Sphinx
+
+on:
+  pull_request:
+
+jobs:
+  build-docs:
+    name: Build docs with Sphinx
+    runs-on: ubuntu-latest
+
+    steps:
+      # Step 1: Checkout the repository
+      - name: Checkout code
+        uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+          submodules: recursive  # Ensure submodules are cloned
+
+      # Step 2: Install system dependencies
+      - name: Install system dependencies
+        run: |
+          sudo apt-get update
+          sudo apt-get install -y libmysqlclient-dev pandoc python3-sphinx
+
+      # Step 3: Set up Python and virtual environment
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version-file: .python-version
+
+      - name: Set up virtual environment
+        run: |
+          python -m venv .venv-docs
+          source .venv-docs/bin/activate
+          pip install --upgrade pip
+          pip install ".[torch,docs]"
+
+      # Step 4: Build the Sphinx documentation
+      - name: Build docs
+        run: |
+          source .venv-docs/bin/activate
+          cd docs
+          make clean
+          make html
diff --git a/.github/workflows/pytest.yml b/.github/workflows/pytest.yml
@@ -36,3 +36,4 @@ jobs:
         shell: bash -l {0}
         run: .venv-pytorch/bin/pytest -v ./tests/ -m "not hpc"
 
+
diff --git a/docs/Makefile b/docs/Makefile
@@ -3,7 +3,7 @@
 
 # You can set these variables from the command line, and also
 # from the environment for the first two.
-SPHINXOPTS    ?=
+SPHINXOPTS    ?= -W -v
 SPHINXBUILD   ?= sphinx-build
 SOURCEDIR     = .
 BUILDDIR      = _build

diff --git a/docs/README.md b/docs/README.md
@@ -5,26 +5,56 @@ The docs can be built either locally on your system or remotely on JSC.
 ## Build docs locally
 
 To build the docs locally and visualize them in your browser without relying on external
-services (e.g., Read The Docs cloud), use the following commands:
+services (e.g., Read The Docs cloud), follow these steps:
+
+### Step 0 - Clone the Repository
+
+If you haven't already cloned the repository, you can do it like this:
 
 ```bash
-# Clone the repo, if not done yet
 git clone --recurse-submodules https://github.com/interTwin-eu/itwinai.git itwinai-docs
 cd itwinai-docs
+```
+
+Notice the `--recurse-submodules` flag here, which makes sure that any `git submodules`
+are also installed.
 
-# The first time, you may need to install some Linux packages (assuming Ubuntu system here)
+### Step 1 - Install Linux Packages
+
+You might need to install some Linux packages. With an Ubuntu system, you can use the
+following commands:
+
+```bash
 sudo apt update && sudo apt install libmysqlclient-dev
 sudo apt install pandoc
 sudo apt install python3-sphinx
+```
 
-# Create a python virtual environment and install itwinai and its dependencies
+### Step 2 - Create a Virtual Environment and Install itwinai
+
+We first build a virtual environment and then install `itwinai` with the `docs` and
+`torch` extras. If you didn't clone the repository recursively, then you also have to
+update submodules. All of this is done with the following commands:
+
+```bash
+# Update submodules
+git submodule update --init --recursive
+
+# Create venv
 python3 -m venv .venv-docs
 source .venv-docs/bin/activate
 
-# Choose the appropriate command for your OS here
-# pip install ".[torch,docs,macos]" 
-pip install ".[torch,docs,linux]"
+# Install itwinai
+pip install ".[torch,docs]"
+```
+
+### Step 3 - Build the docs and start a server
 
+Now you can go into the right directory, clean any old build files and then build
+the docs. Finally, you can start a server to look at the docs. This can all be done
+as follows:
+
+```bash
 # Move to the docs folder and build them using Sphinx
 cd docs
 make clean

diff --git a/docs/conf.py b/docs/conf.py
@@ -21,8 +21,6 @@
 import subprocess
 import sys
 
-exclude_patterns = "requirements.txt"
-
 sys.path.insert(0, os.path.abspath("../tutorials/ml-workflows/"))
 sys.path.insert(0, os.path.abspath("../src"))
 sys.path.insert(0, os.path.abspath("../images"))
@@ -54,10 +52,19 @@
 }
 
 templates_path = ["_templates"]
-exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
+exclude_patterns = [
+    "_build",
+    "Thumbs.db",
+    ".DS_Store",
+    "requirements.txt",
+    "README.md",
+    "uv-tutorial.md",
+]
+suppress_warnings = ["myst.xref_missing", "myst.header"]
 
 autodoc_mock_imports = ["mlflow"]
 
+
 # Enable numref
 numfig = True
 
@@ -94,8 +101,4 @@ def get_git_tag():
 </div>
 """
 
-html_sidebars = {
-    "**": [
-        html_footer  # Adds the custom footer with version information
-    ]
-}
+html_sidebars = {"**": [html_footer]}  # Adds the custom footer with version information
diff --git a/docs/uv-tutorial.md → docs/getting-started/uv-tutorial.md b/docs/uv-tutorial.md → docs/getting-started/uv-tutorial.md
diff --git a/docs/how-it-works/training/explain_ddp.rst b/docs/how-it-works/training/explain_ddp.rst
@@ -1,20 +1,23 @@
-Distributed Data Parallelism
-----------------------------------
+Explanation of Distributed Data Parallelism
+-------------------------------------------
 
 **Author(s)**: Killian Verder (CERN),  Matteo Bunino (CERN)
 
-Deep neural networks (DNN) are often extremely large and are trained on massive amounts of data, more than most computers have memory for.
-Even smaller DNNs can take days to train. 
-Distributed Data Parallel (DDP) addresses these two issues, long training times and limited memory, by using multiple machines to host and train both model and data.
+Deep neural networks (DNN) are often extremely large and are trained on massive amounts
+of data, more than most computers have memory for. Even smaller DNNs can take days to
+train. Distributed Data Parallel (DDP) addresses these two issues, long training times
+and limited memory, by using multiple machines to host and train both model and data.
 
-Data parallelism is an easy way for a developer to vastly reduce training times.
-Rather than using single-node parallelism, Distributed Data Parallelism (DDP) scales to multiple machines. 
-This scaling maximises parallelisation of your model and drastically reduces training times.
+Data parallelism is an easy way for a developer to vastly reduce training times. Rather
+than using single-node parallelism, DDP scales to multiple machines. This scaling
+maximises parallelisation of your model and drastically reduces training times.
 
-Another benefit of DDP is removal of single-machine memory constraints. Since a dataset or model can be stored across several machines,
-it becomes possible to analyse much larger datasets or models.
+Another benefit of DDP is removal of single-machine memory constraints. Since a dataset
+or model can be stored across several machines, it becomes possible to analyse much
+larger datasets or models.
 
-Below is a list of resources expanding on theoretical aspects and practical implementations of DDP:
+Below is a list of resources expanding on theoretical aspects and practical
+implementations of DDP:
 
 * Introduction to DP: https://siboehm.com/articles/22/data-parallel-training
 

diff --git a/docs/index.rst b/docs/index.rst
@@ -56,6 +56,7 @@ contains thoroughly tested features aligned with the toolkit's most recent relea
    getting-started/getting_started_with_itwinai
    getting-started/slurm
    getting-started/plugins
+   getting-started/uv-tutorial.md
 
 .. toctree::
    :maxdepth: 2

diff --git a/docs/tutorials/distrib-ml/torch_scaling_test.rst b/docs/tutorials/distrib-ml/torch_scaling_test.rst
@@ -3,9 +3,25 @@ PyTorch scaling test
 
 .. include:: ../../../tutorials/distributed-ml/torch-scaling-test/README.md
    :parser: myst_parser.sphinx_
-   :end-before: Below follows an example of
 
 
-Below follows an example of scalability plot generated by ``itwinai scalability-report``:
+Plots of the scalability metrics
+--------------------------------
+
+We have the following scalability metrics available: 
+
+- Absolute wall-clock time comparison
+- Relative wall-clock time speedup
+- Communication vs. Computation time
+- GPU Utilization (%)
+- Power Consumption (Watt)
+
+You can see example plots of these in the 
+:doc:`Virgo documentation <../../use-cases/virgo_doc>` or the 
+:doc:`EURAC documentation <../../use-cases/eurac_doc>`.
+
+Additionally, we ran a larger scalability test with this tutorial on the full ImageNet
+dataset with the older script. This only shows the relative speedup and can be seen here:
 
 .. image:: ../../../tutorials/distributed-ml/torch-scaling-test/img/report.png
+
diff --git a/docs/tutorials/distrib-ml/torch_tutorial_kubeflow_1.rst b/docs/tutorials/distrib-ml/torch_tutorial_kubeflow_1.rst
@@ -1,12 +1,12 @@
 Tutorial on Kubeflow and TorchTrainer class
-=========================================
+===========================================
 
 .. include:: ../../../tutorials/distributed-ml/torch-kubeflow-1/README.md
    :parser: myst_parser.sphinx_
 
 
 train-cpu.py
-++++++++
+++++++++++++
 
 .. literalinclude:: ../../../tutorials/distributed-ml/torch-kubeflow-1/train-cpu.py
    :language: python
@@ -19,7 +19,7 @@ cpu.yaml
    :language: yaml
 
 Dockerfile
-++++++++
+++++++++++
 
 .. literalinclude:: ../../../tutorials/distributed-ml/torch-kubeflow-1/Dockerfile
    :language: dockerfile
diff --git a/docs/use-cases/cyclones_doc.rst b/docs/use-cases/cyclones_doc.rst
@@ -1,3 +1,5 @@
+:orphan:
+
 Tropical Cyclones Detection
 ===========================
 

diff --git a/docs/use-cases/eurac_doc.rst b/docs/use-cases/eurac_doc.rst
@@ -1,11 +1,54 @@
+:orphan: 
 
-EURAC
-=====
+EURAC Use Case
+==============
 You can find the relevant code for the EURAC use case in the 
 `use case's folder on Github <https://github.com/interTwin-eu/itwinai/tree/main/use-cases/eurac>`_,
 or by consulting the use case's README: 
 
-
 .. include:: ../../use-cases/eurac/README.md
    :parser: myst_parser.sphinx_
    :start-line: 2
+
+Scalability Metrics
+-------------------
+Here are some examples of the scalability metrics for this use case: 
+
+Average Epoch Time Comparison
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+This plot shows a comparison between the average time per epochs for each strategy
+and number of nodes. 
+
+.. image:: ../../use-cases/eurac/scalability-plots/absolute_scalability_plot.png
+
+Relative Epoch Time Speedup
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+This plot shows a comparison between the speedup between the different number of nodes
+for each strategy. The speedup is calculated using the lowest number of nodes as a
+baseline.
+
+.. image:: ../../use-cases/eurac/scalability-plots/relative_scalability_plot.png
+
+Communication vs Computation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+This plot shows how much of the GPU time is spent doing computation compared to
+communication between GPUs and nodes, for each strategy and number of nodes. The shaded
+area is communication and the colored area is computation. They have all been
+normalized so that the values are between 0 and 1.0. 
+
+.. image:: ../../use-cases/eurac/scalability-plots/communication_plot.png
+
+GPU Utilization
+~~~~~~~~~~~~~~~
+This plot shows how high the GPU utilization is for each strategy and number of nodes,
+as a percentage from 0 to 100. This is the defined as how much of the time is spent
+in computation mode vs not, and does not directly correlate to FLOPs. 
+
+.. image:: ../../use-cases/eurac/scalability-plots/utilization_plot.png
+
+Power Consumption
+~~~~~~~~~~~~~~~~~
+This plot shows the total energy consumption in watt-hours for the different strategies
+and number of nodes. 
+
+.. image:: ../../use-cases/eurac/scalability-plots/gpu_energy_plot.png
diff --git a/docs/use-cases/use_cases.rst b/docs/use-cases/use_cases.rst
@@ -3,7 +3,7 @@ How to run a use case
 
 Each use case comes with their own tutorial on how to run it. Before running them,
 however, you should set up a Python virtual environment. Refer to the
-:doc:`getting started section <../getting-started/getting_started_with_itwinai.rst`
+:doc:`getting started section <../getting-started/getting_started_with_itwinai>`
 for more information on how to do this.
 
 After installing and activating the virtual environment, you will want to install the

diff --git a/docs/use-cases/virgo_doc.rst b/docs/use-cases/virgo_doc.rst
@@ -1,3 +1,5 @@
+:orphan:
+
 Virgo
 =====
 
@@ -18,3 +20,45 @@ or by consulting the use case's README:
    :parser: myst_parser.sphinx_
    :start-line: 6
 
+Scalability Metrics
+-------------------
+Here are some examples of the scalability metrics for this use case: 
+
+Average Epoch Time Comparison
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+This plot shows a comparison between the average time per epochs for each strategy
+and number of nodes. 
+
+.. image:: ../../use-cases/virgo/scalability-plots/absolute_scalability_plot.png
+
+Relative Epoch Time Speedup
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+This plot shows a comparison between the speedup between the different number of nodes
+for each strategy. The speedup is calculated using the lowest number of nodes as a
+baseline.
+
+.. image:: ../../use-cases/virgo/scalability-plots/relative_scalability_plot.png
+
+Communication vs Computation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+This plot shows how much of the GPU time is spent doing computation compared to
+communication between GPUs and nodes, for each strategy and number of nodes. The shaded
+area is communication and the colored area is computation. They have all been
+normalized so that the values are between 0 and 1.0. 
+
+.. image:: ../../use-cases/virgo/scalability-plots/communication_plot.png
+
+GPU Utilization
+~~~~~~~~~~~~~~~
+This plot shows how high the GPU utilization is for each strategy and number of nodes,
+as a percentage from 0 to 100. This is the defined as how much of the time is spent
+in computation mode vs not, and does not directly correlate to FLOPs. 
+
+.. image:: ../../use-cases/virgo/scalability-plots/utilization_plot.png
+
+Power Consumption
+~~~~~~~~~~~~~~~~~
+This plot shows the total energy consumption in watt-hours for the different strategies
+and number of nodes. 
+
+.. image:: ../../use-cases/virgo/scalability-plots/gpu_energy_plot.png
diff --git a/env-files/torch/generic_torch.sh b/env-files/torch/generic_torch.sh
@@ -29,3 +29,11 @@ source $ENV_NAME/bin/activate
 pip install -e ".[torch,tf,dev]" \
     --no-cache-dir \
     --extra-index-url https://download.pytorch.org/whl/cu121
+
+# Install Prov4ML
+if [[ "$(uname)" == "Darwin" ]]; then
+  pip install --no-cache-dir  "prov4ml[apple]@git+https://github.com/matbun/ProvML@new-main"
+else
+  # Assuming Nvidia GPUs are available
+  pip install --no-cache-dir  "prov4ml[nvidia]@git+https://github.com/matbun/ProvML@new-main"
+fi
Original file line number	Diff line number	Diff line change
Expand Up		@@ -36,3 +36,4 @@ jobs:
		shell: bash -l {0}
		run: .venv-pytorch/bin/pytest -v ./tests/ -m "not hpc"