Venv creation and uv support (#245)

* add empty requirements file for cuda * add requirements files and update pyproject toml * update pyproject * update installation in pyproject.toml * update readme and horovod installation script * update readme with horovod explanation * update horovod installation script * update readme with -e flag * fix linter readme errors * add more info to readme * trailing whitespace 🙃 * trailing whitespace 🙃 (again) * add draft of table of contents to readme * update readme toc * update readme toc again * add section about uv lock to readme * update toc of readme * fix errors in readme * add version numbers to packages in pyproject.toml * remove uv.lock (for now) * remove link from readme * put toc in html comment * remove toc, remove ds and horovod from reqs, add docs comment to pyproj * Itwinai jlab Docker image (#236) * Refactor Dockerfiles * Refactor container gen script * ADD jlab dockerfile * First working version of jlab container * ADD CMCC requirements * update dockerfiles * ADD nvconda and refactor * Update containers * ADD containers * ADD simple plus dockerfile * Update NV deps * Update CUDA * Add comment * Cleanup * Cleanup * UPDATE README * Refactor * Fix linter * Refactor dockerfiles and improve tests * Refactor * Refactor * Fix * Add first tests for HPC * First broken tests for HPC * Update tests and strategy * UPDATE tests * Update horovod tests * Update tests and jlab deps * Add MLFLow tracking URI * ADD distributed trainer tests * mpirun container deepspeed * Fix distributed strategy tests on multi-node * ADD srun launcher * Refactor jobscript * Cleanup * isort tests * Refactor scripts * Minor fixes * Add logging to file for all workers * Add jupyter base files * Add jupyter base files * spelling * Update provenance deps * Update DS version * Update prov docs * Cleanup * add nvidia dep * Remove incomplete work * update pyproject * ADD hadolit config file * FIX flag * Fix linters * Refactor * Update prov4ml * Update pytest CI * Minor fix * Incorporate feedback * Update Dockerfiles * Incorporate feedback * Update comments * Refactor tests * Virgo HDF5 file format (#240) * update virgo generated dataset to use hdf5 format * add functionality for selecting output location * set new data format as standard * make virgo work with new data loader and add progress bar * remove old generation files and add script for concatenating hdf5 files * remove old generation files and add script for concatenating hdf5 files * rename folder using hyphens * remove multiprocessing * add multiprocessing at correct place * update handling of seed and num processes * Gpu monitoring (#237) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * remove redundant variable * remove trailing whitespace * fix issues from PR * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * add configurable and dynamic wait and warmup times for the profiler * remove old plot * move horovod import * fix linting errors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Scalability test wall clock (#239) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * temporary changes * remove redundant variable * add absolute time plot * remove trailing whitespace * remove redundant variable * remove trailing whitespace * begin implementation of backup * fix issues from PR * fix issues from PR * add backup to gpu monitoring * fix import in eurac trainer * cleanup backup mechanism slightly * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup to gpu monitoring * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * add configurable and dynamic wait and warmup times for the profiler * temporary changes * add absolute time plot * begin implementation of backup * add backup to gpu monitoring * cleanup backup mechanism slightly * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * fix linting errors * fix more linting errors * add utilization percentage plot * run isort for linting * update default save path for metrics * add decorators to virgo and some cleanup * add contributions and cleanup * fix linting errors * change 'credits' to 'credit' * update communication plot style * update function names * update scalability function for a more streamlined approach * run isort * move horovod import * fix linting errors * add contributors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * make virgo work with new data loader and add progress bar * add contributors * update ruff settings in pyproject * update virgo dataset concatenation * add isort option to ruff * break imports on purpose * break more imports to test * remove ruff config file * 😀 * test linter 😁 * remove comment in github workflows * add validation python to linter and make more mistakes * add linting errors to trainer * remove isort and flake8 and replace with ruff * update linters * run formatter on virgo folder * fix linting errors and stuff from PR * update config * change config for timing code * update profiler to use 'with' for context managing * fix profiler.py --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * add requirements files and update pyproject toml * update installation in pyproject.toml * add pytorch extra to horovod and remove redundant script * update readme tutorial with pip installation * add uv tutorial in separate file * fix linting errors * update horovod install script * fix dead link * update readme * add uv installation command to readme * add requirements files and update pyproject toml * update pyproject * update installation in pyproject.toml * add version numbers to packages in pyproject.toml * update horovod install script and add pip as dependency * formatting * fix linting * trailing whitespace * remove comment from readme * remove comments and small formatting difference * move uv tutorial under docs/ * update readme with nvidia and amd instead of linux * remove duplicate entries in pyproject and reformat distributed file * update readmes * separate horovod ds installation script into two files * fix linting errors and update dependencies * fix tests and update lockfile * fix linting errors * update installation scripts for testing * add local test command * add tf to installation in readme * add torch cuda to project dependencies * remove index from tutorial * remove unused comments and update tutorial --------- Co-authored-by: Matteo Bunino <[email protected]> Co-authored-by: Anna Lappe <[email protected]>
interTwin-eu · Dec 2, 2024 · a964e47 · a964e47
1 parent d05fb75
commit a964e47
Show file tree

Hide file tree

Showing 20 changed files with 9,614 additions and 318 deletions.
diff --git a/Makefile b/Makefile
@@ -11,7 +11,7 @@ torch-env-cpu: env-files/torch/generic_torch.sh
 	env ENV_NAME=.venv-pytorch \
 		NO_CUDA=1 \
 		bash -c 'bash env-files/torch/generic_torch.sh'
-	.venv-pytorch/bin/horovodrun --check-build 
+	# .venv-pytorch/bin/horovodrun --check-build 
 
 # Install TensorFlow env (GPU support)
 tensorflow-env: env-files/tensorflow/generic_tf.sh
@@ -44,7 +44,10 @@ tf-env-vega: env-files/tensorflow/createEnvVegaTF.sh env-files/tensorflow/generi
 
 
 test:
-	PYTORCH_ENABLE_MPS_FALLBACK=1 .venv-pytorch/bin/pytest -v tests/ -m "not slurm"
+	.venv/bin/pytest -v tests/
+
+test-local:
+	PYTORCH_ENABLE_MPS_FALLBACK=1 .venv/bin/pytest -v tests/ -m "not hpc"
 
 test-jsc: tests/run_on_jsc.sh
 	bash tests/run_on_jsc.sh

diff --git a/README.md b/README.md
@@ -128,10 +128,84 @@ git clone [--recurse-submodules] [email protected]:interTwin-eu/itwinai.git
 
 ### Install itwinai environment
 
-You can create the
-Python virtual environments using our predefined Makefile targets.
+In this project, we are using `uv` as a project-wide package manager. Therefore, if
+you are a developer, you should see the [uv tutorial](/docs/uv-tutorial.md) after reading
+the following `pip` tutorial.
 
-#### PyTorch (+ Lightning) virtual environment
+#### Installation using pip
+
+##### Creating a venv
+
+You can install the `itwinai` environment for development using `pip`. First, however,
+you would want to make a Python venv if you haven't already. Make sure you have
+Python installed (on HPC you have to load it with `module load Python`), and then you
+can create a venv with the following command:
+
+```bash
+python -m venv <name-of-venv>
+```
+
+For example, if I wanted to create a venv in the directory `.venv` (which is useful if
+you use e.g. `uv`), then I would do:
+
+```bash
+python -m venv .venv
+```
+
+After this you can activate your venv using the following command:
+
+```bash
+source .venv/bin/activate
+```
+
+Now anything you pip install will be installed in your venv and if you run any python
+commands they will use the version from your venv.
+
+##### Installation of packages
+
+We provide some _extras_ that can be activated depending on which platform you are
+using.
+
+- `macos`, `amd` or `nvidia` depending on which platform you use. Changes the version
+of `prov4ML`.
+- `dev` for development purposes. Includes libraries for testing and tensorboard etc.
+- `torch` for installation with PyTorch.
+
+If you want to install PyTorch using CUDA then you also have to add an
+`--extra-index-url` to the CUDA version that you want. Since you are developing the
+library, you also want to enable the editable flag, `-e`, so that you don't have to
+reinstall everything every time you make a change. If you are on HPC, then you will
+usually want to add the `--no-cache-dir` flag to avoid filling up your `~/.cache`
+directory, as you can very easily reach your disk quota otherwise. An example of a
+complete command for installing as a developer on HPC with CUDA thus becomes:
+
+```bash
+pip install -e ".[torch,dev,nvidia,tf]" \
+    --no-cache-dir \
+    --extra-index-url https://download.pytorch.org/whl/cu121
+```
+
+If you wanted to install this locally on macOS (i.e. without CUDA) with PyTorch, you
+would do the following instead:
+
+```bash
+pip install -e ".[torch,dev,macos,tf]"
+```
+
+<!-- You can create the Python virtual environments using our predefined Makefile targets. -->
+
+#### Horovod and DeepSpeed
+
+The above does not install `Horovod` and `DeepSpeed`, however, as they require a
+specialized [script](env-files/torch/install-horovod-deepspeed-cuda.sh). If you do not
+require CUDA, then you can install them using `pip` as follows:
+
+```bash
+pip install --no-cache-dir --no-build-isolation git+https://github.com/horovod/horovod.git
+pip install --no-cache-dir --no-build-isolation deepspeed
+```
+
+#### PyTorch (+ Lightning) virtual environment with makefiles
 
 Makefile targets for environment installation:
 

diff --git a/docs/uv-tutorial.md b/docs/uv-tutorial.md
@@ -0,0 +1,98 @@
+# Tutorial for using the uv package manager
+
+[uv](https://docs.astral.sh/uv/) is a Python package manager meant to act as a drop-in
+replacement for `pip` (and many more tools). In this project, we use it to manage our
+packages, similar to how `poetry` works. This is done using a lockfile called
+`uv.lock`.
+
+## uv as a drop-in replacement for pip
+
+`uv` is a lot faster than `pip`, so we recommend installing packages from `PyPI`
+with `uv pip install <package>` instead of `pip install <package>`. You don't need to
+change anything in your project to use this feature, as it works as a drop-in
+replacement to `pip`.
+
+## uv as a project-wide package manager
+
+If you wish to use the `uv sync` and/or `uv lock` commands, which is how you use `uv`
+to manage all your project packages, then note that these commands will only work
+with the directory called `.venv` in the project directory. Sometimes, this can be a
+bit annoying, especially with an existing venv, so we recommend using a
+[symlink](https://en.wikipedia.org/wiki/Symbolic_link). If you need to have multiple
+venvs that you want to switch between, you can update the symlink to whichever of them
+you want to use at the moment. For SLURM scripts, you can hardcode them if need be.
+
+### Symlinking .venv
+
+To create a symlink between your venv and the `.venv` directory, you can use the
+following command:
+
+```bash
+ln -s <path/to/your_venv> <path/to/.venv>
+```
+
+As an example, if I am in the `itwinai/` folder and my venv is called `envAI_juwels`,
+then the following will create the wanted symlink:
+
+```bash
+ln -s envAI_juwels .venv
+```
+
+### Installing from uv.lock
+
+> [!Warning]
+> If `uv` creates your venv for you, the venv will not contain `pip`. However, you need
+> to have `pip` installed to be able to run the installation scripts for `Horovod` and
+> `DeepSpeed`, so we have included `pip` in the dependencies in `pyproject.toml`.
+
+To install from the `uv.lock` file into the `.venv` venv, you can do the following:
+
+```bash
+uv sync
+```
+
+If the `uv.lock` file has optional dependencies (e.g. `macos` or `torch`), then these
+can be added with the `--extra` flag as follows:
+
+```bash
+uv sync --extra torch --extra macos
+```
+
+These will usually correspond to the optional dependencies in the `pyproject.toml`. In
+particular, if you are a developer you would use one of the following two commands. If
+you are on HPC with cuda, you would use:
+
+```bash
+uv sync --no-cache --extra dev --extra nvidia --extra torch --extra tf 
+```
+
+If you are developing on your local computer with macOS, then you would use:
+
+```bash
+uv sync --extra torch --extra tf --extra dev --extra macos
+```
+
+### Updating the uv.lock file
+
+To update the project's `uv.lock` file with the dependencies of the project, you can
+use the command:
+
+```bash
+uv lock
+```
+
+This will create a `uv.lock` file if it doesn't already exist, using the dependencies
+from the `pyproject.toml`.
+
+## Adding new packages to the project
+
+To add a new package to the project (i.e. to the `pyproject.toml` file) with `uv`, you
+can use the following command:
+
+```bash
+uv add <package>
+```
+
+> [!Warning]
+> This will add the package to your `.venv` venv, so make sure to have symlinked to
+> this directory if you haven't already.
diff --git a/env-files/tensorflow/generic_tf.sh b/env-files/tensorflow/generic_tf.sh
@@ -1,97 +1,18 @@
 #!/bin/bash
 
-# ENV VARIABLES:
-#   - ENV_NAME: set custom name for virtual env. Default: ".venv-tf"
-#   - NO_CUDA: if set, install without cuda support
-
-# Detect custom env name from env
 if [ -z "$ENV_NAME" ]; then
   ENV_NAME=".venv-tf"
 fi
 
-if [ -z "$NO_CUDA" ]; then
-  echo "Installing itwinai and its dependencies in '$ENV_NAME' virtual env (CUDA enabled)"
-else
-  echo "Installing itwinai and its dependencies in '$ENV_NAME' virtual env (CUDA disabled)"
-fi
-
-# get python version
-pver="$(python --version 2>&1 | awk '{print $2}' | cut -f1-2 -d.)"
-
-# use pyenv if exist
-if [ -d "$HOME/.pyenv" ];then
-  export PYENV_ROOT="$HOME/.pyenv"
-  export PATH="$PYENV_ROOT/bin:$PATH"
-fi
+work_dir=$PWD
 
-# set dir
-cDir=$PWD
-
-# create environment
-if [ -d "${cDir}/$ENV_NAME" ];then
+# Create the python venv if it doesn't already exist
+if [ -d "${work_dir}/$ENV_NAME" ];then
   echo "env $ENV_NAME already exists"
-
-  source $ENV_NAME/bin/activate
 else
   python3 -m venv $ENV_NAME
-
-  # activate env
-  source $ENV_NAME/bin/activate
-
-  echo "$ENV_NAME environment is created in ${cDir}"
-fi
-
-pip3 install --no-cache-dir  --upgrade pip
-
-# get wheel -- setuptools extension
-pip3 install --no-cache-dir wheel
-
-# install TF 
-if [ -f "${cDir}/$ENV_NAME/bin/tensorboard" ]; then
-  echo 'TF already installed'
-  echo
-else
-  if [ -z "$NO_CUDA" ]; then
-    pip3 install tensorflow[and-cuda]==2.16.* --no-cache-dir
-  else
-    # CPU only installation
-    pip3 install tensorflow==2.16.* --no-cache-dir
-  fi
-fi
-
-# CURRENTLY, horovod is not used with TF. Skipped.
-# # install horovod
-# if [ -f "${cDir}/$ENV_NAME/bin/horovodrun" ]; then
-#   echo 'Horovod already installed'
-#   echo
-# else
-#   if [ -z "$NO_CUDA" ]; then
-#     export HOROVOD_GPU=CUDA
-#     export HOROVOD_GPU_OPERATIONS=NCCL
-#     export HOROVOD_WITH_TENSORFLOW=1
-#     # export TMPDIR=${cDir}
-#   else
-#     # CPU only installation
-#     export HOROVOD_WITH_TENSORFLOW=1
-#     # export TMPDIR=${cDir}
-#   fi
-
-#   pip3 install --no-cache-dir horovod[tensorflow,keras] # --ignore-installed
-# fi
-
-# WHEN USING TF >= 2.16:
-# install legacy version of keras (2.16)
-# Since TF 2.16, keras updated to 3.3,
-# which leads to an error when more than 1 node is used
-# https://keras.io/getting_started/
-pip3 install --no-cache-dir  tf_keras==2.16.*
-
-# Install Pov4ML
-if [[ "$OSTYPE" =~ ^darwin ]] ; then
-  pip install "prov4ml[apple,nvidia]@git+https://github.com/matbun/ProvML@new-main" || exit 1
-else
-  pip install "prov4ml[nvidia]@git+https://github.com/matbun/ProvML@new-main" || exit 1
+  echo "$ENV_NAME environment is created in ${work_dir}"
 fi
 
-# Install itwinai: MUST be last line of the script for the user installation script to work!
-pip3 install --no-cache-dir  -e .[dev]
+source $ENV_NAME/bin/activate
+pip install --no-cache-dir -e ".[dev,nvidia,tf]"