Skip to content

Commit

Permalink
Venv creation and uv support (#245)
Browse files Browse the repository at this point in the history
* add empty requirements file for cuda

* add requirements files and update pyproject toml

* update pyproject

* update installation in pyproject.toml

* update readme and horovod installation script

* update readme with horovod explanation

* update horovod installation script

* update readme with -e flag

* fix linter readme errors

* add more info to readme

* trailing whitespace 🙃

* trailing whitespace 🙃 (again)

* add draft of table of contents to readme

* update readme toc

* update readme toc again

* add section about uv lock to readme

* update toc of readme

* fix errors in readme

* add version numbers to packages in pyproject.toml

* remove uv.lock (for now)

* remove link from readme

* put toc in html comment

* remove toc, remove ds and horovod from reqs, add docs comment to pyproj

* Itwinai jlab Docker image (#236)

* Refactor Dockerfiles

* Refactor container gen script

* ADD jlab dockerfile

* First working version of jlab container

* ADD CMCC requirements

* update dockerfiles

* ADD nvconda and refactor

* Update containers

* ADD containers

* ADD simple plus dockerfile

* Update NV deps

* Update CUDA

* Add comment

* Cleanup

* Cleanup

* UPDATE README

* Refactor

* Fix linter

* Refactor dockerfiles and improve tests

* Refactor

* Refactor

* Fix

* Add first tests for HPC

* First broken tests for HPC

* Update tests and strategy

* UPDATE tests

* Update horovod tests

* Update tests and jlab deps

* Add MLFLow tracking URI

* ADD distributed trainer tests

* mpirun container deepspeed

* Fix distributed strategy tests on multi-node

* ADD srun launcher

* Refactor jobscript

* Cleanup

* isort tests

* Refactor scripts

* Minor fixes

* Add logging to file for all workers

* Add jupyter base files

* Add jupyter base files

* spelling

* Update provenance deps

* Update DS version

* Update prov docs

* Cleanup

* add nvidia dep

* Remove incomplete work

* update pyproject

* ADD hadolit config file

* FIX flag

* Fix linters

* Refactor

* Update prov4ml

* Update pytest CI

* Minor fix

* Incorporate feedback

* Update Dockerfiles

* Incorporate feedback

* Update comments

* Refactor tests

* Virgo HDF5 file format (#240)

* update virgo generated dataset to use hdf5 format

* add functionality for selecting output location

* set new data format as standard

* make virgo work with new data loader and add progress bar

* remove old generation files and add script for concatenating hdf5 files

* remove old generation files and add script for concatenating hdf5 files

* rename folder using hyphens

* remove multiprocessing

* add multiprocessing at correct place

* update handling of seed and num processes

* Gpu monitoring (#237)

* add gpu utilization decorator and begin work on plots

* add decorator for gpu energy utilization

* Added config option to hpo script, styling (#235)

* Update README.md

* Update README.md

* Update createEnvVega.sh

* remove unused dist file

* run black and isort to fix linting errors

* remove redundant variable

* remove trailing whitespace

* fix issues from PR

* fix import in eurac trainer

* fix linting errors

* update logging directory and pattern

* update default pattern for gpu energy plots

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* add configurable and dynamic wait and warmup times for the profiler

* remove old plot

* move horovod import

* fix linting errors

---------

Co-authored-by: Anna Lappe <[email protected]>
Co-authored-by: Matteo Bunino <[email protected]>

* Scalability test wall clock (#239)

* add gpu utilization decorator and begin work on plots

* add decorator for gpu energy utilization

* Added config option to hpo script, styling (#235)

* Update README.md

* Update README.md

* Update createEnvVega.sh

* remove unused dist file

* run black and isort to fix linting errors

* temporary changes

* remove redundant variable

* add absolute time plot

* remove trailing whitespace

* remove redundant variable

* remove trailing whitespace

* begin implementation of backup

* fix issues from PR

* fix issues from PR

* add backup to gpu monitoring

* fix import in eurac trainer

* cleanup backup mechanism slightly

* fix linting errors

* update logging directory and pattern

* update default pattern for gpu energy plots

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* fix import in eurac trainer

* fix linting errors

* update logging directory and pattern

* update default pattern for gpu energy plots

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* begin implementation of backup

* add backup to gpu monitoring

* add backup functionality to communication plot

* rewrite epochtimetracker and refactor scalability plot code

* cleanup scalability plot code

* updating some epochtimetracker dependencies

* add configurable and dynamic wait and warmup times for the profiler

* temporary changes

* add absolute time plot

* begin implementation of backup

* add backup to gpu monitoring

* cleanup backup mechanism slightly

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* begin implementation of backup

* add backup functionality to communication plot

* rewrite epochtimetracker and refactor scalability plot code

* cleanup scalability plot code

* updating some epochtimetracker dependencies

* fix linting errors

* fix more linting errors

* add utilization percentage plot

* run isort for linting

* update default save path for metrics

* add decorators to virgo and some cleanup

* add contributions and cleanup

* fix linting errors

* change 'credits' to 'credit'

* update communication plot style

* update function names

* update scalability function for a more streamlined approach

* run isort

* move horovod import

* fix linting errors

* add contributors

---------

Co-authored-by: Anna Lappe <[email protected]>
Co-authored-by: Matteo Bunino <[email protected]>

* make virgo work with new data loader and add progress bar

* add contributors

* update ruff settings in pyproject

* update virgo dataset concatenation

* add isort option to ruff

* break imports on purpose

* break more imports to test

* remove ruff config file

* 😀

* test linter 😁

* remove comment in github workflows

* add validation python to linter and make more mistakes

* add linting errors to trainer

* remove isort and flake8 and replace with ruff

* update linters

* run formatter on virgo folder

* fix linting errors and stuff from PR

* update config

* change config for timing code

* update profiler to use 'with' for context managing

* fix profiler.py

---------

Co-authored-by: Anna Lappe <[email protected]>
Co-authored-by: Matteo Bunino <[email protected]>

* add requirements files and update pyproject toml

* update installation in pyproject.toml

* add pytorch extra to horovod and remove redundant script

* update readme tutorial with pip installation

* add uv tutorial in separate file

* fix linting errors

* update horovod install script

* fix dead link

* update readme

* add uv installation command to readme

* add requirements files and update pyproject toml

* update pyproject

* update installation in pyproject.toml

* add version numbers to packages in pyproject.toml

* update horovod install script and add pip as dependency

* formatting

* fix linting

* trailing whitespace

* remove comment from readme

* remove comments and small formatting difference

* move uv tutorial under docs/

* update readme with nvidia and amd instead of linux

* remove duplicate entries in pyproject and reformat distributed file

* update readmes

* separate horovod ds installation script into two files

* fix linting errors and update dependencies

* fix tests and update lockfile

* fix linting errors

* update installation scripts for testing

* add local test command

* add tf to installation in readme

* add torch cuda to project dependencies

* remove index from tutorial

* remove unused comments and update tutorial

---------

Co-authored-by: Matteo Bunino <[email protected]>
Co-authored-by: Anna Lappe <[email protected]>
  • Loading branch information
3 people authored Dec 2, 2024
1 parent d05fb75 commit a964e47
Show file tree
Hide file tree
Showing 20 changed files with 9,614 additions and 318 deletions.
7 changes: 5 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ torch-env-cpu: env-files/torch/generic_torch.sh
env ENV_NAME=.venv-pytorch \
NO_CUDA=1 \
bash -c 'bash env-files/torch/generic_torch.sh'
.venv-pytorch/bin/horovodrun --check-build
# .venv-pytorch/bin/horovodrun --check-build

# Install TensorFlow env (GPU support)
tensorflow-env: env-files/tensorflow/generic_tf.sh
Expand Down Expand Up @@ -44,7 +44,10 @@ tf-env-vega: env-files/tensorflow/createEnvVegaTF.sh env-files/tensorflow/generi


test:
PYTORCH_ENABLE_MPS_FALLBACK=1 .venv-pytorch/bin/pytest -v tests/ -m "not slurm"
.venv/bin/pytest -v tests/

test-local:
PYTORCH_ENABLE_MPS_FALLBACK=1 .venv/bin/pytest -v tests/ -m "not hpc"

test-jsc: tests/run_on_jsc.sh
bash tests/run_on_jsc.sh
Expand Down
80 changes: 77 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,10 +128,84 @@ git clone [--recurse-submodules] [email protected]:interTwin-eu/itwinai.git

### Install itwinai environment

You can create the
Python virtual environments using our predefined Makefile targets.
In this project, we are using `uv` as a project-wide package manager. Therefore, if
you are a developer, you should see the [uv tutorial](/docs/uv-tutorial.md) after reading
the following `pip` tutorial.

#### PyTorch (+ Lightning) virtual environment
#### Installation using pip

##### Creating a venv

You can install the `itwinai` environment for development using `pip`. First, however,
you would want to make a Python venv if you haven't already. Make sure you have
Python installed (on HPC you have to load it with `module load Python`), and then you
can create a venv with the following command:
```bash
python -m venv <name-of-venv>
```
For example, if I wanted to create a venv in the directory `.venv` (which is useful if
you use e.g. `uv`), then I would do:
```bash
python -m venv .venv
```
After this you can activate your venv using the following command:
```bash
source .venv/bin/activate
```
Now anything you pip install will be installed in your venv and if you run any python
commands they will use the version from your venv.
##### Installation of packages
We provide some _extras_ that can be activated depending on which platform you are
using.
- `macos`, `amd` or `nvidia` depending on which platform you use. Changes the version
of `prov4ML`.
- `dev` for development purposes. Includes libraries for testing and tensorboard etc.
- `torch` for installation with PyTorch.
If you want to install PyTorch using CUDA then you also have to add an
`--extra-index-url` to the CUDA version that you want. Since you are developing the
library, you also want to enable the editable flag, `-e`, so that you don't have to
reinstall everything every time you make a change. If you are on HPC, then you will
usually want to add the `--no-cache-dir` flag to avoid filling up your `~/.cache`
directory, as you can very easily reach your disk quota otherwise. An example of a
complete command for installing as a developer on HPC with CUDA thus becomes:

```bash
pip install -e ".[torch,dev,nvidia,tf]" \
--no-cache-dir \
--extra-index-url https://download.pytorch.org/whl/cu121
```

If you wanted to install this locally on macOS (i.e. without CUDA) with PyTorch, you
would do the following instead:

```bash
pip install -e ".[torch,dev,macos,tf]"
```

<!-- You can create the Python virtual environments using our predefined Makefile targets. -->

#### Horovod and DeepSpeed

The above does not install `Horovod` and `DeepSpeed`, however, as they require a
specialized [script](env-files/torch/install-horovod-deepspeed-cuda.sh). If you do not
require CUDA, then you can install them using `pip` as follows:

```bash
pip install --no-cache-dir --no-build-isolation git+https://github.com/horovod/horovod.git
pip install --no-cache-dir --no-build-isolation deepspeed
```

#### PyTorch (+ Lightning) virtual environment with makefiles

Makefile targets for environment installation:

Expand Down
98 changes: 98 additions & 0 deletions docs/uv-tutorial.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Tutorial for using the uv package manager

[uv](https://docs.astral.sh/uv/) is a Python package manager meant to act as a drop-in
replacement for `pip` (and many more tools). In this project, we use it to manage our
packages, similar to how `poetry` works. This is done using a lockfile called
`uv.lock`.

## uv as a drop-in replacement for pip

`uv` is a lot faster than `pip`, so we recommend installing packages from `PyPI`
with `uv pip install <package>` instead of `pip install <package>`. You don't need to
change anything in your project to use this feature, as it works as a drop-in
replacement to `pip`.

## uv as a project-wide package manager

If you wish to use the `uv sync` and/or `uv lock` commands, which is how you use `uv`
to manage all your project packages, then note that these commands will only work
with the directory called `.venv` in the project directory. Sometimes, this can be a
bit annoying, especially with an existing venv, so we recommend using a
[symlink](https://en.wikipedia.org/wiki/Symbolic_link). If you need to have multiple
venvs that you want to switch between, you can update the symlink to whichever of them
you want to use at the moment. For SLURM scripts, you can hardcode them if need be.

### Symlinking .venv

To create a symlink between your venv and the `.venv` directory, you can use the
following command:

```bash
ln -s <path/to/your_venv> <path/to/.venv>
```

As an example, if I am in the `itwinai/` folder and my venv is called `envAI_juwels`,
then the following will create the wanted symlink:

```bash
ln -s envAI_juwels .venv
```

### Installing from uv.lock

> [!Warning]
> If `uv` creates your venv for you, the venv will not contain `pip`. However, you need
> to have `pip` installed to be able to run the installation scripts for `Horovod` and
> `DeepSpeed`, so we have included `pip` in the dependencies in `pyproject.toml`.
To install from the `uv.lock` file into the `.venv` venv, you can do the following:

```bash
uv sync
```

If the `uv.lock` file has optional dependencies (e.g. `macos` or `torch`), then these
can be added with the `--extra` flag as follows:

```bash
uv sync --extra torch --extra macos
```

These will usually correspond to the optional dependencies in the `pyproject.toml`. In
particular, if you are a developer you would use one of the following two commands. If
you are on HPC with cuda, you would use:

```bash
uv sync --no-cache --extra dev --extra nvidia --extra torch --extra tf
```

If you are developing on your local computer with macOS, then you would use:

```bash
uv sync --extra torch --extra tf --extra dev --extra macos
```

### Updating the uv.lock file

To update the project's `uv.lock` file with the dependencies of the project, you can
use the command:

```bash
uv lock
```

This will create a `uv.lock` file if it doesn't already exist, using the dependencies
from the `pyproject.toml`.

## Adding new packages to the project

To add a new package to the project (i.e. to the `pyproject.toml` file) with `uv`, you
can use the following command:

```bash
uv add <package>
```

> [!Warning]
> This will add the package to your `.venv` venv, so make sure to have symlinked to
> this directory if you haven't already.
91 changes: 6 additions & 85 deletions env-files/tensorflow/generic_tf.sh
Original file line number Diff line number Diff line change
@@ -1,97 +1,18 @@
#!/bin/bash

# ENV VARIABLES:
# - ENV_NAME: set custom name for virtual env. Default: ".venv-tf"
# - NO_CUDA: if set, install without cuda support

# Detect custom env name from env
if [ -z "$ENV_NAME" ]; then
ENV_NAME=".venv-tf"
fi

if [ -z "$NO_CUDA" ]; then
echo "Installing itwinai and its dependencies in '$ENV_NAME' virtual env (CUDA enabled)"
else
echo "Installing itwinai and its dependencies in '$ENV_NAME' virtual env (CUDA disabled)"
fi

# get python version
pver="$(python --version 2>&1 | awk '{print $2}' | cut -f1-2 -d.)"

# use pyenv if exist
if [ -d "$HOME/.pyenv" ];then
export PYENV_ROOT="$HOME/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"
fi
work_dir=$PWD

# set dir
cDir=$PWD

# create environment
if [ -d "${cDir}/$ENV_NAME" ];then
# Create the python venv if it doesn't already exist
if [ -d "${work_dir}/$ENV_NAME" ];then
echo "env $ENV_NAME already exists"

source $ENV_NAME/bin/activate
else
python3 -m venv $ENV_NAME

# activate env
source $ENV_NAME/bin/activate

echo "$ENV_NAME environment is created in ${cDir}"
fi

pip3 install --no-cache-dir --upgrade pip

# get wheel -- setuptools extension
pip3 install --no-cache-dir wheel

# install TF
if [ -f "${cDir}/$ENV_NAME/bin/tensorboard" ]; then
echo 'TF already installed'
echo
else
if [ -z "$NO_CUDA" ]; then
pip3 install tensorflow[and-cuda]==2.16.* --no-cache-dir
else
# CPU only installation
pip3 install tensorflow==2.16.* --no-cache-dir
fi
fi

# CURRENTLY, horovod is not used with TF. Skipped.
# # install horovod
# if [ -f "${cDir}/$ENV_NAME/bin/horovodrun" ]; then
# echo 'Horovod already installed'
# echo
# else
# if [ -z "$NO_CUDA" ]; then
# export HOROVOD_GPU=CUDA
# export HOROVOD_GPU_OPERATIONS=NCCL
# export HOROVOD_WITH_TENSORFLOW=1
# # export TMPDIR=${cDir}
# else
# # CPU only installation
# export HOROVOD_WITH_TENSORFLOW=1
# # export TMPDIR=${cDir}
# fi

# pip3 install --no-cache-dir horovod[tensorflow,keras] # --ignore-installed
# fi

# WHEN USING TF >= 2.16:
# install legacy version of keras (2.16)
# Since TF 2.16, keras updated to 3.3,
# which leads to an error when more than 1 node is used
# https://keras.io/getting_started/
pip3 install --no-cache-dir tf_keras==2.16.*

# Install Pov4ML
if [[ "$OSTYPE" =~ ^darwin ]] ; then
pip install "prov4ml[apple,nvidia]@git+https://github.com/matbun/ProvML@new-main" || exit 1
else
pip install "prov4ml[nvidia]@git+https://github.com/matbun/ProvML@new-main" || exit 1
echo "$ENV_NAME environment is created in ${work_dir}"
fi

# Install itwinai: MUST be last line of the script for the user installation script to work!
pip3 install --no-cache-dir -e .[dev]
source $ENV_NAME/bin/activate
pip install --no-cache-dir -e ".[dev,nvidia,tf]"
Loading

0 comments on commit a964e47

Please sign in to comment.