Integration author(s): Kalliopi Tsolaki (CERN), Matteo Bunino (CERN)
First, create a Python virtual environment, then install the plugin.
Note
On HPC, remember to activate the correct software modules before creating the virtual environment.
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
# Installation in editable mode
pip install --no-cache-dir -e .
Note
Python commands below assumed to be executed from within the virtual environment.
In python you can import the plugin as:
import itwinai.plugins.tdgan
from itwinai.plugins.tdgan.model import Generator
Before you can start training, you have to download the data using the dataloading script:
itwinai exec-pipeline --config config.yaml --pipe-key training_pipeline --steps dataloading_step
Now you can launch training using itwinai
and the provided training configuration config.yaml
:
itwinai exec-pipeline --config config.yaml --pipe-key training_pipeline
The command above shows how to run the training using a single worker, but if you want to run distributed ML training you have two options: interactive (launch from terminal) or batch (launch form SLURM job script).
Warning
Before running distributed ML, make sure that the distributed strategy used
by pytorch lightning is set to ddp_find_unused_parameters_true
. You can set
this manually by setting
distributed_strategy: ddp_find_unused_parameters_true
in config.yaml
.
To know more on SLURM, see our SLURM cheatsheet.
If you want to use SLURM in interactive mode, do the following:
# Allocate resources (on JSC)
$ salloc --partition=batch --nodes=1 --account=intertwin --gres=gpu:4 --time=1:59:00
job ID is XXXX
# Get a shell in the compute node (if using SLURM)
$ srun --jobid XXXX --overlap --pty /bin/bash
# Now you are inside the compute node
# On JSC, you may need to load some modules
ml --force purge
ml Stages/2024 GCC OpenMPI CUDA/12 MPI-settings/CUDA Python HDF5 PnetCDF libaio mpi4py
# ...before activating the Python environment (adapt this to your env name/path)
source ../../envAI_hdfml/bin/activate
To launch the training with torch DDP use:
torchrun --standalone --nnodes=1 --nproc-per-node=gpu \
$(which itwinai) exec-pipeline --config config.yaml --pipe-key training_pipeline
# Alternatively, from a SLURM login node:
srun --jobid XXXX --ntasks-per-node=1 torchrun --standalone --nnodes=1 --nproc-per-node=gpu \
$(which itwinai) exec-pipeline --config config.yaml --pipe-key training_pipeline
Differently from the interactive approach, this way allows you to use more than one compute node, thus allowing to scale the distributed ML to larger resources.
Remember that on JSC there is no internet connection on compute nodes, thus if your script tries to contact the internet it will fail. If needed, make sure to download the datasets from the SLURM login node before launching the job.
# Launch a SLURM batch job (on JSC)
sbatch slurm.jsc.sh
# Launch a SLURM batch job (on Vega)
sbatch slurm.vega.sh
# Check the job in the SLURM queue
squeue -u YOUR_USERNAME
# Check the job status
sacct -j JOBID
Job's stdout is usually saved to job.out
and its stderr is saved to job.err
.
Depending on the logging service that you are using, there are different ways to inspect the logs generated during ML training.
To visualize the logs generated with MLFLow, if you set a local path as tracking URI, run the following in the terminal:
mlflow ui --backend-store-uri LOCAL_TRACKING_URI
And select the "3DGAN" experiment.
-
As inference dataset we can reuse training/validation dataset, for instance the one downloaded from Google Drive folder: if the dataset root folder is not present, the dataset will be downloaded. The inference dataset is a set of H5 files stored inside
exp_data
sub-folders:├── exp_data │ ├── data | │ ├── file_0.h5 | │ ├── file_1.h5 ... | │ ├── file_N.h5
-
As model, if a pre-trained checkpoint is not available, we can create a dummy version of it with:
python create_inference_sample.py
-
Run inference command. This will generate a
3dgan-generated-data
folder containing generated particle traces in form of torch tensors (.pth files) and 3D scatter plots (.jpg images).itwinai exec-pipeline --config config.yaml --pipe-key inference_pipeline
The inference execution will produce a folder called
3dgan-generated-data
containing
generated 3D particle trajectories (overwritten if already
there). Each generated 3D image is stored both as a
torch tensor (.pth) and 3D scatter plot (.jpg):
├── 3dgan-generated-data
| ├── energy=1.296749234199524&angle=1.272539496421814.pth
| ├── energy=1.296749234199524&angle=1.272539496421814.jpg
...
| ├── energy=1.664689540863037&angle=1.4906378984451294.pth
| ├── energy=1.664689540863037&angle=1.4906378984451294.jpg
However, if aggregate_predictions
in the ParticleImagesSaver
step is set to True
,
only one pickled file will be generated inside 3dgan-generated-data
folder.
Notice that multiple inference calls will create new files under 3dgan-generated-data
folder.
With fields overriding:
# Override variables
export CERN_DATA_ROOT="../.." # data root
export TMP_DATA_ROOT=$CERN_DATA_ROOT
export CERN_CODE_ROOT="." # where code and configuration are stored
export MAX_DATA_SAMPLES=20000 # max dataset size
export BATCH_SIZE=1024 # increase to fill up GPU memory
export NUM_WORKERS_DL=4 # num worker processes used by the dataloader to pre-fetch data
export AGGREGATE_PREDS="true" # write predictions in a single file
export ACCELERATOR="gpu" # choose "cpu" or "gpu"
export STRATEGY="auto" # distributed strategy
export DEVICES="0," # GPU devices list
itwinai exec-pipeline --print-config --config $CERN_CODE_ROOT/config.yaml \
--pipe-key inference_pipeline \
-o dataset_location=$CERN_DATA_ROOT/exp_data \
-o logs_dir=$TMP_DATA_ROOT/ml_logs/mlflow_logs \
-o distributed_strategy=$STRATEGY \
-o devices=$DEVICES \
-o hw_accelerators=$ACCELERATOR \
-o checkpoints_path=$TMP_DATA_ROOT/checkpoints \
-o inference_model_uri=$CERN_CODE_ROOT/3dgan-inference.pth \
-o max_dataset_size=$MAX_DATA_SAMPLES \
-o batch_size=$BATCH_SIZE \
-o num_workers_dataloader=$NUM_WORKERS_DL \
-o inference_results_location=$TMP_DATA_ROOT/3dgan-generated-data \
-o aggregate_predictions=$AGGREGATE_PREDS
Warning
This section regarding container images is not up to date. For instance, Docker image names are still referring the itwinai library, but should reference this repo.
Build from project root with
# Local
docker buildx build -t itwinai:0.0.1-3dgan-0.1 .
# Ghcr.io
docker buildx build -t ghcr.io/intertwin-eu/itwinai:0.0.1-3dgan-0.1 .
docker push ghcr.io/intertwin-eu/itwinai:0.0.1-3dgan-0.1
You can run inference from wherever a sample of H5 files is available
(folder called exp_data/
'):
├── $PWD
| ├── exp_data
| │ ├── data
| | │ ├── file_0.h5
| | │ ├── file_1.h5
...
| | │ ├── file_N.h5
docker run -it --rm --name running-inference -v "$PWD":/tmp/data ghcr.io/intertwin-eu/itwinai:0.0.1-3dgan-0.1
This command will store the results in a folder called 3dgan-generated-data
:
├── $PWD
| ├── 3dgan-generated-data
| │ ├── energy=1.296749234199524&angle=1.272539496421814.pth
| │ ├── energy=1.296749234199524&angle=1.272539496421814.jpg
...
| │ ├── energy=1.664689540863037&angle=1.4906378984451294.pth
| │ ├── energy=1.664689540863037&angle=1.4906378984451294.jpg
To override fields in the configuration file at runtime, you can use the -o
flag. Example: -o path.to.config.element=NEW_VALUE
.
Please find a complete exampled below, showing how to override default configurations by setting some env variables:
# Override variables
export CERN_DATA_ROOT="/usr/data"
export CERN_CODE_ROOT="/usr/src/app"
export MAX_DATA_SAMPLES=10 # max dataset size
export BATCH_SIZE=64 # increase to fill up GPU memory
export NUM_WORKERS_DL=4 # num worker processes used by the dataloader to pre-fetch data
export AGGREGATE_PREDS="true" # write predictions in a single file
export ACCELERATOR="gpu" # choose "cpu" or "gpu"
docker run -it --rm --name running-inference \
-v "$PWD":/usr/data ghcr.io/intertwin-eu/itwinai:0.0.1-3dgan-0.1 \
/bin/bash -c "itwinai exec-pipeline \
--print-config --config $CERN_CODE_ROOT/config.yaml \
--pipe-key inference_pipeline \
-o dataset_location=$CERN_DATA_ROOT/exp_data \
-o logs_dir=$TMP_DATA_ROOT/ml_logs/mlflow_logs \
-o distributed_strategy=$STRATEGY \
-o devices=$DEVICES \
-o hw_accelerators=$ACCELERATOR \
-o checkpoints_path=$TMP_DATA_ROOT/checkpoints \
-o inference_model_uri=$CERN_CODE_ROOT/3dgan-inference.pth \
-o max_dataset_size=$MAX_DATA_SAMPLES \
-o batch_size=$BATCH_SIZE \
-o num_workers_dataloader=$NUM_WORKERS_DL \
-o inference_results_location=$TMP_DATA_ROOT/3dgan-generated-data \
-o aggregate_predictions=$AGGREGATE_PREDS "
Keeping the example above as reference, increase the value of BATCH_SIZE
as much as possible
(just below "out of memory" errors). Also, make sure that ACCELERATOR="gpu"
. Also, make sure
to use a dataset large enough by changing the value of MAX_DATA_SAMPLES
to collect meaningful
performance data. Consider that each H5 file contains roughly 5k items, thus setting
MAX_DATA_SAMPLES=10000
should be enough to use all items in each input H5 file.
You can try:
export MAX_DATA_SAMPLES=10000 # max dataset size
export BATCH_SIZE=1024 # increase to fill up GPU memory
export ACCELERATOR="gpu
Run Docker container with Singularity:
singularity run --nv -B "$PWD":/usr/data docker://ghcr.io/intertwin-eu/itwinai:0.0.1-3dgan-0.1 /bin/bash -c \
"cd /usr/src/app && itwinai exec-pipeline --config config.yaml --pipe-key inference_pipeline"
Example with overrides (as above for Docker):
# Override variables
export CERN_DATA_ROOT="/usr/data"
export CERN_CODE_ROOT="/usr/src/app"
export MAX_DATA_SAMPLES=10 # max dataset size
export BATCH_SIZE=64 # increase to fill up GPU memory
export NUM_WORKERS_DL=4 # num worker processes used by the dataloader to pre-fetch data
export AGGREGATE_PREDS="true" # write predictions in a single file
export ACCELERATOR="gpu" # choose "cpu" or "gpu"
singularity run --nv -B "$PWD":/usr/data docker://ghcr.io/intertwin-eu/itwinai:0.0.1-3dgan-0.1 /bin/bash -c \
"cd /usr/src/app && itwinai exec-pipeline \
--print-config --config $CERN_CODE_ROOT/config.yaml \
--pipe-key inference_pipeline \
-o dataset_location=$CERN_DATA_ROOT/exp_data \
-o logs_dir=$TMP_DATA_ROOT/ml_logs/mlflow_logs \
-o distributed_strategy=$STRATEGY \
-o devices=$DEVICES \
-o hw_accelerators=$ACCELERATOR \
-o checkpoints_path=$TMP_DATA_ROOT/checkpoints \
-o inference_model_uri=$CERN_CODE_ROOT/3dgan-inference.pth \
-o max_dataset_size=$MAX_DATA_SAMPLES \
-o batch_size=$BATCH_SIZE \
-o num_workers_dataloader=$NUM_WORKERS_DL \
-o inference_results_location=$TMP_DATA_ROOT/3dgan-generated-data \
-o aggregate_predictions=$AGGREGATE_PREDS "