PY4CAST

This project, built using PyTorch and PyTorch-lightning, is designed to train a variety of Neural Network architectures (GNNs, CNNs, Vision Transformers, ...) on various weather forecasting datasets. This is a Work in Progress, intended to share ideas and design concepts with partners.

Developped at Météo-France by DSM/AI Lab and CNRM/GMAP/PREV.

Contributions are welcome (Issues, Pull Requests, ...).

This project is licensed under the APACHE 2.0 license.

Acknowledgements

This project started as a fork of neural-lam, a project by Joel Oskarsson, see here. Many thanks to Joel for his work!

Use any neural network architectures available in mfai
1 dataset with samples available on Huggingface : Titan
2 training strategies : Scaled Auto-regressive steps, Differential Auto-regressive steps
4 losses: Scaled RMSE, Scaled L1, Weighted MSE, Weighted L1
neural networks as simple torch.nn.Module
training with pytorchlightning
simple interfaces to easily add a new dataset, neural network, training strategy or loss
simple command line to lauch a training
config files to change the parameters of your dataset or neural network during training
experiment tracking with tensorboard and plots of forecasts with matplotlib
implementation of NamedTensors to tracks features and dimensions of tensors at each step of the training

See here for details on the available datasets, neural networks, training strategies, losses, and explanation of our NamedTensor.

Installation

Start by cloning the repository:

git clone https://github.com/meteofrance/py4cast.git
cd py4cast

Setting environment variables

In order to be able to run the code on different machines, some environment variables can be set. You may add them in your .bashrc or modify them just before launching an experiment.

PY4CAST_ROOTDIR : Specify the ROOT DIR for your experiment. It also modifies the CACHE_DIR. This is where the files created during the experiment will be stored.
PY4CAST_SMEAGOL_PATH: Specify where the smeagol dataset is stored. Only needed if you want to use the smeagol dataset.
PY4CAST_TITAN_PATH: Specify where the titan dataset is stored. Only needed if you want to use the titan dataset.

This should be done by

export PY4CAST_ROOTDIR="/my/dir/"

You MUST export PY4CAST_ROOTDIR to make py4cast work, you can use for instance the existing SCRATCH env var:

export PY4CAST_ROOTDIR=$SCRATCH/py4cast

If PY4CAST_ROOTDIR is not exported py4cast will default to use /scratch/shared/py4cast as its root directory, leading to Exceptions if this directory does not exist or if it is not writable.

At Météo-France

When working at Météo-France, you can use either runai + Docker or Conda/Micromamba to setup a working environment. On the AI Lab cluster we recommend using runai, Conda on our HPC.

See the runai repository for installation instructions.

For HPC, see the related doc (doc/install/install_MF.md) to get the right installation settings.

Install with conda

You can install a conda environment, including py4cast in editable mode, using

conda env create --file env.yaml

From an exixting conda environment, you can now install manually py4cast in development mode using

conda install conda-build -n py4cast
conda develop .

or

pip install --editable .

In case the install fail because some dependencies are not found or are in conflict, please look at the installation known issues.

Install with micromamba

Please install the environment using :

micromamba create -f env.yaml

From an exixting micromamba environment, you can now install manually py4cast in editable mode using

pip install --editable .

Build docker image

To build the docker image please use the oci-image-build.sh script. For Meteo-France user, you should export the variable INJECT_MF_CERT to use the Meteo-France certificate

export INJECT_MF_CERT=1

Then, build with the following command

bash ./oci-image-build.sh --runtime docker

By default, the CUDA and pytorch version are extracted from the env.yaml reference file. Nevertheless, for test purpose, you can set the PY4CAST_CUDA_VERSION and PY4CAST_TORCH_VERSION to override the default versions.

Build podman image

As an alternative to docker, you can use podman to build the image.

Click to expand

To build the podman image please use the oci-image-build.sh script.

bash ./oci-image-build.sh --runtime podman

By default, the CUDA and pytorch version are extracted from the env.yaml reference file. Nevertheless, for test purpose, you can set the PY4CAST_CUDA_VERSION and PY4CAST_TORCH_VERSION to override the default versions.

Convert to Singularity image

From a previously built docker or podman image, you can convert it to the singularity format.

Click to expand

To convert the previously built image to a Singularity container, you have to first save the image as a tar file:

docker save py4cast:your_tag -o py4cast-your_tag.tar

or with podman:

podman save --format oci-archive py4cast:your_tag -o py4cast-your_tag.tar

Then, build the singularity image with:

singularity build py4cast-your_tag.sif docker-archive://py4cast-your_tag.tar

Please, be sure to get enough free disk space to store the .tar and .sif files.

Usage

Docker

From your py4cast source directory, to run an experiment using the docker image you need to mount in the container :

The dataset path
The py4cast sources
The PY4CAST_ROOTDIR path

Here is an example of command to run a training of the HiLam model with the TITAN dataset, using all the GPUs:

docker run \
    --name hilam-titan \
    --rm \
    --gpus all \
    -v ./${HOME} \
    -v <path-to-datasets>/TITAN:/dataset/TITAN \
    -v <your_py4cast_root_dir>:<your_py4cast_root_dir> \
    -e PY4CAST_ROOTDIR=<your_py4cast_root_dir> \
    -e PY4CAST_TITAN_PATH=/dataset/TITAN \
    py4cast:<your_tag> \
    bash -c " \
        pip install -e . &&  \
        python bin/main.py fit\
            --config config/CLI/trainer.yaml \
            --config config/CLI/model/hilam.yaml \
            --config config/CLI/dataset/titan.yaml \
    "

Podman

Click to expand

From your py4cast source directory, to run an experiment using the podman image you need to mount in the container :

The dataset path
The py4cast sources
The PY4CAST_ROOTDIR path

Here is an example of command to run a training of the HiLam model with the TITAN dataset, using all the GPUs:

podman run \
    --name hilam-titan \
    --rm \
    --device nvidia.com/gpu=all \
    --ipc=host \
    --network=host \
    -v ./${HOME} \
    -v <path-to-datasets>/TITAN:/dataset/TITAN \
    -v <your_py4cast_root_dir>:<your_py4cast_root_dir> \
    -e PY4CAST_ROOTDIR=<your_py4cast_root_dir> \
    -e PY4CAST_TITAN_PATH=/dataset/TITAN \
    py4cast:<your_tag> \
    bash -c " \
        pip install -e . &&  \
        python bin/main.py fit\
            --config config/CLI/trainer.yaml \
            --config config/CLI/model/hilam.yaml \
            --config config/CLI/dataset/titan.yaml \
    "
    "

Singularity

Click to expand

From your py4cast source directory, to run an experiment using a singularity container you need to mount in the container :

The dataset path
The PY4CAST_ROOTDIR path

Here is an example of command to run a training of the HiLam model with the TITAN dataset:

PY4CAST_TITAN_PATH=/dataset/TITAN \
PY4CAST_ROOTDIR=<your_py4cast_root_dir> \
singularity exec \
    --nv \
    --bind <path-to-datasets>/TITAN:/dataset/TITAN \
    --bind <your_py4cast_root_dir>:<your_py4cast_root_dir> \
    py4cast-<your_tag>.sif \
    bash -c " \
        pip install -e . &&  \
        python bin/main.py fit\
            --config config/CLI/trainer.yaml \
            --config config/CLI/model/hilam.yaml \
            --config config/CLI/dataset/titan.yaml \
    "

runai

For now this works only for internal Météo-France users.

Click to expand

runai commands must be issued at the root directory of the py4cast project:

Run an interactive training session

runai gpu_play 4
runai build
runai exec_gpu python bin/main.py fit --config config/CLI/trainer.yaml --config config/CLI/dataset/titan.yaml --config config/CLI/model/hilam.yaml

Train using sbatch single node multi-GPUs

Modify the trainer.yaml configuration file.

trainer:
  num_nodes: 1

export RUNAI_GRES="gpu:v100:4"
runai sbatch python bin/main.py fit --config config/CLI/trainer.yaml --config config/CLI/dataset/titan.yaml --config config/CLI/model/hilam.yaml

Train using sbatch multi nodes multi GPUs

Here we use 2 nodes with 4 GPUs each.

Modify the trainer.yaml configuration file.

trainer:
  num_nodes: 2

export RUNAI_SLURM_NNODES=2
export RUNAI_GRES="gpu:v100:4"
runai sbatch_multi_node python bin/main.py fit --config config/CLI/trainer.yaml --config config/CLI/dataset/titan.yaml --config config/CLI/model/hilam.yaml

For the rest of the documentation, you must preprend each python command with runai exec_gpu.

Conda or Micromamba

Once your micromamba environment is setup, you should :

activate your environment conda activate py4cast or micromamba activate nlam
launch a training

A very simple training can be launch (on your current node)

python bin/main.py fit --config config/CLI/trainer.yaml --config config/CLI/dataset/dummy.yaml --config config/CLI/model/hilam.yaml

Example of script to launch on gpu

To do so, you will need to create a small sh script.

#!/usr/bin/bash
#SBATCH --partition=ndl
#SBATCH --nodes=1 # Specify the number of GPU node you required
#SBATCH --gres=gpu:1 # Specify the number of GPU required per Node
#SBATCH --time=05:00:00 # Specify your experiment Time limit
#SBATCH --ntasks-per-node=1 # Specify the number of task per node. This should match the number of GPU Required per Node

# Note that other variable could be set (according to your machine). For example you may need to set the number of CPU or the memory used by your experiment.
# On MF hpc, this is proportional to the number of GPU required per node. This is not the case on other machine (e.g MétéoFrance AILab machine).

source ~/.bashrc  # Be sure that all your environment variables are set
conda activate py4cast # Activate your environment (installed by micromamba or conda)
cd $PY4CAST_PATH # Go to Py4CAST (you can either add an environment variable or hard code it here).
# Launch your favorite command.
srun bin/main.py fit --config config/CLI/trainer.yaml --config config/CLI/dataset/dummy.yaml --config config/CLI/model/hilam.yaml

Then just launch this script using

sbatch my_tiny_script.sh

NB Note that you may have some trouble with SSL certificates (for cartopy). You may need to explicitely export the certificate as :

 export SSL_CERT_FILE="/opt/softs/certificats/proxy1.pem"

with the proxy path depending on your machine.

Delving into the design

main.py uses a LightningCLI to train (bin/main.py fit), test (bin/main.py test) or predict (bin/main.py predict).

This LightningCLI calls the LightningModule (where the model is initialized and methods are written) and the DataModule (where the dataset is initialized).

The native args of the LightningCLI (trainer), the args of the LightningModule (model) and the args of the DataModule (data) are accessible through the trainer.yaml, model.yaml and dataset.yaml. Here is a standard command line :

usage : python bin/main.py <mode> --config config/CLI/trainer.yaml --config config/CLI/dataset/<datatset>.yaml --config config/CLI/model/<model>.yaml

When you want to change an argument, you can either modify the config.yaml where it is parsed or override it by parsing it directly. For instance if you want to change the loss_name argument accessible in unetrpp.yaml, you can use the following command line :

usage : python bin/main.py <mode> --config config/CLI/trainer.yaml --config config/CLI/dataset/<datatset>.yaml --config config/CLI/model/<model>.yaml --model.loss_name mae

Quick note: trainer.fast_dev_run is a useful option to try to fit the model with minimal computation. It fixes max_epochs: 1 and limit_train_batches: 1

Dataset initialization

As in neural-lam, before training you must first compute the mean and std of each feature.

To compute the stats of the Titan dataset:

python py4cast/datasets/titan/titan_cli.py prepare

To train on a dataset with its default settings just pass the name of the dataset (all lowercase) :

python bin/main.py fit --config config/CLI/trainer.yaml --config config/CLI/dataset/titan.yaml --config config/CLI/model/hilam.yaml

You can override the dataset default configuration file :

either by modifying dataset.yaml :

data:
  dataset_conf: config/datasets/titan_refacto2.json

or by parsing the argument :

python bin/main.py fit --config config/CLI/trainer.yaml --config config/CLI/dataset/titan.yaml --config config/CLI/model/hilam.yaml --data.dataset_conf config/datasets/titan_refacto2.json

Details on available datasets.

Training options

Configuring the neural network

To train on a dataset using a network with its default settings just pass the name of the architecture (all lowercase) as shown below:

python bin/main.py fit --config config/CLI/trainer.yaml --config config/CLI/dataset/smeagol.yaml --config config/CLI/model/hilam.yaml

python bin/main.py fit --config config/CLI/trainer.yaml --config config/CLI/dataset/smeagol.yaml --config config/CLI/model/halfunet.yaml

You can override some settings of the model using a json config file (here we increase the number of filter to 128 and use ghost modules):

either by modifying model.yaml :

model:
  settings_init_args:
    hidden_size: 256
    num_heads_encoder: 4
    etc.

or by parsing the argument :

python bin/main.py fit --config config/CLI/trainer.yaml --config config/CLI/dataset/titan.yaml --config config/CLI/model/unetrpp.yaml --model.setting_init_args.hidden_size 256

Details on available neural networks.

Changing the training strategy

You can choose a training strategy :

either by modifying model.yaml :

model:
  training_strategy: diff_ar

or by parsing the argument :

python bin/main.py fit --config config/CLI/trainer.yaml --config config/CLI/dataset/titan.yaml --config config/CLI/model/unetrpp.yaml --model.strategy diff_ar

Details on available training strategies.

Other training options:

For more options, please refer to the various trainer.yaml, model.yaml and dataset.yaml

You can find more details about all the num_X_steps options here.

Tracking experiment

Tensorboard

We use Tensorboad to track the experiments. You can launch a tensorboard server using the following command:

At Météo-France:

runai will handle port forwarding for you.

runai tensorboard --logdir PATH_TO_YOUR_ROOT_PATH

Elsewhere

tensorboard --logdir PATH_TO_YOUR_ROOT_PATH

Then you can access the tensorboard server at the following address: http://YOUR_SERVER_IP:YOUR_PORT/

MLFlow

Optionally, you can use MLFlow, in addition to Tensorboard, to track your experiment and log your model. To activate the MLFlow logger simply add the --mlflow_log option on the bin/train.py command line.

Local usage

Without a MLFlow server, the logs are stored in your root path, at PY4CAST_ROOTDIR/logs/mlflow.

With a MLFlow server

If you have a MLFow server you can configure your training environment to push the logs on the remote server. A set of environment variables are available to do that.

For exemple, you can export the following variable in your training environment:

export MLFLOW_TRACKING_URI=https://my.mlflow.server.com/
export MLFLOW_TRACKING_USERNAME=<your-mlflow-user>
export MLFLOW_TRACKING_PASSWORD=<your-mlflow-pwd>
export MLFLOW_EXPERIMENT_NAME=py4cast/unetrpp

Inference

Inference is done by running the bin/main.py predict script. This script will load a model and run it on a dataset using the training parameters (dataset config, timestep options, ...).

A simple example of inference is shown below:

 runai exec_gpu python bin/main.py predict --config config/CLI/trainer.yaml --config config/CLI/dataset/titan.yaml --config config/CLI/model/unetrpp.yaml

Making animated plots comparing multiple models

You can compare multiple trained models on specific case studies and visualize the forecasts on animated plots with the bin/gif_comparison.py. See example of GIF at the beginning of the README.

Warnings:

For now this script only works with models trained with Titan dataset.
If you want to use AROME as a model, you have to manually download the forecast before.

Usage: gif_comparison.py [-h] --ckpt CKPT --date DATE [--num_pred_steps NUM_PRED_STEPS]

options:
  -h, --help            show this help message and exit
  --ckpt CKPT           Paths to the model checkpoint or AROME
  --date DATE           Date for inference. Format YYYYMMDDHH.
  --num_pred_steps NUM_PRED_STEPS
                        Number of auto-regressive steps/prediction steps.

example: python bin/gif_comparison.py --ckpt AROME --ckpt /.../logs/my_run/epoch=247.ckpt
                                      --date 2023061812 --num_pred_steps 10

Scoring and comparing models

The bin/main.py test script will compute and save metrics on the validation set, on as many auto-regressive prediction steps as you want.

python python bin/main.py test --config config/CLI/trainer.yaml --config config/CLI/dataset/titan.yaml --config config/CLI/model/unetrpp.yaml.py

Once you have executed the test.py script on all the models you want, you can compare them with bin/scores_comparison.py:

python bin/scores_comparison.py --ckpt PATH_TO_CKPT_0  --ckpt PATH_TO_CKPT_1

Warning: For now bin/scores_comparison.py only works with models trained with Titan dataset

Adding features and contributing

This page explains how to:

add a new neural network
add a new dataset
contribute to this project following our guidelines

Design choices

The figure below illustrates the principal components of the Py4cast architecture.

We define interface contracts between the components of the system using Python ABCs. As long as the Python classes respect the interface contract, they can be used interchangeably in the system and the underlying implementation can be very different. For instance datasets with any underlying storage (grib2, netcdf, mmap+numpy, ...) and real-time or ahead of time concat and pre-processing could be used with the same neural network architectures and training strategies.
Adding a model, a dataset, a loss, a plot, a training strategy, ... should be as simple as creating a new Python class that complies with the interface contract.
Dataset produce Item, collated into ItemBatch, both having NamedTensor attributes.
Dataset produce tensors with the following dimensions: (batch, timestep, lat, lon, features). Models can flatten or reshape spatial dimension in the prepare_batch but the rest of the system expects features to be always the last dimension of the tensors.
Neural network architectures are Python classes that inherit from both ModelABC and PyTorch's nn.Module. The later means it is quick to insert a third-party pure PyTorch model in the system (see for instance the code for Lucidrains' Segformer or a U-Net).
We use dataclasses and dataclass_json to define the settings whenever possible. This allows us to easily serialize and deserialize the settings to/from json files with Schema validation.
The NamedTensor allows us to keep track of the physical/weather parameters along the features dimension and to pass a single consistent object in the system. It is also a way to factorize common operations on tensors (concat along features dimension, flatten in place, ...) while keeping the dimension and feature names metadata in sync.
We use PyTorch-lightning to train the models. This allows us to easily scale the training to multiple GPUs and to use the same training loop for all the models. We also use the PyTorch-lightning logging system to log the training metrics and the hyperparameters.

Ideas for future improvements

Ideally, we could end up with a simple based class system for the training strategies to allow for easy addition of new strategies.
The ItemBatch class attributes could be generalized to have multiple inputs, outputs and forcing tensors referenced by name, this would allow for more flexibility in the models and plug metnet-3 and Pangu.
The distinction between prognostic and diagnostic variables should be made explicit in the system.
We should probably reshape back the GNN outputs to (lat, lon) gridded shape as early as possible to have this as a common/standard output format for all the models. This would simplify the post-processing, plotting, ... We still have if statements in the code to handle the different output shapes of the models.

Name		Name	Last commit message	Last commit date
Latest commit History 565 Commits
.github/workflows		.github/workflows
bin		bin
config		config
doc		doc
py4cast		py4cast
saved_models/smeagol_franmgsp32halfunet-051413:33-1511		saved_models/smeagol_franmgsp32halfunet-051413:33-1511
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
Dockerfile.ewc_flash_attn		Dockerfile.ewc_flash_attn
LICENSE-2.0.txt		LICENSE-2.0.txt
README.md		README.md
env.yaml		env.yaml
lint.sh		lint.sh
oci-image-build.sh		oci-image-build.sh
py4cast_plugin_example.py		py4cast_plugin_example.py
pyproject.toml		pyproject.toml
reformat.sh		reformat.sh
requirements.txt		requirements.txt
requirements_lint.txt		requirements_lint.txt
runai_settings.sh		runai_settings.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PY4CAST

Acknowledgements

Table of contents

Overview

Installation

Setting environment variables

At Météo-France

Install with conda

Install with micromamba

Build docker image

Build podman image

Convert to Singularity image

Usage

Docker

Podman

Singularity

runai

Conda or Micromamba

Example of script to launch on gpu

Delving into the design

Dataset initialization

Training options

Tracking experiment

Tensorboard

MLFlow

Inference

Making animated plots comparing multiple models

Scoring and comparing models

Adding features and contributing

Design choices

Ideas for future improvements

About

Releases

Packages

Contributors 10

Languages

meteofrance/py4cast

Folders and files

Latest commit

History

Repository files navigation

PY4CAST

Acknowledgements

Table of contents

Overview

Installation

Setting environment variables

At Météo-France

Install with conda

Install with micromamba

Build docker image

Build podman image

Convert to Singularity image

Usage

Docker

Podman

Singularity

runai

Conda or Micromamba

Example of script to launch on gpu

Delving into the design

Dataset initialization

Training options

Tracking experiment

Tensorboard

MLFlow

Inference

Making animated plots comparing multiple models

Scoring and comparing models

Adding features and contributing

Design choices

Ideas for future improvements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 10

Languages

Packages