Challenge #24 - CliMetLab - Machine Learning on weather and climate data #13

EsperanzaCuartero · 2021-01-28T15:02:18Z

Challenge 24- CliMetLab - Machine Learning on weather and climate data

Stream 2 - Machine Learning for weather, climate and atmosphere applications

Goal

Extend new Python ML package and help to mature package

Mentors and skills

Mentors: @b8raoult @floriankrb
Skills required:
- Python
- ML
- Handling of weather data and formats

Challenge description

CliMetLab is a Python package aiming at simplifying access to climate and meteorological datasets, allowing users to focus on science instead of technical issues such as data access and data formats. It is mostly intended to be used in Jupyter notebooks, and be interoperable with all popular data analytic packages, such as NumPy, Pandas, Xarray, SciPy, Matplotlib, etc. and well as Machine Learning frameworks, such as TensorFlow, Keras or PyTorch. Several tasks are proposed:

Task 1: extend CliMetLab with so that offers user with high-level Matplotlib-based plotting functions to produce graphs and plot which are relevant to weather and climate applications (e.g. plumes plots, ROC curves, …).
Task 2: the Python package Intake is a lightweight set of tools for loading and sharing data in data science projects. Extend CliMetLab so that it seamlessly interfaces with Intake and allow users to access all intake-compatible datasets.
Task 3: Xarray uses the data format Zarr to allow parallel read and parallel write. Convert large already available datasets to xarray-readable zarr format, define appropriate configuration (chunking/compression/other) according to domain use cases, develop tools to benchmark when used on a cloud-platform, compare to other formats (N5, GRIB, netCDF, geoTIFF, etc.).

veds12 · 2021-03-01T07:08:39Z

Hey! I am Vedant, a pre-final year undergrad student. I have been mainly working in the fields of Machine Learning and AI in general and have some experience in developing python libraries related to the same. I am interested in working on this challenge. Will be great if you could provide some more details on the work and how to get started, etc.

EsperanzaCuartero · 2021-03-02T09:55:50Z

Hi Vedant, thanks for your interest. The mentors will provide more details about the challenge as soon as possible. Best, Esperanza

floriankrb · 2021-03-02T13:20:07Z

Hello Vedant, depending on your background/interest/time, you may want to focus more on one of the three tasks offered here or address all of them.
Regarding task 3, a first step would be to take a small NetCDF file (a few Mbytes) in the field (search for "netcdf sample dataset") or GRIB, and write it as zarr or other formats, then compare these different alternatives.
Regarding task 1, using matplotlib to reproduce some of the plots at https://confluence.ecmwf.int/display/MAGP/Magics+Tutorial with artificial data may be a good start.
We will stabilize soon the climetlab plugin API that is needed the address task 2 (regarding Intake), in the meanwhile, understanding the logic of the intake project (by reading the documentation) and how to use its datasets in python would be a good start.

jwagemann · 2021-03-22T09:54:02Z

Hi,
join us for the ECMWF Summer of Weather Code Ask Me Anything session and learn all things ESoWC.

When: Wednesday, 24 March 2021 at 4 pm GMT

What: learn everything about ESoWC - how it works, the challenges this year, some tips for your proposal and listen to ESoWC experiences from previous participants

How: register here.

vidurmithal · 2021-04-04T09:46:57Z

Hi! I'm interested in this challenge, and have prior experience of working with meteorological datasets using the Pangeo stack (Zarr, Xarray, etc.)

If I understand correctly, plotting in the CliMetLab library is currently done using Magics, and you want that to be extended to allow for the creation of other kinds of plots using Matplotlib? Would this require the creation of an additional Matplotlib driver within plotting? It would be great if you could provide some information on what you foresee in terms of plotting functionality.

b8raoult · 2021-04-05T11:16:23Z

@vidurmithal this is a good idea. The most important is that CliMetLab is seen as a framework with a plug-in architecture. So yes, support for different plotting software is a good idea as long you can ensure that specifics of that software are somewhat hidden from the end user. The aim of CliMetLab is also to provide high-level functions so that user can focus on science. Of course, users could also be given access to lower levels functionalities, as long as they are optional.

vidurmithal · 2021-04-05T14:44:37Z

Thank you for your response @b8raoult.

So, if I understand correctly, for Task 1, you are looking at something like the plotting functionality that is built in to libraries like pandas, geopandas and even xarray, where calling .plot() on a dataframe or dataset automatically creates a plot of a suitable type (inferred based on data dimensions and types) using the matplotlib back-end. In CliMetLab, this would be accessed via the .plot_map() method on a dataset.

b8raoult · 2021-04-05T17:21:18Z

Yes, that is correct. One of the challenge is to route a call to cml.plot_map(some_dataset, some_options) to the right plotting backend and generate a reasonable plot based on the type of dataset. CliMetLab already plots 2D fields (e.g. xarray) as isoline maps, using a colour palette based on the plotted variable (temperature, pressure, etc), and plots observations (e.g. pandas) as red dots on the map. But this is still very preliminary. I think the word "infer" that you used is the correct one. Currently, this is done by consulting a series of objects: the dataset, the source, the reader if the source is file based, and the helper is the datatype is not a CliMetLab object (e.g. a NumPy array).

floriankrb · 2021-04-14T14:04:40Z

I got more questions from email:

For Task 1, I think we want matplotlib as an alternative/ replacement for Magics in Climetlab, but some clarification would be helpful.!

Magics has so much features that a replacement is way out-of-scope. This task is to explore this path though. For climetlab users, it would offer a way to plot nicely the data (as nice as with Magics) with the tools they are used to (i.e. matplotlib). For plugin developers, providing visualization code (along in the plugin code to access the data) may be easier with a matplotlib driver.

For Task 2, I understand that the idea would be to get 'Intake' as a plugin package (...)

Yes, the plugin we expect for intake would be a 'source' plugin (not a dataset plugin) : see the doc for an example https://climetlab.readthedocs.io/en/latest/contributing/sources.html
Please note that the plugin api may still change (but the logic will remain the same)

For Task 3, (...) but not really sure regarding the different format comparison, so would like to ask you for some clarification.

To elaborate the short sentence "define appropriate configuration (chunking/compression/other) according to domain use cases, develop tools to benchmark when used on a cloud-platform, compare to other formats (N5, GRIB, netCDF, geoTIFF, etc.)", here are a few questions I have in mind. I believe that answering some of them would fulfill task 3.
Is it more efficient to use zarr format vs netCDF vs GRIB vs other ? Obviously using one unique chunk in a zarr format data is not the best. What would be the best chunking strategy ? Does it depends on the dataset size, or user access scheme (i.e. reading time series vs reading maps) ? Why ? Is is ok to use chunks of size 1k ? 1M ? 1G ? 10G ? If compression be done with different options, what are the impact of these ? Why should I use zarr instead of N5 or parquet ? Which format offers better compression, download speed, compression/uncompression speed , and why ?
And very importantly, how can we benchmark all this ? It could be interesting to use some tiny/small/medium/large datasets (10M, 1G, 1T, more) to test some of these ideas. And the other ideas it could generate.

aaronspring · 2021-06-11T21:44:40Z

Regarding task 2. https://xskillscore.readthedocs.io/en/stable/api/xskillscore.roc.html#xskillscore.roc provides a calculation for ROC

EsperanzaCuartero added the stream-2 Stream 2 - Machine Learning for weather, climate and atmosphere applications label Jan 28, 2021

EsperanzaCuartero assigned floriankrb Feb 4, 2021

tusharmanekar mentioned this issue Mar 29, 2021

Challenge #13 - ORIGAMI (global river gauges mapping) #5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Challenge #24 - CliMetLab - Machine Learning on weather and climate data #13

Challenge #24 - CliMetLab - Machine Learning on weather and climate data #13

EsperanzaCuartero commented Jan 28, 2021 •

edited by jwagemann

Loading

veds12 commented Mar 1, 2021

EsperanzaCuartero commented Mar 2, 2021

floriankrb commented Mar 2, 2021

jwagemann commented Mar 22, 2021

vidurmithal commented Apr 4, 2021

b8raoult commented Apr 5, 2021

vidurmithal commented Apr 5, 2021

b8raoult commented Apr 5, 2021 •

edited

Loading

floriankrb commented Apr 14, 2021

aaronspring commented Jun 11, 2021

Challenge #24 - CliMetLab - Machine Learning on weather and climate data #13

Challenge #24 - CliMetLab - Machine Learning on weather and climate data #13

Comments

EsperanzaCuartero commented Jan 28, 2021 • edited by jwagemann Loading

Challenge 24- CliMetLab - Machine Learning on weather and climate data

Goal

Mentors and skills

Challenge description

veds12 commented Mar 1, 2021

EsperanzaCuartero commented Mar 2, 2021

floriankrb commented Mar 2, 2021

jwagemann commented Mar 22, 2021

vidurmithal commented Apr 4, 2021

b8raoult commented Apr 5, 2021

vidurmithal commented Apr 5, 2021

b8raoult commented Apr 5, 2021 • edited Loading

floriankrb commented Apr 14, 2021

aaronspring commented Jun 11, 2021

EsperanzaCuartero commented Jan 28, 2021 •

edited by jwagemann

Loading

b8raoult commented Apr 5, 2021 •

edited

Loading