Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Challenge #24 - CliMetLab - Machine Learning on weather and climate data #13

Open
EsperanzaCuartero opened this issue Jan 28, 2021 · 10 comments
Assignees
Labels
stream-2 Stream 2 - Machine Learning for weather, climate and atmosphere applications

Comments

@EsperanzaCuartero
Copy link
Contributor

EsperanzaCuartero commented Jan 28, 2021

Challenge 24- CliMetLab - Machine Learning on weather and climate data

Stream 2 - Machine Learning for weather, climate and atmosphere applications

Goal

Extend new Python ML package and help to mature package

Mentors and skills


Challenge description

CliMetLab is a Python package aiming at simplifying access to climate and meteorological datasets, allowing users to focus on science instead of technical issues such as data access and data formats. It is mostly intended to be used in Jupyter notebooks, and be interoperable with all popular data analytic packages, such as NumPy, Pandas, Xarray, SciPy, Matplotlib, etc. and well as Machine Learning frameworks, such as TensorFlow, Keras or PyTorch. Several tasks are proposed:

  • Task 1: extend CliMetLab with so that offers user with high-level Matplotlib-based plotting functions to produce graphs and plot which are relevant to weather and climate applications (e.g. plumes plots, ROC curves, …).

  • Task 2: the Python package Intake is a lightweight set of tools for loading and sharing data in data science projects. Extend CliMetLab so that it seamlessly interfaces with Intake and allow users to access all intake-compatible datasets.

  • Task 3: Xarray uses the data format Zarr to allow parallel read and parallel write. Convert large already available datasets to xarray-readable zarr format, define appropriate configuration (chunking/compression/other) according to domain use cases, develop tools to benchmark when used on a cloud-platform, compare to other formats (N5, GRIB, netCDF, geoTIFF, etc.).


@EsperanzaCuartero EsperanzaCuartero added the stream-2 Stream 2 - Machine Learning for weather, climate and atmosphere applications label Jan 28, 2021
@veds12
Copy link

veds12 commented Mar 1, 2021

Hey! I am Vedant, a pre-final year undergrad student. I have been mainly working in the fields of Machine Learning and AI in general and have some experience in developing python libraries related to the same. I am interested in working on this challenge. Will be great if you could provide some more details on the work and how to get started, etc.

@EsperanzaCuartero
Copy link
Contributor Author

Hi Vedant, thanks for your interest. The mentors will provide more details about the challenge as soon as possible. Best, Esperanza

@floriankrb
Copy link
Contributor

Hello Vedant, depending on your background/interest/time, you may want to focus more on one of the three tasks offered here or address all of them.
Regarding task 3, a first step would be to take a small NetCDF file (a few Mbytes) in the field (search for "netcdf sample dataset") or GRIB, and write it as zarr or other formats, then compare these different alternatives.
Regarding task 1, using matplotlib to reproduce some of the plots at https://confluence.ecmwf.int/display/MAGP/Magics+Tutorial with artificial data may be a good start.
We will stabilize soon the climetlab plugin API that is needed the address task 2 (regarding Intake), in the meanwhile, understanding the logic of the intake project (by reading the documentation) and how to use its datasets in python would be a good start.

@jwagemann
Copy link
Contributor

Hi,
join us for the ECMWF Summer of Weather Code Ask Me Anything session and learn all things ESoWC.

When: Wednesday, 24 March 2021 at 4 pm GMT

What: learn everything about ESoWC - how it works, the challenges this year, some tips for your proposal and listen to ESoWC experiences from previous participants

How: register here.

@vidurmithal
Copy link

Hi! I'm interested in this challenge, and have prior experience of working with meteorological datasets using the Pangeo stack (Zarr, Xarray, etc.)

If I understand correctly, plotting in the CliMetLab library is currently done using Magics, and you want that to be extended to allow for the creation of other kinds of plots using Matplotlib? Would this require the creation of an additional Matplotlib driver within plotting? It would be great if you could provide some information on what you foresee in terms of plotting functionality.

@b8raoult
Copy link

b8raoult commented Apr 5, 2021

@vidurmithal this is a good idea. The most important is that CliMetLab is seen as a framework with a plug-in architecture. So yes, support for different plotting software is a good idea as long you can ensure that specifics of that software are somewhat hidden from the end user. The aim of CliMetLab is also to provide high-level functions so that user can focus on science. Of course, users could also be given access to lower levels functionalities, as long as they are optional.

@vidurmithal
Copy link

Thank you for your response @b8raoult.

So, if I understand correctly, for Task 1, you are looking at something like the plotting functionality that is built in to libraries like pandas, geopandas and even xarray, where calling .plot() on a dataframe or dataset automatically creates a plot of a suitable type (inferred based on data dimensions and types) using the matplotlib back-end. In CliMetLab, this would be accessed via the .plot_map() method on a dataset.

@b8raoult
Copy link

b8raoult commented Apr 5, 2021

Yes, that is correct. One of the challenge is to route a call to cml.plot_map(some_dataset, some_options) to the right plotting backend and generate a reasonable plot based on the type of dataset. CliMetLab already plots 2D fields (e.g. xarray) as isoline maps, using a colour palette based on the plotted variable (temperature, pressure, etc), and plots observations (e.g. pandas) as red dots on the map. But this is still very preliminary. I think the word "infer" that you used is the correct one. Currently, this is done by consulting a series of objects: the dataset, the source, the reader if the source is file based, and the helper is the datatype is not a CliMetLab object (e.g. a NumPy array).

@floriankrb
Copy link
Contributor

I got more questions from email:

For Task 1, I think we want matplotlib as an alternative/ replacement for Magics in Climetlab, but some clarification would be helpful.!

Magics has so much features that a replacement is way out-of-scope. This task is to explore this path though. For climetlab users, it would offer a way to plot nicely the data (as nice as with Magics) with the tools they are used to (i.e. matplotlib). For plugin developers, providing visualization code (along in the plugin code to access the data) may be easier with a matplotlib driver.

For Task 2, I understand that the idea would be to get 'Intake' as a plugin package (...)

Yes, the plugin we expect for intake would be a 'source' plugin (not a dataset plugin) : see the doc for an example https://climetlab.readthedocs.io/en/latest/contributing/sources.html
Please note that the plugin api may still change (but the logic will remain the same)

For Task 3, (...) but not really sure regarding the different format comparison, so would like to ask you for some clarification.

To elaborate the short sentence "define appropriate configuration (chunking/compression/other) according to domain use cases, develop tools to benchmark when used on a cloud-platform, compare to other formats (N5, GRIB, netCDF, geoTIFF, etc.)", here are a few questions I have in mind. I believe that answering some of them would fulfill task 3.
Is it more efficient to use zarr format vs netCDF vs GRIB vs other ? Obviously using one unique chunk in a zarr format data is not the best. What would be the best chunking strategy ? Does it depends on the dataset size, or user access scheme (i.e. reading time series vs reading maps) ? Why ? Is is ok to use chunks of size 1k ? 1M ? 1G ? 10G ? If compression be done with different options, what are the impact of these ? Why should I use zarr instead of N5 or parquet ? Which format offers better compression, download speed, compression/uncompression speed , and why ?
And very importantly, how can we benchmark all this ? It could be interesting to use some tiny/small/medium/large datasets (10M, 1G, 1T, more) to test some of these ideas. And the other ideas it could generate.

@aaronspring
Copy link

Regarding task 2. https://xskillscore.readthedocs.io/en/stable/api/xskillscore.roc.html#xskillscore.roc provides a calculation for ROC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stream-2 Stream 2 - Machine Learning for weather, climate and atmosphere applications
Projects
None yet
Development

No branches or pull requests

7 participants