DataLoaders and DataModules for Reproducability of different Data Challenges #67

nilsleh · 2024-01-10T13:02:30Z

Thanks for the work on this benchmark dataset. I have spent some time on the documentation page and the repo, however, I am still a bit lost on how to use data for the different tasks. Ideally, I would like to train my own interpolation model on any of the different tasks and then evaluate them and compare the results on the leaderboard. There is a mention of an "End-to-End" example in the README, however, I could not find a corresponding notebook.

The closest example I have found is Evaluating an SSH Mapping algorithm using OceanBench and 4dvarnet-starter. Is the expectation, that I copy the code from "Get ocean bench task data", "From task data to Patcher", and "Torch Dataset and DataLoader" and then replace the configuration of the 4D Var Net with my own model? However, given that it is a benchmark dataset, shouldn't all these steps be configured and pre-defined to ensure reproducability? I suppose I was hoping for something like:

from oceanbench import OceanBenchDataModule

model = MyLightningModule()
datamodule = OceanBenchDataModule(root="path_to_data") # or maybe a separate DataModule for each data challenge
trainer = ...
trainer.fit(model, datamodule)
trainer.predict(model, datamodule) # this generates a resulting file that let's the user compare results on the leaderboard

Meaning that the XrTorchDataset logic and loader logic is already integrated to ensure that everyone is using the same data setup.

Apologies if I misunderstood something or have been looking in the wrong places, but I am not sure how to get started with trying my own model on any of the four data challenges with a reproducible setup.

Thanks in advance for any suggestions :)

The text was updated successfully, but these errors were encountered:

nilsleh · 2024-01-16T15:12:01Z

I have written up a lightning datamodule, that I would use for training in a gist of how I would use the parts from the 4dvarnet-tutorial. Not sure if this is correct, or how you intended it to be used, so would be grateful for any feedback :)

jejjohnson · 2024-01-17T16:26:04Z

Hi! Thank you very much for your interest! And thanks for raising your concerns as we are still trying to improve how we explain what we are doing. Let me try to explain the different tasks (from my perspective). I would appreciate any feedback on my explanation because I would like to include this write-up within the documentation. You can find the write-up here.

Estimation Problem

For the OceanBench: SSH edition, our objective is to estimate the best state given some partial observations and some parameters or auxillary information.

$$ f:\text{Observations}\times\text{Params} \rightarrow \text{State} $$

In particular, we have formulated this as a state estimation problem whereby we want to estimate the state, i.e., the full SSH field, $\eta$, given some observations, e.g., $\eta_{obs}$, $\eta_{atrack}$, or $T$. In OceanBench, all we do is provide different datasets for specific challenges as well as provide a framework (with examples) of how you can create your own datasets with some of our custom preprocessing routines, dataloaders, and metrics. For a user to get started with inference right away, you're correct: I think the end-to-end 4DVarNet example is the best place to start. It demonstrates:

How to create an inference dataloader
Visualize results and compute metrics
Use hydra to help with configurations
So a user can load the validation data (only) for the different challenges and get started right away with making predictions (basically zero-shot predictions).

However: most people cannot start here right away if they do not have their own pretrained model because we don't have any training loops and this does not give the user access to the ground truth. The 4DVarNet has a "training" loop because it solves a minimization problem but most ML people probably will not use this (yet). To try something more typical (like a UNet), they would need to start with training which I outline below.

Training Problem

As I mentioned above, most people need to train something before getting started with inference. So we also try to provide some helpful data and tools so that the users can learn the parameters for their own models. We try to use a general definition of learning whereby we try to learn the best parameters, $\theta$, for some model, $\boldsymbol{f}$, given some dataset $\mathcal{D}{\text{tr}}$. So essentially every loss and objective function in the ML world ever whereby we provide a training dataset, $\mathcal{D}{tr}$, and the user providers their own training objective, model, and trainer.

We try to demonstrate tools within the OceanBench framework to help the users start training. So for people interested in training (almost everyone), the easiest place to start is the from tasks to datasets example. Of course they are welcome to use any of the tools from the estimate problem tutorials. We only ask that the users don't use any of the data that we use for validation.

The main difference between the train-inference-validation datasets is what period and region is being used and we take special care to make sure the inference and validation data has no overlap with the training data. See the table right below.

Challenge	Training Data Period	Inference Data Period	Validation Period
OSSE I, II, III	`[2013-01-01, 2013-09-30]`	`[2012-10-01, 2012-12-02]`	`[2012-10-22, 2012-12-02]`
OSE	`[2016-12-01, 2018-01-31]`	`[2016-12-01, 2018-01-31]	`[2017-01-01, 2017-12-31]`

The only challenge without a distinct period for training and inference/validation is the OSE challenge. However, these are real observations where we don't have the full ground truth for the field. So we remove some observations of a satellite entirely and use this as our split.

Lightning Data Module

I have written up a lightning datamodule, that I would use for training in a gist of how I would use the parts from the 4dvarnet-tutorial. Not sure if this is correct, or how you intended it to be used, so would be grateful for any feedback :)

This is the kind of contribution that would make the life of new users even easier! This is great!

nilsleh · 2024-01-17T19:30:53Z

Hi, thanks so much for the elaborate response! My questions were mainly about the Training Problem, and your explanation is quiet clear. I must admit that after some more digging, the "from tasks to datasets" example became more clear to me. Nevertheless, I personally like to work with lighting, because it has such nice functionalities without having to change the code. Thus, if you were interested, I'd be happy to make a PR for an OceanBenchDataModule of some sort. From such a DataModule, all the separate dataloaders would still be available if people do not want to use lightning but pure pytorch instead. Additionally, it might offer the possibility to integrate the patcher_from_..._task() functions in a way that they do not have to be redefined in different notebooks and have a resulting DataModule that people could subclass if they want to change certain parts.

Apart from that, I just started running OceanBench experiments today with a focus on Uncertainty Quantification, so if you should have interest/time, I'd also appreciate an opportunity to talk to you about that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataLoaders and DataModules for Reproducability of different Data Challenges #67

DataLoaders and DataModules for Reproducability of different Data Challenges #67

nilsleh commented Jan 10, 2024

nilsleh commented Jan 16, 2024

jejjohnson commented Jan 17, 2024 •

edited

Loading

nilsleh commented Jan 17, 2024

DataLoaders and DataModules for Reproducability of different Data Challenges #67

DataLoaders and DataModules for Reproducability of different Data Challenges #67

Comments

nilsleh commented Jan 10, 2024

nilsleh commented Jan 16, 2024

jejjohnson commented Jan 17, 2024 • edited Loading

nilsleh commented Jan 17, 2024

jejjohnson commented Jan 17, 2024 •

edited

Loading