Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utility to extract, reshape, and store data for a subset of the data. e.g. for extracting timeseries for single PV sites from gridded NWPs #141

Open
JackKelly opened this issue Jun 21, 2024 · 0 comments
Labels
enhancement New feature or request performance Improvements to runtime performance usability Make things more user-friendly

Comments

@JackKelly
Copy link
Owner

JackKelly commented Jun 21, 2024

If I put on my hat of being an energy forecasting ML researchers, then one of the "dreams" would be to be able to use a single on-disk dataset (e.g. 500 TBytes of NWPs) for multiple ML experiments:

  1. a neural net, which takes in dense imagery from NWPs and satellite imagery, covering the same regions in space and time
  2. an XGBoost model to forecast solar PV power for a handful of specific sites. For each site, the input might be a single "pixel" (single lat lon location), across time.

If the data is chunked on disk to support use-case 1 (the neural net) then we might use chunks something like y=128, x=128, t=1, c=10. But that sucks for use-case 2 (which only wants a single pixel).

So it'd be nice to have a tool to:

  • easily extract long timeseries for a handful of sparse locations, and maybe save those in chunk sizes of something like y=1, x=1, t=4096, c=10
  • append to these timeseries
  • automatically append to the timeseries datasets when new timesteps are added to the dense dataset?

Maybe the ideal would be for the user to be able to express these conversions in a few lines of python, perhaps using xarray, whilst still saturating the IO (e.g. a cloud instance with a 200 Gbps NIC, reading and writing from object storage). The user shouldn't have to worry about parallelising stuff.

Perhaps you'd have multiple on-disk datasets (each optimised for a different read pattern). But the user wouldn't have to manually manage these multiple datasets. Instead the user would interact with a "multi-dataset" layer would would manage the underlying datasets (see #142).

@JackKelly JackKelly added enhancement New feature or request performance Improvements to runtime performance usability Make things more user-friendly labels Jun 21, 2024
@JackKelly JackKelly moved this to Todo in light-speed-io Jun 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Improvements to runtime performance usability Make things more user-friendly
Projects
Status: Todo
Development

No branches or pull requests

1 participant