Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add rechunking example #197

Closed

Conversation

thodson-usgs
Copy link
Contributor

This PR adds an example script demonstrating how to rechunk a VirtualiZarr dataset with Cubed.

However, this is still a WIP. I'm creating the PR to elicit feedback about what changes might be necessary in order for the script to run as intended. @TomNicholas and @norlandrhagen might have some thoughts.

After creating the combined virtual dataset, I specify the source chunking before passing it off the Cubed for rechunking

source_chunks = {'Time':1, 'south_north':250, 'west_east':320}

combined_chunked = combined_ds.chunk(
    chunks = source_chunks,
)

combined_chunks

returns

Frozen({'Time': (1, 1, 1, 1), 'south_north': (250,), 'west_east': (320,), 'interp_levels': (9,), 'soil_layers_stag': (4,)})

The virtual dataset contains four files, indicated by 'Time': (1, 1, 1, 1).

Then I attempt to rechunk:

from cubed.primitive.rechunk import rechunk

target_chunks = {'Time':5, 'south_north':25, 'west_east':32}

rechunk(
    combined_chunked['TMAX'], # requires shape attr, so can't pass full Dataset
    target_chunks=target_chunks,
    source_array_name='virtual',
    int_array_name='temp',
    allowed_mem=2000,
    reserved_mem=1000,
    target_store="test.zarr",
    #temp_store="s3://cubed-thodson-temp",
)

which errors with

TypeError: can't multiply sequence by non-int of type 'tuple'

Apparently, Cubed won't tolerate the Time chunk tuple 'Time': (1, 1, 1, 1). Is there a simple way to convert it to Time': (1, )? Alternatively, I could prepare a PR to Cubed, which would set the memory constraint around the largest chunk size when chunks are variable.

kerchunk

I also tested this workflow with kerchunk but I ran into a bug while following the Pythia cookbook example:

/home/runner/miniconda3/envs/kerchunk-cookbook/lib/python3.10/site-packages/kerchunk/combine.py:370: UserWarning: Concatenated coordinate 'Time' contains less than expected number of values across the datasets: [0]
  warnings.warn(

@TomNicholas
Copy link
Member

TomNicholas commented Jul 20, 2024 via email

@TomNicholas TomNicholas added the usage example Real world use case examples label Jul 21, 2024
@thodson-usgs
Copy link
Contributor Author

I don't think you should import the rechunk primitive from Cubed. I think instead you should open the kerchunked dataset as an Xarray dataset using cubed-xarray, then call Xarray's chunk method with the desired chunks.

Yes, that seems to work, but I'm still working through several errors when I write out to Zarr. I'll report more in a day or two.

@TomNicholas
Copy link
Member

Yes, that seems to work, but I'm still working through several errors when I write out to Zarr. I'll report more in a day or two.

Great - very curious to see the details.

I think what you're doing here should live in the cubed repo though - once you have the kerchunk reference files on disk virtualizarr is out of the picture, all of the rechunking is about using cubed. I do think that this use case would make an important example to have in the cubed docs though - as its basically showing how the original rechunker package is just a special case of cubed (cc @tomwhite).

@tomwhite
Copy link
Collaborator

Sounds great. Happy for this to be added as a Cubed example.

@thodson-usgs
Copy link
Contributor Author

Closing and opening a PR on cubed cubed-dev/cubed#520.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage example Real world use case examples
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants