Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore concat_dim if only one file is passed #275

Open
jbusecke opened this issue Feb 4, 2022 · 4 comments
Open

Ignore concat_dim if only one file is passed #275

jbusecke opened this issue Feb 4, 2022 · 4 comments

Comments

@jbusecke
Copy link
Contributor

jbusecke commented Feb 4, 2022

We ran into an issue with out prototype script to create zarr stores from CMIP6 netcdfs today. I believe the issue there is that we want to generally concatenate files in the 'time' dimension, except for when the file does not have a time dimension.

I am wondering if this could be implemented as a check and ignore logic in the recipe itself. Within https://github.com/pangeo-forge/pangeo-forge-recipes/blob/master/pangeo_forge_recipes/recipes/xarray_zarr.py could we implement a check for the number of files passed, and if only one is passed, check if the concat_dim is present. If not it should just default to a single chunk.

This would enable a smooth processing of many datasets without the need to introspect the datasets beforehand.

@rabernat
Copy link
Contributor

rabernat commented Feb 4, 2022

I think the problem may have been with the fact that I explicitly specified target_chunks={'time': 3} in the recipe. Without that, things may have just worked. Could you check this?

https://github.com/pangeo-data/pangeo-cmip6-cloud/blob/master/zarr_from_esgf.py#L92

@jbusecke
Copy link
Contributor Author

jbusecke commented Feb 4, 2022

You did, and this is actually a parameter that needs to be changed depending on the dataset (depending on the dimensionality and lateral dimensions we need different time chunks to keep the chunksize in the optimal range)

What do you think about some additional logic that loads the first url and does some simple logic to:

  • Either determine the file has no time dimension
  • Calculate an appropriate chunking size depending on lateral dimensions?

I think that is not a bad way, but in that case we probably want to move this issue back to https://github.com/pangeo-data/pangeo-cmip6-cloud?

@jbusecke
Copy link
Contributor Author

jbusecke commented Nov 6, 2024

Pinging this again here. I just ran into this again here. I realize that my use case (many 100k datasets in CMIP) is fairly on the edge of the usecase here, so I am trying to work with minimal changes to pgf-recipes, but I wonder if enabling 'concat_dim=None' in the pattern (and downstream) would be a viable option? It seems like some of the internals would already accept None as input. I might try to submit a WIP PR, but wanted to ask folks here about it first.

@jbusecke
Copy link
Contributor Author

Trying to test/implement this in #783

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants