-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The future of pangeo CMIP6 in the cloud #31
Comments
Julius, thanks so much for raising this important discussion. You are asking all the right questions. Political StuffFor me, the elephant in the room is the future of ESGF itself. I think our project has demonstrated the value of and ARCO copy of CMIP6 in the cloud. That demonstration has influenced how ESGF is thinking about their infrastructure, as evidenced from the engagement from ESGF core members in our ongoing working group. At the same time, it has made clear that, although we have plenty of ideas, we (as in you, me, and @cisaacstern) do not have the bandwidth ourselves to expand the scope of what we are already doing without additional resources (person power and / or funding). In my ideal world, the job of providing ARCO CMIP6 data access in a cloud-native way would simply be taken over by ESGF. Their mandate is to provide access to the CMIP6 data, and they have a whole boatload of funding to do so. I would love for the ESGF infrastructure to be so fast, performant, and easy to use that the Zarr data copy becomes unnecessary. On the other hand, the existing Zarr data has many users now dependent on it, so we can't just delete it. To decide how to proceed, we need answers from ESGF. Specifically:
As explained by Scott, ESGF 2.0 won't kick off until congress passes a budget. So we are in a holding pattern on all of this. Technical Stuff
This is exactly what is demonstrated in this Pangeo Forge tutorial - https://pangeo-forge.readthedocs.io/en/latest/tutorials/xarray_zarr/cmip6-recipe.html - for the dataset This shows that it is indeed feasible to use Pangeo Forge to produce CMIP6 Zarrs in an automated way. Another really transformative development since we started this project has been the creation of kerchunk. With Kerchunk, we can have the best of both worlds: Zarr-style access to existing netCDF data in the cloud. Kerchunk also allows us to virtually concatenate multiple files. That would enable us to basically reproduce the existing Zarr convenience (fast access, single file for all timesteps) without duplicating a petabyte of data. So going forward, I feel strongly that we should be exploring this option. The main downside is that it only works with python. We have no idea what languages people are using to access the Zarr data. Julia folks certainly are. Bottom line, I would love to hand off the problem of simply providing the CMIP6 data on the cloud to ESGF in some way. Then we could focus on adding value on top of that data, namely by generating the kerchunk indexes and then moving on the exciting problem of derived datasets. But in the meantime, the most important priority is to remove retracted data. So thanks for working on that! 👏 |
Hey @rabernat , thanks for the mention. Youthful exuberance will be my undoing as ESGF2.0 is yet to be awarded, linked to the issue you mention. We have to be cautious being to public with our planning (something that does, to no surprise to those that know me, irk me). Happy to talk offline. Let's just say the proposal not only seeks to link heavily to Pangeo but contribute to it where appropriate. |
Just adding the right handle to this thread @aradhakrishnanGFDL welcome aboard! |
@rabernat and I have worked on a prototype script that would take a I am extremely excited to see the first test run working. During the CMIP/ESGF zoom today we did however see some things that need to be improved:
Please feel free to post any feedback/comments here. |
Great to see you here @durack1! You're welcome at our bi-weekly working group meetings any time! (Though I'm sure you already have enough meetings to keep you busy. 😉 ) |
Thanks @rabernat! We have WIP representation with @matthew-mizielinski so I think we have our bases covered, also happy to be pinged (@durack1) if you wanted a secondary perspective to be chimed in. I am normally locked up with kiddy dropoff until ~8:15 PST, so would have a hard time attending the early meetings anyway |
Hey folks, it has been a while. I just wanted to give an update on our efforts. @cisaacstern and I have been working hard on a singular recipe (cmip-feedstock) where we were quite successful in automating @naomi-henderson s efforts, so that the user only has to provide the We have used several requests (pangeo-forge/cmip6-feedstock#10, pangeo-forge/cmip6-feedstock#5, pangeo-forge/cmip6-feedstock#3) to expand the recipe to several 10s of datasets. After some initial successes we realized that encapsulating the entirety of CMIP6 into a single recipe will be extraordinarily difficult to maintain. Together with @rabernat we agree that we need to split CMIP6 into several feedstocks, and centralize the 'generator code' (which queries the ESGF API to get both the urls and dynamically generated keyword arguments for the recipe) in a single repository: https://github.com/jbusecke/pangeo-forge-esgf. But the question I would bring up for discussion is how to split up the feedstocks? I believe that splitting the feedstocks along the model ( I propose a two stage organization:
I would love to get some feedback from folks here on possible other aspects that I might have overlooked. |
I have just migrated some logic to pangeo-forge-esgf that could form the base for such a bot. Id imagine the user would make some request like: **I want all historical data of variable
|
Where do you envision this meta-request would be made? That is, from where is the bot's activity triggered? |
Excellent question. I was naively thinking that we could convert the But this is something where I am fairly inexperienced with, so maybe you have a better approach? |
I would like to start a high level discussion about the priorities and organization of the pangeo CMIP6 archive in the cloud.
I have officially? taken over this effort, and would first and foremost thank @naomi-henderson for her tireless work in getting this effort of the ground! It is amazing what is already possible with all this data.
Overall I would like to discuss:
Data generation/organization
I think the central model of organization works well. We store ‘datasets’ (time concatenated) as zarr stores in a cloud bucket, and use a csv based on the unique combination of “facets”. What we should discuss here is how to populate the cloud bucket. This is closely related to the question of how users can request additional datasets.
Previously this was managed using a request form and manual creation/upload if the datasets.
I hope we can eventually fully automate this, but I am aware that there might be a transition phase.
Ultimately I would like to be able to build a pangeo-forge recipe from a dataset_id string
(exact facets used pending an answer here)
, and it would build and upload the final zarr store entirely in the cloud (cc @cisaacstern). But since this is likely a more involved undertaking what do folks think about intermediate solutions?
Related: As suggested here we could consider augmenting the existing global variables with other attributes that might make things easier in the long run.
Filtering retracted datasets
The basic idea (established by @naomi-henderson) is that we maintain a “raw” catalog with every store ever created, and then filter it to have a user-facing “main” catalog, which only includes valid datasets and only the newest versions.
For a more in depth discussion see #30.
User communication and derived datasets
I think it is very important that we establish better visibility of “what is going on” and some sort of way to inform users of new developments. I really want to minimize the interactions via email!
Some ideas I had:
Any other ideas from ppl here?
Another issue that I (and seemingly other users) care about a lot is the issue of derived datasets. I have been prototyping some ways to generate derived datasets but am not quite sure what the best way is for a broader audience to contribute datasets. You can find more discussion .
cc @rabernat
The text was updated successfully, but these errors were encountered: