-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposed Recipe for historical NorESM2-LM vmo #128
Comments
Thanks for opening this issue, @jdldeauna! You should be able to create a import datetime
import pandas as pd
from pangeo_forge_recipes.patterns import ConcatDim, FilePattern
BASE_URL = "https://data.org/dataset"
dates = pd.date_range("1850-01", "2010-01", freq="10YS")
time_concat_dim = ConcatDim("date", keys=dates)
def make_url(date):
"""With a start date as input, return a url terminating in
``{start}-{end}.nc`` where end is 10 years after the start
date for years other than 2010. If the start date is 2010,
the end date will be 5 years after the start date.
:param date: The start date.
"""
# assign 10 year interval for all years aside from 2010
freq = "10YS" if date.year != 2010 else "5YS"
# make a time range based on the assigned interval
start, end = pd.date_range(date, periods=2, freq=freq)
# subtract one day from the end of the range
end = end - datetime.timedelta(days=1)
# return the url with the timestamp in '%Y%m' format
return f"{BASE_URL}/{start.strftime('%Y%m')}-{end.strftime('%Y%m')}.nc"
pattern = FilePattern(make_url, time_concat_dim)
for _, url in pattern.items():
print(url) which prints
If you instead assign BASE_URL = 'http://noresg.nird.sigma2.no/thredds/fileServer/esg_dataroot/cmor/CMIP6/CMIP/NCC/NorESM2-LM/historical/r1i1p1f1/Omon/vmo/gn/v20190815/vmo_Omon_NorESM2-LM_historical_r1i1p1f1_gn_' I believe you will get the urls for this dataset. Note also that we just pushed a big update to the recipe contribution docs this morning, which may be of help to you: Please let me know if this solves your problem. If not, I'm standing by to help. Really looking forward to bringing this data to the cloud together! |
Thanks for your help @cisaacstern ! Right now I'm trying to decide how to chunk the dataset. When I try to download the URLs through xarray I get this error: urls = []
for _, url in pattern.items():
urls.append(url)
ds = xr.open_mfdataset(urls)
OSError: [Errno -90] NetCDF: file not found: b'http://noresg.nird.sigma2.no/thredds/fileServer/esg_dataroot/cmor/CMIP6/CMIP/NCC/NorESM2-LM/historical/r1i1p1f1/Omon/vmo/gr/v20190815/vmo_Omon_NorESM2-LM_historical_r1i1p1f1_gr_185001-185912.nc' I've tried using alternate thredds servers (e.g., `http://esgf-data3.ceda.ac.uk/` and `http://esgf3.dkrz.de/`), which also return the same error. I'm confused because when I try accessing an individual URL, I'm able to download the file associated with it. I can also download files from other data sources using `xr.open_mfdataset`. Would appreciate any advice, thanks! |
Awesome progress, @jdldeauna! You are correct that it appears these urls are hard to open over HTTP with xarray (if there is a way, I also could not find it). As shown in the tutorial section you link on chunking, to determine
and then opened it directly from the downloaded copy import xarray as xr
local_path = "vmo_Omon_NorESM2-LM_historical_r1i1p1f1_gr_185001-185912.nc"
ds = xr.open_dataset(local_path) which worked. Then, I applied the calculation from the tutorial you linked, except I bumped ntime = len(ds.time) # the number of time slices
chunksize_optimal = 125e6 # desired chunk size in bytes
ncfile_size = ds.nbytes # the netcdf file size
chunksize = max(int(ntime* chunksize_optimal/ ncfile_size),1)
target_chunks = ds.dims.mapping
target_chunks['time'] = chunksize
target_chunks # a dictionary giving the chunk sizes in each dimension which gives us
We can then check how this chunking scheme affects the chunk size for ds_chunked = ds.chunk(target_chunks)
ds_chunked.vmo |
This is great! I was able to create the recipe. Unfortunately when I try to execute this error appears: for input_key in recipe.iter_inputs():
recipe.cache_input(input_key)
FileNotFoundError: http://noresg.nird.sigma2.no/thredds/fileServer/esg_dataroot/cmor/CMIP6/CMIP/NCC/NorESM2-LM/historical/r1i1p1f1/Omon/vmo/gr/v20190815/vmo_Omon_NorESM2-LM_historical_r1i1p1f1_gr_185001-185912.nc I think it may have to do with the URL again, so I'm going to try and track down how best to download from the ESGF servers. Thanks again for all your help @cisaacstern ! |
Awesome, @jdldeauna! Could you share the complete Python code you are using to create the I suspect there may be a way to resolve this, but I won't know for sure without the actual recipe, so I can try it out. Thanks! |
Sure! import datetime
import pandas as pd
import xarray as xr
from pangeo_forge_recipes.patterns import ConcatDim, FilePattern
from pangeo_forge_recipes.recipes.xarray_zarr import XarrayZarrRecipe
BASE_URL = 'http://noresg.nird.sigma2.no/thredds/fileServer/esg_dataroot/cmor/CMIP6/CMIP/NCC/NorESM2-LM/historical/r1i1p1f1/Omon/vmo/gr/v20190815/vmo_Omon_NorESM2-LM_historical_r1i1p1f1_gr_'
dates = pd.date_range("1850-01", "2010-01", freq="10YS")
time_concat_dim = ConcatDim("time", keys=dates)
def make_url(time):
"""With a start date as input, return a url terminating in
``{start}-{end}.nc`` where end is 10 years after the start
date for years other than 2010. If the start date is 2010,
the end date will be 5 years after the start date.
:param date: The start date.
"""
# assign 10 year interval for all years aside from 2010
freq = "10YS" if time.year != 2010 else "5YS"
# make a time range based on the assigned interval
start, end = pd.date_range(time, periods=2, freq=freq)
# subtract one day from the end of the range
end = end - datetime.timedelta(days=1)
# return the url with the timestamp in '%Y%m' format
return f"{BASE_URL}{start.strftime('%Y%m')}-{end.strftime('%Y%m')}.nc"
pattern = FilePattern(make_url, time_concat_dim)
# Decide on chunk size by downloading local copy of 1 file
local_path = "vmo_Omon_NorESM2-LM_historical_r1i1p1f1_gr_185001-185912.nc"
ds = xr.open_dataset(local_path)
ntime = len(ds.time) # the number of time slices
chunksize_optimal = 125e6 # desired chunk size in bytes
ncfile_size = ds.nbytes # the netcdf file size
chunksize = max(int(ntime* chunksize_optimal/ ncfile_size),1)
target_chunks = ds.dims.mapping
target_chunks['time'] = chunksize
# the netcdf lists some of the coordinate variables as data variables. This is a fix which we want to apply to each chunk.
def set_bnds_as_coords(ds):
new_coords_vars = [var for var in ds.data_vars if 'bnds' in var or 'bounds' in var]
ds = ds.set_coords(new_coords_vars)
return ds
recipe = XarrayZarrRecipe(
pattern,
target_chunks=target_chunks,
process_chunk=set_bnds_as_coords,
xarray_concat_kwargs={'join':'exact'},
) Executing recipe: import zarr
for input_key in recipe.iter_inputs():
recipe.cache_input(input_key)
FileNotFoundError: http://noresg.nird.sigma2.no/thredds/fileServer/esg_dataroot/cmor/CMIP6/CMIP/NCC/NorESM2-LM/historical/r1i1p1f1/Omon/vmo/gr/v20190815/vmo_Omon_NorESM2-LM_historical_r1i1p1f1_gr_185001-185912.nc
|
Thanks for sharing! I don't seem to be able to test my hypothesis about Because not even |
Hi! I can use |
@jdldeauna, glad the server is back up! Executing the recipe actually does work for me now, without changing any kwargs. Note that it's to be expected that the input caching will take a long time, because each input is pretty large. To figure out the exact input size without downloading, we can use $ wget --spider http://noresg.nird.sigma2.no/thredds/fileServer/esg_dataroot/cmor/CMIP6/CMIP/NCC/NorESM2-LM/historical/r1i1p1f1/Omon/vmo/gr/v20190815/vmo_Omon_NorESM2-LM_historical_r1i1p1f1_gr_185001-185912.nc
...
Length: 1795188500 (1.7G) [application/x-netcdf] So that will take a long time to cache! Without logging turned on, it would be totally understandable to assume that such a long-running process has just stalled. To turn on debug logs, if you have not already: from pangeo_forge_recipes.recipes import setup_logging
setup_logging("DEBUG") Then when running a subsetted test of the recipe with recipe_pruned = recipe.copy_pruned()
runner_func = recipe_pruned.to_function()
runner_func() we see progress logs like this during the caching stage
So this input is 1.7 GB, which means that at 3.31 MB/sec we'd expect caching it to take Let me know if that makes sense and what else I can do to help! We'll definitely get this to work 😄 |
Just because these inputs are so large, I might recommend assigning the cache to be a named local directory as follows: from fsspec.implementations.local import LocalFileSystem
from pangeo_forge_recipes.storage import CacheFSSpecTarget
fs_local = LocalFileSystem()
recipe.storage_config.cache = CacheFSSpecTarget(fs_local, "cache") Making this assignment before executing the test will write the files to a local directory named Just to confirm, are you working with a local installation? For working with such large inputs I would definitely recommend a local installation as opposed to the Sandbox. This is for the same reason: to avoid having to cache these inputs more than once. |
I was able to run a pruned copy of the recipe on my local installation, thanks @cisaacstern ! I did have a question about executing the recipe as outlined in the CMIP6 tutorial: import zarr
for input_key in recipe.iter_inputs():
recipe.cache_input(input_key)
# use recipe to create the zarr store:
recipe.prepare_target()
# is it there?
zgroup = zarr.open(recipe.target_mapper)
print(zgroup.tree())
for chunk in recipe.iter_chunks():
recipe.store_chunk(chunk)
recipe.finalize_target() I was wondering if it's preferred to do this type of execution for CMIP6 files over this version: recipe_pruned = recipe.copy_pruned()
runner_func = recipe_pruned.to_function()
runner_func() |
Awesome 🎉 If you are satisfied with the outcome of your local test, I invite you to Make a PR for your recipe so we can move towards including in the Pangeo Forge Cloud catalog. Regarding execution style, good question. The execution style is intentionally completely independent of the source data. In this case, the inconsistency you're are seeing is a product of the fact that we have not actually deprecated the manual stage functions yet. We plan to do this, as noted in:
After this happens, we'll remove all of the remaining docs references to the older style (including the one you just noted) and the |
Yep, I'm working on putting together the
I see, thanks for the clarification! |
Some questions on the
Thanks! |
Yeah, we need better documentation for this. You can put
You can use the same one in the example you linked, Right now, the default storage target for that bakery points to our Open Storage Network (OSN) bucket for Pangeo Forge. I believe your goal is to get this data into the existing CMIP6 catalog, so we'll have to see the best way to do that, which might involve adding an alternate storage target to this bakery, or perhaps just pointing the CMIP6 catalog listing for this dataset at our OSN bucket. Either is an option; I imagine @jbusecke can help us decide which is best. |
Alright! Re: bakeries, I was confused by this post and thought the links were pointing to different bakeries but were actually just pointing to templates. 😅
Yes, please let me know if I can help with making it accessible. Thanks for all your help! |
Heya folks, this looks like a really cool addition. I think this is a great way to get some CMIP6 output on the cloud while I am still working on a more automated way to upload new CMIP6 stores (pangeo-data/pangeo-cmip6-cloud#31). Just wanted to flag this here as possible 'temporary'? I think it would be nice to ultimately keep all CMIP6 stuff in one spot/catalog? |
Sure! Just to follow my train of thought, I looked up the old Google Form which led me to Pangeo Forge, and I figured I would start with a staged recipe.
Sounds good! What would you recommend in terms of next steps for this dataset? |
I think if this works, you should totally just work with it. I would be happy to iterate with you and @cisaacstern in the future so consolidate/automate the efforts here! |
Source Dataset
Meridional ocean mass transport from the CMIP6 NorESM2-LM historical model
'http://noresg.nird.sigma2.no/thredds/fileServer/esg_dataroot/cmor/CMIP6/CMIP/NCC/NorESM2-LM/historical/r1i1p1f1/Omon/vmo/gn/v20190815/vmo_Omon_NorESM2-LM_historical_r1i1p1f1_gn_185001-185912.nc'
Transformation / Alignment / Merging
I tried using this template to make a recipe but this dataset is not available in the s3://esgf-world bucket:
I tried exploring the
FilePattern
method but I'm confused how to set it up because the cmip6 esgf URLs are written per decade (i.e.,'http://noresg.nird.sigma2.no/thredds/fileServer/esg_dataroot/cmor/CMIP6/CMIP/NCC/NorESM2-LM/historical/r1i1p1f1/Omon/vmo/gn/v20190815/vmo_Omon_NorESM2-LM_historical_r1i1p1f1_gn_185001-185912.nc'
), with the last file containing only the last five years (...201001-201412.nc
). DoesFilePattern
only accept atime_concat_dim
object?Output Dataset
Stored with the rest of the CMIP6 outputs in Pangeo Cloud -
https://storage.googleapis.com/cmip6/pangeo-cmip6-noQC.json
The text was updated successfully, but these errors were encountered: