-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2.SpeedingUpTheSlowOption #78
base: main
Are you sure you want to change the base?
Conversation
The dictionary's keys should be
|
|
Currently testing this with a reduced dataset containing 3 realm types but many different var_ids. Script is working fine for this test but a more strenuous test case with proper standard names is needed. Dictionary logic is working nicely and it doesn't seem that files are being unnecessarily opened anymore. I verified this in a few ways but mostly by forcibly stopping the script when a match was found in the unique_datasets dictionary. Side note: It would be nice if we could find a creative way to identify whether or not a file will even have a real standard name before it is opened. This would stop us from opening a file just to get "NA" as standard name. May not be a possibility though... |
Agreed, good point! Here is a possibility that MDTF will also appreciate so that xarray open call is still useful. standard_name: (str) if a standard_name is not defined for a variable in the target file, use the equivalent CMIP6 standard_name or the long_name with underscores in place of spaces (e.g., air temperature -> air_temperature) This means we will also grab the long_name (non-empty) when we open the file. We use long_name when standard name is empty, except replace the space to underscores. E.g. in getinfo.py https://github.com/NOAA-GFDL/CatalogBuilder/blob/main/catalogbuilder/intakebuilder/getinfo.py#L213
Note that we will be making up these standard names and they are not the official CF ones, but it serves the purpose here and for the broader analysis needs. |
When timing the slow option on main and on this branch the difference is.... negligible. ~5 seconds on this branch and ~7 seconds on main. It would be interesting to see if the difference will be greater when testing with a larger dataset. Also, some code auditing is needed to ensure we're being efficient. |
I have a new test case for you. /archive/a1r/fre/FMS2024.02_OM5_20240724/testcase_1/gfdl.ncrc5-intel23-prod-openmp/pp/ pp/river mostly does not have standard name, so expect long name to be used. 1- dmget all the data beforehand so we don't account for this in our timing tests. |
Test Instructions:
|
@Ciheim should the PR be marked ready for review? |
can you paste the exact command you used to test including your template.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dictionary logic looks clear and correct to me, but I didn't actually run it. Nice work, Ciheim.
There are a series of warnings raised. Do you see it in your test too @Ciheim. Can we find out why and document it in an issue. Also, add a line to not print those warnings unless --verbose is on. example-
|
@Ciheim I've made some other minor changes to the PR branch. So pull new changes as needed. Once this reviewed and warnings are moved to logged.debug, this can be merged in. I did a simple test on a smaller test case since spear has some permission issues. |
Config used for testing: #what kind of directory structure to expect? the output_path_template is set as follows.#We have NA in those values that do not match up with any of the expected headerlist (CSV columns), otherwise we #catalog headers headerlist: ["activity_id", "institution_id", "source_id", "experiment_id", #what kind of directory structure to expect? the output_path_template is set as follows.#We have NA in those values that do not match up with any of the expected headerlist (CSV columns), otherwise we output_path_template: ['NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA','NA','NA','NA','NA','NA','NA','NA','NA','NA','NA','NA'] output_file_template: ['realm','time_range','variable_id'] #OUTPUT FILE INFO is currently passed as command-line argument. ####################################################### input_path: "/archive/a1r/fre/FMS2024.02_OM5_20240724/testcase_1/gfdl.ncrc5-intel23-prod-openmp/pp" |
Not sure why it posted in a weird way.. |
Improving the speed of the slow option by keep track of what we open