Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address diffs v2.12.1 to v3 #907

Draft
wants to merge 24 commits into
base: main
Choose a base branch
from
Draft

Conversation

chengzhuzhang
Copy link
Contributor

@chengzhuzhang chengzhuzhang commented Dec 11, 2024

Description

Bug Fixes:

  • Fixed incorrect logic in cf_units.Unit that replaced genutil.udunits, which previously caused incorrect results.

Performance Improvements:

  • Improved subsetting logic by performing a time slice before loading time series datasets into memory, enhancing performance by reducing the size of the datasets
  • Optimized variable derivation by subsetting and loading datasets into memory with .load(scheduler="sync") first, as the convert_units() function requires in-memory access for cf_units.Unit operations -- affected MERRA2 datasets
  • Addressed an issue with single-point time dimensions in climatology datasets (e.g., OMI-MLS). xCDAT now squeezes the time dimension to remove time coordinates and bounds before loading the dataset, avoiding errors like "Non-integer years and months are ambiguous and not currently supported." -- affected OMI-MLS datasets
  • Update _get_slice_from_time_bounds() to load time bounds into memory if they already exist using .load(scheduler="sync") to avoid hanging -- affected streamflow datasets
  • Add missed unit conversion for H2OLNZ via Convert H2OLNZ units to ppm by volume #874 to address large differences

Todo

  • Complete run with v3 data
    • (note: must run without arm_diags due to chunking issue with dataset described above
    • Regression testing
      • Regression testing notebook
      • 553/590 match
      • 4/590 nan mismatch (okay, known diff with ccb subsetting)
      • 33/590 not equal (okay, very small number of elements are different with reasonable abs/rel diffs)
  • Complete run with v2 data -- IN PROGRESS
    • (note: must run without arm_diags due to chunking issue with dataset described above
    • Confirm full run works with convert_units() change
    • Regression testing
      • Figure out why there are more variables with diffs now -- the new diffs are actually not an issue, mostly minimal number of elements and not large. I think this is related to the convert_units() change.
      • Address Q variable diffs -- the unit conversion changed from g/kg to g/kg by volume via 4cda0a8, and I didn't re-run zonal_mean_2d_stratosphere complete run to update /main results. This is good to go.
  • Test arm_diags with v2 and v3 data with correct chunking
    • Alternative solution: Call open_dataset() when only a single file is needed.
  • Missing streamflow plots
      '/global/cfs/cdirs/e3sm/www/cdat-migration-fy24/25-01-10-branch-907-v2-data/streamflow/RIVER_DISCHARGE_OVER_LAND_LIQ_GSIM/annual_map.png',
      '/global/cfs/cdirs/e3sm/www/cdat-migration-fy24/25-01-10-branch-907-v2-data/streamflow/RIVER_DISCHARGE_OVER_LAND_LIQ_GSIM/annual_scatter.png',
      '/global/cfs/cdirs/e3sm/www/cdat-migration-fy24/25-01-10-branch-907-v2-data/streamflow/RIVER_DISCHARGE_OVER_LAND_LIQ_GSIM/seasonality_map.png',
    • Issue with Opening time series files: ['/global/cfs/cdirs/e3sm/e3sm_diags/postprocessed_e3sm_v2_data_for_e3sm_diags/20210528.v2rc3e.piControl.ne30pg2_EC30to60E2r2.chrysalis/time-series/rgr/RIVER_DISCHARGE_OVER_LAND_LIQ_005101_006012.nc']
    • Root cause: Due to attempting to calculate time slice using existing time_bounds variable in the dataset without calling .load(scheduler="sync") first.
    • Fix: Call .load(scheduler="sync") if time bounds underlying array is a Dask Array object.
  • Fix integration tests failure on Python 3.10+

Regression testing with v2 data and arm_diags

Link here: #907 (comment)

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

If applicable:

  • New and existing unit tests pass with my changes (locally and CI/CD build)
  • I have added tests that prove my fix is effective or that my feature works
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Dec 11, 2024

Regression testing performed in #903, related to comment.

Copying here:

Here is the regression notebook that compares latest main vs. v2.12.1.

  • Matching number of .nc files produced (590)
  • 501/590 are matching
  • 4/590 are mismatching (nan positions)
  • 85/590 are not equal

For the 85/590 are not equal to rtol=1e-4. These variables will need further investigation.

['/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/COREv2_Flux/COREv2_Flux-PminusE-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/CRU_IPCC/CRU-TREFHT-ANN-land_60S90N_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/CRU_IPCC/CRU-TREFHT-ANN-land_60S90N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/Cloud MISR/MISRCOSP-CLDTOT_TAU1.3_9.4_MISR-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/Cloud MISR/MISRCOSP-CLDTOT_TAU1.3_MISR-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/ERA5/ERA5-NET_FLUX_SRF-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/ERA5/ERA5-OMEGA-200-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/ERA5/ERA5-OMEGA-500-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/ERA5/ERA5-OMEGA-850-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/ERA5/ERA5-TREFHT-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/ERA5/ERA5-TREFHT-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/ERA5/ERA5-TREFHT-ANN-land_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/ERA5/ERA5-U-850-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/GPCP_OAFLux/GPCP_OAFLux-PminusE-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/MERRA2/MERRA2-NET_FLUX_SRF-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/MERRA2/MERRA2-OMEGA-200-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/MERRA2/MERRA2-OMEGA-500-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/MERRA2/MERRA2-OMEGA-850-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/MERRA2/MERRA2-TREFHT-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/MERRA2/MERRA2-TREFHT-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/MERRA2/MERRA2-TREFHT-ANN-land_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/MERRA2/MERRA2-TREFMNAV-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/MERRA2/MERRA2-TREFMNAV-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/MERRA2/MERRA2-TREFMXAV-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/MERRA2/MERRA2-TREFMXAV-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/SST_CL_HadISST/HadISST_CL-SST-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/SST_HadISST/HadISST-SST-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/SST_PD_HadISST/HadISST_PD-SST-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/SST_PI_HadISST/HadISST_PI-SST-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/meridional_mean_2d/MERRA2/MERRA2-OMEGA-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/CRU_IPCC/CRU-TREFHT-ANN-polar_N_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/CRU_IPCC/CRU-TREFHT-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/ERA5/ERA5-TREFHT-ANN-polar_N_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/ERA5/ERA5-TREFHT-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/ERA5/ERA5-TREFHT-ANN-polar_S_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/ERA5/ERA5-TREFHT-ANN-polar_S_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/ERA5/ERA5-U-850-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/ERA5/ERA5-U-850-ANN-polar_S_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFHT-ANN-polar_N_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFHT-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFHT-ANN-polar_S_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFHT-ANN-polar_S_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFMNAV-ANN-polar_N_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFMNAV-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFMNAV-ANN-polar_S_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFMNAV-ANN-polar_S_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFMXAV-ANN-polar_N_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFMXAV-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFMXAV-ANN-polar_S_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFMXAV-ANN-polar_S_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/SST_CL_HadISST/HadISST_CL-SST-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/SST_CL_HadISST/HadISST_CL-SST-ANN-polar_S_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/SST_PD_HadISST/HadISST_PD-SST-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/SST_PD_HadISST/HadISST_PD-SST-ANN-polar_S_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/SST_PI_HadISST/HadISST_PI-SST-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/SST_PI_HadISST/HadISST_PI-SST-ANN-polar_S_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_2d/ERA5/ERA5-OMEGA-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_2d/MERRA2/MERRA2-OMEGA-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_2d_stratosphere/ERA5/ERA5-H2OLNZ-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_2d_stratosphere/ERA5/ERA5-H2OLNZ-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_2d_stratosphere/MERRA2/MERRA2-H2OLNZ-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_2d_stratosphere/MERRA2/MERRA2-H2OLNZ-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_xy/ERA5/ERA5-TREFHT-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_xy/ERA5/ERA5-TREFHT-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_xy/MERRA2/MERRA2-TREFHT-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_xy/MERRA2/MERRA2-TREFHT-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_xy/MERRA2/MERRA2-TREFMNAV-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_xy/MERRA2/MERRA2-TREFMNAV-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_xy/MERRA2/MERRA2-TREFMXAV-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_xy/MERRA2/MERRA2-TREFMXAV-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_xy/SST_CL_HadISST/HadISST_CL-SST-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_xy/SST_PD_HadISST/HadISST_PD-SST-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_xy/SST_PI_HadISST/HadISST_PI-SST-ANN-global_test.nc']

@tomvothecoder
Copy link
Collaborator

I will pull my regression testing notebooks over from PR #903 to this PR and work on debugging in this branch with you.

That way #903 can be dedicated to adapting the regression tests.

@chengzhuzhang
Copy link
Contributor Author

thank you @tomvothecoder for summarizing the cases to investigate. It look like I'm running into environment issue while debugging with vscode #906 (comment). Let me know if the env works for you when debugging.

@tomvothecoder
Copy link
Collaborator

thank you @tomvothecoder for summarizing the cases to investigate. It look like I'm running into environment issue while debugging with vscode #906 (comment). Let me know if the env works for you when debugging.

I just replied here. The dev env works fine for me.

@tomvothecoder
Copy link
Collaborator

I think I found the root cause of the large diffs here and in your comment here, specifically with TREFHT and ANN.

v2.12.1 code uses genutil.udunits

In v2.12.1, the convert_units() function for derived vars includes a portion of logic that calls genutil.udunits:

temp = udunits(1.0, var.units)
coeff, offset = temp.how(target_units)
var = coeff * var + offset
var.units = target_units

Latest main uses cf_units.Unit with new implementation logic

I replaced genutil.udnits with cf_units.Unit. It looks like the conversion is incorrect for TREFHT between K to C.

temp = cf_units.Unit(var.attrs["units"])
target = cf_units.Unit(target_units)
coeff, offset = temp.convert(1, target), temp.convert(0, target)
# Keep all of the attributes except the units.
with xr.set_options(keep_attrs=True):
var = coeff * var + offset
var.attrs["units"] = target_units

The implementation might not be able to handle all variables and/or the logic is incorrect, which leads to the massive diff found in your comment here.

Next steps

Investigate the cf_units.Unit code block in convert_units(). Fix as needed.

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Dec 12, 2024

Investigate the cf_units.Unit code block in convert_units(). Fix as needed.

I just pushed ee009c1 (#907) to address this issue.

Next steps

  • Add this fix on PR EAMxx variables #880 and re-run plot for TREFHT-ANN-land to compare results
  • Perform another regression test with the complete run to compare against v2.12.1
    • 01/06/25 - I keep running into an issue where it hangs during arm_diags after the PS variable, resulting in missing files (I think). I'm going to try running the script with dask=2024.11.2 to see if that helps.

    • Figure out why complete run script hangs on arm_diags after PS variable -- due to sub-optimal chunking for specific datasets (more info here)

      Files only in [/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/v2.12.1v2/arm_diags](https://vscode-remote+ssh-002dremote-002bperlmutter.vscode-resource.vscode-cdn.net/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/v2.12.1v2/arm_diags):
      armdiags-CLOUD-ANNUALCYCLE-nsac1-ref.png
      armdiags-CLOUD-ANNUALCYCLE-nsac1-test.png
      armdiags-CLOUD-ANNUALCYCLE-sgpc1-ref.png
      armdiags-CLOUD-ANNUALCYCLE-sgpc1-test.png
      armdiags-CLOUD-ANNUALCYCLE-twpc1-ref.png
      armdiags-CLOUD-ANNUALCYCLE-twpc1-test.png
      armdiags-CLOUD-ANNUALCYCLE-twpc2-ref.png
      armdiags-CLOUD-ANNUALCYCLE-twpc2-test.png
      armdiags-CLOUD-ANNUALCYCLE-twpc3-ref.png
      armdiags-CLOUD-ANNUALCYCLE-twpc3-test.png
      armdiags-FLUS-ANNUALCYCLE-sgpc1.png
      armdiags-PRECT-DJF-sgpc1-diurnal-cycle.png
      armdiags-PRECT-JJA-sgpc1-diurnal-cycle.png
      armdiags-PRECT-MAM-sgpc1-diurnal-cycle.png
      armdiags-PRECT-SON-sgpc1-diurnal-cycle.png
      armdiags-aerosol-activation-enac1-ccn02-ref.png
      armdiags-aerosol-activation-enac1-ccn02-test.png
      armdiags-aerosol-activation-enac1-ccn05-ref.png
      armdiags-aerosol-activation-enac1-ccn05-test.png
      armdiags-aerosol-activation-sgpc1-ccn02-ref.png
      armdiags-aerosol-activation-sgpc1-ccn02-test.png
      armdiags-aerosol-activation-sgpc1-ccn05-ref.png
      armdiags-aerosol-activation-sgpc1-ccn05-test.png
      armdiags-convection-onset-twpc1.png
      armdiags-convection-onset-twpc2.png
      armdiags-convection-onset-twpc3.png
    • Figure out why these MERRA2 datasets aren't being generated. The log says resource temporarily unavailable, maybe this is related.

      ['/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/lat_lon/MERRA2/MERRA2-OMEGA-200-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/lat_lon/MERRA2/MERRA2-OMEGA-500-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/lat_lon/MERRA2/MERRA2-OMEGA-850-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/lat_lon/MERRA2/MERRA2-PSL-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/lat_lon/MERRA2/MERRA2-TAUXY-ANN-ocean_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/lat_lon/MERRA2/MERRA2-TMQ-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/lat_lon/MERRA2/MERRA2-TREFHT-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/lat_lon/MERRA2/MERRA2-TREFHT-ANN-land_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/lat_lon/MERRA2/MERRA2-TREFMNAV-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/lat_lon/MERRA2/MERRA2-TREFMXAV-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/lat_lon/MERRA2/MERRA2-Z3-500-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/meridional_mean_2d/MERRA2/MERRA2-OMEGA-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/meridional_mean_2d/MERRA2/MERRA2-OMEGA-ANN-global_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/meridional_mean_2d/MERRA2/MERRA2-RELHUM-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/meridional_mean_2d/MERRA2/MERRA2-RELHUM-ANN-global_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-PSL-ANN-polar_N_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-PSL-ANN-polar_N_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-PSL-ANN-polar_S_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-PSL-ANN-polar_S_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TAUXY-ANN-polar_N_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TAUXY-ANN-polar_N_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TAUXY-ANN-polar_S_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TAUXY-ANN-polar_S_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TMQ-ANN-polar_N_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TMQ-ANN-polar_N_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TMQ-ANN-polar_S_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TMQ-ANN-polar_S_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFHT-ANN-polar_N_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFHT-ANN-polar_N_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFHT-ANN-polar_S_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFHT-ANN-polar_S_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFMNAV-ANN-polar_N_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFMNAV-ANN-polar_N_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFMNAV-ANN-polar_S_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFMNAV-ANN-polar_S_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFMXAV-ANN-polar_N_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFMXAV-ANN-polar_N_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFMXAV-ANN-polar_S_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFMXAV-ANN-polar_S_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-Z3-500-ANN-polar_N_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-Z3-500-ANN-polar_N_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-Z3-500-ANN-polar_S_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-Z3-500-ANN-polar_S_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d/MERRA2/MERRA2-H2OLNZ-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d/MERRA2/MERRA2-H2OLNZ-ANN-global_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d/MERRA2/MERRA2-OMEGA-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d/MERRA2/MERRA2-OMEGA-ANN-global_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d/MERRA2/MERRA2-Q-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d/MERRA2/MERRA2-Q-ANN-global_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d/MERRA2/MERRA2-RELHUM-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d/MERRA2/MERRA2-RELHUM-ANN-global_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d_stratosphere/MERRA2/MERRA2-H2OLNZ-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d_stratosphere/MERRA2/MERRA2-H2OLNZ-ANN-global_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d_stratosphere/MERRA2/MERRA2-OMEGA-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d_stratosphere/MERRA2/MERRA2-OMEGA-ANN-global_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d_stratosphere/MERRA2/MERRA2-Q-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d_stratosphere/MERRA2/MERRA2-Q-ANN-global_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d_stratosphere/MERRA2/MERRA2-RELHUM-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d_stratosphere/MERRA2/MERRA2-RELHUM-ANN-global_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_xy/MERRA2/MERRA2-TMQ-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_xy/MERRA2/MERRA2-TMQ-ANN-global_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_xy/MERRA2/MERRA2-TREFHT-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_xy/MERRA2/MERRA2-TREFHT-ANN-global_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_xy/MERRA2/MERRA2-TREFMNAV-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_xy/MERRA2/MERRA2-TREFMNAV-ANN-global_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_xy/MERRA2/MERRA2-TREFMXAV-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_xy/MERRA2/MERRA2-TREFMXAV-ANN-global_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_xy/MERRA2/MERRA2-Z3-500-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_xy/MERRA2/MERRA2-Z3-500-ANN-global_test.nc']

      For the missing MERRA2 datasets in the comment above, I have identified a timeout issue when attempting to derive a climatology variable that is a Dask Array via convert_units().

      ds = self._get_dataset_with_derived_climo_var(ds)

      In convert_units(), loading the variable into memory via values causes a deadlock for some reason.

      var.values = original_udunit.convert(var.values, target_udunit)

      This does not happen when isolating the code outside of E3SM Diags. Further investigation is needed.

      import cf_units
      import xcdat as xc
      
      filepath = "/global/cfs/cdirs/e3sm/diagnostics/observations/Atm/climatology/MERRA2/MERRA2_ANN_198001_201612_climo.nc"
      
      args = {
          "paths": filepath,
          "decode_times": True,
          "add_bounds": ["X", "Y", "Z"],
          "coords": "minimal",
          "compat": "override",
      }
      
      ds = xc.open_mfdataset(**args)
      
      # Load data into memory and view values
      ds.wap.values
      
      <xarray.DataArray 'wap' (time: 1, plev: 42, lat: 361, lon: 576)> Size: 35MB
      dask.array<open_dataset-wap, shape=(1, 42, 361, 576), dtype=float32, chunksize=(1, 42, 361, 576), chunktype=numpy.ndarray>
      Coordinates:
          height   float64 8B ...
        * lat      (lat) float64 3kB -90.0 -89.5 -89.0 -88.5 ... 88.5 89.0 89.5 90.0
        * lon      (lon) float64 5kB 0.0 0.625 1.25 1.875 ... 357.5 358.1 358.8 359.4
        * plev     (plev) float64 336B 1e+05 9.75e+04 9.5e+04 ... 40.0 30.0 10.0
        * time     (time) object 8B 1980-01-15 00:00:00
      Attributes:
          standard_name:     lagrangian_tendency_of_air_pressure
          long_name:         omega (=dp/dt)
          units:             Pa s-1
          comment:           commonly referred to as ""omega"", this represents the...
          original_name:     omega
          original_units:    Pa s-1
          history:           2015-10-11T21:02:10Z altered by CMOR: Converted units ...
          cell_methods:      time: mean
          cell_measures:     area: areacella
          associated_files:  baseURL: http://cmip-pcmdi.llnl.gov/CMIP5/dataLocation...
      • The root cause is this that the Dask Array is being loaded into memory without using the correct scheduler (e.g,. .load(scheduler="sync"):
        original_udunit = cf_units.Unit(var.attrs["units"])
        target_udunit = cf_units.Unit(target_units)
        var.values = original_udunit.convert(var.values, target_udunit)
        var.attrs["units"] = target_units

@tomvothecoder tomvothecoder mentioned this pull request Dec 12, 2024
9 tasks
- Addresses performance bottleneck associated with attempting to load large datasets into memory. Time slicing reduces the size before loading into memory
- Constraint `dask <2024.12.0`
- Subset and load the climo dataset with the source variables before deriving the target variable to use the appropriate Dask scheduler
Copy link
Collaborator

@tomvothecoder tomvothecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chengzhuzhang FYI if we are still experiencing slowness or hanging with a complete run in #880, I will carry this fix over to that PR as well and re-run.

For cases where deriving a climo variable occurs, this fix will first subset the dataset with the source variables and load it into memory before performing the actual derivation.

This fix is required because the convert_units() function will try to load the variables into memory using the wrong scheduler by calling .values instead of .load(scheduler="sync"), which causes hanging/crashing (code below).

original_udunit = cf_units.Unit(var.attrs["units"])
target_udunit = cf_units.Unit(target_units)
var.values = original_udunit.convert(var.values, target_udunit)
var.attrs["units"] = target_units

@chengzhuzhang
Copy link
Contributor Author

@tomvothecoder hey, Tom. Just a heads-up. I think #880 is in good shape and ready to merge.
I'm happy to test the scripts from #880 again when this PR is finalized. Let me know if anything I can help for this PR...

- Avoids issue where time bounds are generated for singleton coords that have issues (e.g., OMI files)
- Update `squeeze_time_dim()` to catch cases where time_dim does not exist in the dataset
- Add logger info messages to print out climatology and time series filepaths and derived variable filepaths
@tomvothecoder
Copy link
Collaborator

@tomvothecoder hey, Tom. Just a heads-up. I think #880 is in good shape and ready to merge.

Great, I'll perform a final review before merging. Thanks!

I'm happy to test the scripts from #880 again when this PR is finalized. Let me know if anything I can help for this PR...

In this PR, I’m debugging the final variables that fail to generate for arm_diags. The issue stems from sub-optimal/incorrect chunk sizes in the native v3 datasets, causing the Xarray .load(scheduler="sync") call to hang due to excessive chunks.

Here’s a script from af046a2 (#907) (pasted below for convenience) detailing the affected datasets and their chunk size issues. Any thoughts on why these chunk sizes were set this way?

# %%
"""
Issue 1 - Sub-optimal `CLOUD` and `time_bnds chunking
  * Related dataset: "/global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/CLOUD_twpc3_200001_201412.nc"
  * Dataset shape is (time: 131400, bound: 2, lev: 80)
  * `CLOUD` variable has chunks of (1, 80), resulting in 131400 chunks in 2 graph layers. (very bad, slow loading)
  * `time_bnds` has chunks of (1, 2), resulting in 131400 chunks in 3 graph layers. (very bad, slow loading)
"""

import xcdat as xc

ds = xc.open_mfdataset(
    [
        "/global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/CLOUD_twpc3_200001_201412.nc"
    ]
)

print(ds.CLOUD.data)
#               Array	        Chunk
# Bytes	        40.10 MiB	    320 B
# Shape	        (131400, 80)	(1, 80)
# Dask graph	131400 chunks in 2 graph layers
# Data type	    float32 numpy.ndarray

print(ds.time_bnds.data)
# Array	Chunk
# Bytes	2.01 MiB	16 B
# Shape	(131400, 2)	(1, 2)
# Dask graph	131400 chunks in 3 graph layers
# Data type	object numpy.ndarray


# %%
"""
Issue 2 - Sub-optimal `time_bnds` chunking
  * Related dataset: "/global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/FLDS_sgpc1_200001_201412.nc"
  * Dataset shape is (time: 131400, bound: 2, lev: 80)
  * `FLDS` variable has chunks of (1019,), resulting in 129 in 2 graph layers (okay)
  * `time_bnds` has chunks of (1, 2), resulting in 131400 chunks in 3 graph layers (very bad, slow loading)
"""

ds2 = xc.open_mfdataset(
    [
        "/global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/FLDS_sgpc1_200001_201412.nc"
    ]
)

print(ds2.FLDS.data)
#               Array	Chunk
# Bytes	        513.28 kiB	3.98 kiB
# Shape	        (131400,)	(1019,)
# Dask graph	129 chunks in 2 graph layers
# Data type	    float32 numpy.ndarray

print(ds2.time_bnds.data)
#               Array	Chunk
# Bytes	        2.01 MiB	16 B
# Shape	        (131400, 2)	(1, 2)
# Dask graph	131400 chunks in 3 graph layers
# Data type	    object numpy.ndarray

# %%
"""
Issue 3 - Sub-optimal `time_bnds` chunking
  * Related dataset: "/global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/PRECT_sgpc1_200001_201412.nc"
  * Dataset shape is (time: 131400, bound: 2, lev: 80)
  * `PRECT` variable has chunks of (1019,), resulting in 129 in 2 graph layers (okay)
  * `time_bnds` has chunks of (1, 2), resulting in 131400 chunks in 3 graph layers (very bad, slow loading)
"""

ds3 = xc.open_mfdataset(
    [
        "/global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/PRECT_sgpc1_200001_201412.nc"
    ]
)

print(ds3.PRECT.data)
#               Array	Chunk
# Bytes	        513.28 kiB	3.98 kiB
# Shape	        (131400,)	(1019,)
# Dask graph	129 chunks in 2 graph layers
# Data type	    float32 numpy.ndarray

print(ds3.time_bnds.data)
#               Array	Chunk
# Bytes	        2.01 MiB	16 B
# Shape	        (131400, 2)	(1, 2)
# Dask graph	131400 chunks in 3 graph layers
# Data type	    object numpy.ndarray

# %%
"""
Issue 4 - Sub-optimal `num_a1` and `time_bnds` chunking
  * Related dataset: "/global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/num_a1_enac1_200001_201412.nc"
  * Dataset shape is (time: 131400, bound: 2, lev: 80)
  * `num_a1` variable has chunks of (1, 80), resulting in 131400 chunks in 2 graph layers. (very bad, slow loading)
  * `time_bnds` has chunks of (1, 2), resulting in 131400 chunks in 3 graph layers (very bad, slow loading)
"""

ds4 = xc.open_mfdataset(
    [
        "/global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/num_a1_enac1_200001_201412.nc"
    ]
)

print(ds4.num_a1.data)
#               Array	Chunk
# Bytes	        40.10 MiB	320 B
# Shape	        (131400, 80)	(1, 80)
# Dask graph	131400 chunks in 2 graph layers
# Data type	    float32 numpy.ndarray

print(ds4.time_bnds.data)
#               Array	Chunk
# Bytes	        2.01 MiB	16 B
# Shape	        (131400, 2)	(1, 2)
# Dask graph	131400 chunks in 3 graph layers
# Data type	    object numpy.ndarray


# %%
"""
Issue 5 - Sub-optimal `time_bnds` chunking
  * Related dataset: "/global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/PRECT_twpc3_200001_201412.nc"
  * Dataset shape is (time: 131400, bound: 2, lev: 80)
  * `PRECT` variable has chunks of (1019,), resulting in 129 in 2 graph layers (okay)
  * `time_bnds` has chunks of (1, 2), resulting in 131400 chunks in 3 graph layers (very bad, slow loading)
"""

ds5 = xc.open_mfdataset(
    [
        "/global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/PRECT_twpc3_200001_201412.nc"
    ]
)

print(ds5.time_bnds.data)
#               Array	Chunk
# Bytes	        2.01 MiB	16 B
# Shape	        (131400, 2)	(1, 2)
# Dask graph	131400 chunks in 3 graph layers
# Data type	    object numpy.ndarray

@chengzhuzhang
Copy link
Contributor Author

chengzhuzhang commented Jan 9, 2025

@tomvothecoder I'm trying to understand the new emerging issues. I feel we haven't seen these when we tested the refactor branch. Another difference is that we are now testing a new dataset as well v3 as compared to v2 that is used for testing during refactor.

It looks like firstly, genutil.udunits is changed to cf_units.Unit, and then this change requires that instead of using .value we need .load(scheduler="sync"), and the latter caused wired chunking? Did I get the story right? I'm wondering if we will get same behavior if we re-test arm-diags with the v2 data we have been working with?

I'm now testing with the v2 dataset here: /global/cfs/cdirs/e3sm/e3sm_diags/postprocessed_e3sm_v2_data_for_e3sm_diags/20221103.v2.LR.amip.NGD_v3atm.chrysalis/arm-diags-data/. The time record is longer in this v2 dataset which is from 1985-2014.

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Jan 9, 2025

@tomvothecoder I'm trying to understand the new emerging issues. I feel we haven't seen these when we tested the refactor branch. Another difference is that we are now testing a new dataset as well v3 as compared to v2 that is used for testing during refactor.

The root cause of the issue for arm_diags above is specific to the v3 data. The v3 data seems to have incorrect chunking at the file level (defined via _Chunksizes) for some variables and time_bnds, which causes Xarray to hang when trying to load into memory because there are way too many chunks. This is not an issue with the codebase itself, as far as I'm aware.

It looks like firstly, genutil.udunits is changed to cf_units.Unit, and then this change requires that instead of using .value we need .load(scheduler="sync"), and the latter caused wired chunking? Did I get the story right? I'm wondering if we will get same behavior if we re-test arm-diags with the v2 data we have been working with?

I'm now testing with the v2 dataset here: /global/cfs/cdirs/e3sm/e3sm_diags/postprocessed_e3sm_v2_data_for_e3sm_diags/20221103.v2.LR.amip.NGD_v3atm.chrysalis/arm-diags-data/. The time record is longer in this v2 dataset which is from 1985-2014.

The units conversion topic is separate from the arm_diags issue with chunking.

I'll do a complete run with v2 data and perform a regression test with the convert_units() update too. I don't remember if I did it for #885.

@chengzhuzhang
Copy link
Contributor Author

I'm digging up some discussions regarding to performance of refactored arm_diags set: #842 (comment). There are some nice performance benchmarks there.

@chengzhuzhang
Copy link
Contributor Author

The root cause of the issue for arm_diags above is specific to the v3 data. The v3 data seems to have incorrect chunking at the file level (defined via _Chunksizes) for some variables and time_bnds, which causes Xarray to hang when trying to load into memory because there are way too many chunks. This is not an issue with the codebase itself, as far as I'm aware.

I checked the metadata for v2 vs v3. The difference is that v2 data has 2 x time steps of v3; v2 has 72 vertical levels and v3 has 80. Is it possible that in the original refactored code, data was loaded in without chunking, while in the new code base, data are loaded as dask arrray and chunked in a weird way?

@tomvothecoder
Copy link
Collaborator

I checked the metadata for v2 vs v3. The difference is that v2 data has 2 x time steps of v3; v2 has 72 vertical levels and v3 has 80. Is it possible that in the original refactored code, data was loaded in without chunking, while in the new code base, data are loaded as dask arrray and chunked in a weird way?

There are a few possibilities here (and any combination of them together):

  1. We did regression testing on v2 data with open_dataset() which avoids Dask issues
  2. We incorporated open_mfdataset() and .load() later on (e.g., [Feature]: CDAT Migration: Add support for opening multiple time series datasets #861)
  3. The v2 data does not have weird chunking like v3 data

As a workaround, I can add code to detect if there is only 1 file to open. If there is, we use open_dataset() instead of open_mfdataset(). It shouldn't affect anything downstream, but a regression test should still be performed with this change.

@chengzhuzhang
Copy link
Contributor Author

chengzhuzhang commented Jan 10, 2025

3. The v2 data does not have weird chunking like v3 data

This is mysterious to me, cause both v2 and v3 files were pre-processed with a same cdms-based script: here), and here. I don't know why the chunking would be different on file level. Another possibility is to rewrite this pre-processing script with xarray and create a new set of files to test with.

@chengzhuzhang
Copy link
Contributor Author

chengzhuzhang commented Jan 10, 2025

This is mysterious to me, cause both v2 and v3 files were pre-processed with a same cdms-based script: here), and here. I don't know why the chunking would be different on file level. Another possibility is to rewrite this pre-processing script with xarray and create a new set of files to test with.

I did a little research on chunking. ncks -m --hdn is the NCO command to view chunking size. It does look like both version of data has similar chunking size.

(e3sm_unified_1.10.0_login) chengzhu@perlmutter:login30:~/arm> ncks -m --hdn /global/cfs/cdirs/e3sm/e3sm_diags/postprocessed_e3sm_v2_data_for_e3sm_diags/20221103.v2.LR.amip.NGD_v3atm.chrysalis/arm-diags-data/CLOUD_twpc3_198501_201412.nc 
netcdf CLOUD_twpc3_198501_201412 {
  dimensions:
    bound = 2 ;
    lev = 72 ;
    time = UNLIMITED ; // (262800 currently)

  variables:
    float CLOUD(time,lev) ;
      CLOUD:mdims = 1 ;
      CLOUD:units = "1" ;
      CLOUD:long_name = "Cloud fraction" ;
      CLOUD:cell_methods = "time: point time: mean" ;
      CLOUD:basename = "CLOUD" ;
      CLOUD:missing_value = 1.e+20f ;
      CLOUD:_FillValue = 1.e+20f ;
      CLOUD:_Storage = "chunked" ;
      CLOUD:_ChunkSizes = 1, 72 ;
      CLOUD:_Endianness = "little" ;

    double P0 ;
      P0:name = "P0" ;
      P0:missing_value = 1.e+20 ;
      P0:_FillValue = 1.e+20 ;
      P0:_Endianness = "little" ;

    float PS(time) ;
      PS:units = "Pa" ;
      PS:long_name = "Surface pressure" ;
      PS:standard_name = "surface_air_pressure" ;
      PS:cell_methods = "time: point time: mean" ;
      PS:basename = "PS" ;
      PS:missing_value = 1.e+20f ;
      PS:_FillValue = 1.e+20f ;
      PS:_Storage = "chunked" ;
      PS:_ChunkSizes = 1023 ;
      PS:_Endianness = "little" ;
    double hyam(lev) ;
      hyam:long_name = "hybrid A coefficient at layer midpoints" ;
      hyam:missing_value = 1.e+20 ;
      hyam:_FillValue = 1.e+20 ;
      hyam:_Storage = "contiguous" ;
      hyam:_Endianness = "little" ;

    double hybm(lev) ;
      hybm:long_name = "hybrid B coefficient at layer midpoints" ;
      hybm:missing_value = 1.e+20 ;
      hybm:_FillValue = 1.e+20 ;
      hybm:_Storage = "contiguous" ;
      hybm:_Endianness = "little" ;

    double lat ;
      lat:long_name = "latitude" ;
      lat:units = "degrees_north" ;
      lat:missing_value = 1.e+20 ;
      lat:_FillValue = 1.e+20 ;
      lat:_Endianness = "little" ;

    double lev(lev) ;
      lev:axis = "Z" ;
      lev:units = "hPa" ;
      lev:long_name = "hybrid level at midpoints (1000*(A+B))" ;
      lev:positive = "down" ;
      lev:standard_name = "atmosphere_hybrid_sigma_pressure_coordinate" ;
      lev:formula_terms = "a: hyam b: hybm p0: P0 ps: PS" ;
      lev:realtopology = "linear" ;
      lev:_Storage = "contiguous" ;
      lev:_Endianness = "little" ;

    double lon ;
      lon:long_name = "longitude" ;
      lon:units = "degrees_east" ;
      lon:missing_value = 1.e+20 ;
      lon:_FillValue = 1.e+20 ;
      lon:_Endianness = "little" ;

    double time(time) ;
      time:bounds = "time_bnds" ;
      time:axis = "T" ;
      time:calendar = "noleap" ;
      time:units = "days since 1985-01-01 00:00:00" ;
      time:long_name = "time" ;
      time:cell_methods = "time: mean" ;
      time:realtopology = "linear" ;
      time:_Storage = "chunked" ;
      time:_ChunkSizes = 512 ;
      time:_Endianness = "little" ;

    double time_bnds(time,bound) ;
      time_bnds:_Storage = "chunked" ;
      time_bnds:_ChunkSizes = 1, 2 ;
      time_bnds:_Endianness = "little" ;
} // group /

and

(e3sm_unified_1.10.0_login) chengzhu@perlmutter:login30:~/arm> ncks -m --hdn /global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/CLOUD_twpc3_200001_201412.nc
netcdf CLOUD_twpc3_200001_201412 {
  dimensions:
    bound = 2 ;
    lev = 80 ;
    time = UNLIMITED ; // (131400 currently)

  variables:
    float CLOUD(time,lev) ;
      CLOUD:mdims = 1 ;
      CLOUD:units = "fraction" ;
      CLOUD:long_name = "Cloud fraction" ;
      CLOUD:cell_methods = "time: point" ;
      CLOUD:basename = "CLOUD" ;
      CLOUD:missing_value = 1.e+20f ;
      CLOUD:_FillValue = 1.e+20f ;
      CLOUD:_Storage = "chunked" ;
      CLOUD:_ChunkSizes = 1, 80 ;
      CLOUD:_Endianness = "little" ;

    double P0 ;
      P0:name = "P0" ;
      P0:missing_value = 1.e+20 ;
      P0:_FillValue = 1.e+20 ;
      P0:_Endianness = "little" ;

    float PS(time) ;
      PS:units = "Pa" ;
      PS:long_name = "Surface pressure" ;
      PS:standard_name = "surface_air_pressure" ;
      PS:cell_methods = "time: point" ;
      PS:basename = "PS" ;
      PS:missing_value = 1.e+20f ;
      PS:_FillValue = 1.e+20f ;
      PS:_Storage = "chunked" ;
      PS:_ChunkSizes = 1019 ;
      PS:_Endianness = "little" ;

    double hyam(lev) ;
      hyam:long_name = "hybrid A coefficient at layer midpoints" ;
      hyam:missing_value = 1.e+20 ;
      hyam:_FillValue = 1.e+20 ;
      hyam:_Storage = "contiguous" ;
      hyam:_Endianness = "little" ;

    double hybm(lev) ;
      hybm:long_name = "hybrid B coefficient at layer midpoints" ;
      hybm:missing_value = 1.e+20 ;
      hybm:_FillValue = 1.e+20 ;
      hybm:_Storage = "contiguous" ;
      hybm:_Endianness = "little" ;

    double lat ;
      lat:long_name = "latitude" ;
      lat:units = "degrees_north" ;
      lat:missing_value = 1.e+20 ;
      lat:_FillValue = 1.e+20 ;
      lat:_Endianness = "little" ;

    double lev(lev) ;
      lev:axis = "Z" ;
      lev:units = "hPa" ;
      lev:long_name = "hybrid level at midpoints (1000*(A+B))" ;
      lev:positive = "down" ;
      lev:standard_name = "atmosphere_hybrid_sigma_pressure_coordinate" ;
      lev:formula_terms = "a: hyam b: hybm p0: P0 ps: PS" ;
      lev:realtopology = "linear" ;
      lev:_Storage = "contiguous" ;
      lev:_Endianness = "little" ;

    double lon ;
      lon:long_name = "longitude" ;
      lon:units = "degrees_east" ;
      lon:missing_value = 1.e+20 ;
      lon:_FillValue = 1.e+20 ;
      lon:_Endianness = "little" ;

    double time(time) ;
      time:bounds = "time_bnds" ;
      time:axis = "T" ;
      time:calendar = "noleap" ;
      time:units = "days since 2000-01-01 00:00:00" ;
      time:long_name = "time" ;
      time:realtopology = "linear" ;
      time:_Storage = "chunked" ;
      time:_ChunkSizes = 512 ;
      time:_Endianness = "little" ;

    double time_bnds(time,bound) ;
      time_bnds:_Storage = "chunked" ;
      time_bnds:_ChunkSizes = 1, 2 ;
      time_bnds:_Endianness = "little" ;
} // group /

I think the chunking follows NCO's chunking policy, but for this particular small datasets, we probably don't need to chunk..I'm rewriting the pre-processing script using xarray/xcdat, to see if we can get more reasonable .nc files to start with.

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Jan 10, 2025

I did a little research on chunking. ncks -m --hdn is the NCO command to view chunking size. It does look like both version of data has similar chunking size.

Thanks for confirming @chengzhuzhang. This leads me to believe that we ran the v2 complete run with open_dataset() rather than open_mfdataset() + load(), which explains why arm_diags worked fine.

I think the chunking follows NCO's chunking policy, but for this particular small datasets, we probably don't need to chunk..I'm rewriting the pre-processing script using xarray/xcdat, to see if we can get more reasonable .nc files to start with.

Sounds good, thanks.

@tomvothecoder tomvothecoder mentioned this pull request Jan 10, 2025
9 tasks
@chengzhuzhang
Copy link
Contributor Author

I updated the pre-processing script for single site output, and created correctly chunked datasets: /global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/

The datasets load fast but I ran into an issue with the diurnal_cycle_zt subset (error message as follows). I will further investigate if a fix is needed for the newly generated data, or the code base.

15:05:14,345 [ERROR]: core_parameter.py(_run_diag:343) >> Error in e3sm_diags.driver.arm_diags_driver
Traceback (most recent call last):
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/edv3/lib/python3.12/site-packages/e3sm_diags/parameter/core_parameter.py", line 340, in _run_diag
    single_result = module.run_diag(self)
                    ^^^^^^^^^^^^^^^^^^^^^
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/edv3/lib/python3.12/site-packages/e3sm_diags/driver/arm_diags_driver.py", line 61, in run_diag
    return _run_diag_diurnal_cycle_zt(parameter)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/edv3/lib/python3.12/site-packages/e3sm_diags/driver/arm_diags_driver.py", line 183, in _run_diag_diurnal_cycle_zt
    test_diurnal, lst = composite_diurnal_cycle(  # type: ignore
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/edv3/lib/python3.12/site-packages/e3sm_diags/driver/utils/diurnal_cycle_xr.py", line 86, in composite_diurnal_cycle
    lst[it, :, ilon] = (itime + start_time + lon[ilon] / 360 * 24) % 24  # type: ignore
    ~~~^^^^^^^^^^^^^
ValueError: could not broadcast input array from shape (131400,) into shape (1,)

@chengzhuzhang
Copy link
Contributor Author

chengzhuzhang commented Jan 11, 2025

313a641 should addressed the data problem reported #907 (comment). The lat, and lon values has time as their dimension which offended the diurnal cycle routine. I'm generating a new set of datasets and will replace the files under /global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/ for testing.

The old version of site data is now located at /global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site_cdat/

@tomvothecoder
Copy link
Collaborator

313a641 should addressed the data problem reported #907 (comment). The lat, and lon values has time as their dimension which offended the diurnal cycle routine. I'm generating a new set of datasets and will replace the files under /global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/ for testing.

The old version of site data is now located at /global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site_cdat/

Thanks Jill. I will do an arm_diags only run with v3 data on Tues.

This PR looks like it's almost done. I'm wrapping up regression testing (v3 is good to go, v2 almost done just Q var is off). I'd say we're on track to merge next Wed/Thurs.

@chengzhuzhang
Copy link
Contributor Author

I can confirm that the performance issue is resolved by the new dataset in place. The full arm set completed in 2 mins in my test.

@chengzhuzhang
Copy link
Contributor Author

I checked the metadata for v2 vs v3. The difference is that v2 data has 2 x time steps of v3; v2 has 72 vertical levels and v3 has 80. Is it possible that in the original refactored code, data was loaded in without chunking, while in the new code base, data are loaded as dask arrray and chunked in a weird way?

There are a few possibilities here (and any combination of them together):

1. We did regression testing on v2 data with `open_dataset()` which avoids Dask issues

2. We incorporated `open_mfdataset()` and `.load()` later on (e.g., [[Feature]: CDAT Migration: Add support for opening multiple time series datasets #861](https://github.com/E3SM-Project/e3sm_diags/issues/861))

3. The v2 data does not have weird chunking like v3 data

To recap on this issue, the root cause is that the original set of site-based output was pre-processed with a script that is based on ncrcat, for this specific dataset with var(time,lev) or var(time) dimension, to chunk over time create too many small chunks and caused I/O bottleneck. This problem was not shown up during refactor because, the data set was read in with open_dataset() which avoids chunking issue. When we introduced open_mfdataset() and .load() later, the chunksize is not compatible and caused I/O issue. Now we rewrote the pre-processing script with xarray, which is used to generate correctly chunked time-series files, e.g., chunksize = var dimension. The performance is now good.

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Jan 14, 2025

@chengzhuzhang Great, thank you for summarizing the arm_diags I/O issue and confirming the performance with the new e3sm_diags is now good!

I will complete regression testing with v2 data, fix the integration tests, and do a final review before merging this PR.

@chengzhuzhang
Copy link
Contributor Author

I will complete regression testing with v2 data, fix the integration tests, and do a final review before merging this PR.

@chengzhuzhang Great, thank you for summarizing the arm_diags I/O issue and confirming the performance with the new e3sm_diags is now good!

I will complete regression testing with v2 data, fix the integration tests, and do a final review before merging this PR.

yeah, I'm glad we don't need to update the code base for workaround this data issue. I haven't re-produced v2 site data, so the complete v2 run still has to exclude arm_diags set.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chengzhuzhang We need to replace cwt from scipy with PyWavelets (#912) I'm trying to figure out where to pass the deg = 6 arg here. If you can try taking a look too that'd be great.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm looking at it. cmor means totally different here...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@whannah1 hey Walter, as we are preparing for the e3sm_diags v3 release, @tomvothecoder found that scipy.signal.cwt and scipy.signal.morlet2 are removed from latest scipy, and the suggestion was to use pywavelets instead. It seems that transition to use pywavelets API is not straightforward and not much documentation was found to help transition. I'm wondering if you are familiar with the this tool and could help re-write this _get_psd_from_wavelet function?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants