Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update OOPS and use reduce obs space filter action for pre-filters #209

Conversation

SamuelDegelia-NOAA
Copy link
Contributor

@SamuelDegelia-NOAA SamuelDegelia-NOAA commented Oct 23, 2024

This PR addresses #122 and #189 to update the OOPS hash to a recent commit that fixes the reduce obs space filter action for GETKF. This filter action is then added to temporal thinning and time window filters in each of the obs-space yamls under rrfs-test/validated_yamls/templates/obtype_config.

Using this filter action allows us to remove observations from memory that do not pass any pre-filters. These obs will thus not be output to the hofx file and do not get read in during the GETKF solver. This significantly speeds up the GETKF solver by about ~66% (see table below). Memory usage for the GETKF solver is also reduced by ~35%.

The only ctests affected by this change are the GETKF observer and solver. Their reference files are thus updated here as well.

Table for impact of reduce obs space:

table

ctest output on Hera:

(eva) [Samuel.Degelia@hfe07 rrfs-test]$ ctest -j6
Test project /scratch1/BMC/zrtrr/Samuel.Degelia/RDASApp_updateoops/RDASApp/build/rrfs-test
    Start 4: rrfs_mpasjedi_2024052700_getkf_observer
    Start 3: rrfs_mpasjedi_2024052700_Ens3Dvar
    Start 1: rrfs_fv3jedi_hyb_2022052619
    Start 6: rrfs_mpasjedi_2024052700_bumploc
    Start 2: rrfs_fv3jedi_letkf_2022052619
1/6 Test #4: rrfs_mpasjedi_2024052700_getkf_observer ...   Passed  349.67 sec
    Start 5: rrfs_mpasjedi_2024052700_getkf_solver
2/6 Test #2: rrfs_fv3jedi_letkf_2022052619 .............   Passed  370.65 sec
3/6 Test #1: rrfs_fv3jedi_hyb_2022052619 ...............   Passed  431.38 sec
4/6 Test #6: rrfs_mpasjedi_2024052700_bumploc ..........   Passed  947.75 sec
5/6 Test #5: rrfs_mpasjedi_2024052700_getkf_solver .....   Passed  626.84 sec
6/6 Test #3: rrfs_mpasjedi_2024052700_Ens3Dvar .........   Passed  1481.68 sec

100% tests passed, 0 tests failed out of 6

Label Time Summary:
mpi            = 4207.97 sec*proc (6 tests)
rdas-bundle    = 4207.97 sec*proc (6 tests)
script         = 4207.97 sec*proc (6 tests)

Total Test time (real) = 1481.99 sec

@SamuelDegelia-NOAA
Copy link
Contributor Author

Plot of GETKF analysis confirming no impact on the analysis:

Capture

@SamuelDegelia-NOAA
Copy link
Contributor Author

Also, this PR might have some conflicts with #207 since I did some slight reorganization of rrfs-test/CMakeLists.txt here. So we can merge that first and then look at this one.

@guoqing-noaa
Copy link
Collaborator

@SamuelDegelia-NOAA PR #207 was merged. You can go ahead to resolve the conflicts.

Also, as this PR updates one submodule component, it is expected to run the mpas-jedi and fv-jedi ctests as well. Thanks!

@guoqing-noaa
Copy link
Collaborator

The change also affects the EnVar case, do we also see improvements in the computing time and memory usage?

file(GLOB bl_FILES "${src_casedir}/*.*BL")
file(COPY ${bl_FILES} DESTINATION ${casedir})
file(COPY ${src_yaml}/${case}.yaml DESTINATION ${casedir})
file(CREATE_LINK ${src_casedir}/data ${casedir}/data SYMBOLIC)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we get extra indentations from the latest commit. Are these changes expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah not sure what happened there. I used Github to do the merge conflicts and it added some weird spacing. Just fixed it, I think.

@delippi
Copy link
Collaborator

delippi commented Oct 23, 2024

@SamuelDegelia-NOAA, just as an FYI, I already made these updates in my current PR #188. I've actually removed the Temporal Thinning filter and moved the order of this time window filter closer to the top. Will this cause conflicts when trying to merge both PRs?

@SamuelDegelia-NOAA
Copy link
Contributor Author

SamuelDegelia-NOAA commented Oct 23, 2024

@SamuelDegelia-NOAA, just as an FYI, I already made these updates in my current PR #188. I've actually removed the Temporal Thinning filter and moved the order of this time window filter closer to the top. Will this cause conflicts when trying to merge both PRs?

@delippi Probably so, maybe that means we should focus on getting yours merged first since it has been up longer.

Also, I didn't realize you removed the temporal thinning filter - what was the reason for that? The bulk of the speed up here is from applying reduce obs space to the time window check, so removing the temporal thinning isn't a problem in the context of this PR.

@SamuelDegelia-NOAA
Copy link
Contributor Author

SamuelDegelia-NOAA commented Oct 23, 2024

The change also affects the EnVar case, do we also see improvements in the computing time and memory usage?

@guoqing-noaa There are slight improvements but not as much as the GETKF. Runtime for rrfs_mpasjedi_2024052700_Ens3Dvar is reduced from 438 to 409 s, and the total memory used is reduced from 434 to 424 Gb.

Also, I'll start running the mpas-jedi and fv3-jedi tests.

@guoqing-noaa
Copy link
Collaborator

The change also affects the EnVar case, do we also see improvements in the computing time and memory usage?

@guoqing-noaa There are slight improvements but not as much as the GETKF. Runtime for rrfs_mpasjedi_2024052700_Ens3Dvar is reduced from 438 to 409 s, and the total memory used is reduced from 434 to 424 Gb.

Thanks! It may reduce more when we assimilate more observations, such as satellite radiance, or radar data, etc

@SamuelDegelia-NOAA
Copy link
Contributor Author

The change also affects the EnVar case, do we also see improvements in the computing time and memory usage?

@guoqing-noaa There are slight improvements but not as much as the GETKF. Runtime for rrfs_mpasjedi_2024052700_Ens3Dvar is reduced from 438 to 409 s, and the total memory used is reduced from 434 to 424 Gb.

Thanks! It may reduce more when we assimilate more observations, such as satellite radiance, or radar data, etc

Definitely!

@delippi
Copy link
Collaborator

delippi commented Oct 24, 2024

@SamuelDegelia-NOAA, just as an FYI, I already made these updates in my current PR #188. I've actually removed the Temporal Thinning filter and moved the order of this time window filter closer to the top. Will this cause conflicts when trying to merge both PRs?

@delippi Probably so, maybe that means we should focus on getting yours merged first since it has been up longer.

Also, I didn't realize you removed the temporal thinning filter - what was the reason for that? The bulk of the speed up here is from applying reduce obs space to the time window check, so removing the temporal thinning isn't a problem in the context of this PR.

The temporal thinning I found was just not providing the correct observation counts... even when setting it to PT00M which I thought would turn it off. GSI also doesn't appear to do any temporal thinning of the data so I was opting to remove it. The reason we were going to try to use it was to remove duplicate observations such that each station has observations sufficiently spaced in time, but there might be better ways to do that... I also don't understand why setting min spacing to 0 min doesn't produce the same result as just removing the function altogether.

Here are some side by side comparisons just for reference:
Result with Temporal Thinning completely removed:
fv3jedi_vs_gsi_increment_airTemperature_msonet_airTemperature_188_none
Result using min_spacing: PT00M:
fv3jedi_vs_gsi_increment_airTemperature_msonet_airTemperature_188_pt00m
Result using min_spacing: PT30M:
fv3jedi_vs_gsi_increment_airTemperature_msonet_airTemperature_188_pt30m

@SamuelDegelia-NOAA
Copy link
Contributor Author

Thanks @delippi for the explanation, it makes sense why you removed the temporal thinning now since its a much better match with GSI without it. For the PT00M not being the same as turning it off, could that be because there are some duplicate obs with the same time offset? And one of those gets removed with PT00M?

@SamuelDegelia-NOAA
Copy link
Contributor Author

SamuelDegelia-NOAA commented Oct 24, 2024

Seven mpas-jedi tests fail after updating OOPS, all due to oops::TestReferenceFloatMismatchError:

The following tests FAILED:
         37 - test_mpasjedi_3denvar_amsua_allsky (Failed)
         38 - test_mpasjedi_3denvar_amsua_bc (Failed)
         40 - test_mpasjedi_3dfgat (Failed)
         43 - test_mpasjedi_4denvar_VarBC (Failed)
         44 - test_mpasjedi_4denvar_VarBC_nonpar (Failed)
         47 - test_mpasjedi_4dfgat (Failed)
         54 - test_mpasjedi_lgetkf_height_vloc (Failed)

These are the same failures discussed here and are expected.

FV3-JEDI tests are ongoing - there are quite a lot...!

@SamuelDegelia-NOAA SamuelDegelia-NOAA linked an issue Oct 24, 2024 that may be closed by this pull request
@delippi
Copy link
Collaborator

delippi commented Oct 24, 2024

Seven mpas-jedi tests fail after updating OOPS, all due to oops::TestReferenceFloatMismatchError:

The following tests FAILED:
         37 - test_mpasjedi_3denvar_amsua_allsky (Failed)
         38 - test_mpasjedi_3denvar_amsua_bc (Failed)
         40 - test_mpasjedi_3dfgat (Failed)
         43 - test_mpasjedi_4denvar_VarBC (Failed)
         44 - test_mpasjedi_4denvar_VarBC_nonpar (Failed)
         47 - test_mpasjedi_4dfgat (Failed)
         54 - test_mpasjedi_lgetkf_height_vloc (Failed)

These are the same failures discussed here and are expected.

FV3-JEDI tests are ongoing - there are quite a lot...!

@SamuelDegelia-NOAA, can you run ncdump -h ioda_msonet.nc (or on whatever ioda is used) and see if you see if height and/or stationElevation is an integer? I think it is supposed to be a float which was what I was noticing before. I'm just now testing my yamls again in MPAS-JEDI and am noticing that it works when those are floats and NOT ints.

@SamuelDegelia-NOAA
Copy link
Contributor Author

SamuelDegelia-NOAA commented Oct 24, 2024

@SamuelDegelia-NOAA, can you run ncdump -h ioda_msonet.nc (or on whatever ioda is used) and see if you see if height and/or stationElevation is an integer? I think it is supposed to be a float which was what I was noticing before. I'm just now testing my yamls again in MPAS-JEDI and am noticing that it works when those are floats and NOT ints.

@delippi For the data we currently have staged in obs_ctest, both MetaData/height and MetaData/stationElevation are ints. Is this referring to the Temporal Thinning working when they are floats?

@delippi
Copy link
Collaborator

delippi commented Oct 24, 2024

oops::TestReferenceFloatMismatchError

I was wondering if that might be why your ctests are failing with a float mismatch error? I'm finding that it doesn't fail either way for FV3, but it does fail for MPAS when those data types are ints.

@SamuelDegelia-NOAA
Copy link
Contributor Author

oops::TestReferenceFloatMismatchError

I was wondering if that might be why your ctests are failing with a float mismatch error? I'm finding that it doesn't fail either way for FV3, but it does fail for MPAS when those data types are ints.

Oh I gotcha. I think these mpasjedi tests are failing primarily because the test references are built with CRTMv3, while we are using v2.4 in RDASApp.

@hu5970
Copy link
Contributor

hu5970 commented Oct 24, 2024

@spanNOAA Could you review this PR?

@hu5970
Copy link
Contributor

hu5970 commented Oct 24, 2024

@delippi "the temporal thinning" is not the filter that remove duplicated surface obs. right? We want to remove the duplicated surface observation.

@SamuelDegelia-NOAA
Copy link
Contributor Author

For the fv3-jedi tests, the same seven tests fail as documented here, all due to the reference mismatch error. So again, to be expected and not a problem.

94% tests passed, 7 tests failed out of 127

Label Time Summary:
fv3-jedi    = 16980.48 sec*proc (126 tests)
fv3jedi     = 16984.82 sec*proc (127 tests)
mpi         = 16965.34 sec*proc (115 tests)
script      = 16984.82 sec*proc (127 tests)

Total Test time (real) = 17004.99 sec

The following tests FAILED:
         70 - fv3jedi_test_tier1_hofx_nomodel_abi_radii (Failed)
         88 - fv3jedi_test_tier1_hyb-3dvar (Failed)
         91 - fv3jedi_test_tier1_3dvar_lam_cmaq (Failed)
         96 - fv3jedi_test_tier1_hyb-fgat_fv3lm (Failed)
         98 - fv3jedi_test_tier1_4denvar (Failed)
         99 - fv3jedi_test_tier1_4denvar_seq (Failed)
        111 - fv3jedi_test_tier1_diffstates_lam_cmaq (Failed)
Errors while running CTest
Output from these tests are in: /scratch1/BMC/zrtrr/Samuel.Degelia/RDASApp_updateoops/RDASApp/build/fv3-jedi/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.

@SamuelDegelia-NOAA
Copy link
Contributor Author

The updates to OOPS in this PR are superseded by #217. I'll go ahead and close this PR and plan to add the reduce obs space filter actions as a separate PR once #217 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update OOPS and IODA hashes
4 participants