Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further explore duplicate basins in Congo & Australia #13

Open
aufdenkampe opened this issue Dec 30, 2024 · 0 comments
Open

Further explore duplicate basins in Congo & Australia #13

aufdenkampe opened this issue Dec 30, 2024 · 0 comments

Comments

@aufdenkampe
Copy link
Member

aufdenkampe commented Dec 30, 2024

As identified by @rajadain with WikiWatershed/model-my-watershed#3647 (comment), the TDX-Hydro basins datasets have a number of duplicates rows.

I did a bit of sleuthing and shared the following in an email thread.

On Dec. 10:

Those duplicates come from two TDX-HydroRegions:

"1020018110": "13",  # (Congo River Basin, Africa)
"5020049720": "54",  # (Australia, Australia and Oceania)

It looks most of the large counts are from the Congo. Both of these places are very flat, so there could be issues with the underlying data.

On Dec. 17:

I explored the raw GeoPackage files from NGA along with the processed GeoParquet files we sent to you.

The short answer is that the duplicate basin records in the Congo and Australia were duplicates in the raw data! I’m a bit puzzled how GeoPandas allowed us to use LINKNO as the index, but that’s another question…

The good news is that we can just drop them, as they are complete duplicates! Those same duplicates don’t exist in the streamnet files, which were the basis of all of our MNSI calculations for delineation and hydrologic groupings. So all of our processing is just fine. Yay!

So this is an easy fix, and worth noting to the NGA folks sooner than later!

Unfortunately, I hadn't noticed that the duplicate rows were not completely identical, because as @rajadain noted they are identical "except for geom, which is a slightly different square for each case".

So my recommendation to just use the first one may have been wrong. We will likely need to merge the geometries of the duplicates to get the right geometry for each TDX-Hydro Basin.

This issue is a placeholder to explore further for the next round of work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant