Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Losing attributes when exporting data with: write_dataset_to_netcdf #1585

Open
Nick-Maurer opened this issue May 23, 2024 · 6 comments
Open

Comments

@Nick-Maurer
Copy link

When exporting and re-loading data using datacube's write_dataset_to_netcdf function I discovered that some attributes are lost. In particular it appears to be dictionaries. I discovered this problem when exporting some landsat water observation data, where the flag_definitions dictionary doesn't export. The export function should ideally keep the dataset intact for future use.

@robbibt
Copy link
Contributor

robbibt commented May 23, 2024

@SpacemanPaul @Kirill888 This seems like a bug - writing out datacube data should preserve those important datacube attributes.

However, it also makes me think we need to think about relocating the write_dataset_to_netcdf function to odc.geo in the future, e.g. by adding an additional .odc.write_netcdf method to sit alongside .odc.write_cog (the existing xarray .to_netcdf method doesn't work with datacube data, and fails with serialisation errors).

@Nick-Maurer
Copy link
Author

Screenshot 2024-05-23 at 2 24 21 PM

@robbibt
Copy link
Contributor

robbibt commented May 23, 2024

Copying chat from Slack here @SpacemanPaul @Kirill888:
image

@Kirill888
Copy link
Member

write_dataset_to_netcdf was used by ingestor, it's main purpose was to encode spatial information stored in attributes into NetCDF-CF convention. Datasets created by odc-geo should have all the spatial attributes already populated, so a simple .to_netcdf method of xarray should just work without any information loss, provided attributes were not polluted with non-serializable data, like is the case for mask bands loaded by datacube.

Real issue here is that code that populates and uses flags_definition attribute populates it with a python dictionary causing issues for .to_netcdf() function. Instead it should use some form of serialization to string and back when storing/extracting masking info. JSON would be a reasonable candidate here.

Extraction code is here:

def get_flags_def(variable):
flags = getattr(variable, FLAGS_ATTR_NAME, None)
if flags is not None:
return flags
data_vars = getattr(variable, 'data_vars', None)
if data_vars is not None:
# Maybe we have a DataSet, not a DataArray
for var in data_vars.values():
flags = getattr(var, FLAGS_ATTR_NAME, None)
if flags is not None:
return flags
raise ValueError('No masking variable found')

so it's just a matter of adding flags = json.loads(flags) somewhere there.

And on the create side, code here will need to encode flags_definition appropriately:

def dataarray_attrs(self) -> Dict[str, Any]:
"""This returns attributes filtered for display in a dataarray."""
return {key: value for key, value in self.items() if key not in self.ATTR_SKIP}

@Kirill888
Copy link
Member

but honestly, flags_definition structure is rather odd and tricky to create properly, we need a better mechanism for describing bit-mask information.

@SpacemanPaul
Copy link
Contributor

but honestly, flags_definition structure is rather odd and tricky to create properly, we need a better mechanism for describing bit-mask information.

I think we're starting to get into Enhancement Proposal territory there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants