Skip to content
This repository has been archived by the owner on Nov 23, 2023. It is now read-only.

Bug: New version of every metadata file when a new dataset is added #1818

Closed
11 tasks
billgeo opened this issue Jul 4, 2022 · 2 comments · Fixed by #1915
Closed
11 tasks

Bug: New version of every metadata file when a new dataset is added #1818

billgeo opened this issue Jul 4, 2022 · 2 comments · Fixed by #1915
Assignees
Labels
bug Something isn't working

Comments

@billgeo
Copy link
Contributor

billgeo commented Jul 4, 2022

Bug Description

Every time a new dataset is added, it creates a new version of every other metadata file in the s3 bucket. This may be expected in the way we use pystac, but it could potentially add up to Terabytes of data if we have a high number of datasets with a high number of metadata files? It could also lead to peformance issues?

First need to investigate and find best options to resolve this.
Could be related to this? stac-utils/pystac#90

See here for example of many file versions:

Screenshot from 2022-07-05 08-24-26.png

How to Reproduce

  1. Add a new 'dataset'
  2. Add a new 'dataset version'
  3. Add a second 'dataset version'
  4. See new versions for metadata files in other sibling datasets

What did you expect to happen?

What actually happened?

Software Context

Operating system:

Environment:

Relevant software versions:

  • AWS CLI:
  • Poetry:

Additional context

Definition of Done

  • This bug is done:
    • Bug resolved to user's satisfaction
    • Automated tests are passing
    • Code is peer reviewed and pushed to master
    • Deployed successfully to test environment
    • Checked against
      CODING guidelines
    • Relevant new tasks are added to backlog and communicated to the team
    • Important decisions recorded in the issue ticket
    • Readme/Changelog/Diagrams are updated
    • Product Owner has approved as complete
    • No regression to functional or
      non-functional
      requirements
@billgeo billgeo added the bug Something isn't working label Jul 4, 2022
@Jimlinz Jimlinz self-assigned this Jul 28, 2022
Jimlinz added a commit that referenced this issue Aug 4, 2022
Pystac normalize_hrefs resolves all links and mutates the entire tree, causing collection.json for all datasets to update. save_object only saves the particular entity within the catalog, preventing unnecessary metadata duplication. Fixes #1818
@Jimlinz Jimlinz linked a pull request Aug 4, 2022 that will close this issue
Jimlinz added a commit that referenced this issue Aug 9, 2022
Pystac offers optional title field in link object (human-readable title to be used in rendered displays of the link). Removing normalized_href (put in place to solve #1818) causes tests to fail due to missing title.
Jimlinz added a commit that referenced this issue Aug 10, 2022
Pystac offers optional title field in link object (human-readable title to be used in rendered displays of the link). Removing normalized_href (put in place to solve #1818) causes tests to fail due to missing title.
Jimlinz added a commit that referenced this issue Aug 10, 2022
Pystac offers optional title field in link object (human-readable title to be used in rendered displays of the link). Removing normalized_href (put in place to solve #1818) causes tests to fail due to missing title.
Jimlinz added a commit that referenced this issue Aug 10, 2022
Pystac offers optional title field in link object (human-readable title to be used in rendered displays of the link). Removing normalized_href (put in place to solve #1818) causes tests to fail due to missing title.
Jimlinz added a commit that referenced this issue Aug 10, 2022
Pystac offers optional title field in link object (human-readable title to be used in rendered displays of the link). Removing normalized_href (put in place to solve #1818) causes tests to fail due to missing title.
Jimlinz added a commit that referenced this issue Aug 10, 2022
Pystac offers optional title field in link object (human-readable title to be used in rendered displays of the link). Removing normalized_href (put in place to solve #1818) causes tests to fail due to missing title.
Jimlinz added a commit that referenced this issue Aug 11, 2022
Pystac offers optional title field in link object (human-readable title to be used in rendered displays of the link). Removing normalized_href (put in place to solve #1818) causes tests to fail due to missing title.
Jimlinz added a commit that referenced this issue Aug 12, 2022
Pystac offers optional title field in link object (human-readable title to be used in rendered displays of the link). Removing normalized_href (put in place to solve #1818) causes tests to fail due to missing title.
Jimlinz added a commit that referenced this issue Aug 12, 2022
Pystac offers optional title field in link object (human-readable title to be used in rendered displays of the link). Removing normalized_href (put in place to solve #1818) causes tests to fail due to missing title.
@kodiakhq kodiakhq bot closed this as completed in #1915 Sep 12, 2022
@billgeo billgeo reopened this Sep 13, 2022
@mfwightman mfwightman moved this from Backlog to Reviewing in Data Infrastructure Squad Sep 13, 2022
@billgeo
Copy link
Contributor Author

billgeo commented Sep 14, 2022

Tested this and it's working as expected. Collection.json has a new version, but the child items don't. Thanks @Jimlinz !

FYI - I also created a followup issue #2022 to undo these changes once pystac handles this better.

@billgeo billgeo closed this as completed Sep 14, 2022
@Jimlinz
Copy link
Contributor

Jimlinz commented Sep 14, 2022

The s3 etag solution I am currently working on #1995 should prevent Collection.json from writing a new version (if identical). This should be sufficient until we work on #2022

@mfwightman mfwightman moved this from Reviewing to Closed in Data Infrastructure Squad Sep 14, 2022
@mfwightman mfwightman moved this from ✅ Closed to 👀 Reviewing in Data Infrastructure Squad Sep 14, 2022
@mfwightman mfwightman moved this from 👀 Reviewing to ✅ Closed in Data Infrastructure Squad Sep 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Development

Successfully merging a pull request may close this issue.

2 participants