Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate Files in Migrated Items #1918

Open
25 of 26 tasks
sec122 opened this issue Sep 6, 2024 · 6 comments
Open
25 of 26 tasks

Duplicate Files in Migrated Items #1918

sec122 opened this issue Sep 6, 2024 · 6 comments
Assignees
Labels

Comments

@sec122
Copy link

sec122 commented Sep 6, 2024

Duplicate Files in Migrated Items

Expected behavior

Only one copy of each file should be present in migrated objects.

Actual behavior

Some files are duplicated in migrated datasets, easily noticed by filenames with a prefix of either "dataspace" or "globus"

These duplicated file cases fall into three groups:
a) Only the README files are duplicated - 22 cases
b) All files are duplicated (and there are TAR files) - 6 cases In a separate ticket #1920
^ Matt has more info about how we want to handle these cases
c) All files are duplicated (and there are no TAR files) - 1 case

Steps to replicate

View the full list of items (color coded in red) on the NEEDS ATTENTION tab of the "Copy of RDOS Records in DataSpace" google sheet https://docs.google.com/spreadsheets/d/130B7RMhnqSeTIKPFBdDsrSVbC1C_PCdZp0qwTucR0QA/edit?usp=sharing

Issue type = "Duplicate Files beyond Readmes" and "Duplicate Readmes Only" for specific examples and links to the records.

Impact of this bug

We cannot approve these datasets until the issue is fixed. Therefore, these records remain in DataSpace until the issue is resolved.

Honeybadger link and code snippet, if applicable

Implementation notes, if any

I believe @carolyncole may already have a script to take care of these issues - since she has fixed a very similar issue for us earlier in the migration. Unsure if it requires making a new similar script or rerunning the existing one though.

Acceptance criteria

@carolyncole
Copy link
Member

Hey team! Please add your planning poker estimate with Zenhub @bess @hectorcorrea @JaymeeH @leefaisonr

@carolyncole carolyncole self-assigned this Sep 11, 2024
@carolyncole
Copy link
Member

Here is the output of the rake task showing that there are some actual checksum miss-matches between globus and datspace.

@sec122 the curators will need to look into these further.

442.txt
479.txt

@matthewjchandler
Copy link

@sec122 In the case of 479 mentioned above, I see the only difference is with the README file, and I can see in the one starting with "globus_" shows some character encoding errors when viewed in a browser (for me, at least). If you don't see any substantial differences between the two in terms of content, then I'd recommend we go with the one starting "data_space_".

@matthewjchandler
Copy link

@sec122 As for 442, that's more mysterious. Here the options I see:

  1. Leave all of the files in PDC for now, flag it for later review, and move on with the migration
  2. Make a judgment call to preserve the file set closest to what was originally uploaded (those starting with "data_space_" I believe)
  3. Go through all of the .nc files, figure out why the checksums don't match, and make a confident decision about what to keep and what to delete

One way or another, RDSS will need clear guidance from PRDS about what to keep and what to delete (if anything).

@carolyncole
Copy link
Member

@sec122 Those updates are completed now.

bess pushed a commit that referenced this issue Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants