-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate Files in Migrated Items #1918
Comments
Hey team! Please add your planning poker estimate with Zenhub @bess @hectorcorrea @JaymeeH @leefaisonr |
@sec122 In the case of 479 mentioned above, I see the only difference is with the README file, and I can see in the one starting with "globus_" shows some character encoding errors when viewed in a browser (for me, at least). If you don't see any substantial differences between the two in terms of content, then I'd recommend we go with the one starting "data_space_". |
@sec122 As for 442, that's more mysterious. Here the options I see:
One way or another, RDSS will need clear guidance from PRDS about what to keep and what to delete (if anything). |
@sec122 Those updates are completed now. |
Needed for cleanup #1918 Co-authored-by: Hector Correa <[email protected]>
Duplicate Files in Migrated Items
Expected behavior
Only one copy of each file should be present in migrated objects.
Actual behavior
Some files are duplicated in migrated datasets, easily noticed by filenames with a prefix of either "dataspace" or "globus"
These duplicated file cases fall into three groups:
a) Only the README files are duplicated - 22 cases
b)
All files are duplicated (and there are TAR files) - 6 casesIn a separate ticket #1920^ Matt has more info about how we want to handle these cases
c) All files are duplicated (and there are no TAR files) - 1 case
Steps to replicate
View the full list of items (color coded in red) on the NEEDS ATTENTION tab of the "Copy of RDOS Records in DataSpace" google sheet https://docs.google.com/spreadsheets/d/130B7RMhnqSeTIKPFBdDsrSVbC1C_PCdZp0qwTucR0QA/edit?usp=sharing
Issue type = "Duplicate Files beyond Readmes" and "Duplicate Readmes Only" for specific examples and links to the records.
Impact of this bug
We cannot approve these datasets until the issue is fixed. Therefore, these records remain in DataSpace until the issue is resolved.
Honeybadger link and code snippet, if applicable
Implementation notes, if any
I believe @carolyncole may already have a script to take care of these issues - since she has fixed a very similar issue for us earlier in the migration. Unsure if it requires making a new similar script or rerunning the existing one though.
Acceptance criteria
https://datacommons.princeton.edu/describe/works/353No files matched the globus_https://datacommons.princeton.edu/describe/works/38moved to Write Code to pull down 6 tar files explode the tar files, checksum each file in the tar and produce a report for the curators #1920https://pdc-describe-prod.princeton.edu/describe/works/425moved to Write Code to pull down 6 tar files explode the tar files, checksum each file in the tar and produce a report for the curators #1920https://pdc-describe-prod.princeton.edu/describe/works/429moved to Write Code to pull down 6 tar files explode the tar files, checksum each file in the tar and produce a report for the curators #1920https://pdc-describe-prod.princeton.edu/describe/works/431moved to Write Code to pull down 6 tar files explode the tar files, checksum each file in the tar and produce a report for the curators #1920The text was updated successfully, but these errors were encountered: