Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3 crawler -- doesn't handle "gracefully" situation of completely removed keys in a versioned bucket #73

Open
yarikoptic opened this issue Mar 24, 2020 · 0 comments

Comments

@yarikoptic
Copy link
Member

Use case is

(git)smaug:/mnt/datasets/datalad/crawl/adhd200/RawDataBIDS/WashU[master]git
$> datalad ls -aL 's3://fcp-indi/data/Projects/ADHD200/RawDataBIDS/WashU/task-reststudy3_task-rest_bold.json'
Connecting to bucket: fcp-indi
[INFO   ] S3 session: Connecting to the bucket fcp-indi with authentication
Bucket info:
  Versioning: S3ResponseError: 403 Forbidden
     Website: S3ResponseError: 403 Forbidden
         ACL: S3ResponseError: 403 Forbidden
data/Projects/ADHD200/RawDataBIDS/WashU/task-reststudy3_task-rest_bold.json 2020-02-12T21:29:03.000Z DeleteMarker
data/Projects/ADHD200/RawDataBIDS/WashU/task-reststudy3_task-rest_bold.json 2020-02-12T21:26:29.000Z  996 ver:qbldQOJmB_DYp40eh3AeJlrgfZa281N2  acl:S3ResponseError: 404 Not Found
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>data/Projects/ADHD200/RawDataBIDS/WashU/task-reststudy3_task-rest_bold.json</Key><RequestId>AE14DA04D623BAC5</RequestId><HostId>jK7XM5sEAeaZz2bd1OCXVoMMb/0C46iKyzL4VpGsB7rcypyvU+CeaHKH3I5uh4LX84GNm8UZdZQ=</HostId></Error>  http://fcp-indi.s3.amazonaws.com/data/Projects/ADHD200/RawDataBIDS/WashU/task-reststudy3_task-rest_bold.json?versionId=qbldQOJmB_DYp40eh3AeJlrgfZa281N2 [E: 403]
data/Projects/ADHD200/RawDataBIDS/WashU/task-reststudy3_task-rest_bold.json 2020-02-12T15:05:14.000Z 1643 ver:_PImN3HbTRK9vXnFRgUg6Kq4HmF5Z7r.  acl:S3ResponseError: 404 Not Found
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>data/Projects/ADHD200/RawDataBIDS/WashU/task-reststudy3_task-rest_bold.json</Key><RequestId>AA76EA65F16A68E5</RequestId><HostId>JIf20KPlc5LwBbAy9cL40Z4yCQ+jDEGMUXMM1wouKw6t5uKqGvrwyhiIPsT9PUTPjLeQvyBdeZw=</HostId></Error>  http://fcp-indi.s3.amazonaws.com/data/Projects/ADHD200/RawDataBIDS/WashU/task-reststudy3_task-rest_bold.json?versionId=_PImN3HbTRK9vXnFRgUg6Kq4HmF5Z7r. [E: 403]
data/Projects/ADHD200/RawDataBIDS/WashU/task-reststudy3_task-rest_bold.json 2017-02-07T23:11:18.000Z 1407 ver:5110B.k9Fdmo3CErwRlDd4oqKL.Pf5Vp  acl:S3ResponseError: 404 Not Found
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>data/Projects/ADHD200/RawDataBIDS/WashU/task-reststudy3_task-rest_bold.json</Key><RequestId>C8EF28ACF8E83E31</RequestId><HostId>kXAKuNSOZCcCDMONQuW7LKrEuWX2WHm9TM8A9aigOzQpYkSoZass1YWkiR+IF7uk36n+h2WwcoI=</HostId></Error>  http://fcp-indi.s3.amazonaws.com/data/Projects/ADHD200/RawDataBIDS/WashU/task-reststudy3_task-rest_bold.json?versionId=5110B.k9Fdmo3CErwRlDd4oqKL.Pf5Vp [OK]
data/Projects/ADHD200/RawDataBIDS/WashU/task-reststudy3_task-rest_bold.json 2017-02-07T23:11:08.000Z DeleteMarker
data/Projects/ADHD200/RawDataBIDS/WashU/task-reststudy3_task-rest_bold.json 2017-01-11T22:41:38.000Z 1407 ver:1NKbyRk7K0A4DoCrsbmcdDwO_yb7NOvs  acl:S3ResponseError: 404 Not Found
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>data/Projects/ADHD200/RawDataBIDS/WashU/task-reststudy3_task-rest_bold.json</Key><RequestId>59732ACF13CC104D</RequestId><HostId>5UPTjFRK9XfhWz3cECx7NUBVV4RvVC+7tlzNpdFFXEbojncvygFzeE7gmtmbDWOP0nYvqQVs5VA=</HostId></Error>  http://fcp-indi.s3.amazonaws.com/data/Projects/ADHD200/RawDataBIDS/WashU/task-reststudy3_task-rest_bold.json?versionId=1NKbyRk7K0A4DoCrsbmcdDwO_yb7NOvs [OK]

ATM there is an option to completely skip all problematic files, but we do not want to apply the same rule to completely removed and permission denied (might still be fixed). So we need more specific option to tell which ones to skip, and which still fail on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant