Instantiate DataLad dandisets from backup #250

yarikoptic · 2020-10-02T15:12:33Z

Continuation to #243 which was addressed by providing/running https://github.com/dandi/dandi-cli/blob/master/tools/instantiate-dandisets.py .

While API is still being cooking, I think it would be already beneficial to start providing proper DataLad datasets for all our dandisets.

Short term implementation

Build on top of instantiate-dandisets.py so it would (option names etc could be not exactly correct, just typing)

not just mkdir but use ds = Dataset(path).create(cfg='text2git') (if there is none) to establish a datalad dataset and configure text files, such as dandiset.yaml to go to git (everything is public anyways ATM and .nwb's are binary)
- by default datalad makes annex to use md5 based backend... our API ATM, and also git-annex by default uses SHA256. So even though I would have liked shorter/faster SHA1, let's go with sha256e backend for annex, so create above should specify that one to use
after cping file from the local assetstore
- ds.repo.add it
- if metadata had digests - verify that annex computed checksum in its key is the same as the one we know in digests. If not -- crash, we better figure out what is going on
- ds.repo.add_url_to_file with URL pointing to redirected to location in the bucket , not girder API one so it still has a chance to work after girder is brought down but we still have assetstore in current shape (actually it would be even easier since script already works based on the path in the asset store!)
at the end -- just ds.save all the changes

But then add logic for proper updates

we should not re-cp/add file if did not change
- if metadata contains digest, we could base off that
- if not - we could use size/mtime to check if file was updated on dandi
if file (directory) is removed on dandi, we should remove locally

Crawler (next section) has/uses https://github.com/datalad/datalad-crawler/blob/master/datalad_crawler/dbs/files.py#L89 to store/update information on each file status, but I guess we might just avoid all of that if we just store the full dump of assets listing metadata somewhere under .datalad/dandi/ so that next time we run it we have a full list of files/assets from previous update to perform "update" actions listed above.

Alternative (a bit more involved, not sure if worthwhile ATM)

Can be a "datalad crawler" pipeline (see https://github.com/datalad/datalad-crawler/), which would (internally)

take care about checking time/size
- might need to be extended to support check by digest
persistently storing that information per file (which it does already)

But since datalad-crawler, although functioning etc, is yet another thing to figure out and still uses older DataLad interfaces, primarily talking directly via GitRepo and AnnexRepo interfaces and without really taking advantage of higher level ones, I think it might be a bigger undertaking.

Nevertheless here are some pointers on that

very basic and scarce docs about "crawler" and its nodes and pipelines: http://docs.datalad.org/projects/crawler/en/stable/basics.html
https://github.com/datalad/datalad-crawler/blob/master/datalad_crawler/pipelines/xnat.py#L375 - a sample simple pipeline for XNAT. It all boils down to just a few helpers to get listing, and pass it into annexificator
https://github.com/datalad/datalad-crawler/blob/master/datalad_crawler/pipelines/simple_with_archives.py#L28 -- one of the "simple" pipelines, which actually is not that simple since it implements 3-branch approach (incoming -> incoming-processed -> master) to support working with data from tarballs, and is "reused" in other pipelines I believe.
there was some thinking/idea about RFing or just providing a new (more modernly designed) pipeline which would "integrate" with datalad addurls -- some older thinking is at Generic framework for crawling data providers with versions datalad/datalad-crawler#22 . So we might get there -- it could be just combining two calls - get assets list from dandi, tune them up just a bit, pass to the stock pipeline which would take care about doing all the time checks, removal of now obsolete files, and then passing the rest to addurls to do the rest (possibly splitting into subdatasets etc)

Edit 1: publishing

Upon the end of update, script should publish all updated dandisets under https://github.com/dandisets organization (for testing, may be first create some throw away organization on github, e.g. dandisets-testing or alike).

There are datalad publish and push commands with slight differences, push is supposedly cleaner interface so probably use that one. There is also create-sibling-github to initiate a github repository to push to.

Longer term

Also, I hope that would just start updating datalad datasets straight within API backend/workers reacting to API calls thus making datalad datasets immediately reflecting introduced changes. For that we would not need a dedicated script or a crawler.

The text was updated successfully, but these errors were encountered:

jwodder · 2020-10-02T15:21:58Z

By "Build on top of instantiate-dandisets.py," do you mean you want a new, separate script or that instantiate-dandisets.py should be modified?
It looks like you didn't finish writing the item that reads "if metadata had digests".
Is there an easy way to check whether a Datalad dataset has been created, or should I just check whether the directory is nonempty?

yarikoptic · 2020-10-02T15:51:56Z

up to you, either it is an option to existing script so we could still do without datalad or a new one based on it
thanks -- finished writing ;)
yes, .is_installed() http://docs.datalad.org/en/stable/generated/datalad.api.Dataset.html?highlight=is_installed#datalad.api.Dataset . Pretty much

ds = DataLad(path)
if not ds.is_installed():
   ds.create(...)

I don't think we should bother adding them all into a "super-dataset" ATM, so should not worry if that path might be an uninstalled submodule or smth like that.

jwodder · 2020-10-02T17:03:54Z

@yarikoptic What is the proper way to set the git-annex backend to SHA256E? According to this page, it should be doable by just changing the annex.backend (singular) Git config setting, yet when I look in .git/config in a test dataset, I see an annex.backends = MD5E (plural) setting. If I set just one or the other to SHA256E (and even if I set backend and delete backends) and then create & save a file, running ds.repo.get_content_annexinfo(["path/to/file"]) shows that the file's backend is still MD5E.

(Incidentally, is there a better way of getting a file's git-annex digest than splitting apart ds.repo.get_content_annexinfo(["file"])[Path("file")]["key"]?)

yarikoptic · 2020-10-02T19:28:01Z

backend -- apparently no easy way! kheh kheh -- see datalad/datalad#4978 for issue and my "workaround"

digest -- you could use ds.repo.get_key(path) to get annex key, and then indeed need to parse... https://git-annex.branchable.com/internals/key_format/ describes the format just in case ;) we have bunch of helpers but never cared about extracting checksum (could be also generally not a checksum for URL backend etc)

yarikoptic · 2020-10-05T03:28:38Z

forgot to add: if backup has no corresponding key (we are after all backing up only once a day, I will increase frequency) - file will need to be downloaded first e.g. using datalad's download_url (which will also add url to the file in annex upon download) or could download using dandi (or directly) and proceed to add_url_to_file.

jwodder · 2020-10-05T13:29:49Z

@yarikoptic The Dandi asset metadata (as returned by navigate_url) contains two occurrences each of the SHA256, size, and mtime fields; should I use the ones in the .metadata sub-dict, or the .sha256, .attrs.size, & .attrs.mtime fields, or does it not matter?

yarikoptic · 2020-10-05T13:57:45Z

Shouldn't matter. Probably use the ones from top level

Re mtime - might be cool if you use the latest mtime of files to be committed as a commit date? (One of the two, iirc there is like changes date and commit date; you could check how datalad does it for fake-dates mode)

jwodder · 2020-10-05T14:51:19Z

@yarikoptic ds.repo.add_url_to_file() is failing with "while adding a new url to an already annexed file, failed to verify url exists". Is there a way to turn off the URL verification?

Incidentally, should the query parameters be stripped from URLs before associating them with files? They're rather long, and they seem superfluous.

Also, while testing, I need to be able to delete datalad datasets, but the git-annex files don't have write permission, so rm -rf fails unless I use sudo, which I'm not allowed to do on drogon. Is there a clean way around this? (Nevermind, found a way around this.)

yarikoptic · 2020-10-05T16:34:15Z

I am sorry, I am not exactly following on what query parameters (example of a the url could have helped). My guess you followed my desire to store the URL on S3 as it was redirected to by girder and all those query parameters are the expiration etc? Then yeah -- we do not want that and URL (since I believe versioning was turned off, right @satra?) should be just a url pointing to the asset store (so we could actually even just create it ourselves without asking girder I guess, since we already interrogate girder about that to find file in a local backup).

edit 1: we should not skip validation of the URL, or we would not have any guarantee that we did use correct, at least currently, URL

yarikoptic · 2020-10-05T16:35:17Z

re removing -- remove(path, check=False) command of datalad could be of help, or indeed just chmod -R +w path && rm -rf path

jwodder · 2020-10-05T16:36:19Z

@yarikoptic Here's a URL that the script is currently trying to associate with a file:

https://dandiarchive.s3.amazonaws.com/girder-assetstore/f7/65/f765fb69bd7343798fd994103c1ff2eb?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIA3GIMZPVVCZ6O3AUO%2F20201005%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20201005T151533Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Security-Token=FwoGZXIvYXdzEMD%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaDOu19%2BOH%2BIcKB%2BuVWSK%2FAfTWBMZ31yfaLgHKgQ8Dc8M4em39qRwOBgau0g15MrDMGJXzjyT2qQpr5yhbeVekWhAr7J2zoZqIkhTX3Fnpyn8TgO1AW%2FlzyMjNdVOYpzWkvTibOlWOQt0I%2B9Yjt3TzDQ%2BQcvV55SHbiQwgswPX6LY06tOlppex8hnCSWxLuaXVKl64Om2r8VCEmlqHJTd0yOCt6csY4vOKgr6K2RKoYJj5lXoX%2B%2BpQLBVDm2coFiBQmiX7n9dmHqugwEuHQGY%2FKJXc7PsFMi2hgBoOSTKeYlREtO136ewJJ5MoNIQuul4Sipl90GEFmmuW7tXC2P0eDweLuOo%3D&X-Amz-Signature=1e3cb8d5fb3638259a1e03e439dcf4a4a14cb6bbd82ea677f136d12f995067c6

satra · 2020-10-05T16:56:20Z

@yarikoptic - versioning has not been turned off yet. if we get full datalad datasets and can maintain that, it will be easier to turn off versioning from that information and remove older versions of things.

yarikoptic · 2020-10-05T17:08:31Z

that is good (that it is not off yet). Then our "make a datalad" dataset can become more robust and not puke if someone removes a file while we are "crawling" it if we used versioned URLs! So, @jwodder - let's proceed with versioned URLs!
The url for that example could be seen e.g. if you do

$> datalad ls -aL s3://dandiarchive/girder-assetstore/f7/65/f765fb69bd7343798fd994103c1ff2eb
Connecting to bucket: dandiarchive
[INFO   ] S3 session: Connecting to the bucket dandiarchive anonymously 
Bucket info:
  Versioning: {'Versioning': 'Enabled'}
     Website: dandiarchive.s3-website-us-east-1.amazonaws.com
         ACL: <Policy: None (owner) = FULL_CONTROL>
girder-assetstore/f7/65/f765fb69bd7343798fd994103c1ff2eb 2020-03-19T23:05:57.000Z 12887868734 ver:V5BK7tKKBlGCdAV75e_Q98xRAWvq5NCG  acl:<Policy: None (owner) = FULL_CONTROL>  http://dandiarchive.s3.amazonaws.com/girder-assetstore/f7/65/f765fb69bd7343798fd994103c1ff2eb?versionId=V5BK7tKKBlGCdAV75e_Q98xRAWvq5NCG [OK]

so it is http://dandiarchive.s3.amazonaws.com/girder-assetstore/f7/65/f765fb69bd7343798fd994103c1ff2eb?versionId=V5BK7tKKBlGCdAV75e_Q98xRAWvq5NCG

but programmatically -- we have datalad.support.s3.get_versioned_url so you could use that one for now. But in the long run -- we need to either RF it or add some caching so it would become more efficient and not redo "get the bucket" for every URL from the same bucket - waste of computing/traffic.

yarikoptic · 2020-10-05T17:09:17Z

alternatively - you could mint it in your code yourself reusing the same bucket instance etc - up to you

jwodder · 2020-10-05T18:01:25Z

@yarikoptic Getting a versioned URL isn't working for me. Using the file in Dandiset 000027 as an example (file ID 5f176584f63d62e1dbd06946):

Making a HEAD request to https://girder.dandiarchive.org/api/v1/file/5f176584f63d62e1dbd06946/download returns the location:

https://dandiarchive.s3.amazonaws.com/girder-assetstore/74/0f/740feade0d784acc8ec76bb7834d80dc?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIA3GIMZPVVHNVAWVMN%2F20201005%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20201005T175459Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Security-Token=FwoGZXIvYXdzEMP%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaDL%2BSe47eGSV94y9NmiK%2FAQE1w7nr7B52xxI5XQli3z7HheHtfgFGvnMJe1PqCml4%2BqDepzZHa%2FkvP0Ltk6lnsWpVUAiga%2FZQOlChowhXkGQ7uLU36Coe6fntpsTj2hv4UCFqFtA5qaWSmXYHLBmnc9WV3%2Bd1sxjqoD%2FcBnJZe%2FGAwDijTIiDNgJvGL4nWNmYCDL5T2DmZo9KB0QLq17SO517RCRGwG%2BAsszpiAlgtAuO3wzSVnt9%2FFaR7iWWqR4uxlzHX55PlfGoNqDiQOLCKKS57fsFMi2YrZO1UPadlwGbC6jWoe%2BvZwmaVUr%2BikFvRx5iLwAIjthpwNvVJn16oHYJD1M%3D&X-Amz-Signature=f198dfa71a0e16f9c715fb5ea121f11efb7fa65c715f882154ef3510132ad9ea

If I pass the lengthy URL above to datalad.support.s3.get_versioned_url(), it fails with:

Traceback (most recent call last):
  File "q.py", line 15, in <module>
    print(get_versioned_url(url))
  File "/home/jwodder/dandi-cli/venv/lib/python3.8/site-packages/datalad/support/s3.py", line 434, in get_versioned_url
    if s3provider.authenticator.bucket is not None and s3provider.authenticator.bucket.name == s3_bucket:
AttributeError: 'NoneType' object has no attribute 'bucket'

If I remove the query parameters and pass the result to get_versioned_url(), the same error occurs.
If I remove the query parameters, change the scheme to s3, and change the host to dandiarchive to produce the URL s3://dandiarchive/girder-assetstore/74/0f/740feade0d784acc8ec76bb7834d80dc, then get_versioned_url() fails with:

Traceback (most recent call last):
  File "q.py", line 25, in <module>
    print(get_versioned_url(url3))
  File "/home/jwodder/dandi-cli/venv/lib/python3.8/site-packages/datalad/support/s3.py", line 424, in get_versioned_url
    raise NotImplementedError
NotImplementedError

yarikoptic · 2020-10-05T18:35:39Z

crap, "confirmed", needs to be filed/fixed... if you could just get custom boto (or boto3) code to get bucket, get relevant key and its most recent version -- would be great. or just proceed without versioning and I will provide a fix on datalad end later on to just add that call.

edit: filed datalad/datalad#4984

jwodder · 2020-10-05T19:05:27Z

@yarikoptic It seems that AWS credentials are required for getting any information about objects via boto3.

yarikoptic · 2020-10-05T19:44:43Z

there should be some way for anonymous access.

Anyways -- here is the fix for datalad: datalad/datalad#4985 (against maint branch)

jwodder · 2020-10-06T19:25:28Z

@yarikoptic GitHub siblings are being created in https://github.com/dandisets-testing, but for some reason their default branch is set to git-annex instead of master. How do I fix that?

satra · 2020-10-06T19:38:57Z

i think it may be because github has moved away from master as default for new repositories: https://www.zdnet.com/article/github-to-replace-master-with-main-starting-next-month/

jwodder · 2020-10-07T13:01:16Z

@yarikoptic After running for almost seven hours, the script failed while trying to add a URL to a file in 000026 with the error message:

datalad.support.exceptions.CommandError: CommandError: 'addurl' [Error, annex reported failure for addurl (url='https://dandiarchive.s3.amazonaws.com/girder-assetstore/dd/ac/ddac7448a1a847d5843078d1ec772dba?versionId=8j4PrdzTK9zmi_8yewcuLqOwHR2qyc19'): {'command': 'addurl', 'success': False, 'error-messages': ['  rawdata/sub-I46/ses-MRI/anat/sub-I46_echo-1_fa-1_VFA.json already exists; not overwriting'], 'file': 'rawdata/sub-I46/ses-MRI/anat/sub-I46_echo-1_fa-1_VFA.json'}]

The error message seems to be implying that the file in question was somehow added to the repository twice, but the script output does not support that interpretation.

Also, the files in 000026 (and also 000025) are not arranged in the normal Dandiset manner; should the script do anything about this?

jwodder · 2020-10-07T13:15:43Z

@yarikoptic It appears that adding a file to git-annex updates its mtime, so using mtimes to determine whether a file has been modified won't work — or should the script try forcing the mtime back to the "expected" value after adding each file?

jwodder · 2020-10-07T13:36:12Z

@satra I don't believe that's it. If I try creating & pushing a repository with two branches, one named "master", one named "test", it's whichever is pushed first that ends up as the default branch.

yarikoptic · 2020-10-08T13:28:32Z

mtime: oh well, git does not store mtimes anyways so for the purpose of dandi client determining "freshness" indeed we would be out of luck I guess... will be a separate issue. As for the script to determine either re-download/update is needed, I guess it is doomed to keep a "registry" of files and their mtime/size (as datalad-crawler does).

edit: may be forget about mtime since you rely on hash to verify if re-download is needed... any recently uploaded file should have a hash, so it would only be needed for old ones to allow to download them once without hash known

yarikoptic · 2020-10-08T13:32:08Z

Also, the files in 000026 (and also 000025) are not arranged in the normal Dandiset manner; should the script do anything about this?

nope... script should not be anyhow files layout specific (besides that there should be no dandiset.yaml file on s3/list of assets)

yarikoptic · 2020-10-08T13:51:34Z

re 000026 and addurl: where (on drogon?) those are (I thought to have a look into git history)? Ideally log from the script (per each dandiset) should be dumped into a file for introspection. Looking at the code have not spotted anything obvious, and expect log to tell more. I thought it might be that the file was "broken" in/copied from assetstore

but annex issues another message then

$> datalad create /tmp/testdsss      
[INFO   ] Creating a new annex repo at /tmp/testdsss 
[INFO   ] Scanning for unlocked files (this may take some time) 
create(ok): /tmp/testdsss (dataset)
(dev3) 1 27849.....................................:Thu 08 Oct 2020 09:48:26 AM EDT:.
lena:/tmp
$> cd testdsss
(dev3) 1 27851.....................................:Thu 08 Oct 2020 09:48:32 AM EDT:.
(git-annex)lena:/tmp/testdsss[master]
$> touch ddac7448a1a847d5843078d1ec772dba\?versionId=8j4PrdzTK9zmi_8yewcuLqOwHR2qyc19        
(dev3) 1 27852.....................................:Thu 08 Oct 2020 09:48:40 AM EDT:.
(git-annex)lena:/tmp/testdsss[master]
$> git annex add ddac7448a1a847d5843078d1ec772dba\?versionId=8j4PrdzTK9zmi_8yewcuLqOwHR2qyc19
add ddac7448a1a847d5843078d1ec772dba?versionId=8j4PrdzTK9zmi_8yewcuLqOwHR2qyc19 
ok
(recording state in git...)
(dev3) 1 27853.....................................:Thu 08 Oct 2020 09:48:44 AM EDT:.
(git-annex)lena:/tmp/testdsss[master]
$> git annex addurl --file ddac7448a1a847d5843078d1ec772dba\?versionId=8j4PrdzTK9zmi_8yewcuLqOwHR2qyc19 'https://dandiarchive.s3.amazonaws.com/girder-assetstore/dd/ac/ddac7448a1a847d5843078d1ec772dba?versionId=8j4PrdzTK9zmi_8yewcuLqOwHR2qyc19'
addurl https://dandiarchive.s3.amazonaws.com/girder-assetstore/dd/ac/ddac7448a1a847d5843078d1ec772dba?versionId=8j4PrdzTK9zmi_8yewcuLqOwHR2qyc19 
  while adding a new url to an already annexed file, url does not have expected file size (use --relaxed to bypass this check) https://dandiarchive.s3.amazonaws.com/girder-assetstore/dd/ac/ddac7448a1a847d5843078d1ec772dba?versionId=8j4PrdzTK9zmi_8yewcuLqOwHR2qyc19
failed
git-annex: addurl: 1 failed

you could also enable/dump detailed log from datalad itself (DATALAD_LOG_LEVEL=DEBUG DATALAD_LOG_TARGET=/tmp/mylog.log going lower than DEBUG 10 to get even more info, and adding DATALAD_LOG_OUTPUTS=1 to log detailed outputs from underlying commands) to see what is going one. ATM I have no further ideas

yarikoptic · 2020-10-08T13:57:54Z

re default git-annex branch: could you please come up with minimal reproducer (probably in python to stay close to your code) which would do .create; .create_sibling_github; .push and demonstrate that git-annex branch is default. If it does -- file an issue against datalad (with WTF information) -- I think it would be something to fixup on datalad end.

yarikoptic · 2020-10-08T14:21:43Z

re github: happens I had to do the same, so reproduced, ~~will file and issue against datalad~~ (done: datalad/datalad#4997)

jwodder · 2020-10-08T14:36:09Z

@yarikoptic The datasets are in /mnt/backup/dandi/dandiarchive-replica on drogon.

jwodder · 2020-10-08T14:45:00Z

@yarikoptic To be clear, regarding mtimes and hashes, your current recommendation is to only use hashes, and if a file doesn't have a hash in Dandiarchive, only copy it if it's not already in the dataset?

yarikoptic · 2020-10-08T15:07:02Z

... only copy it if it's not already in the dataset?

yes, and fail if already in dataset and size differs (should not happen; but check on size at least will provide some safety blanket)... (unless you do want to keep a record of mtimes in the dataset somewhere)

yarikoptic · 2020-10-08T15:07:30Z

/mnt/backup/dandi/dandiarchive-replica

I do not see the problematic 000026 there

jwodder · 2020-10-08T15:08:43Z

@yarikoptic Oh, sorry, I deleted it because I planned on rerunning the script to see if the error would happen again, but then the script started copying everything again because of the mtime issue, so I killed it.

jwodder · 2020-10-08T15:38:23Z

@yarikoptic I reran the script for just 000026, and it failed with the same error. I'm leaving the directory there this time so you can inspect it.

jwodder · 2020-10-08T20:54:28Z

I've finished running the script for all Dandisets. The only problems left should be the default branch issue and whatever's wrong with 000026.

yarikoptic · 2020-10-08T23:39:27Z

thanks! looking at 26 , I do not see addurl error you mentioned and the error I see

*$> grep Error .dandi/logs/*
.dandi/logs/sync-20201008191316Z-26083.log:    raise CommandError(
.dandi/logs/sync-20201008191316Z-26083.log:datalad.support.exceptions.CommandError: CommandError: 'git-annex lookupkey -c annex.dotfiles=true -c annex.retry=3 -- rawdata/sub-I46/ses-MRI/anat/sub-I46_echo-1_fa-1_VFA.json' failed
 with exitcode 1 under /mnt/backup/dandi/dandiarchive-replica/000026
.dandi/logs/sync-20201008191316Z-26083.log:    raise FileInGitError(cmd=cmd_str,
.dandi/logs/sync-20201008191316Z-26083.log:datalad.support.exceptions.FileInGitError: FileInGitError: ''git annex lookupkey rawdata/sub-I46/ses-MRI/anat/sub-I46_echo-1_fa-1_VFA.json'' [File not in annex, but git: rawdata/sub-I4
6/ses-MRI/anat/sub-I46_echo-1_fa-1_VFA.json]

makes sense since we instantiate datasets with -c text2git so that file was added to git, and thus annex has no clue about it. For such files And since this is the first BIDS(-like) dataset, we did not encounter this in other datasets. So just AnnexRepo.add it and then check using .is_under_annex, e.g.

*$> python -c 'from datalad.support.annexrepo import AnnexRepo; r=AnnexRepo(".");print(r.is_under_annex("rawdata/sub-I46/ses-MRI/anat/sub-I46_echo-1_fa-1_VFA.json"))'
False

and if True -- only then use add_url... or just catch FileInGitError around that point and proceed ;)

jwodder · 2020-10-09T12:48:18Z

@yarikoptic I believe what you're seeing in the logs is due to a combination of the fact that fatal exceptions weren't logged at first combined with the fact that I later reran the script on the partially-populated dataset, at which point it failed with the error you see. After adjusting the script to only call add_url_to_file if .is_under_annex and then deleting the dataset on disk, running the script for 000026 afresh fails with the same "rawdata/sub-I46/ses-MRI/anat/sub-I46_echo-1_fa-1_VFA.json already exists; not overwriting" addurl error from earlier.

Script for creating Datalad datasets from Dandiarchive backup

yarikoptic · 2020-10-23T17:17:16Z

I will consider it done!

yarikoptic assigned jwodder Oct 2, 2020

yarikoptic mentioned this issue Oct 8, 2020

create-sibling-github created repos make git-annex branch the default datalad/datalad#4997

Closed

yarikoptic added a commit that referenced this issue Oct 9, 2020

Merge pull request #251 from dandi/gh-250

9182877

Script for creating Datalad datasets from Dandiarchive backup

yarikoptic closed this as completed Oct 23, 2020

Instantiate DataLad dandisets from backup #250

Instantiate DataLad dandisets from backup #250

Comments

yarikoptic commented Oct 2, 2020 • edited Loading

Short term implementation

Alternative (a bit more involved, not sure if worthwhile ATM)

Edit 1: publishing

Longer term

jwodder commented Oct 2, 2020 • edited Loading

yarikoptic commented Oct 2, 2020

jwodder commented Oct 2, 2020 • edited Loading

yarikoptic commented Oct 2, 2020

yarikoptic commented Oct 5, 2020

jwodder commented Oct 5, 2020

yarikoptic commented Oct 5, 2020

jwodder commented Oct 5, 2020 • edited Loading

yarikoptic commented Oct 5, 2020 • edited Loading

yarikoptic commented Oct 5, 2020

jwodder commented Oct 5, 2020

satra commented Oct 5, 2020

yarikoptic commented Oct 5, 2020

yarikoptic commented Oct 5, 2020

jwodder commented Oct 5, 2020

yarikoptic commented Oct 5, 2020 • edited Loading

jwodder commented Oct 5, 2020

yarikoptic commented Oct 5, 2020

jwodder commented Oct 6, 2020

satra commented Oct 6, 2020

jwodder commented Oct 7, 2020

jwodder commented Oct 7, 2020

jwodder commented Oct 7, 2020

yarikoptic commented Oct 8, 2020 • edited Loading

yarikoptic commented Oct 8, 2020 • edited Loading

yarikoptic commented Oct 8, 2020

yarikoptic commented Oct 8, 2020

yarikoptic commented Oct 8, 2020 • edited Loading

jwodder commented Oct 8, 2020

jwodder commented Oct 8, 2020

yarikoptic commented Oct 8, 2020

yarikoptic commented Oct 8, 2020

jwodder commented Oct 8, 2020

jwodder commented Oct 8, 2020

jwodder commented Oct 8, 2020

yarikoptic commented Oct 8, 2020

jwodder commented Oct 9, 2020

yarikoptic commented Oct 23, 2020

yarikoptic commented Oct 2, 2020 •

edited

Loading

jwodder commented Oct 2, 2020 •

edited

Loading

jwodder commented Oct 2, 2020 •

edited

Loading

jwodder commented Oct 5, 2020 •

edited

Loading

yarikoptic commented Oct 5, 2020 •

edited

Loading

yarikoptic commented Oct 5, 2020 •

edited

Loading

yarikoptic commented Oct 8, 2020 •

edited

Loading

yarikoptic commented Oct 8, 2020 •

edited

Loading

yarikoptic commented Oct 8, 2020 •

edited

Loading