-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instantiate DataLad dandisets from backup #250
Comments
|
ds = DataLad(path)
if not ds.is_installed():
ds.create(...) I don't think we should bother adding them all into a "super-dataset" ATM, so should not worry if that path might be an uninstalled submodule or smth like that. |
@yarikoptic What is the proper way to set the git-annex backend to SHA256E? According to this page, it should be doable by just changing the (Incidentally, is there a better way of getting a file's git-annex digest than splitting apart |
backend -- apparently no easy way! kheh kheh -- see datalad/datalad#4978 for issue and my "workaround" digest -- you could use |
forgot to add: if backup has no corresponding key (we are after all backing up only once a day, I will increase frequency) - file will need to be downloaded first e.g. using datalad's |
@yarikoptic The Dandi asset metadata (as returned by |
Shouldn't matter. Probably use the ones from top level Re mtime - might be cool if you use the latest mtime of files to be committed as a commit date? (One of the two, iirc there is like changes date and commit date; you could check how datalad does it for fake-dates mode) |
@yarikoptic Incidentally, should the query parameters be stripped from URLs before associating them with files? They're rather long, and they seem superfluous.
|
I am sorry, I am not exactly following on what query parameters (example of a the url could have helped). My guess you followed my desire to store the URL on S3 as it was redirected to by girder and all those query parameters are the expiration etc? Then yeah -- we do not want that and URL (since I believe versioning was turned off, right @satra?) should be just a url pointing to the asset store (so we could actually even just create it ourselves without asking girder I guess, since we already interrogate girder about that to find file in a local backup). edit 1: we should not skip validation of the URL, or we would not have any guarantee that we did use correct, at least currently, URL |
re removing -- |
@yarikoptic Here's a URL that the script is currently trying to associate with a file:
|
@yarikoptic - versioning has not been turned off yet. if we get full datalad datasets and can maintain that, it will be easier to turn off versioning from that information and remove older versions of things. |
that is good (that it is not off yet). Then our "make a datalad" dataset can become more robust and not puke if someone removes a file while we are "crawling" it if we used versioned URLs! So, @jwodder - let's proceed with versioned URLs!
but programmatically -- we have |
alternatively - you could mint it in your code yourself reusing the same bucket instance etc - up to you |
@yarikoptic Getting a versioned URL isn't working for me. Using the file in Dandiset 000027 as an example (file ID
|
crap, "confirmed", needs to be filed/fixed... if you could just get custom boto (or boto3) code to get bucket, get relevant key and its most recent version -- would be great. or just proceed without versioning and I will provide a fix on datalad end later on to just add that call. edit: filed datalad/datalad#4984 |
@yarikoptic It seems that AWS credentials are required for getting any information about objects via boto3. |
there should be some way for anonymous access. Anyways -- here is the fix for datalad: datalad/datalad#4985 (against |
@yarikoptic GitHub siblings are being created in https://github.com/dandisets-testing, but for some reason their default branch is set to git-annex instead of master. How do I fix that? |
i think it may be because github has moved away from master as default for new repositories: https://www.zdnet.com/article/github-to-replace-master-with-main-starting-next-month/ |
@yarikoptic After running for almost seven hours, the script failed while trying to add a URL to a file in 000026 with the error message:
The error message seems to be implying that the file in question was somehow added to the repository twice, but the script output does not support that interpretation. Also, the files in 000026 (and also 000025) are not arranged in the normal Dandiset manner; should the script do anything about this? |
@yarikoptic It appears that adding a file to git-annex updates its mtime, so using mtimes to determine whether a file has been modified won't work — or should the script try forcing the mtime back to the "expected" value after adding each file? |
@satra I don't believe that's it. If I try creating & pushing a repository with two branches, one named "master", one named "test", it's whichever is pushed first that ends up as the default branch. |
mtime: oh well, git does not store mtimes anyways so for the purpose of edit: may be forget about mtime since you rely on hash to verify if re-download is needed... any recently uploaded file should have a hash, so it would only be needed for old ones to allow to download them once without hash known |
nope... script should not be anyhow files layout specific (besides that there should be no dandiset.yaml file on s3/list of assets) |
re 000026 and addurl: where (on drogon?) those are (I thought to have a look into git history)? Ideally log from the script (per each dandiset) should be dumped into a file for introspection. Looking at the code have not spotted anything obvious, and expect log to tell more. I thought it might be that the file was "broken" in/copied from assetstore but annex issues another message then$> datalad create /tmp/testdsss
[INFO ] Creating a new annex repo at /tmp/testdsss
[INFO ] Scanning for unlocked files (this may take some time)
create(ok): /tmp/testdsss (dataset)
(dev3) 1 27849.....................................:Thu 08 Oct 2020 09:48:26 AM EDT:.
lena:/tmp
$> cd testdsss
(dev3) 1 27851.....................................:Thu 08 Oct 2020 09:48:32 AM EDT:.
(git-annex)lena:/tmp/testdsss[master]
$> touch ddac7448a1a847d5843078d1ec772dba\?versionId=8j4PrdzTK9zmi_8yewcuLqOwHR2qyc19
(dev3) 1 27852.....................................:Thu 08 Oct 2020 09:48:40 AM EDT:.
(git-annex)lena:/tmp/testdsss[master]
$> git annex add ddac7448a1a847d5843078d1ec772dba\?versionId=8j4PrdzTK9zmi_8yewcuLqOwHR2qyc19
add ddac7448a1a847d5843078d1ec772dba?versionId=8j4PrdzTK9zmi_8yewcuLqOwHR2qyc19
ok
(recording state in git...)
(dev3) 1 27853.....................................:Thu 08 Oct 2020 09:48:44 AM EDT:.
(git-annex)lena:/tmp/testdsss[master]
$> git annex addurl --file ddac7448a1a847d5843078d1ec772dba\?versionId=8j4PrdzTK9zmi_8yewcuLqOwHR2qyc19 'https://dandiarchive.s3.amazonaws.com/girder-assetstore/dd/ac/ddac7448a1a847d5843078d1ec772dba?versionId=8j4PrdzTK9zmi_8yewcuLqOwHR2qyc19'
addurl https://dandiarchive.s3.amazonaws.com/girder-assetstore/dd/ac/ddac7448a1a847d5843078d1ec772dba?versionId=8j4PrdzTK9zmi_8yewcuLqOwHR2qyc19
while adding a new url to an already annexed file, url does not have expected file size (use --relaxed to bypass this check) https://dandiarchive.s3.amazonaws.com/girder-assetstore/dd/ac/ddac7448a1a847d5843078d1ec772dba?versionId=8j4PrdzTK9zmi_8yewcuLqOwHR2qyc19
failed
git-annex: addurl: 1 failed
you could also enable/dump detailed log from datalad itself ( |
re default |
re github: happens I had to do the same, so reproduced, |
@yarikoptic The datasets are in /mnt/backup/dandi/dandiarchive-replica on drogon. |
@yarikoptic To be clear, regarding mtimes and hashes, your current recommendation is to only use hashes, and if a file doesn't have a hash in Dandiarchive, only copy it if it's not already in the dataset? |
yes, and fail if already in dataset and size differs (should not happen; but check on size at least will provide some safety blanket)... (unless you do want to keep a record of mtimes in the dataset somewhere) |
I do not see the problematic 000026 there |
@yarikoptic Oh, sorry, I deleted it because I planned on rerunning the script to see if the error would happen again, but then the script started copying everything again because of the mtime issue, so I killed it. |
@yarikoptic I reran the script for just 000026, and it failed with the same error. I'm leaving the directory there this time so you can inspect it. |
I've finished running the script for all Dandisets. The only problems left should be the default branch issue and whatever's wrong with 000026. |
thanks! looking at 26 , I do not see
makes sense since we instantiate datasets with *$> python -c 'from datalad.support.annexrepo import AnnexRepo; r=AnnexRepo(".");print(r.is_under_annex("rawdata/sub-I46/ses-MRI/anat/sub-I46_echo-1_fa-1_VFA.json"))'
False and if True -- only then use |
@yarikoptic I believe what you're seeing in the logs is due to a combination of the fact that fatal exceptions weren't logged at first combined with the fact that I later reran the script on the partially-populated dataset, at which point it failed with the error you see. After adjusting the script to only call |
Script for creating Datalad datasets from Dandiarchive backup
I will consider it done! |
Continuation to #243 which was addressed by providing/running https://github.com/dandi/dandi-cli/blob/master/tools/instantiate-dandisets.py .
While API is still being cooking, I think it would be already beneficial to start providing proper DataLad datasets for all our dandisets.
Short term implementation
Build on top of
instantiate-dandisets.py
so it would (option names etc could be not exactly correct, just typing)mkdir
but useds = Dataset(path).create(cfg='text2git')
(if there is none) to establish a datalad dataset and configure text files, such asdandiset.yaml
to go to git (everything is public anyways ATM and .nwb's are binary)cp
ing file from the local assetstoreds.repo.add
itds.repo.add_url_to_file
with URL pointing to redirected to location in the bucket , not girder API one so it still has a chance to work after girder is brought down but we still have assetstore in current shape (actually it would be even easier since script already works based on the path in the asset store!)ds.save
all the changesBut then add logic for proper updates
Crawler (next section) has/uses https://github.com/datalad/datalad-crawler/blob/master/datalad_crawler/dbs/files.py#L89 to store/update information on each file status, but I guess we might just avoid all of that if we just store the full dump of assets listing metadata somewhere under
.datalad/dandi/
so that next time we run it we have a full list of files/assets from previous update to perform "update" actions listed above.Alternative (a bit more involved, not sure if worthwhile ATM)
Can be a "datalad crawler" pipeline (see https://github.com/datalad/datalad-crawler/), which would (internally)
But since datalad-crawler, although functioning etc, is yet another thing to figure out and still uses older DataLad interfaces, primarily talking directly via GitRepo and AnnexRepo interfaces and without really taking advantage of higher level ones, I think it might be a bigger undertaking.
Nevertheless here are some pointers on that
incoming
->incoming-processed
->master
) to support working with data from tarballs, and is "reused" in other pipelines I believe.datalad addurls
-- some older thinking is at Generic framework for crawling data providers with versions datalad/datalad-crawler#22 . So we might get there -- it could be just combining two calls - get assets list from dandi, tune them up just a bit, pass to the stock pipeline which would take care about doing all the time checks, removal of now obsolete files, and then passing the rest to addurls to do the rest (possibly splitting into subdatasets etc)Edit 1: publishing
Upon the end of update, script should
publish
all updated dandisets under https://github.com/dandisets organization (for testing, may be first create some throw away organization on github, e.g. dandisets-testing or alike).There are
datalad
publish
andpush
commands with slight differences,push
is supposedly cleaner interface so probably use that one. There is alsocreate-sibling-github
to initiate a github repository topush
to.Longer term
Also, I hope that would just start updating datalad datasets straight within API backend/workers reacting to API calls thus making datalad datasets immediately reflecting introduced changes. For that we would not need a dedicated script or a crawler.
The text was updated successfully, but these errors were encountered: