-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generic framework for crawling data providers with versions #22
Comments
I think an additional item to the list is handling of subdatasets, so dumping some "thinking out loud" in here SubdatasetsATM crawlers such as openfmri, crcns, etc rely on a dedicated function specified to be used to return a dedicated pipeline for the top level super dataset, which would create subdatasets while populating them with per-subdataset crawl configuration.
In general it seems that it would be nice to be able to specify more flexibly to
E.g. for HCP, if we decide to split into subdatasets at the subject level, and then all subdirectories which do not match subdatasets:
- path_regex: "^[0-9]{6}"
crawler_config: inherit
procedures:
- cfg_text2git
- full_path_regex: ".*/[0-9]{6}/(?!release-notes)"
crawler_config: inherit
procedures:
- hcp_subject_data_dataset
- name: cfg_metadatatypes
args:
- dicom
- nifti1
- xmp to be specified at the top level dataset, so it could be inherited and used in subdatasets as is. But may be it would be undesired so that In above With such setup we could also adjust for preprocessed per-task folders in HCP to be subdatasets with smth like - full_path_regex: ".*/[0-9]{6}/[^/]+/Results/[^/]+"
... (potentially just mixing it into regex for the parent dataset) A somewhat alternative specification organization could be to orient it around "paths", with default action being "addurl" (what is now), while allowing for others ("subdataset", etc) paths:
- path_regex: "^[0-9]{6}$"
action: subdataset
procedures:
- inherit_crawler_config
- cfg_text2git
- full_path_regex: ".*/[0-9]{6}/(?!release-notes)$"
action: subdataset
procedures:
- inherit_crawler_config
- hcp_subject_data_dataset
- name: cfg_metadatatypes
args:
- dicom
- nifti1
- xmp so we could use the same specification to provide alternative actions, such as e.g. "skip" (currently some pipelines allow for : paths:
- path: .xdlm$
action: skip |
attn @mih et al (@kyleam @bpoldrack @TobiasKadelka) who might be interested: While trying to come up with a temporary hack for current
I am leaning toward 1Benefits of 1. is ability to later on take a collection of subdatasets, and recrawl them independently. Something yet to be attempted/used in a wild. With current
Cons: May be there is some overall crawler pipelineing solution possible (to instantiate crawlers within subdatasets where path goes into subdataset, and somehow feed them with those records), but it would fall into the same trap as outlined below -- crawling individual subdatasets would potentially be different from crawling from superdataset. 2Going forward, I kinda like this way better since it would
Cons:
2 with config for 1 (mneh)We could still populate crawler configuration within subdatasets, but they would lack "versioning" information (although may be there is a workaround via storing versioning information updates in each subdataset along the path upon each new file being added/updated). Even if we populate all needed for recrawling pieces, since it will not actually be the crawling configuration used originally, it would be fragile etc. and re-crawling them individually would probably fail and/or result in the full recrawling of the subdataset. 2 would still allow for 1It should still be possible where desired to not recurse and stop at the subdirectory/subdataset boundary (with simple_s3 pipeline I mean) where really desired. |
FTR: .csv file with two sample urls http://www.onerussian.com/tmp/hcp-sample-urls.csv for an invocation 4 @mih: when producing the table add "last-modified" and then include versionId into the url. That would later help to produce a "diff" so an "update" could be done for addurls by just providing a table with new/changed entries.
|
There are two aspects:
Versioning
ATM some crawling pipelines do care/version datasets, e.g.
///openfmri/ds000001
. To make it happen, that crawler pipeline relies onincoming
branch. Very magical function which even supports "overlays" (e.g. whenever only one of the files for new .patch version is provided while others should be taken from the previous major.minor. version(s))Pipelines reuse
simple_with_archives pipeline is already reused one way or another in other pipelines (simple_s3, stanford_lib) to provide multi-branch setup for implementing archives processing (extraction, cleanup, etc) via multiple branches:
incoming
for plain as is crawled materialsincoming-processed
- only automatically processed/merged data - no manual changes done in that branch, e.g. extracted from archives. To make that happen it "-mtheirs" theincoming
branch, and then performs processing (extraction) and commitsmaster - typically an
incoming-processed` branch state (merged normally) with whatever else desired to be done (e.g. metadata aggregation, manual fixes etc)addurls
Datalad "core" now has addurls which provides quite an extended/flexible implementation to populate a dataset (including git annex metadata) from a set of records, e.g. as typically provided in .tsv or .json file. But it doesn't provide crawler's functionality of being able to monitor remote urls (or those entire records) for changes
So in principle, based on those experiences, and having additional desires in mind (be able to do multiple pipelines in the same dataset, may be different branches) it should be worth producing some "ultimate" pipeline which would rely on obtaining records with source urls, or about versions etc, and perform all necessary steps to handle versioning etc. Then specialized pipelines would only produce provider specific logic feeding that pipeline with those records.
It might be worth approaching this after (or while working on) #20 solution which would provide yet another "versioned" provider to see how such pipeline could generalize for all openfmri/s3/figshare cases.
The text was updated successfully, but these errors were encountered: