-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support ingestion of new metadata source format #483
Comments
My current idea is to create a new
For matching specific metadata formats to specific "readers", I am not sure yet. One could make it very simple and provide a dictionary containing a mapping from |
Short comment: a starting point for matching a reader to a metadata source format could be an exact match of the file/directory name, e.g. a |
#482 will provide the new source format specification. Then we need a new set of tools/scripts to allow ingestion of metadata deposited in said format, and output into a format compliant with the
datalad-catalog
schema, i.e. ready to bedatalad catalog-add
ed. The new specification will allow multiple files/formats of metadata per dataset-version, and the tools need to account for this.datalad-catalog
, the SFB1451 catalog, and the ABCD-J catalog all have existing functionality that in some way contribute to achieving a similar goal. It is worth investigating these to see which parts can be reused.Extractors
An extractor understands a particular metadata format (e.g.
datacite.yml
), reads such a metadata file, extracts the information, and outputs this (usually) in JSON format, often viadatalad-metalad
Existing examples include:
datalad-metalad
(viadatalad meta-extract
): https://github.com/datalad/datalad-metalad/tree/master/datalad_metalad/extractors; most often used aremetalad-studyminimeta
andmetalad-core
(dataset and file-level)Translators
Translators take
datalad-metalad
output and translates them into adatalad-catalog
-schema compatible format. They inherit from a base translator class and for the purposes of thedatalad catalog-translate
method use a common procedure for matching a specific translator to a specific metadata record. Some translators usejq
bindings to do the translation. Other translators use pure python. Examples:datalad-catalog
: https://github.com/datalad/datalad-catalog/tree/main/datalad_catalog/translators (including translators forcore
,studyminimeta
,bids_dataset
,datacite_gin
, all or most based on jq)datalad-catalog
)Standalone extraction+translation scripts
Some standalone scripts have been created to be independent of both
datalad-metalad
extraction functionality and thedatalad-catalog
translation functionality. These are used in the ABCD-J catalog pipeline, specifically:datalad-catalog
helpersdatalad-catalog
ships with functions that help to construct catalog-ready records. They are located in https://github.com/datalad/datalad-catalog/blob/main/datalad_catalog/schema_utils.py, and used by both SFB1451 and ABCD-J catalog pipelines when constructing/translating records to be added to a catalog.From my POV:
We should have a set of tools that make the creation of catalog-ready records from "raw" metadata formats as simple as possible. The concept of a "reader" was conceived in previous discussions, with the idea that there would be a reader for any metadata format deposited per dataset-version in the source specification. The reader would do all that's necessary to get from the ingestion state to the catalog-ready state, it would be supported by helper functionality living inside of
datalad-catalog
, and would not be dependent on external packages/extensions. Pretty much what the abovementioned standalone scripts do, but perhaps with some wrapper functionality that makes it a common interface?@mih @mslw curious to hear your thoughts here. (@mslw please also add updates if I missed or misrepresented any relevant functionality from the SFB1451 pipeline)
The text was updated successfully, but these errors were encountered: