Support ingestion of new metadata source format #483

jsheunis · 2024-07-05T13:11:15Z

#482 will provide the new source format specification. Then we need a new set of tools/scripts to allow ingestion of metadata deposited in said format, and output into a format compliant with the datalad-catalog schema, i.e. ready to be datalad catalog-added. The new specification will allow multiple files/formats of metadata per dataset-version, and the tools need to account for this.

datalad-catalog, the SFB1451 catalog, and the ABCD-J catalog all have existing functionality that in some way contribute to achieving a similar goal. It is worth investigating these to see which parts can be reused.

Extractors

An extractor understands a particular metadata format (e.g. datacite.yml), reads such a metadata file, extracts the information, and outputs this (usually) in JSON format, often via datalad-metalad

Existing examples include:

extractors shipped with and dependent on datalad-metalad (via datalad meta-extract): https://github.com/datalad/datalad-metalad/tree/master/datalad_metalad/extractors; most often used are metalad-studyminimeta and metalad-core (dataset and file-level)
metalad-compatible extractors used in SFB1451: https://github.com/mslw/datalad-wackyextra/tree/main/datalad_wackyextra/extractors (including cff and citations)
metalad-compatible extractors used for BIDS datasets: https://github.com/datalad/datalad-neuroimaging/blob/master/datalad_neuroimaging/extractors/bids_dataset.py

Translators

Translators take datalad-metalad output and translates them into a datalad-catalog-schema compatible format. They inherit from a base translator class and for the purposes of the datalad catalog-translate method use a common procedure for matching a specific translator to a specific metadata record. Some translators use jq bindings to do the translation. Other translators use pure python. Examples:

translators shipped with datalad-catalog: https://github.com/datalad/datalad-catalog/tree/main/datalad_catalog/translators (including translators for core, studyminimeta, bids_dataset, datacite_gin, all or most based on jq)
translators used in SFB1451: https://github.com/mslw/datalad-wackyextra/tree/main/datalad_wackyextra/translators (including cff and citations, and python-based translation for those shipped with datalad-catalog)

Standalone extraction+translation scripts

Some standalone scripts have been created to be independent of both datalad-metalad extraction functionality and the datalad-catalog translation functionality. These are used in the ABCD-J catalog pipeline, specifically:

`datalad-catalog` helpers

datalad-catalog ships with functions that help to construct catalog-ready records. They are located in https://github.com/datalad/datalad-catalog/blob/main/datalad_catalog/schema_utils.py, and used by both SFB1451 and ABCD-J catalog pipelines when constructing/translating records to be added to a catalog.

From my POV:
We should have a set of tools that make the creation of catalog-ready records from "raw" metadata formats as simple as possible. The concept of a "reader" was conceived in previous discussions, with the idea that there would be a reader for any metadata format deposited per dataset-version in the source specification. The reader would do all that's necessary to get from the ingestion state to the catalog-ready state, it would be supported by helper functionality living inside of datalad-catalog, and would not be dependent on external packages/extensions. Pretty much what the abovementioned standalone scripts do, but perhaps with some wrapper functionality that makes it a common interface?

@mih @mslw curious to hear your thoughts here. (@mslw please also add updates if I missed or misrepresented any relevant functionality from the SFB1451 pipeline)

The text was updated successfully, but these errors were encountered:

jsheunis · 2024-07-24T14:03:38Z

My current idea is to create a new datalad-catalog module focused on collection ingestion. Likely a Collection class with methods for:

validation:
- check if the collection has the expected tree structure (contains config and records directories)
- check if the collection has a catalog-level config file
- check if the collection has at least one dataset with at least one version with at least one metadata file
- (? check if the metadata files are all known to existing readers)
ingestion:
- for each metadata file in the collection, use the correct reader to transform the metadata file into a datalad-catalog-compatible record, and output it to stdout.
utilities:
- some helper functions for reading from standard file formats (txt, json, yml, xlsx, tsv, etc)
- possibly import and expose the utilities already available in schema_utils.py to help construct datalad-catalog-compatible records

For matching specific metadata formats to specific "readers", I am not sure yet. One could make it very simple and provide a dictionary containing a mapping from <format-id> to reader, when the ingestion is invoked. Or one could make it more automated but more involved, by following the same approach as for the translation functionality currently implemented in datalad-catalog where a base translation class has a "match" method that should be overwritten by any specific translator implementations that inherit from the base class; translators can be registered to the application and then a matching routine is run to find the correct translator for a given record.

jsheunis · 2024-08-14T07:57:46Z

Short comment: a starting point for matching a reader to a metadata source format could be an exact match of the file/directory name, e.g. a dataciteReader would be used if (and only if) the metadata source format is a file with name datacite.yml. This is simple enough. One thing that still bothers with this approach is what to do with incremental changes to readers. E.g. if a new dataciteReader_v2 handles the unchanged incoming metadata slightly differently, should a new metadata file be provided that would allow for a new and separate match (e.g. datacite_reader-v2.yml)?

tientong98 mentioned this issue Oct 7, 2024

catalog-validate failed to validate metadata extracted by meta-extract #492

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support ingestion of new metadata source format #483

Support ingestion of new metadata source format #483

jsheunis commented Jul 5, 2024

jsheunis commented Jul 24, 2024

jsheunis commented Aug 14, 2024

Support ingestion of new metadata source format #483

Support ingestion of new metadata source format #483

Comments

jsheunis commented Jul 5, 2024

Extractors

Translators

Standalone extraction+translation scripts

datalad-catalog helpers

jsheunis commented Jul 24, 2024

jsheunis commented Aug 14, 2024

`datalad-catalog` helpers