Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ingestion of new metadata source format #483

Open
jsheunis opened this issue Jul 5, 2024 · 2 comments
Open

Support ingestion of new metadata source format #483

jsheunis opened this issue Jul 5, 2024 · 2 comments

Comments

@jsheunis
Copy link
Member

jsheunis commented Jul 5, 2024

#482 will provide the new source format specification. Then we need a new set of tools/scripts to allow ingestion of metadata deposited in said format, and output into a format compliant with the datalad-catalog schema, i.e. ready to be datalad catalog-added. The new specification will allow multiple files/formats of metadata per dataset-version, and the tools need to account for this.

datalad-catalog, the SFB1451 catalog, and the ABCD-J catalog all have existing functionality that in some way contribute to achieving a similar goal. It is worth investigating these to see which parts can be reused.

Extractors

An extractor understands a particular metadata format (e.g. datacite.yml), reads such a metadata file, extracts the information, and outputs this (usually) in JSON format, often via datalad-metalad

Existing examples include:

Translators

Translators take datalad-metalad output and translates them into a datalad-catalog-schema compatible format. They inherit from a base translator class and for the purposes of the datalad catalog-translate method use a common procedure for matching a specific translator to a specific metadata record. Some translators use jq bindings to do the translation. Other translators use pure python. Examples:

Standalone extraction+translation scripts

Some standalone scripts have been created to be independent of both datalad-metalad extraction functionality and the datalad-catalog translation functionality. These are used in the ABCD-J catalog pipeline, specifically:

datalad-catalog helpers

datalad-catalog ships with functions that help to construct catalog-ready records. They are located in https://github.com/datalad/datalad-catalog/blob/main/datalad_catalog/schema_utils.py, and used by both SFB1451 and ABCD-J catalog pipelines when constructing/translating records to be added to a catalog.


From my POV:
We should have a set of tools that make the creation of catalog-ready records from "raw" metadata formats as simple as possible. The concept of a "reader" was conceived in previous discussions, with the idea that there would be a reader for any metadata format deposited per dataset-version in the source specification. The reader would do all that's necessary to get from the ingestion state to the catalog-ready state, it would be supported by helper functionality living inside of datalad-catalog, and would not be dependent on external packages/extensions. Pretty much what the abovementioned standalone scripts do, but perhaps with some wrapper functionality that makes it a common interface?

@mih @mslw curious to hear your thoughts here. (@mslw please also add updates if I missed or misrepresented any relevant functionality from the SFB1451 pipeline)

@jsheunis
Copy link
Member Author

My current idea is to create a new datalad-catalog module focused on collection ingestion. Likely a Collection class with methods for:

  • validation:
    • check if the collection has the expected tree structure (contains config and records directories)
    • check if the collection has a catalog-level config file
    • check if the collection has at least one dataset with at least one version with at least one metadata file
    • (? check if the metadata files are all known to existing readers)
  • ingestion:
    • for each metadata file in the collection, use the correct reader to transform the metadata file into a datalad-catalog-compatible record, and output it to stdout.
  • utilities:
    • some helper functions for reading from standard file formats (txt, json, yml, xlsx, tsv, etc)
    • possibly import and expose the utilities already available in schema_utils.py to help construct datalad-catalog-compatible records

For matching specific metadata formats to specific "readers", I am not sure yet. One could make it very simple and provide a dictionary containing a mapping from <format-id> to reader, when the ingestion is invoked. Or one could make it more automated but more involved, by following the same approach as for the translation functionality currently implemented in datalad-catalog where a base translation class has a "match" method that should be overwritten by any specific translator implementations that inherit from the base class; translators can be registered to the application and then a matching routine is run to find the correct translator for a given record.

@jsheunis
Copy link
Member Author

Short comment: a starting point for matching a reader to a metadata source format could be an exact match of the file/directory name, e.g. a dataciteReader would be used if (and only if) the metadata source format is a file with name datacite.yml. This is simple enough. One thing that still bothers with this approach is what to do with incremental changes to readers. E.g. if a new dataciteReader_v2 handles the unchanged incoming metadata slightly differently, should a new metadata file be provided that would allow for a new and separate match (e.g. datacite_reader-v2.yml)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant