In `RequestReport`, use the `hashstore-java` library to retrieve a system metadata document or metadata eml document for a pid #464

doulikecookiedough · 2025-01-13T23:35:35Z

Begin the process of removing metadig-engine API calls to the CN or MN for retrieving system metadata and EML metadata documents for a given pid by refactoring the RequestReport class. For now, retain the existing behavior of the metadig-engine for public datasets:

If a NotAuthorized exception is thrown while accessing system metadata or EML documents via CN/MN, fallback logic will attempt to retrieve the data from HashStore.
We will work with the HashStore location for dev.nceas.ucsb.edu on the dev k8s cluster.

This first step will address and resolve:

Metadata quality checks not running for private datasets

Branch:

feature-hashstore-libimport

// submitReportRequest

try {
    if (isCN) {
        sysmeta = cnNode.getSystemMetadata(session, pid);
    } else {
        sysmeta = mnNode.getSystemMetadata(session, pid);
    }
} catch (NotAuthorized na) {
    log.error("Not authorized to read sysmeta for pid: " + pid.getValue() + ", continuing with next pid...");
    return;
} catch (Exception e) {
    throw (e);
}

try {
    if (isCN) {
        objectIS = cnNode.get(session, pid);
    } else {
        objectIS = mnNode.get(session, pid);
    }
    log.trace("Retrieved metadata object for pid: " + pidStr);
} catch (NotAuthorized na) {
    log.error("Not authorized to read pid: " + pid
            + ", unable to retrieve metadata, continuing with next pid...");
    return;
}

The text was updated successfully, but these errors were encountered:

doulikecookiedough · 2025-01-16T16:25:04Z

Update:

Pushed changes to branch feature-hashstore-libimport where hashstore was imported into RequestReport and executed when a NotAuthorized exception is thrown
Next Steps: Deploy new images to dev k8s and upload test private datasets on dev.nceas.ucsb.edu and check to see if metadig reports are generating

mbjones · 2025-01-16T19:01:48Z

@doulikecookiedough I quickly looked over your changes -- in general looking good, but I have 2 high-level suggestions:

Please reverse the logic of the try/catch block, so that HashStore is tried first, and then, only if that fails, then maybe try calliing the API to get the objects
I see a lot of hardcoded values in the code, including the cluster address, hashstore settings, etc. Those must be pulled out into configuration files before this gets merged

In addition, in my quick skim of that branch, I noticed some odd changes, like having to set an unusual user agent, which is also hardcoded in sha 6a9ac1d -- why is this needed, and it seems like very much the wrong place to set a user agent -- we probably now have different user agents for different requests from metadig. If it needs to be changed (which would be strange), can we do it in one central location?

Finally, @jeanetteclark, could you please give all of these changes a thorough code review before they get merged into develop?

jeanetteclark · 2025-01-21T20:21:00Z

I reverted the User-Agent commit since I don't think that is causing the 403 error.

Regarding the original issue - this is going to require more thought than what is currently implemented. Metadig runs on a bunch of different member nodes, in addition to the CN, for metadata checks. Only one of those nodes will have a hashstore configured. So the code will need to know what nodes are configured for hashstore and run hashstore calls on those, and dataone API calls for the rest.

I actually wasn't planning on replacing this with hashstore at all, since there is only one API call for each suite that is run, to get the sysmeta and object for the metadata object. IMO this isn't worth refactoring at the moment since it isn't going to affect performance at all.

Eventually we will need to re-work node related configuration anyway (see #374). I think we should keep focusing on the data suite side of things and leave this alone for now, revisiting when we have a better idea of how we can configure multiple nodes with multiple store types.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In `RequestReport`, use the `hashstore-java` library to retrieve a system metadata document or metadata eml document for a pid #464

In `RequestReport`, use the `hashstore-java` library to retrieve a system metadata document or metadata eml document for a pid #464

doulikecookiedough commented Jan 13, 2025 •

edited

Loading

doulikecookiedough commented Jan 16, 2025

mbjones commented Jan 16, 2025

jeanetteclark commented Jan 21, 2025

In RequestReport, use the hashstore-java library to retrieve a system metadata document or metadata eml document for a pid #464

In RequestReport, use the hashstore-java library to retrieve a system metadata document or metadata eml document for a pid #464

Comments

doulikecookiedough commented Jan 13, 2025 • edited Loading

doulikecookiedough commented Jan 16, 2025

mbjones commented Jan 16, 2025

jeanetteclark commented Jan 21, 2025

In `RequestReport`, use the `hashstore-java` library to retrieve a system metadata document or metadata eml document for a pid #464

In `RequestReport`, use the `hashstore-java` library to retrieve a system metadata document or metadata eml document for a pid #464

doulikecookiedough commented Jan 13, 2025 •

edited

Loading