Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In RequestReport, use the hashstore-java library to retrieve a system metadata document or metadata eml document for a pid #464

Open
doulikecookiedough opened this issue Jan 13, 2025 · 3 comments

Comments

@doulikecookiedough
Copy link

doulikecookiedough commented Jan 13, 2025

Begin the process of removing metadig-engine API calls to the CN or MN for retrieving system metadata and EML metadata documents for a given pid by refactoring the RequestReport class. For now, retain the existing behavior of the metadig-engine for public datasets:

  • If a NotAuthorized exception is thrown while accessing system metadata or EML documents via CN/MN, fallback logic will attempt to retrieve the data from HashStore.
  • We will work with the HashStore location for dev.nceas.ucsb.edu on the dev k8s cluster.

This first step will address and resolve:

Branch:

  • feature-hashstore-libimport
// submitReportRequest

try {
    if (isCN) {
        sysmeta = cnNode.getSystemMetadata(session, pid);
    } else {
        sysmeta = mnNode.getSystemMetadata(session, pid);
    }
} catch (NotAuthorized na) {
    log.error("Not authorized to read sysmeta for pid: " + pid.getValue() + ", continuing with next pid...");
    return;
} catch (Exception e) {
    throw (e);
}

try {
    if (isCN) {
        objectIS = cnNode.get(session, pid);
    } else {
        objectIS = mnNode.get(session, pid);
    }
    log.trace("Retrieved metadata object for pid: " + pidStr);
} catch (NotAuthorized na) {
    log.error("Not authorized to read pid: " + pid
            + ", unable to retrieve metadata, continuing with next pid...");
    return;
}
@doulikecookiedough
Copy link
Author

Update:

  • Pushed changes to branch feature-hashstore-libimport where hashstore was imported into RequestReport and executed when a NotAuthorized exception is thrown
  • Next Steps: Deploy new images to dev k8s and upload test private datasets on dev.nceas.ucsb.edu and check to see if metadig reports are generating

@mbjones
Copy link
Member

mbjones commented Jan 16, 2025

@doulikecookiedough I quickly looked over your changes -- in general looking good, but I have 2 high-level suggestions:

  1. Please reverse the logic of the try/catch block, so that HashStore is tried first, and then, only if that fails, then maybe try calliing the API to get the objects
  2. I see a lot of hardcoded values in the code, including the cluster address, hashstore settings, etc. Those must be pulled out into configuration files before this gets merged

In addition, in my quick skim of that branch, I noticed some odd changes, like having to set an unusual user agent, which is also hardcoded in sha 6a9ac1d -- why is this needed, and it seems like very much the wrong place to set a user agent -- we probably now have different user agents for different requests from metadig. If it needs to be changed (which would be strange), can we do it in one central location?

Finally, @jeanetteclark, could you please give all of these changes a thorough code review before they get merged into develop?

@jeanetteclark
Copy link
Collaborator

I reverted the User-Agent commit since I don't think that is causing the 403 error.

Regarding the original issue - this is going to require more thought than what is currently implemented. Metadig runs on a bunch of different member nodes, in addition to the CN, for metadata checks. Only one of those nodes will have a hashstore configured. So the code will need to know what nodes are configured for hashstore and run hashstore calls on those, and dataone API calls for the rest.

I actually wasn't planning on replacing this with hashstore at all, since there is only one API call for each suite that is run, to get the sysmeta and object for the metadata object. IMO this isn't worth refactoring at the moment since it isn't going to affect performance at all.

Eventually we will need to re-work node related configuration anyway (see #374). I think we should keep focusing on the data suite side of things and leave this alone for now, revisiting when we have a better idea of how we can configure multiple nodes with multiple store types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants