Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Openalex fetch example #61

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

bkampe
Copy link
Contributor

@bkampe bkampe commented Aug 28, 2024

Added an example of fetching publication metadata from Openalex based on the JSON fetch.
Three example queries are available in example-scripts/bash-scripts/full-harvest-examples/1.13-1.15-examples/example-openalex/openAlexfetch.config.xml. Choose one for first test or modify it to fit to your needs.

OpenAlex fetch is already used in the Research Atlas: https://forschungsatlas.fid-bau.de/research

Namespace in example-scripts/bash-scripts/full-harvest-examples/1.13-1.15-examples/example-openalex/openalex-to-vivo.datamap.xsl needs to be adjusted according to your settings in runtime.properties:
<xsl:variable name = "baseURI">https://forschungsatlas.fid-bau.de/individual/</xsl:variable>

JSONFetch.java was extended to be capable of handling nested object. Also some filtering for unwanted characters was added to avoid problems with the XSLTranslator (javax.xml.transform)

Closes #56

@chenejac chenejac self-requested a review December 23, 2024 09:43
Copy link
Contributor

@chenejac chenejac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkampe thanks for this. I didn't test it yet, but I briefly reviewed the code. There is one tiny comment about gitignore file. And I have one more comment about SPARQL API based approach. Ivan Mrsulja makes SPARQL API based approach working in the case of DSpace ETL (#63). There is a parameter for the main script file (the value of the parameter might be tdb or sparql). I am wondering whether that approach might be copied in this PR as well?

Comment on lines +32 to +34
/example-scripts/bash-scripts/full-harvest-examples/1.13-1.15-examples/example-openalex/logs/
/example-scripts/bash-scripts/full-harvest-examples/1.13-1.15-examples/example-openalex/data/
/example-scripts/bash-scripts/full-harvest-examples/1.13-1.15-examples/example-openalex/previous-harvest/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that lines 3-7 includes this case?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that adding:

**/data
**/logs
**/previous-harvest

to root-level .gitignore will solve this issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have been in there, I thought. Will take a look.

Copy link

@ivanmrsulja ivanmrsulja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Please check out my comments, I believe they can be helpfull.

Comment on lines +408 to +411
//log.trace("Adding record: " + fixedkey + "_" + recID);
//log.trace("data: "+ sb.toString());
//log.info("rhOutput: "+ this.rhOutput);
//log.info("recID: "+recID);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably remove these comments (and also in the other places you commented-out the code snippets), it will clean up the code slightly.

Comment on lines +596 to +607
sb.append(" <");
sb.append(SpecialEntities.xmlEncode(field));
sb.append(">");

// insert field value
sb.append(SpecialEntities.xmlEncode(val.toString().trim()));

// Field END
sb.append("</");
sb.append(SpecialEntities.xmlEncode(field));
sb.append(">\n");
return sb.toString();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these appends be chained using StringBuilder's default builder pattern?

Comment on lines +389 to +391
.replaceAll(" |/", "_")
.replaceAll("\\(|\\)", "")
.replaceAll("/", "_");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use replaceAll("[ /]", "_").replaceAll("[()]", "") to make this more clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not good at regex. And this was obviously incrementally added. I'm happy to use your suggestion to make it cleaner.

.replaceAll(" |/", "_")
.replaceAll("\\(|\\)", "")
.replaceAll("/", "_");
if (!Character.isDigit(fixedkey.charAt(0)) && !fixedkey.equals("abstract_inverted_index")) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can fixedKey ever be null? If yes, then I think there should be a null-check for that edge case.

}

public String getTagName(String field, Object val) {
StringBuffer sb = new StringBuffer();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is StringBuffer used here? If this class will not be used in a multithreaded environment I think we should switch to using StringBuilder everywhere because it is a lot faster.

Copy link
Contributor

@chenejac chenejac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkampe I have tested harvester on Windows 10 and it works. Well done. Please, check my comments.

<!-- <param name="file">https://api.openalex.org/works?filter=concepts.id:C10238366|C118416809|C119823426|C120991184|C122156500|C123657996|C124363303|C127416549|C147176958|C148803439|C154226666|C158049464|C158550234|C1631582|C173560066|C178432105|C190831278|C196316656|C203115093|C203299862|C205300905|C2775926657|C2776009117|C2776081408|C2776136241|C2776161637|C2776311590|C2776445639|C2776748203|C2776825979|C2777231864|C2777364373|C2777800518|C2777831296|C2778206487|C2778647717|C2778684775|C2778753569|C2778906150|C2779054714|C2779201158|C2779265402|C2779331490|C2779635184|C2780021121|C2780113678|C2780344732|C2780886216|C2780933643|C2781052401,authorships.institutions.country_code:de&amp;per-page=200&amp;[email protected]&amp;cursor=*</param>-->

<param name="output">raw-records.config.xml</param>
<param name="namespaceBase">http://vivo.example.com/harvest/aims_users/</param>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this parameter, and what is the difference between this and baseURI in *.datamap.xsl. Moreover, this is hardcoded value in *-datamap.xsl file, meaning if someone change the value in this file, also has to update the another file to make harvesting process working properly. I found file changenamespace-all.config in other examples. Any chance to use it for openalex etl process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just copied it from another already existing config file. I have to find out what its purpose is.

xmlns:vitro = 'http://vitro.mannlib.cornell.edu/ns/vitro/0.7#'
xmlns:vcard = 'http://www.w3.org/2006/vcard/ns#'
xmlns:kdsf-vivo = 'http://lod.tib.eu/onto/kdsf/'
xmlns:node-publication='http://vivo.example.com/harvest/aims_users/fields/publication/'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we keep this hardcoded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are better ways to handle that. But this is how the harvester was build. I kept most from the existing cofiguration and functionality and added just the necessary things.

xmlns:node-publication='http://vivo.example.com/harvest/aims_users/fields/publication/'
xmlns:fn='http://www.w3.org/2005/xpath-functions'
xmlns:functx='http://www.functx.com'
xmlns:vivo-oa='http://lod.tib.eu/onto/vivo-oa/'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIB specific namespace

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove all the TIB specific things..

xmlns:c4o='http://purl.org/spar/c4o/' >

<xsl:output method = "xml" indent = "yes"/>
<xsl:variable name = "baseURI">https://forschungsatlas.fid-bau.de/individual/</xsl:variable>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hardcoded to TIB specific value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove all the TIB specific things.

<xsl:if test="normalize-space( $cited_by_count )">
<rdf:Description rdf:about="{$baseURI}gcf_{$oaid}">
<rdf:type rdf:resource="http://purl.org/spar/c4o/GlobalCitationCount"/>
<c4o:hasGlobalCountSource rdf:resource="https://forschungsatlas01.develop.service.tib.eu/individual/n4885"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIB specific

-->
<config>
# harvesting publications from TIB – Leibniz Information Centre for Science and Technology. exchange the ROR ID to test with your institution.
<param name="file">https://api.openalex.org/works?filter=authorships.institutions.ror:https://ror.org/04aj4c181&amp;per-page=200&amp;[email protected]&amp;cursor=*</param>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest adding here an example with timestamp. I have used this one:

Suggested change
<param name="file">https://api.openalex.org/works?filter=authorships.institutions.ror:https://ror.org/04aj4c181&amp;per-page=200&amp;[email protected]&amp;cursor=*</param>
<param name="file">https://api.openalex.org/works?filter=authorships.institutions.ror:https://ror.org/04aj4c181,from_publication_date:2024-12-01,to_publication_date:2024-12-31&amp;per-page=200&amp;[email protected]&amp;cursor=*</param>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add harvesting OpenAlex data for a certain institution Add harvesting script for OpenAlex
3 participants