Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#112 Parse epub insed of mix of ppub and epub #141

Merged
merged 10 commits into from
Jul 12, 2024

Conversation

nils-herrmann
Copy link
Contributor

The XML looks like this:

<pub-date pub-type="ppub">
    <month>9</month>
    <year>2005</year>
</pub-date>

<pub-date pub-type="epub">
    <day>31</day>
    <month>5</month>
    <year>2005</year>
</pub-date>

The code was mixing both elements. The new implementation parses the epub

@Michael-E-Rose
Copy link
Collaborator

In general the paper publication is more relevant. Otherwise you have authors whose articles got published in the 1970s and suddenly they still publish.

But it would be great to have a new attribute: epublication_date.

Also thanks for already updating the tests!

Copy link
Collaborator

@Michael-E-Rose Michael-E-Rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more changes please.

pubmed_parser/pubmed_oa_parser.py Outdated Show resolved Hide resolved
pubmed_parser/pubmed_oa_parser.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Michael-E-Rose Michael-E-Rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's improve the function one more time

pubmed_parser/pubmed_oa_parser.py Outdated Show resolved Hide resolved
pubmed_parser/pubmed_oa_parser.py Show resolved Hide resolved
pubmed_parser/pubmed_oa_parser.py Outdated Show resolved Hide resolved
… different functions. The new date format avoids trailing '-'. Publication year is now returned as int (or None)
Copy link
Collaborator

@Michael-E-Rose Michael-E-Rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplify the year to int conversion

pubmed_parser/pubmed_oa_parser.py Show resolved Hide resolved
pubmed_parser/pubmed_oa_parser.py Outdated Show resolved Hide resolved
@nils-herrmann
Copy link
Contributor Author

New commit parses collection date if ppub missing. try-except to get pub_year.

…_dict to have empty values if node is None. This commit conducts the needed changes.
…ub. Remark: File 3460867 has collection and epub.
@Michael-E-Rose Michael-E-Rose merged commit 327403f into titipata:master Jul 12, 2024
0 of 2 checks passed
@nils-herrmann nils-herrmann mentioned this pull request Sep 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants