Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"last_seen" date on publisher gets updated even when rss contents are missing #68

Open
roncanepa opened this issue Feb 28, 2020 · 5 comments

Comments

@roncanepa
Copy link
Contributor

@mielliott noticed that the last_seen date for a publisher seemed to get updated even when it shouldn't. Did some digging and was able to confirm. The code handles any http return code >400 properly, but one particular publisher's rss URL was returning their custom "error" page with a 200 OK. Thus, it doesn't get handled or bubbled up properly, nor is deeper inspection of the string contents occurring.

This field is mostly just for human use but should still be corrected.

line 164 of update_db_from_rss in /idigbio_ingestion/update_publisher_recordset.py , which then passes to _do_rss in the same file

@danstoner
Copy link
Contributor

danstoner commented Mar 3, 2020

Which is the particular publisher?

@roncanepa
Copy link
Contributor Author

@danstoner I've sent you a link that has additional info and logging

@danstoner
Copy link
Contributor

We probably want to add a last_feed_parse_success date or something similar if we need to track the results of the feed parse event.

@danstoner
Copy link
Contributor

We are using the feedparser lib. We can probably leverage bozo detection.

https://pythonhosted.org/feedparser/bozo.html#advanced-bozo

For example, on content that is a web page instead of an xml RSS feed:

>>> feed.bozo
1
>>> feed.bozo_exception
SAXParseException('mismatched tag',)

Note to self, also check if we should be using len(feed.entries) instead of len(feed).

@nrejac
Copy link
Contributor

nrejac commented May 15, 2020

See: #72 to find where to correct this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants