Problems Parsing Titles #262

grantdelozier · 2016-10-03T19:24:18Z

Seeing extraction errors on certain websites that have titles.

File "/usr/local/lib/python2.7/site-packages/ContentAnalysis-0.1.1-py2.7.egg/ContentAnalysis/document.py", line 53, in parse ginfo = g.extract(url=self.link) File "/usr/local/lib/python2.7/site-packages/goose/__init__.py", line 56, in extract return self.crawl(cc) File "/usr/local/lib/python2.7/site-packages/goose/__init__.py", line 66, in crawl article = crawler.crawl(crawl_candiate) File "/usr/local/lib/python2.7/site-packages/goose/crawler.py", line 154, in crawl self.article.title = self.title_extractor.extract() File "/usr/local/lib/python2.7/site-packages/goose/extractors/title.py", line 99, in extract return self.get_title() File "/usr/local/lib/python2.7/site-packages/goose/extractors/title.py", line 78, in get_title return self.clean_title(title) File "/usr/local/lib/python2.7/site-packages/goose/extractors/title.py", line 56, in clean_title if title_words[0] in TITLE_SPLITTERS: IndexError: list index out of range

You can replicate by running goose extract on a site like http://daydreamingfoodie.com/

The text was updated successfully, but these errors were encountered:

grantdelozier · 2016-10-03T20:42:08Z

The issue on this site and plenty of others stems from when the title = opengraph site name

Fixed the issue in this commit of my fork

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems Parsing Titles #262

Problems Parsing Titles #262

grantdelozier commented Oct 3, 2016 •

edited

Loading

grantdelozier commented Oct 3, 2016

Problems Parsing Titles #262

Problems Parsing Titles #262

Comments

grantdelozier commented Oct 3, 2016 • edited Loading

grantdelozier commented Oct 3, 2016

grantdelozier commented Oct 3, 2016 •

edited

Loading