Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler ignoring base tag #757

Closed
p3lim opened this issue Aug 5, 2020 · 2 comments · Fixed by #2763
Closed

Crawler ignoring base tag #757

p3lim opened this issue Aug 5, 2020 · 2 comments · Fixed by #2763
Labels

Comments

@p3lim
Copy link

p3lim commented Aug 5, 2020

When crawling a website and finding RSS links, the <base> element should be taken into account, if present.

Example: https://www.mmo-champion.com/ redirects to the /content/ path, and the <link rel="alternate" ...> is a relative link, but the site also has <base href="https://www.mmo-champion.com/" /> which should be taken into account. Without this, attempting to add the feed will fail.

@p3lim
Copy link
Author

p3lim commented Aug 5, 2020

They added a / prefix so it's not relative any more, so it can't be used for testing, but the issue should still be considered for other sites possibly presenting the same structure.

@martinetd
Copy link

If you need another example, megatokyo.com has the same issue for its comics e.g. https://megatokyo.com/strip/1586 has a relative link to strips/1586.png but there's a <base href="https://megatokyo.com/"> (feed url: https://megatokyo.com/rss/megatokyo.xml )

Looking at the code a bit it's not trivial to fix, sanitizer.Sanitize only takes an URL so we could pass it the resolved base instead but the processor doesn't have any straighforward way of getting the base url back from the scrapper; it might be worth sanitizing once in the scrapper and resanitizing at the end for safety?
(FWIW, https://github.com/PuerkitoBio/gocrawl/pull/46/files#diff-a693fca73f07436af23c207f04d5a5b7L362 gives an example of respecting base url with goquery)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

3 participants