Crawler ignoring `base` tag #757

p3lim · 2020-08-05T21:07:02Z

When crawling a website and finding RSS links, the <base> element should be taken into account, if present.

Example: https://www.mmo-champion.com/ redirects to the /content/ path, and the <link rel="alternate" ...> is a relative link, but the site also has <base href="https://www.mmo-champion.com/" /> which should be taken into account. Without this, attempting to add the feed will fail.

The text was updated successfully, but these errors were encountered:

p3lim · 2020-08-05T21:13:32Z

They added a / prefix so it's not relative any more, so it can't be used for testing, but the issue should still be considered for other sites possibly presenting the same structure.

martinetd · 2020-10-12T08:51:47Z

If you need another example, megatokyo.com has the same issue for its comics e.g. https://megatokyo.com/strip/1586 has a relative link to strips/1586.png but there's a <base href="https://megatokyo.com/"> (feed url: https://megatokyo.com/rss/megatokyo.xml )

Looking at the code a bit it's not trivial to fix, sanitizer.Sanitize only takes an URL so we could pass it the resolved base instead but the processor doesn't have any straighforward way of getting the base url back from the scrapper; it might be worth sanitizing once in the scrapper and resanitizing at the end for safety?
(FWIW, https://github.com/PuerkitoBio/gocrawl/pull/46/files#diff-a693fca73f07436af23c207f04d5a5b7L362 gives an example of respecting base url with goquery)

fguillot added the improvements label Aug 16, 2020

fguillot mentioned this issue Feb 18, 2022

Keyboard Builders' Digest doesn't show images #1365

Closed

fguillot added wishlist and removed improvements labels Jan 15, 2024

fguillot linked a pull request Jul 26, 2024 that will close this issue

feat: add support for base element when discovering feeds #2763

Merged

5 tasks

fguillot closed this as completed in #2763 Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler ignoring `base` tag #757

Crawler ignoring `base` tag #757

p3lim commented Aug 5, 2020

p3lim commented Aug 5, 2020

martinetd commented Oct 12, 2020

Crawler ignoring base tag #757

Crawler ignoring base tag #757

Comments

p3lim commented Aug 5, 2020

p3lim commented Aug 5, 2020

martinetd commented Oct 12, 2020

Crawler ignoring `base` tag #757

Crawler ignoring `base` tag #757