You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When crawling a website and finding RSS links, the <base> element should be taken into account, if present.
Example: https://www.mmo-champion.com/ redirects to the /content/ path, and the <link rel="alternate" ...> is a relative link, but the site also has <base href="https://www.mmo-champion.com/" /> which should be taken into account. Without this, attempting to add the feed will fail.
The text was updated successfully, but these errors were encountered:
They added a / prefix so it's not relative any more, so it can't be used for testing, but the issue should still be considered for other sites possibly presenting the same structure.
Looking at the code a bit it's not trivial to fix, sanitizer.Sanitize only takes an URL so we could pass it the resolved base instead but the processor doesn't have any straighforward way of getting the base url back from the scrapper; it might be worth sanitizing once in the scrapper and resanitizing at the end for safety?
(FWIW, https://github.com/PuerkitoBio/gocrawl/pull/46/files#diff-a693fca73f07436af23c207f04d5a5b7L362 gives an example of respecting base url with goquery)
When crawling a website and finding RSS links, the
<base>
element should be taken into account, if present.Example: https://www.mmo-champion.com/ redirects to the
/content/
path, and the<link rel="alternate" ...>
is a relative link, but the site also has<base href="https://www.mmo-champion.com/" />
which should be taken into account. Without this, attempting to add the feed will fail.The text was updated successfully, but these errors were encountered: