-
Notifications
You must be signed in to change notification settings - Fork 531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MarleySpoon: add precautionary check for unexpected API URLs. #1069
Conversation
After considering the implications of a comment I wrote at #1064 (comment) I thought it'd be worth checking our existing scrapers for the potential that requests could be made to unexpected domains, and cc @jknndy @hhursev @strangetom for review |
One more note: it should be possible to generalize this check so that it could apply to other scrapers too; I've attempted to write it in a way that would allow for that. It's only written for In addition, this problem can only currently affect legacy scrapers in the v14 branch of the codebase. That's not intended to be an argument to move to v15! We do lose functionality during that transition. But it helps to narrow the scrapers that should be checked for problems. |
Am I correct in thinking that for marleyspoon, the API calls are always to |
@strangetom that does appear to be the case, yep. However, I'd be slightly reluctant to hard-code it, given that they've intentionally made it a variable in the page data. There are situations where doing that can allow for load-balancing / migrations / temporary maintenance by sending a portion of traffic to a different API endpoint, and it'd be nice to (safely) continue to respect that if we can. |
Implementing Cross-Origin Resource Sharing adherence in the scraper could be another way to do this, in a more standards-compliant manner. I wasn't able to find any HTTP-client-side Python CORS libraries from a quick search (plenty of server-side ones), but perhaps there are some out there (or it might not be too onerous to implement basic support). |
Alternatively perhaps we could check wheter the URL found in the JavaScript configuration corresponds to an entry in the Explained alternatively:
|
…her a request is valid or not.
…o originating-exception.
…, but avoids a circular import).
… host domain name.
…ommendations / requirements.
scraper_name = self.__class__.__name__ | ||
try: | ||
next_url = urljoin(self.url, api_url) | ||
host_name = get_host_name(next_url) | ||
next_scraper = type(None) | ||
# check: api.foo.xx.example, foo.xx.example, xx.example | ||
while host_name and host_name.count("."): | ||
next_scraper = SCRAPERS.get(host_name) | ||
if next_scraper: | ||
break | ||
_, host_name = host_name.split(".", 1) | ||
if not isinstance(self, next_scraper): | ||
msg = f"Attempted to scrape using {next_scraper} from {scraper_name}" | ||
raise ValueError(msg) | ||
except Exception as e: | ||
raise RecipeScrapersExceptions(f"Unexpected API URL: {api_url}") from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My attempt to translate this code into a natural-language description:
When scraping a website, ensure that any additional page requests are to hosts that belong to the set of domains supported by the scraper and its subclasses.
I'd like to include this in the next |
Adds a sense-checking step to ensure that the API URL returned in the MarleySpoon script elements refers to a second-level-domain containing
marleyspoon
.As far as I know, there's currently no standardized and machine-readable way to declare inter-related ownership of a group of distinct Internet domain names. I could be mistaken though; and if so, perhaps we could use that as a better alternative than checking for the brand name, as found in the scraper name.