Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MarleySpoon: add precautionary check for unexpected API URLs. #1069

Merged
merged 15 commits into from
May 6, 2024

Conversation

jayaddison
Copy link
Collaborator

Adds a sense-checking step to ensure that the API URL returned in the MarleySpoon script elements refers to a second-level-domain containing marleyspoon.

As far as I know, there's currently no standardized and machine-readable way to declare inter-related ownership of a group of distinct Internet domain names. I could be mistaken though; and if so, perhaps we could use that as a better alternative than checking for the brand name, as found in the scraper name.

@jayaddison
Copy link
Collaborator Author

After considering the implications of a comment I wrote at #1064 (comment) I thought it'd be worth checking our existing scrapers for the potential that requests could be made to unexpected domains, and MarleySpoon appeared as a possibility, so this changeset adds a safety measure against requests to API hosts that seem unrelated.

cc @jknndy @hhursev @strangetom for review

@jayaddison
Copy link
Collaborator Author

One more note: it should be possible to generalize this check so that it could apply to other scrapers too; I've attempted to write it in a way that would allow for that.

It's only written for MarleySpoon because that's the only place that I found where I felt that this problem could occur; in the other cases where we make multiple requests, as far as I can tell, we always use fully-qualified literal strings (with some allowance for templating) to define the request URL.

In addition, this problem can only currently affect legacy scrapers in the v14 branch of the codebase. That's not intended to be an argument to move to v15! We do lose functionality during that transition. But it helps to narrow the scrapers that should be checked for problems.

@strangetom
Copy link
Collaborator

Am I correct in thinking that for marleyspoon, the API calls are always to api.marleyspoon.com, even if the recipe URL is (for example) marleyspoon.de?

@jayaddison
Copy link
Collaborator Author

Am I correct in thinking that for marleyspoon, the API calls are always to api.marleyspoon.com, even if the recipe URL is (for example) marleyspoon.de?

@strangetom that does appear to be the case, yep. However, I'd be slightly reluctant to hard-code it, given that they've intentionally made it a variable in the page data. There are situations where doing that can allow for load-balancing / migrations / temporary maintenance by sending a portion of traffic to a different API endpoint, and it'd be nice to (safely) continue to respect that if we can.

@jayaddison
Copy link
Collaborator Author

Implementing Cross-Origin Resource Sharing adherence in the scraper could be another way to do this, in a more standards-compliant manner. I wasn't able to find any HTTP-client-side Python CORS libraries from a quick search (plenty of server-side ones), but perhaps there are some out there (or it might not be too onerous to implement basic support).

@jayaddison
Copy link
Collaborator Author

Alternatively perhaps we could check wheter the URL found in the JavaScript configuration corresponds to an entry in the SCRAPERS map for the same scraper instance?

Explained alternatively:

  • Input an html and org_url as usual.
  • Map org_url to a ScraperCls from SCRAPERS as usual.
  • If ScraperCls wants to make an additional HTTP request:
    • Store the request URL in next_url.
    • Map next_url to a NextScraperCls from SCRAPERS (as in step 2).
    • Is NextScraperCls the same scraper as ScraperCls?
      • If so, allow the HTTP request to proceed.
      • If not, reject the HTTP request.

Comment on lines +71 to +86
scraper_name = self.__class__.__name__
try:
next_url = urljoin(self.url, api_url)
host_name = get_host_name(next_url)
next_scraper = type(None)
# check: api.foo.xx.example, foo.xx.example, xx.example
while host_name and host_name.count("."):
next_scraper = SCRAPERS.get(host_name)
if next_scraper:
break
_, host_name = host_name.split(".", 1)
if not isinstance(self, next_scraper):
msg = f"Attempted to scrape using {next_scraper} from {scraper_name}"
raise ValueError(msg)
except Exception as e:
raise RecipeScrapersExceptions(f"Unexpected API URL: {api_url}") from e
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My attempt to translate this code into a natural-language description:

When scraping a website, ensure that any additional page requests are to hosts that belong to the set of domains supported by the scraper and its subclasses.

@jayaddison
Copy link
Collaborator Author

I'd like to include this in the next v14 release, probably early next week, unless anyone has concerns about it. It would restrict the HTTP requests that the MarleySpoon scrapers could make, but only by limiting those to domains we've configured MarleySpoon scraper mappings for.

@jayaddison jayaddison merged commit 94b2617 into main May 6, 2024
18 checks passed
@jayaddison jayaddison deleted the precaution/validate-marleyspoon-api-domains branch May 6, 2024 10:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants