-
Notifications
You must be signed in to change notification settings - Fork 531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve rezeptwelt.de recipe parsing #1295
base: main
Are you sure you want to change the base?
Conversation
This change improves the parser of recipes at rezeptwelt.de: - detect ingredient groups - support HTML layout for newer recipes, especially for instruction parsing - add prep time - add equipment entries
@@ -9,19 +25,69 @@ def host(cls): | |||
return "rezeptwelt.de" | |||
|
|||
def site_name(self): | |||
raise StaticValueException(return_value="Rezeptwelt") | |||
return "Thermomix Rezeptwelt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return "Thermomix Rezeptwelt" | |
raise StaticValueException(return_value="Thermomix Rezeptwelt") |
I admit this is a slightly unusual pattern that we use; it is used so that the interface of the library can indicate whether values were retrieved from the source HTML or whether they are static/constant values returned by the code.
def prep_time(self): | ||
tag = self.soup.find(itemprop="performTime", content=nonempty) | ||
return get_minutes(tag['content']) if tag else None | ||
|
||
def equipment(self): | ||
return [tag['content'] for tag in self.soup.find_all("meta", itemprop="tool", content=nonempty)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BeautifulSoup (bs4
/ self.soup
) allows non-empty content filtering by passing a boolean True
value, so I think we can simplify these methods slightly:
def prep_time(self): | |
tag = self.soup.find(itemprop="performTime", content=nonempty) | |
return get_minutes(tag['content']) if tag else None | |
def equipment(self): | |
return [tag['content'] for tag in self.soup.find_all("meta", itemprop="tool", content=nonempty)] | |
def prep_time(self): | |
tag = self.soup.find(itemprop="performTime", content=True) | |
return get_minutes(tag['content']) if tag else None | |
def equipment(self): | |
return [tag['content'] for tag in self.soup.find_all("meta", itemprop="tool", content=True)] |
tag = self.soup.find("div", itemprop="author") | ||
if tag: | ||
return normalize_string(tag.get_text()) | ||
tag = self.soup.find("span", {"id": "viewRecipeAuthor"}) | ||
return normalize_string(tag.get_text()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some observations here:
- The retrieval from an
itemprop="author"
attribute is essentiallyschema.org
metadata retrieval; we have an existing helper method to implement that, so let's re-use them here. - The information contained in the
viewRecipeAuthor
element seems more-specific than the schema metadata, which is sometimes generic. So let's preferviewRecipeAuthor
when mentioned.
What this leads me to when adapting the code locally is:
tag = self.soup.find("div", itemprop="author") | |
if tag: | |
return normalize_string(tag.get_text()) | |
tag = self.soup.find("span", {"id": "viewRecipeAuthor"}) | |
return normalize_string(tag.get_text()) | |
name_from_schema = self.schema.author() | |
name_from_hyperlink = None | |
tag = self.soup.find("span", {"id": "viewRecipeAuthor"}) | |
if tag: | |
name_from_hyperlink = tag.get_text() | |
return normalize_string(name_from_hyperlink or name_from_schema) |
Note: the word von
in some of the test data seems redundant, so we can remove that (these changes affect that).
This change improves the parser of recipes at rezeptwelt.de: