Improve rezeptwelt.de recipe parsing #1295

wummel · 2024-10-16T16:27:47Z

This change improves the parser of recipes at rezeptwelt.de:

detect ingredient groups
support HTML layout for newer recipes, especially for instruction parsing
add prep time
add equipment entries

This change improves the parser of recipes at rezeptwelt.de: - detect ingredient groups - support HTML layout for newer recipes, especially for instruction parsing - add prep time - add equipment entries

jayaddison · 2024-10-17T10:51:23Z

recipe_scrapers/rezeptwelt.py

@@ -9,19 +25,69 @@ def host(cls):
        return "rezeptwelt.de"

    def site_name(self):
-        raise StaticValueException(return_value="Rezeptwelt")
+        return "Thermomix Rezeptwelt"


Suggested change

return "Thermomix Rezeptwelt"

raise StaticValueException(return_value="Thermomix Rezeptwelt")

I admit this is a slightly unusual pattern that we use; it is used so that the interface of the library can indicate whether values were retrieved from the source HTML or whether they are static/constant values returned by the code.

jayaddison · 2024-10-17T10:55:38Z

recipe_scrapers/rezeptwelt.py

+    def prep_time(self):
+        tag = self.soup.find(itemprop="performTime", content=nonempty)
+        return get_minutes(tag['content']) if tag else None
+
+    def equipment(self):
+        return [tag['content'] for tag in self.soup.find_all("meta", itemprop="tool", content=nonempty)]


BeautifulSoup (bs4 / self.soup) allows non-empty content filtering by passing a boolean True value, so I think we can simplify these methods slightly:

Suggested change

def prep_time(self):

tag = self.soup.find(itemprop="performTime", content=nonempty)

return get_minutes(tag['content']) if tag else None

def equipment(self):

return [tag['content'] for tag in self.soup.find_all("meta", itemprop="tool", content=nonempty)]

def prep_time(self):

tag = self.soup.find(itemprop="performTime", content=True)

return get_minutes(tag['content']) if tag else None

def equipment(self):

return [tag['content'] for tag in self.soup.find_all("meta", itemprop="tool", content=True)]

jayaddison · 2024-10-17T11:09:16Z

recipe_scrapers/rezeptwelt.py

+        tag = self.soup.find("div", itemprop="author")
+        if tag:
+            return normalize_string(tag.get_text())
+        tag = self.soup.find("span", {"id": "viewRecipeAuthor"})
+        return normalize_string(tag.get_text())


Some observations here:

The retrieval from an itemprop="author" attribute is essentially schema.org metadata retrieval; we have an existing helper method to implement that, so let's re-use them here.

The information contained in the viewRecipeAuthor element seems more-specific than the schema metadata, which is sometimes generic. So let's prefer viewRecipeAuthor when mentioned.

What this leads me to when adapting the code locally is:

Suggested change

tag = self.soup.find("div", itemprop="author")

if tag:

return normalize_string(tag.get_text())

tag = self.soup.find("span", {"id": "viewRecipeAuthor"})

return normalize_string(tag.get_text())

name_from_schema = self.schema.author()

name_from_hyperlink = None

tag = self.soup.find("span", {"id": "viewRecipeAuthor"})

if tag:

name_from_hyperlink = tag.get_text()

return normalize_string(name_from_hyperlink or name_from_schema)

Note: the word von in some of the test data seems redundant, so we can remove that (these changes affect that).

Improve rezeptwelt.de recipe parsing

8d26cd6

This change improves the parser of recipes at rezeptwelt.de: - detect ingredient groups - support HTML layout for newer recipes, especially for instruction parsing - add prep time - add equipment entries

jayaddison reviewed Oct 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve rezeptwelt.de recipe parsing #1295

Improve rezeptwelt.de recipe parsing #1295

wummel commented Oct 16, 2024

jayaddison Oct 17, 2024

jayaddison Oct 17, 2024

jayaddison Oct 17, 2024

	return "Thermomix Rezeptwelt"
	raise StaticValueException(return_value="Thermomix Rezeptwelt")

Improve rezeptwelt.de recipe parsing #1295

Are you sure you want to change the base?

Improve rezeptwelt.de recipe parsing #1295

Conversation

wummel commented Oct 16, 2024

jayaddison Oct 17, 2024

Choose a reason for hiding this comment

jayaddison Oct 17, 2024

Choose a reason for hiding this comment

jayaddison Oct 17, 2024

Choose a reason for hiding this comment