Merge branch 'hhursev:main' into grouping-recipetineats

hhursev · Nov 13, 2024 · ffb20b0 · ffb20b0
2 parents a596d7c + c30cbba
commit ffb20b0
Show file tree

Hide file tree

Showing 68 changed files with 41,881 additions and 9,183 deletions.
diff --git a/.github/workflows/linters.yaml b/.github/workflows/linters.yaml
@@ -19,5 +19,7 @@ jobs:
         uses: actions/setup-python@v4
         with:
           python-version: "3.x"
-      - run: pip install tox
-      - run: tox -e lint
+          cache: pip
+          cache-dependency-path: .pre-commit-config.yaml
+      - run: pip install pre-commit
+      - run: pre-commit run --all-files
diff --git a/.github/workflows/unittests.yaml b/.github/workflows/unittests.yaml
@@ -27,7 +27,9 @@ jobs:
           cache: pip
       - name: Install dependencies
         run: python -m pip install .[dev]
+      - name: Install parallel test runner
+        run: python -m pip install unittest-parallel
       - name: Run Tests
         env:
           PYTHONWARNINGS: "always:::recipe_scrapers,ignore:::recipe_scrapers.plugins.static_values"
-        run: python -m unittest
+        run: unittest-parallel --level test
diff --git a/LICENSE b/LICENSE
@@ -1,6 +1,6 @@
 The MIT License (MIT)
 
-Copyright (c) 2015 Hristo Harsev
+Copyright (c) 2015 The recipe-scrapers contributors
 
 Permission is hereby granted, free of charge, to any person obtaining a copy of
 this software and associated documentation files (the "Software"), to deal in

diff --git a/README.rst b/README.rst
@@ -88,6 +88,11 @@ Scrapers available for:
 - `https://aflavorjournal.com/ <https://aflavorjournal.com/>`_
 - `https://ah.nl/ <https://ah.nl/>`_
 - `https://akispetretzikis.com/ <https://akispetretzikis.com/>`_
+- `https://aldi-nord.de/ <https://aldi-nord.de/>`_
+    - `.es <https://aldi.es/>`__, `.fr <https://aldi.fr/>`__, `.lu <https://aldi.lu/>`__, `.nl <https://aldi.nl/>`__, `.pl <https://aldi.pl/>`__, `.pt <https://aldi.pt/>`__
+- `https://aldi-sued.de/ <https://aldi-sued.de/>`_
+    - `.hu <https://aldi.hu/>`__, `.it <https://aldi.it/>`__
+- `https://aldi-suisse.ch <https://aldi-suisse.ch/>`_
 - `https://aldi.com.au/ <https://aldi.com.au/>`_
 - `https://alexandracooks.com/ <https://alexandracooks.com/>`_
 - `https://alittlebityummy.com/ <https://alittlebityummy.com/>`_
@@ -231,6 +236,8 @@ Scrapers available for:
 - `https://hellofresh.com/ <https://hellofresh.com>`_
     - `.at <https://www.hellofresh.at/>`__, `.be <https://www.hellofresh.be/>`__, `.ca <https://www.hellofresh.ca/>`__, `.ch <https://www.hellofresh.ch/>`__, `.co.nz <https://www.hellofresh.co.nz/>`__, `.co.uk <https://hellofresh.co.uk>`__, `.com.au <https://www.hellofresh.com.au/>`__, `.de <https://www.hellofresh.de/>`__, `.dk <https://www.hellofresh.dk/>`__, `.es <https://www.hellofresh.es/>`__, `.fr <https://www.hellofresh.fr/>`__, `.ie <https://www.hellofresh.ie/>`__, `.it <https://www.hellofresh.it/>`__, `.lu <https://www.hellofresh.lu/>`__, `.nl <https://www.hellofresh.nl/>`__, `.no <https://www.hellofresh.no/>`__, `.se <https://www.hellofresh.se/>`__
 - `https://www.hersheyland.com/ <https://www.hersheyland.com/>`_
+- `https://hofer.at/ <https://hofer.at/>`_
+    - `.si <https://hofer.si/>`__
 - `https://www.homechef.com/ <https://www.homechef.com/>`_
 - `https://hostthetoast.com/ <https://hostthetoast.com/>`_
 - `https://hungryhappens.net/ <https://hungryhappens.net/>`_
@@ -346,6 +353,7 @@ Scrapers available for:
 - `https://przepisy.pl/ <https://przepisy.pl>`_
 - `https://purelypope.com/ <https://purelypope.com>`_
 - `https://purplecarrot.com/ <https://purplecarrot.com>`_
+- `https://quitoque.fr/ <https://quitoque.fr>`_
 - `https://rachlmansfield.com/ <https://rachlmansfield.com>`_
 - `https://rainbowplantlife.com/ <https://rainbowplantlife.com/>`_
 - `https://realfood.tesco.com/ <https://realfood.tesco.com>`_
@@ -500,7 +508,7 @@ Assuming you have ``>=python3.9`` installed, navigate to the directory where you
     python -m venv .venv &&
     source .venv/bin/activate &&
     python -m pip install --upgrade pip &&
-    pip install -e .[dev] &&
+    pip install -e ".[dev]" &&
     pip install pre-commit &&
     pre-commit install &&
     python -m unittest
@@ -574,6 +582,12 @@ All the `contributors that helped improving <https://github.com/hhursev/recipe-s
    :target: https://github.com/hhursev/recipe-scrapers/graphs/contributors
 
 
+Test Data Notice
+---------------
+
+All content in ``tests/test_data/`` is used for limited, non-commercial testing purposes and belongs to their respective copyright holders. See the ``tests/test_data/LICENSE.md`` for details. If you're a copyright holder with concerns, you can open an issue or contact us privately via the email in our PyPI page.
+
+
 Extra:
 ------
 | You want to gather recipes data?

diff --git a/docs/in-depth-guide-html-scraping.md b/docs/in-depth-guide-html-scraping.md
@@ -10,6 +10,10 @@ The [Beautiful Soup documentation](https://www.crummy.com/software/BeautifulSoup
 
 This guide covers a number of common patterns that are used in this library.
 
+## `_schema_cls` and `_opengraph_cls`
+
+It should rarely be necessary to override the default behaviour of schema.org and OpenGraph metadata retrieval; recipe websites should generally adhere to their respective standard formats when including metadata on their webpages.  However, bugs/mistakes do happen - if you need to override the implementations provided by the `SchemaOrg` and `OpenGraph` classes, you can subclass from those and add a `_schema_cls` or `_opengraph_cls` attribute to your scraper class to instruct the library to use them instead.
+
 ## Finding a single element
 
 The `self.soup.find()` function returns the first element matching the arguments. This is useful if you are trying to extract some information that should only occur once, for example the prep time or total time.

diff --git a/pyproject.toml b/pyproject.toml
@@ -7,6 +7,10 @@ name = "recipe_scrapers"
 description = "Python package, scraping recipes from all over the internet"
 authors = [
     {name = "Hristo Harsev", email = "[email protected]"},
+    {name = "James Addison", email = "[email protected]"},
+]
+maintainers = [
+    {name = "James Addison", email = "[email protected]"},
 ]
 urls = {Homepage = "https://github.com/hhursev/recipe-scrapers/"}
 keywords = ["python", "recipes", "scraper", "harvest", "recipe-scraper", "recipe-scrapers"]

diff --git a/recipe_scrapers/__init__.py b/recipe_scrapers/__init__.py
@@ -44,6 +44,9 @@
 from .akispetretzikis import AkisPetretzikis
 from .albertheijn import AlbertHeijn
 from .aldi import Aldi
+from .aldinord import AldiNord
+from .aldisued import AldiSued
+from .aldisuisse import AldiSuisse
 from .alexandracooks import AlexandraCooks
 from .alittlebityummy import ALittleBitYummy
 from .allrecipes import AllRecipes
@@ -181,6 +184,7 @@
 from .heb import HEB
 from .hellofresh import HelloFresh
 from .hersheyland import HersheyLand
+from .hofer import Hofer
 from .homechef import HomeChef
 from .hostthetoast import Hostthetoast
 from .hungryhappens import HungryHappens
@@ -300,6 +304,7 @@
 from .przepisy import Przepisy
 from .purelypope import PurelyPope
 from .purplecarrot import PurpleCarrot
+from .quitoque import QuiToque
 from .rachlmansfield import RachlMansfield
 from .rainbowplantlife import RainbowPlantLife
 from .realfoodtesco import RealFoodTesco
@@ -422,6 +427,17 @@
     AkisPetretzikis.host(): AkisPetretzikis,
     AlbertHeijn.host(): AlbertHeijn,
     Aldi.host(): Aldi,
+    AldiNord.host(): AldiNord,
+    AldiNord.host(domain="aldi.es"): AldiNord,
+    AldiNord.host(domain="aldi.fr"): AldiNord,
+    AldiNord.host(domain="aldi.lu"): AldiNord,
+    AldiNord.host(domain="aldi.nl"): AldiNord,
+    AldiNord.host(domain="aldi.pl"): AldiNord,
+    AldiNord.host(domain="aldi.pt"): AldiNord,
+    AldiSued.host(): AldiSued,
+    AldiSued.host(domain="aldi.hu"): AldiSued,
+    AldiSued.host(domain="aldi.it"): AldiSued,
+    AldiSuisse.host(): AldiSuisse,
     AlexandraCooks.host(): AlexandraCooks,
     AllRecipes.host(): AllRecipes,
     AllTheHealthyThings.host(): AllTheHealthyThings,
@@ -542,6 +558,7 @@
     PeelWithZeal.host(): PeelWithZeal,
     PinchOfYum.host(): PinchOfYum,
     PotatoRolls.host(): PotatoRolls,
+    QuiToque.host(): QuiToque,
     Recept.host(): Recept,
     ReceptyPreVas.host(): ReceptyPreVas,
     RecipeGirl.host(): RecipeGirl,
@@ -625,6 +642,8 @@
     HelloFresh.host(domain="no"): HelloFresh,
     HelloFresh.host(domain="se"): HelloFresh,
     HersheyLand.host(): HersheyLand,
+    Hofer.host(): Hofer,
+    Hofer.host(domain="hofer.si"): Hofer,
     HomeChef.host(): HomeChef,
     Hostthetoast.host(): Hostthetoast,
     Ica.host(): Ica,

diff --git a/recipe_scrapers/_abstract.py b/recipe_scrapers/_abstract.py
@@ -21,12 +21,15 @@
 class AbstractScraper:
     page_data: str
 
+    _opengraph_cls = OpenGraph
+    _schema_cls = SchemaOrg
+
     def __init__(self, html: str, url: str):
         self.page_data = html
         self.url = url
         self.soup = BeautifulSoup(self.page_data, "html.parser")
-        self.opengraph = OpenGraph(self.soup)
-        self.schema = SchemaOrg(self.page_data)
+        self.opengraph = self._opengraph_cls(self.soup)
+        self.schema = self._schema_cls(self.page_data)
 
         # attach the plugins as instructed in settings.PLUGINS
         if not hasattr(self.__class__, "plugins_initialized"):

diff --git a/recipe_scrapers/albertheijn.py b/recipe_scrapers/albertheijn.py
@@ -1,32 +1,18 @@
-import re
-
 from ._abstract import AbstractScraper
-from ._exceptions import StaticValueException
-from ._utils import normalize_string
 
 
 class AlbertHeijn(AbstractScraper):
     @classmethod
     def host(cls):
         return "ah.nl"
 
-    def site_name(self):
-        raise StaticValueException(return_value="Albert Heijn")
-
     def instructions(self):
-        instructions = [
-            normalize_string(step.get_text())
-            # get steps root
-            for root in self.soup.findAll(
-                "div",
-                {"class", re.compile("recipe-preparation-steps_root.*")},
-            )
-            # get steps
-            for step in root.findAll("p")
-        ]
+        instructions = self.schema.instructions()
 
-        if instructions:
-            return "\n".join(instructions)
+        filtered_instructions = [
+            line
+            for line in instructions.split("\n")
+            if not line.lower().startswith("stap")
+        ]
 
-        # try schema.org
-        return self.schema.instructions()
+        return "\n".join(filtered_instructions)
diff --git a/recipe_scrapers/aldinord.py b/recipe_scrapers/aldinord.py
@@ -0,0 +1,24 @@
+from ._abstract import AbstractScraper
+from ._exceptions import StaticValueException
+
+
+class AldiNord(AbstractScraper):
+    @classmethod
+    def host(cls, domain: str = "aldi-nord.de"):
+        return domain
+
+    def author(self):
+        if author_from_schema := self.schema.author():
+            return author_from_schema
+
+        raise StaticValueException(return_value="ALDI")
+
+    def site_name(self):
+        raise StaticValueException(return_value="ALDI")
+
+    def instructions(self):
+        return (
+            self.schema.data.get("recipeInstructions", "")
+            .replace("\xa0", " ")
+            .replace("\r\n ", "\n")
+        )
diff --git a/recipe_scrapers/aldisued.py b/recipe_scrapers/aldisued.py
@@ -0,0 +1,17 @@
+from ._abstract import AbstractScraper
+
+
+class AldiSued(AbstractScraper):
+    @classmethod
+    def host(cls, domain="aldi-sued.de"):
+        return domain
+
+    def instructions(self):
+        instruction_elements = self.schema.data.get("recipeInstructions", [])
+        return "\n".join(
+            [
+                element.get("text").replace("\xad", "")
+                for element in instruction_elements
+                if element.get("text")
+            ]
+        )
diff --git a/recipe_scrapers/aldisuisse.py b/recipe_scrapers/aldisuisse.py
@@ -0,0 +1,7 @@
+from .aldisued import AldiSued
+
+
+class AldiSuisse(AldiSued):
+    @classmethod
+    def host(cls, domain="aldi-suisse.ch"):
+        return domain
diff --git a/recipe_scrapers/hofer.py b/recipe_scrapers/hofer.py
@@ -0,0 +1,7 @@
+from .aldisued import AldiSued
+
+
+class Hofer(AldiSued):
+    @classmethod
+    def host(cls, domain="hofer.at"):
+        return domain
diff --git a/recipe_scrapers/mccormick.py b/recipe_scrapers/mccormick.py
@@ -1,6 +1,5 @@
 from ._abstract import AbstractScraper
 from ._grouping_utils import group_ingredients
-from ._utils import normalize_string
 
 
 class McCormick(AbstractScraper):
@@ -17,13 +16,12 @@ def ingredient_groups(self):
         )
 
     def instructions(self):
-        instructions_list = self.soup.findAll(
-            "li", {"id": lambda x: x and x.startswith("step")}
-        )
+        instructions = self.schema.instructions()
 
-        return "\n".join(
-            [
-                normalize_string(instruction.find("span", {"class": "para"}).get_text())
-                for instruction in instructions_list
-            ]
-        )
+        filtered_instructions = [
+            line
+            for line in instructions.split("\n")
+            if not line.lower().startswith("step")
+        ]
+
+        return "\n".join(filtered_instructions)
diff --git a/recipe_scrapers/quitoque.py b/recipe_scrapers/quitoque.py
@@ -0,0 +1,90 @@
+from ._abstract import AbstractScraper
+from ._utils import csv_to_tags, get_minutes, get_yields, normalize_string
+
+
+class QuiToque(AbstractScraper):
+    @classmethod
+    def host(cls):
+        return "quitoque.fr"
+
+    @staticmethod
+    def _get_text(element):
+        if element:
+            return normalize_string(element.get_text())
+        else:
+            return None
+
+    def _get_time(self, time_name):
+        times = self.soup.select("div,.recipe-infos-short .item-info")
+        total_time = None
+        for time in times:
+            if time_name in time.get_text():
+                total_time = self._get_text(time).replace(time_name, "")
+        return get_minutes(total_time)
+
+    def _get_nutrient(self, nutrient_name):
+        nutrient_element = self._nutrients.find("p", string=nutrient_name).parent
+        return self._get_text(nutrient_element.find("p", class_="regular"))
+
+    def canonical_url(self):
+        return self.soup.find("meta", {"property": "og:url"}).get("content")
+
+    def author(self):
+        return "QuiToque"
+
+    def title(self):
+        return self._get_text(self.soup.find("h1", class_="title-2"))
+
+    def keywords(self):
+        product_tags = self.soup.find(id="product-tags").find_all(class_="badge")
+        keywords = ",".join(self._get_text(tag) for tag in product_tags)
+        return csv_to_tags(keywords)
+
+    def category(self):
+        category = self.soup.find(class_="primary-ghost")
+        return self._get_text(category)
+
+    def total_time(self):
+        return self._get_time("Total")
+
+    def prep_time(self):
+        return self._get_time("En cuisine")
+
+    def yields(self):
+        serving = self.soup.find(id="ingredients").find("p", class_="body-2")
+        return get_yields(serving)
+
+    def image(self):
+        img_element = self.soup.find(class_="image").find("img")
+        return img_element["src"]
+
+    def ingredients(self):
+        ingredients = []
+        ingredients.extend(self.soup.select("#ingredients .ingredient-list li"))
+        ingredients.extend(self.soup.select(".kitchen-list li"))
+        return [self._get_text(ingredient) for ingredient in ingredients]
+
+    def equipment(self):
+        equipments = self.soup.select("#equipment .ingredient-list li")
+        return [self._get_text(equiment) for equiment in equipments]
+
+    def instructions(self):
+        instructions = self.soup.select("#preparation-steps li")
+        return "\n".join([self._get_text(instruction) for instruction in instructions])
+
+    def description(self):
+        description = self.soup.find("div", class_="container body-2 regular mt-2 mb-4")
+        return self._get_text(description)
+
+    def nutrients(self):
+        self._nutrients = self.soup.find(id="portion")
+        nutrients = {
+            "calories": self._get_nutrient("Énergie (kCal)"),
+            "fatContent": self._get_nutrient("Matières grasses"),
+            "saturatedFatContent": self._get_nutrient("dont acides gras saturés"),
+            "carbohydrateContent": self._get_nutrient("Glucides"),
+            "sugarContent": self._get_nutrient("dont sucre"),
+            "fiberContent": self._get_nutrient("Fibres"),
+            "proteinContent": self._get_nutrient("Protéines"),
+        }
+        return nutrients