Skip to content

Commit

Permalink
Merge branch 'hhursev:main' into grouping-recipetineats
Browse files Browse the repository at this point in the history
  • Loading branch information
jknndy authored Nov 13, 2024
2 parents a596d7c + c30cbba commit ffb20b0
Show file tree
Hide file tree
Showing 68 changed files with 41,881 additions and 9,183 deletions.
6 changes: 4 additions & 2 deletions .github/workflows/linters.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,5 +19,7 @@ jobs:
uses: actions/setup-python@v4
with:
python-version: "3.x"
- run: pip install tox
- run: tox -e lint
cache: pip
cache-dependency-path: .pre-commit-config.yaml
- run: pip install pre-commit
- run: pre-commit run --all-files
4 changes: 3 additions & 1 deletion .github/workflows/unittests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,9 @@ jobs:
cache: pip
- name: Install dependencies
run: python -m pip install .[dev]
- name: Install parallel test runner
run: python -m pip install unittest-parallel
- name: Run Tests
env:
PYTHONWARNINGS: "always:::recipe_scrapers,ignore:::recipe_scrapers.plugins.static_values"
run: python -m unittest
run: unittest-parallel --level test
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
The MIT License (MIT)

Copyright (c) 2015 Hristo Harsev
Copyright (c) 2015 The recipe-scrapers contributors

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
Expand Down
16 changes: 15 additions & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,11 @@ Scrapers available for:
- `https://aflavorjournal.com/ <https://aflavorjournal.com/>`_
- `https://ah.nl/ <https://ah.nl/>`_
- `https://akispetretzikis.com/ <https://akispetretzikis.com/>`_
- `https://aldi-nord.de/ <https://aldi-nord.de/>`_
- `.es <https://aldi.es/>`__, `.fr <https://aldi.fr/>`__, `.lu <https://aldi.lu/>`__, `.nl <https://aldi.nl/>`__, `.pl <https://aldi.pl/>`__, `.pt <https://aldi.pt/>`__
- `https://aldi-sued.de/ <https://aldi-sued.de/>`_
- `.hu <https://aldi.hu/>`__, `.it <https://aldi.it/>`__
- `https://aldi-suisse.ch <https://aldi-suisse.ch/>`_
- `https://aldi.com.au/ <https://aldi.com.au/>`_
- `https://alexandracooks.com/ <https://alexandracooks.com/>`_
- `https://alittlebityummy.com/ <https://alittlebityummy.com/>`_
Expand Down Expand Up @@ -231,6 +236,8 @@ Scrapers available for:
- `https://hellofresh.com/ <https://hellofresh.com>`_
- `.at <https://www.hellofresh.at/>`__, `.be <https://www.hellofresh.be/>`__, `.ca <https://www.hellofresh.ca/>`__, `.ch <https://www.hellofresh.ch/>`__, `.co.nz <https://www.hellofresh.co.nz/>`__, `.co.uk <https://hellofresh.co.uk>`__, `.com.au <https://www.hellofresh.com.au/>`__, `.de <https://www.hellofresh.de/>`__, `.dk <https://www.hellofresh.dk/>`__, `.es <https://www.hellofresh.es/>`__, `.fr <https://www.hellofresh.fr/>`__, `.ie <https://www.hellofresh.ie/>`__, `.it <https://www.hellofresh.it/>`__, `.lu <https://www.hellofresh.lu/>`__, `.nl <https://www.hellofresh.nl/>`__, `.no <https://www.hellofresh.no/>`__, `.se <https://www.hellofresh.se/>`__
- `https://www.hersheyland.com/ <https://www.hersheyland.com/>`_
- `https://hofer.at/ <https://hofer.at/>`_
- `.si <https://hofer.si/>`__
- `https://www.homechef.com/ <https://www.homechef.com/>`_
- `https://hostthetoast.com/ <https://hostthetoast.com/>`_
- `https://hungryhappens.net/ <https://hungryhappens.net/>`_
Expand Down Expand Up @@ -346,6 +353,7 @@ Scrapers available for:
- `https://przepisy.pl/ <https://przepisy.pl>`_
- `https://purelypope.com/ <https://purelypope.com>`_
- `https://purplecarrot.com/ <https://purplecarrot.com>`_
- `https://quitoque.fr/ <https://quitoque.fr>`_
- `https://rachlmansfield.com/ <https://rachlmansfield.com>`_
- `https://rainbowplantlife.com/ <https://rainbowplantlife.com/>`_
- `https://realfood.tesco.com/ <https://realfood.tesco.com>`_
Expand Down Expand Up @@ -500,7 +508,7 @@ Assuming you have ``>=python3.9`` installed, navigate to the directory where you
python -m venv .venv &&
source .venv/bin/activate &&
python -m pip install --upgrade pip &&
pip install -e .[dev] &&
pip install -e ".[dev]" &&
pip install pre-commit &&
pre-commit install &&
python -m unittest
Expand Down Expand Up @@ -574,6 +582,12 @@ All the `contributors that helped improving <https://github.com/hhursev/recipe-s
:target: https://github.com/hhursev/recipe-scrapers/graphs/contributors


Test Data Notice
---------------

All content in ``tests/test_data/`` is used for limited, non-commercial testing purposes and belongs to their respective copyright holders. See the ``tests/test_data/LICENSE.md`` for details. If you're a copyright holder with concerns, you can open an issue or contact us privately via the email in our PyPI page.


Extra:
------
| You want to gather recipes data?
Expand Down
4 changes: 4 additions & 0 deletions docs/in-depth-guide-html-scraping.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,10 @@ The [Beautiful Soup documentation](https://www.crummy.com/software/BeautifulSoup

This guide covers a number of common patterns that are used in this library.

## `_schema_cls` and `_opengraph_cls`

It should rarely be necessary to override the default behaviour of schema.org and OpenGraph metadata retrieval; recipe websites should generally adhere to their respective standard formats when including metadata on their webpages. However, bugs/mistakes do happen - if you need to override the implementations provided by the `SchemaOrg` and `OpenGraph` classes, you can subclass from those and add a `_schema_cls` or `_opengraph_cls` attribute to your scraper class to instruct the library to use them instead.

## Finding a single element

The `self.soup.find()` function returns the first element matching the arguments. This is useful if you are trying to extract some information that should only occur once, for example the prep time or total time.
Expand Down
4 changes: 4 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,10 @@ name = "recipe_scrapers"
description = "Python package, scraping recipes from all over the internet"
authors = [
{name = "Hristo Harsev", email = "[email protected]"},
{name = "James Addison", email = "[email protected]"},
]
maintainers = [
{name = "James Addison", email = "[email protected]"},
]
urls = {Homepage = "https://github.com/hhursev/recipe-scrapers/"}
keywords = ["python", "recipes", "scraper", "harvest", "recipe-scraper", "recipe-scrapers"]
Expand Down
19 changes: 19 additions & 0 deletions recipe_scrapers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,9 @@
from .akispetretzikis import AkisPetretzikis
from .albertheijn import AlbertHeijn
from .aldi import Aldi
from .aldinord import AldiNord
from .aldisued import AldiSued
from .aldisuisse import AldiSuisse
from .alexandracooks import AlexandraCooks
from .alittlebityummy import ALittleBitYummy
from .allrecipes import AllRecipes
Expand Down Expand Up @@ -181,6 +184,7 @@
from .heb import HEB
from .hellofresh import HelloFresh
from .hersheyland import HersheyLand
from .hofer import Hofer
from .homechef import HomeChef
from .hostthetoast import Hostthetoast
from .hungryhappens import HungryHappens
Expand Down Expand Up @@ -300,6 +304,7 @@
from .przepisy import Przepisy
from .purelypope import PurelyPope
from .purplecarrot import PurpleCarrot
from .quitoque import QuiToque
from .rachlmansfield import RachlMansfield
from .rainbowplantlife import RainbowPlantLife
from .realfoodtesco import RealFoodTesco
Expand Down Expand Up @@ -422,6 +427,17 @@
AkisPetretzikis.host(): AkisPetretzikis,
AlbertHeijn.host(): AlbertHeijn,
Aldi.host(): Aldi,
AldiNord.host(): AldiNord,
AldiNord.host(domain="aldi.es"): AldiNord,
AldiNord.host(domain="aldi.fr"): AldiNord,
AldiNord.host(domain="aldi.lu"): AldiNord,
AldiNord.host(domain="aldi.nl"): AldiNord,
AldiNord.host(domain="aldi.pl"): AldiNord,
AldiNord.host(domain="aldi.pt"): AldiNord,
AldiSued.host(): AldiSued,
AldiSued.host(domain="aldi.hu"): AldiSued,
AldiSued.host(domain="aldi.it"): AldiSued,
AldiSuisse.host(): AldiSuisse,
AlexandraCooks.host(): AlexandraCooks,
AllRecipes.host(): AllRecipes,
AllTheHealthyThings.host(): AllTheHealthyThings,
Expand Down Expand Up @@ -542,6 +558,7 @@
PeelWithZeal.host(): PeelWithZeal,
PinchOfYum.host(): PinchOfYum,
PotatoRolls.host(): PotatoRolls,
QuiToque.host(): QuiToque,
Recept.host(): Recept,
ReceptyPreVas.host(): ReceptyPreVas,
RecipeGirl.host(): RecipeGirl,
Expand Down Expand Up @@ -625,6 +642,8 @@
HelloFresh.host(domain="no"): HelloFresh,
HelloFresh.host(domain="se"): HelloFresh,
HersheyLand.host(): HersheyLand,
Hofer.host(): Hofer,
Hofer.host(domain="hofer.si"): Hofer,
HomeChef.host(): HomeChef,
Hostthetoast.host(): Hostthetoast,
Ica.host(): Ica,
Expand Down
7 changes: 5 additions & 2 deletions recipe_scrapers/_abstract.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,15 @@
class AbstractScraper:
page_data: str

_opengraph_cls = OpenGraph
_schema_cls = SchemaOrg

def __init__(self, html: str, url: str):
self.page_data = html
self.url = url
self.soup = BeautifulSoup(self.page_data, "html.parser")
self.opengraph = OpenGraph(self.soup)
self.schema = SchemaOrg(self.page_data)
self.opengraph = self._opengraph_cls(self.soup)
self.schema = self._schema_cls(self.page_data)

# attach the plugins as instructed in settings.PLUGINS
if not hasattr(self.__class__, "plugins_initialized"):
Expand Down
28 changes: 7 additions & 21 deletions recipe_scrapers/albertheijn.py
Original file line number Diff line number Diff line change
@@ -1,32 +1,18 @@
import re

from ._abstract import AbstractScraper
from ._exceptions import StaticValueException
from ._utils import normalize_string


class AlbertHeijn(AbstractScraper):
@classmethod
def host(cls):
return "ah.nl"

def site_name(self):
raise StaticValueException(return_value="Albert Heijn")

def instructions(self):
instructions = [
normalize_string(step.get_text())
# get steps root
for root in self.soup.findAll(
"div",
{"class", re.compile("recipe-preparation-steps_root.*")},
)
# get steps
for step in root.findAll("p")
]
instructions = self.schema.instructions()

if instructions:
return "\n".join(instructions)
filtered_instructions = [
line
for line in instructions.split("\n")
if not line.lower().startswith("stap")
]

# try schema.org
return self.schema.instructions()
return "\n".join(filtered_instructions)
24 changes: 24 additions & 0 deletions recipe_scrapers/aldinord.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
from ._abstract import AbstractScraper
from ._exceptions import StaticValueException


class AldiNord(AbstractScraper):
@classmethod
def host(cls, domain: str = "aldi-nord.de"):
return domain

def author(self):
if author_from_schema := self.schema.author():
return author_from_schema

raise StaticValueException(return_value="ALDI")

def site_name(self):
raise StaticValueException(return_value="ALDI")

def instructions(self):
return (
self.schema.data.get("recipeInstructions", "")
.replace("\xa0", " ")
.replace("\r\n ", "\n")
)
17 changes: 17 additions & 0 deletions recipe_scrapers/aldisued.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
from ._abstract import AbstractScraper


class AldiSued(AbstractScraper):
@classmethod
def host(cls, domain="aldi-sued.de"):
return domain

def instructions(self):
instruction_elements = self.schema.data.get("recipeInstructions", [])
return "\n".join(
[
element.get("text").replace("\xad", "")
for element in instruction_elements
if element.get("text")
]
)
7 changes: 7 additions & 0 deletions recipe_scrapers/aldisuisse.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from .aldisued import AldiSued


class AldiSuisse(AldiSued):
@classmethod
def host(cls, domain="aldi-suisse.ch"):
return domain
7 changes: 7 additions & 0 deletions recipe_scrapers/hofer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from .aldisued import AldiSued


class Hofer(AldiSued):
@classmethod
def host(cls, domain="hofer.at"):
return domain
18 changes: 8 additions & 10 deletions recipe_scrapers/mccormick.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
from ._abstract import AbstractScraper
from ._grouping_utils import group_ingredients
from ._utils import normalize_string


class McCormick(AbstractScraper):
Expand All @@ -17,13 +16,12 @@ def ingredient_groups(self):
)

def instructions(self):
instructions_list = self.soup.findAll(
"li", {"id": lambda x: x and x.startswith("step")}
)
instructions = self.schema.instructions()

return "\n".join(
[
normalize_string(instruction.find("span", {"class": "para"}).get_text())
for instruction in instructions_list
]
)
filtered_instructions = [
line
for line in instructions.split("\n")
if not line.lower().startswith("step")
]

return "\n".join(filtered_instructions)
90 changes: 90 additions & 0 deletions recipe_scrapers/quitoque.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
from ._abstract import AbstractScraper
from ._utils import csv_to_tags, get_minutes, get_yields, normalize_string


class QuiToque(AbstractScraper):
@classmethod
def host(cls):
return "quitoque.fr"

@staticmethod
def _get_text(element):
if element:
return normalize_string(element.get_text())
else:
return None

def _get_time(self, time_name):
times = self.soup.select("div,.recipe-infos-short .item-info")
total_time = None
for time in times:
if time_name in time.get_text():
total_time = self._get_text(time).replace(time_name, "")
return get_minutes(total_time)

def _get_nutrient(self, nutrient_name):
nutrient_element = self._nutrients.find("p", string=nutrient_name).parent
return self._get_text(nutrient_element.find("p", class_="regular"))

def canonical_url(self):
return self.soup.find("meta", {"property": "og:url"}).get("content")

def author(self):
return "QuiToque"

def title(self):
return self._get_text(self.soup.find("h1", class_="title-2"))

def keywords(self):
product_tags = self.soup.find(id="product-tags").find_all(class_="badge")
keywords = ",".join(self._get_text(tag) for tag in product_tags)
return csv_to_tags(keywords)

def category(self):
category = self.soup.find(class_="primary-ghost")
return self._get_text(category)

def total_time(self):
return self._get_time("Total")

def prep_time(self):
return self._get_time("En cuisine")

def yields(self):
serving = self.soup.find(id="ingredients").find("p", class_="body-2")
return get_yields(serving)

def image(self):
img_element = self.soup.find(class_="image").find("img")
return img_element["src"]

def ingredients(self):
ingredients = []
ingredients.extend(self.soup.select("#ingredients .ingredient-list li"))
ingredients.extend(self.soup.select(".kitchen-list li"))
return [self._get_text(ingredient) for ingredient in ingredients]

def equipment(self):
equipments = self.soup.select("#equipment .ingredient-list li")
return [self._get_text(equiment) for equiment in equipments]

def instructions(self):
instructions = self.soup.select("#preparation-steps li")
return "\n".join([self._get_text(instruction) for instruction in instructions])

def description(self):
description = self.soup.find("div", class_="container body-2 regular mt-2 mb-4")
return self._get_text(description)

def nutrients(self):
self._nutrients = self.soup.find(id="portion")
nutrients = {
"calories": self._get_nutrient("Énergie (kCal)"),
"fatContent": self._get_nutrient("Matières grasses"),
"saturatedFatContent": self._get_nutrient("dont acides gras saturés"),
"carbohydrateContent": self._get_nutrient("Glucides"),
"sugarContent": self._get_nutrient("dont sucre"),
"fiberContent": self._get_nutrient("Fibres"),
"proteinContent": self._get_nutrient("Protéines"),
}
return nutrients
Loading

0 comments on commit ffb20b0

Please sign in to comment.