Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sally's Baking Addiction - take the last image in the list rather than the first #1034

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

krisnoble
Copy link
Contributor

Sally's Baking Addiction provides a list of images in different sizes. The default behaviour is to take the first image from the list, but in this case that's a 225x225 thumbnail, with the original image at the end of the list.

This change makes the scraper take the last element in the list. Every recipe I've checked (from old to new) has the same list structure but I've preserved the original checks from SchemaOrg just in case, hopefully that's the right approach but happy to amend if not.

@jayaddison
Copy link
Collaborator

Hi @krisnoble - thanks for the pull request!

The fix itself, selecting the last image instead of the first, sounds good. I'm guessing that you explored ways to avoid duplicating some of the logic here? (it doesn't look straightforward to me, given the relationship between the SchemaOrg and AbstractScraper classes)

if isinstance(image, dict):
image = image.get("url")

if "http://" not in image and "https://" not in image:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to keep the comment that appeared below this line in the SchemaOrg original. Note that that's likely to be updated soon though, by pull request #1032.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, not sure why I removed that, think I caught myself in two minds between the initial version and more defensive version. If we go with this approach will add it back, better in your opinion to use the current one for consistency now and update after that PR gets merged, or just use the new version now?

@krisnoble
Copy link
Contributor Author

krisnoble commented Mar 21, 2024

I'm guessing that you explored ways to avoid duplicating some of the logic here? (it doesn't look straightforward to me, given the relationship between the SchemaOrg and AbstractScraper classes)

In all honesty I'm still learning the ropes with this type of stuff so totally open to suggestions. I did initially have a version that just assumed a list and returned the last item, which worked fine on all my tests but I didn't want to take any risks of breaking edge cases, so went with a more defensive approach - but like you say it does duplicate the rest of the logic.

Looking at the website source, it also provides an ImageObject schema, so could try that first and fall back to the default if it isn't present. That would remove the duplication and preserve the default behaviour in case of an issue? Just tested the following which passes the tests although I suspect there may be a more elegant way to do it:

 def image(self):
        try:
            yoast_data = json.loads(self.soup.find('script', class_="yoast-schema-graph").get_text())['@graph']
            for elem in yoast_data:
                if elem['@type'] == 'ImageObject':
                    image = elem['contentUrl'] # https://developer.yoast.com/features/schema/pieces/image/
                    break
        except:
            # something went wrong so fall back to the default
            image = self.schema.image()

        return image

@jayaddison
Copy link
Collaborator

Apologies for the slow response here @krisnoble - I won't be able to respond completely until next week, but will provide a more complete review then.

@jayaddison
Copy link
Collaborator

Ok, from taking a bit more time to consider this: I think what I'd recommend here is based on the findings here:

This change makes the scraper take the last element in the list. Every recipe I've checked (from old to new) has the same list structure but I've preserved the original checks from SchemaOrg just in case, hopefully that's the right approach but happy to amend if not.

I don't think that we want to duplicate all of the SchemaOrg logic in each scraper (or any other) - it is the default, and it is used in many cases, but when we know of more-precise conditions on a per-website basis (like we do here), then we should generally implement (in fact, override) only what we need for that website. In this case, I think we should retrieve self.data.get("image"), and expect/assert that that the retrieved value is a list type -- and then return the last element from that list.

That should reduce the amount of code considerably, and also means that the question about whether to retain the surrounding comments from the original code is not relevant (because it's a distinct implementation, and isn't attempting to retain either the same behaviour or code presentation).

The risk is that we overfit (to borrow machine learning terminology) the implementation, and it breaks for some recipes that we haven't considered yet. Ideally we'd evaluate the scraper against a large number of sample recipes from an archive and confirm somehow that the error rate is low and that useful images are retrieved - but we don't have infrastructure to do that currently.

@hhursev
Copy link
Owner

hhursev commented Oct 31, 2024

@jayaddison I agree with how you've handled this PR! Didn't the guy implement it the way you've suggested? If I understand you last comment correct, he does exactly as suggested

@jayaddison
Copy link
Collaborator

I wasn't as clear as I should have been in the explanation. My main concern is that it nearly-duplicates this code:

def image(self):
image = self.data.get("image")
if image is None:
raise SchemaOrgException("Image not found in SchemaOrg")
if isinstance(image, list):
# Could contain a dict
image = image[0]
if isinstance(image, dict):
image = image.get("url")
if "http://" not in image and "https://" not in image:
# Some sites use relative image paths;
# prefer generic image retrieval code in those cases.
image = ""
return image

A couple of ideas could be to narrow the implementation in this website to more minimal logic that is only relevant to it, or perhaps to adjust the schema.org method to accept an optional argument to specify last-image-hyperlink instead of the default first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants