Sally's Baking Addiction - take the last image in the list rather than the first #1034

krisnoble · 2024-03-20T17:27:42Z

Sally's Baking Addiction provides a list of images in different sizes. The default behaviour is to take the first image from the list, but in this case that's a 225x225 thumbnail, with the original image at the end of the list.

This change makes the scraper take the last element in the list. Every recipe I've checked (from old to new) has the same list structure but I've preserved the original checks from SchemaOrg just in case, hopefully that's the right approach but happy to amend if not.

…n the first

jayaddison · 2024-03-20T18:37:50Z

Hi @krisnoble - thanks for the pull request!

The fix itself, selecting the last image instead of the first, sounds good. I'm guessing that you explored ways to avoid duplicating some of the logic here? (it doesn't look straightforward to me, given the relationship between the SchemaOrg and AbstractScraper classes)

jayaddison · 2024-03-20T18:39:11Z

recipe_scrapers/sallysbakingaddiction.py

+        if isinstance(image, dict):
+            image = image.get("url")
+
+        if "http://" not in image and "https://" not in image:


It would be nice to keep the comment that appeared below this line in the SchemaOrg original. Note that that's likely to be updated soon though, by pull request #1032.

Good catch, not sure why I removed that, think I caught myself in two minds between the initial version and more defensive version. If we go with this approach will add it back, better in your opinion to use the current one for consistency now and update after that PR gets merged, or just use the new version now?

krisnoble · 2024-03-21T14:17:03Z

I'm guessing that you explored ways to avoid duplicating some of the logic here? (it doesn't look straightforward to me, given the relationship between the SchemaOrg and AbstractScraper classes)

In all honesty I'm still learning the ropes with this type of stuff so totally open to suggestions. I did initially have a version that just assumed a list and returned the last item, which worked fine on all my tests but I didn't want to take any risks of breaking edge cases, so went with a more defensive approach - but like you say it does duplicate the rest of the logic.

Looking at the website source, it also provides an ImageObject schema, so could try that first and fall back to the default if it isn't present. That would remove the duplication and preserve the default behaviour in case of an issue? Just tested the following which passes the tests although I suspect there may be a more elegant way to do it:

 def image(self):
        try:
            yoast_data = json.loads(self.soup.find('script', class_="yoast-schema-graph").get_text())['@graph']
            for elem in yoast_data:
                if elem['@type'] == 'ImageObject':
                    image = elem['contentUrl'] # https://developer.yoast.com/features/schema/pieces/image/
                    break
        except:
            # something went wrong so fall back to the default
            image = self.schema.image()

        return image

jayaddison · 2024-03-29T10:45:39Z

Apologies for the slow response here @krisnoble - I won't be able to respond completely until next week, but will provide a more complete review then.

jayaddison · 2024-04-05T19:30:34Z

Ok, from taking a bit more time to consider this: I think what I'd recommend here is based on the findings here:

This change makes the scraper take the last element in the list. Every recipe I've checked (from old to new) has the same list structure but I've preserved the original checks from SchemaOrg just in case, hopefully that's the right approach but happy to amend if not.

I don't think that we want to duplicate all of the SchemaOrg logic in each scraper (or any other) - it is the default, and it is used in many cases, but when we know of more-precise conditions on a per-website basis (like we do here), then we should generally implement (in fact, override) only what we need for that website. In this case, I think we should retrieve self.data.get("image"), and expect/assert that that the retrieved value is a list type -- and then return the last element from that list.

That should reduce the amount of code considerably, and also means that the question about whether to retain the surrounding comments from the original code is not relevant (because it's a distinct implementation, and isn't attempting to retain either the same behaviour or code presentation).

The risk is that we overfit (to borrow machine learning terminology) the implementation, and it breaks for some recipes that we haven't considered yet. Ideally we'd evaluate the scraper against a large number of sample recipes from an archive and confirm somehow that the error rate is low and that useful images are retrieved - but we don't have infrastructure to do that currently.

hhursev · 2024-10-31T19:51:24Z

@jayaddison I agree with how you've handled this PR! Didn't the guy implement it the way you've suggested? If I understand you last comment correct, he does exactly as suggested

jayaddison · 2024-10-31T20:07:04Z

I wasn't as clear as I should have been in the explanation. My main concern is that it nearly-duplicates this code:

recipe-scrapers/recipe_scrapers/_schemaorg.py

Lines 190 to 208 in 6a6a914

    
           def image(self): 
        
               image = self.data.get("image") 
        
               if image is None: 
        
                   raise SchemaOrgException("Image not found in SchemaOrg") 
        
               if isinstance(image, list): 
        
                   # Could contain a dict 
        
                   image = image[0] 
        
               if isinstance(image, dict): 
        
                   image = image.get("url") 
        
               if "http://" not in image and "https://" not in image: 
        
                   # Some sites use relative image paths; 
        
                   # prefer generic image retrieval code in those cases. 
        
                   image = "" 
        
               return image

A couple of ideas could be to narrow the implementation in this website to more minimal logic that is only relevant to it, or perhaps to adjust the schema.org method to accept an optional argument to specify last-image-hyperlink instead of the default first.

krisnoble added 5 commits March 1, 2024 14:34

Add prep time and cook time to Sally's Baking Addiction

996c77e

add description and cuisine

734c9ef

Sally's Baking Addiction - take the last image in the list rather tha…

5b2d0a3

…n the first

Merge branch 'hhursev:main' into main

e0bacda

add type/presence checks from schemaorg just to be sure

7840a4e

jayaddison reviewed Mar 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sally's Baking Addiction - take the last image in the list rather than the first #1034

Sally's Baking Addiction - take the last image in the list rather than the first #1034

krisnoble commented Mar 20, 2024

jayaddison commented Mar 20, 2024

jayaddison Mar 20, 2024

krisnoble Mar 21, 2024

krisnoble commented Mar 21, 2024 •

edited

Loading

jayaddison commented Mar 29, 2024

jayaddison commented Apr 5, 2024

hhursev commented Oct 31, 2024

jayaddison commented Oct 31, 2024

Sally's Baking Addiction - take the last image in the list rather than the first #1034

Are you sure you want to change the base?

Sally's Baking Addiction - take the last image in the list rather than the first #1034

Conversation

krisnoble commented Mar 20, 2024

jayaddison commented Mar 20, 2024

jayaddison Mar 20, 2024

Choose a reason for hiding this comment

krisnoble Mar 21, 2024

Choose a reason for hiding this comment

krisnoble commented Mar 21, 2024 • edited Loading

jayaddison commented Mar 29, 2024

jayaddison commented Apr 5, 2024

hhursev commented Oct 31, 2024

jayaddison commented Oct 31, 2024

krisnoble commented Mar 21, 2024 •

edited

Loading