-
Notifications
You must be signed in to change notification settings - Fork 531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sally's Baking Addiction - take the last image in the list rather than the first #1034
base: main
Are you sure you want to change the base?
Conversation
Hi @krisnoble - thanks for the pull request! The fix itself, selecting the last image instead of the first, sounds good. I'm guessing that you explored ways to avoid duplicating some of the logic here? (it doesn't look straightforward to me, given the relationship between the |
if isinstance(image, dict): | ||
image = image.get("url") | ||
|
||
if "http://" not in image and "https://" not in image: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to keep the comment that appeared below this line in the SchemaOrg
original. Note that that's likely to be updated soon though, by pull request #1032.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, not sure why I removed that, think I caught myself in two minds between the initial version and more defensive version. If we go with this approach will add it back, better in your opinion to use the current one for consistency now and update after that PR gets merged, or just use the new version now?
In all honesty I'm still learning the ropes with this type of stuff so totally open to suggestions. I did initially have a version that just assumed a list and returned the last item, which worked fine on all my tests but I didn't want to take any risks of breaking edge cases, so went with a more defensive approach - but like you say it does duplicate the rest of the logic. Looking at the website source, it also provides an ImageObject schema, so could try that first and fall back to the default if it isn't present. That would remove the duplication and preserve the default behaviour in case of an issue? Just tested the following which passes the tests although I suspect there may be a more elegant way to do it:
|
Apologies for the slow response here @krisnoble - I won't be able to respond completely until next week, but will provide a more complete review then. |
Ok, from taking a bit more time to consider this: I think what I'd recommend here is based on the findings here:
I don't think that we want to duplicate all of the That should reduce the amount of code considerably, and also means that the question about whether to retain the surrounding comments from the original code is not relevant (because it's a distinct implementation, and isn't attempting to retain either the same behaviour or code presentation). The risk is that we overfit (to borrow machine learning terminology) the implementation, and it breaks for some recipes that we haven't considered yet. Ideally we'd evaluate the scraper against a large number of sample recipes from an archive and confirm somehow that the error rate is low and that useful images are retrieved - but we don't have infrastructure to do that currently. |
@jayaddison I agree with how you've handled this PR! Didn't the guy implement it the way you've suggested? If I understand you last comment correct, he does exactly as suggested |
I wasn't as clear as I should have been in the explanation. My main concern is that it nearly-duplicates this code: recipe-scrapers/recipe_scrapers/_schemaorg.py Lines 190 to 208 in 6a6a914
A couple of ideas could be to narrow the implementation in this website to more minimal logic that is only relevant to it, or perhaps to adjust the schema.org method to accept an optional argument to specify last-image-hyperlink instead of the default first. |
Sally's Baking Addiction provides a list of images in different sizes. The default behaviour is to take the first image from the list, but in this case that's a 225x225 thumbnail, with the original image at the end of the list.
This change makes the scraper take the last element in the list. Every recipe I've checked (from old to new) has the same list structure but I've preserved the original checks from SchemaOrg just in case, hopefully that's the right approach but happy to amend if not.