-
Notifications
You must be signed in to change notification settings - Fork 531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add bergamot #1064
base: main
Are you sure you want to change the base?
Add bergamot #1064
Conversation
That seems OK to me @mlduff - a web browser would load the |
This generally looks good to me, thank you @mlduff - a few small suggestions/questions and then I'll re-review. |
recipe_scrapers/bergamot.py
Outdated
def author(self): | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try to find some basic authorship information when we can, even if it is not perfect/ideal.
When a source_url
is found, I think we could return the domain name.. not ideal, but better than nothing. I've asked in the bugreport (#986) whether we know of any example recipes that contain more author info.
Co-authored-by: James Addison <[email protected]>
Co-authored-by: James Addison <[email protected]>
I had a look through the JS on the site and I have added
Couldn't find author using the same method @jayaddison |
Ok, thanks @mlduff. I think retaining the authorship info is fairly important, so I'd like to pause until we get some more ideas / suggestions. |
@jayaddison would it be completely off the cards to do an extra request to the origin website? That's the only way I can see getting any extra information as it seems bergamot does not support author at all. If you are satisfied that bergamot doesn't support author whatsoever, would you merge the PR? Or does this mean we have to decline it? |
@mlduff hmm, I'm not sure that I'd be entirely happy with making an additional request either; it would seem kinda unexpected for some code that appears to scrape from HTML at a URL to make requests to other domains -- and fairly arbitrarily, based on whatever the response of Bergamot included. I don't expect Bergamot would intentionally link to non-related sites for any reason, but that kind of thing can happen and can cause security/privacy vulnerabilities in web browsers; so I'd prefer to maintain simplicity and avoid that. |
@jayaddison yeah that makes sense. Is there anything I can do to help get this one in, or do we just have to wait for bergamot to reply and see if they will add author? If they say no, can we proceed with just the source domain? |
@jayaddison Sorry, just bumping on the above - I have implemented the source domain for the author field - is it okay to proceed with this PR? |
Thanks @mlduff - no, I don't want to proceed with this given that it isn't providing attribution for an author name that we know exists on an origin recipe. Source domain could arguably provide some of that, but I'd prefer to wait until we can retrieve the author credit from the page we're scraping. |
Should I closethis PR for now @jayaddison? Or leave it up? |
I've been on the other side of this situation a few times now -- providing a contribution to a project that it is unclear will ever be accepted, sometimes for a reason that they've communicated, or sometimes simply because maintainers are unavailable -- and my usual preference is to keep the pull request open, because it's:
However: it can be frustrating sometimes if your PR doesn't get merged for a long time. Usually I ping the maintainers once every so often (say once per 3 months or so) if I think the PR is still valid -- but sometimes I just close them to reduce my mental overhead if it's been a long time with no progress. |
This scraper adds support for dashboard.bergamot.app. Would of liked a few more test URLs, but used the following in testing.
Had to implement a legacy test - not sure if I used the right method to specify the origin URL (which is different to the actual request URL). I ended out having a yield for the first URL (that the user would pass in), with the associated
testhtml
not actually being used.Resolves #986