-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Blocked by #41] Version 2 #38
base: master
Are you sure you want to change the base?
Conversation
end | ||
|
||
def load_remote_crawlers(uri = DEFAULT_URL) | ||
crawlers = Net::HTTP.get(URI(uri)).force_encoding('UTF-8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think this should potentially fail gracefully if say the URL returns a non successful response code or any kind of network error? Potentially just Kernel.warn "VoightKampff: Unable to fetch crawlers from #{uri}"
I can see a scenario where the repo gets pulled and a bunch of sites go down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, thank you. I added tests and code for these situations.
Also, there is warn
with error inspection: it can be SocketError
, or Errno::ENETUNREACH
, or something else.
And the code will "cache" the empty Array in these cases.
Also, I didn't add a code for redirects: Net::HTTP
can't deal with them easily. It's more comfortable to use some external gem, like httpx
, for this. But I hope that there will be not redirects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if we did this at all, I'd want to be really paranoid about it. It should fail very gracefully and any sort of issues with this whole process should log an error, but never take down the whole site. And I'd prefer to avoid a situation where we need to give false positives too. If there are problems getting the file, I suppose you might need to treat everything as not a bot until the file did download. But that seems bad too.
At the very least, I think the gem should start off using it's local file thats packaged with the gem. Then only if a new version is downloaded successfully and the JSON is parsed successfully and then some sort of sanity check is done to make sure we have a real list of bots (even if that's just checking the number of bots in the file). Then and only then, should we start to use that new list.
Reference to |
Wow, thanks for all the work on this @AlexWayfer! I'll do my best to get to reviewing it soon and see if we can get this merged. In the meantime, would you mind addressing the merge conflicts? |
Done. By the way, this version is already used in production, and looks working properly. |
I don't want to rush, just a friendly reminder. |
Added ----- * Built-in cache with file-path and max-age * Automatically refresh from source (external repository) Removed ------- * Rake-tasks * A very large copy of external UA-bots list * `rack` as dependency Changed ------- * `voight_kampff/rack_request` isn't required by default * Default path for cache is now in `./tmp` directory instead of `./config` Development ----------- * [EditorConfig](https://editorconfig.org/) added * [RuboCop](https://github.com/rubocop-hq/rubocop) added and offenses fixed * Drop support for Ruby <= 2.2 (they even will not get security patches) * Ignore built gems for git
Conflicts resolved, RuboCop updated, speed test improved. |
Sorry again for the delay. I've had a bit of a backlog with my open source projects ever since I started a job that gave me some time to work on them. But I'm trying to address at least one issue a week and this is currently next inline. So I should be able to get to it soon if nothing urgent comes up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @AlexWayfer for all your work on this. There are some little things I'd like to see changed in a code review that I can add if you want, but before I got into those details I wanted to cover the big issue I'm seeing.
When I wrote this gem I decided to keep a local copy of the crawler-user-agents.json
file for a reason. Here's some of the pros and cons I see with that approach.
Pros:
- No external dependency, in general
- No downtime for the client app when GitHub or it's internet connection is down
- No problems if the external file or repo is deleted, moved, broken, etc.
- No performance issues from an external file having to be downloaded at startup and then on a regular interval
- No issues with firewalls or running the project offline
Cons:
- You don't get the latest list of bots unless you schedule a rake task to download it
There certainly could be other cons I'm not thinking of too. Feel free to list them.
I hope that's enough for you to see why I made the choice to not pull the file from GitHub by default. That being said, there is a lot of good work in this PR. So...
- I like a lot of the cleanup and improvements mixed in here. If you want to separate them out, I'd be happy to merge them.
- I wouldn't mind pulling in the code to automatically download and cache the
crawler-user-agents.json
file as long as we still keep the local copy included with the gem and use that by default. Then have a config options for "auto refresh".
Let me know how you'd like to proceed with this.
@@ -1 +0,0 @@ | |||
2.6.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why this file was removed, but I'd like to keep it around unless there is a good reason to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The gem supports many version of Ruby. Why we should lock at some specific version? It's like Gemfile.lock
(https://stackoverflow.com/a/4151540/2630849), but Ruby-level.
end | ||
|
||
def load_remote_crawlers(uri = DEFAULT_URL) | ||
crawlers = Net::HTTP.get(URI(uri)).force_encoding('UTF-8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if we did this at all, I'd want to be really paranoid about it. It should fail very gracefully and any sort of issues with this whole process should log an error, but never take down the whole site. And I'd prefer to avoid a situation where we need to give false positives too. If there are problems getting the file, I suppose you might need to treat everything as not a bot until the file did download. But that seems bad too.
At the very least, I think the gem should start off using it's local file thats packaged with the gem. Then only if a new version is downloaded successfully and the JSON is parsed successfully and then some sort of sanity check is done to make sure we have a real list of bots (even if that's just checking the number of bots in the file). Then and only then, should we start to use that new list.
I don't see evil in dependencies. Our (commercial) projects are depended on this gem. Is it bad? I don't think so. Especially, I can contribute to dedicated repo only new bots, not a code, and this immediately will be in our projects, without updates. Like, automated data updating. Like currencies exchange rates.
It's a good point. This job (fetching data update) can be done in a separated thread and with retries. I can try to write a code for this here.
The same as first: our project can have problems if this gem will be deleted, moved, broken, etc. This is compromise, we're taking these risks. And these potential problems are resolvable.
Ehm… OK. I think, it less than 10 seconds even with bad internet connection and large file, and for current interval (24 hours, that can be changed) it's like 0.01% of time, but OK. Again, with separated thread it can be better, I think.
Firewall with blocking downloads from GitHub?… And offline of projects with gem for requests?… I think, these cases are very rare, and there can be another problems. But OK, you're right.
Yes, of course.
I can (try to) separate those changes, but I want to describe my vision of suggested approach: As referenced in the PR description, it's like The data for this gem (crawlers list) is not too frequently updating, but for me the principle the same: separate code and data. Like you separate code from database, code from uploaded files, etc. You don't store them in repo with code and update regularly. |
Added
Removed
rack
as dependencyChanged
voight_kampff/rack_request
isn't required by default./tmp
directory instead of./config
Development
References
money-oxr
gem