[Blocked by #41] Version 2 #38

AlexWayfer · 2018-08-31T13:14:46Z

Added

Built-in cache with file-path and max-age
Automatically refresh from source (external repository)

Removed

Rake-tasks
A very large copy of external UA-bots list
rack as dependency

Changed

voight_kampff/rack_request isn't required by default
Default path for cache is now in ./tmp directory instead of ./config

Development

EditorConfig added
RuboCop added and offenses fixed
Drop support for Ruby <= 2.2 (they even will not get security patches)
Ignore built gems for git

References

money-oxr gem

samuelgiles · 2018-09-05T14:51:15Z

lib/voight_kampff/test.rb

+      end
+
+      def load_remote_crawlers(uri = DEFAULT_URL)
+        crawlers = Net::HTTP.get(URI(uri)).force_encoding('UTF-8')


Do you think this should potentially fail gracefully if say the URL returns a non successful response code or any kind of network error? Potentially just Kernel.warn "VoightKampff: Unable to fetch crawlers from #{uri}"

I can see a scenario where the repo gets pulled and a bunch of sites go down.

Wow, thank you. I added tests and code for these situations.

Also, there is warn with error inspection: it can be SocketError, or Errno::ENETUNREACH, or something else.

And the code will "cache" the empty Array in these cases.

Also, I didn't add a code for redirects: Net::HTTP can't deal with them easily. It's more comfortable to use some external gem, like httpx, for this. But I hope that there will be not redirects.

Yes, if we did this at all, I'd want to be really paranoid about it. It should fail very gracefully and any sort of issues with this whole process should log an error, but never take down the whole site. And I'd prefer to avoid a situation where we need to give false positives too. If there are problems getting the file, I suppose you might need to treat everything as not a bot until the file did download. But that seems bad too.

At the very least, I think the gem should start off using it's local file thats packaged with the gem. Then only if a new version is downloaded successfully and the JSON is parsed successfully and then some sort of sanity check is done to make sure we have a real list of bots (even if that's just checking the number of bots in the file). Then and only then, should we start to use that new list.

AlexWayfer · 2018-09-05T20:20:29Z

Reference to money-oxr gem added into commit message and PR description accordingly.

adamcrown · 2018-09-30T04:04:48Z

Wow, thanks for all the work on this @AlexWayfer! I'll do my best to get to reviewing it soon and see if we can get this merged. In the meantime, would you mind addressing the merge conflicts?

AlexWayfer · 2018-09-30T19:07:40Z

In the meantime, would you mind addressing the merge conflicts?

Done.

By the way, this version is already used in production, and looks working properly.

AlexWayfer · 2019-04-08T16:44:34Z

I'll do my best to get to reviewing it soon and see if we can get this merged.

I don't want to rush, just a friendly reminder.

Added ----- * Built-in cache with file-path and max-age * Automatically refresh from source (external repository) Removed ------- * Rake-tasks * A very large copy of external UA-bots list * `rack` as dependency Changed ------- * `voight_kampff/rack_request` isn't required by default * Default path for cache is now in `./tmp` directory instead of `./config` Development ----------- * [EditorConfig](https://editorconfig.org/) added * [RuboCop](https://github.com/rubocop-hq/rubocop) added and offenses fixed * Drop support for Ruby <= 2.2 (they even will not get security patches) * Ignore built gems for git

AlexWayfer · 2019-07-09T14:40:04Z

Conflicts resolved, RuboCop updated, speed test improved.

adamcrown · 2019-08-06T05:57:49Z

Sorry again for the delay. I've had a bit of a backlog with my open source projects ever since I started a job that gave me some time to work on them. But I'm trying to address at least one issue a week and this is currently next inline. So I should be able to get to it soon if nothing urgent comes up.

adamcrown

Thanks again @AlexWayfer for all your work on this. There are some little things I'd like to see changed in a code review that I can add if you want, but before I got into those details I wanted to cover the big issue I'm seeing.

When I wrote this gem I decided to keep a local copy of the crawler-user-agents.json file for a reason. Here's some of the pros and cons I see with that approach.

Pros:

No external dependency, in general
No downtime for the client app when GitHub or it's internet connection is down
No problems if the external file or repo is deleted, moved, broken, etc.
No performance issues from an external file having to be downloaded at startup and then on a regular interval
No issues with firewalls or running the project offline

Cons:

You don't get the latest list of bots unless you schedule a rake task to download it
There certainly could be other cons I'm not thinking of too. Feel free to list them.

I hope that's enough for you to see why I made the choice to not pull the file from GitHub by default. That being said, there is a lot of good work in this PR. So...

I like a lot of the cleanup and improvements mixed in here. If you want to separate them out, I'd be happy to merge them.
I wouldn't mind pulling in the code to automatically download and cache the crawler-user-agents.json file as long as we still keep the local copy included with the gem and use that by default. Then have a config options for "auto refresh".

Let me know how you'd like to proceed with this.

adamcrown · 2019-08-10T15:40:39Z

.ruby-version

@@ -1 +0,0 @@
-2.6.2


I'm not sure why this file was removed, but I'd like to keep it around unless there is a good reason to remove it.

The gem supports many version of Ruby. Why we should lock at some specific version? It's like Gemfile.lock (https://stackoverflow.com/a/4151540/2630849), but Ruby-level.

adamcrown · 2019-08-10T16:17:13Z

lib/voight_kampff/test.rb

+      end
+
+      def load_remote_crawlers(uri = DEFAULT_URL)
+        crawlers = Net::HTTP.get(URI(uri)).force_encoding('UTF-8')


Yes, if we did this at all, I'd want to be really paranoid about it. It should fail very gracefully and any sort of issues with this whole process should log an error, but never take down the whole site. And I'd prefer to avoid a situation where we need to give false positives too. If there are problems getting the file, I suppose you might need to treat everything as not a bot until the file did download. But that seems bad too.

At the very least, I think the gem should start off using it's local file thats packaged with the gem. Then only if a new version is downloaded successfully and the JSON is parsed successfully and then some sort of sanity check is done to make sure we have a real list of bots (even if that's just checking the number of bots in the file). Then and only then, should we start to use that new list.

AlexWayfer · 2019-08-15T13:01:39Z

No external dependency, in general

I don't see evil in dependencies. Our (commercial) projects are depended on this gem. Is it bad? I don't think so.

Especially, I can contribute to dedicated repo only new bots, not a code, and this immediately will be in our projects, without updates. Like, automated data updating. Like currencies exchange rates.

No downtime for the client app when GitHub or it's internet connection is down

It's a good point. This job (fetching data update) can be done in a separated thread and with retries. I can try to write a code for this here.

No problems if the external file or repo is deleted, moved, broken, etc.

The same as first: our project can have problems if this gem will be deleted, moved, broken, etc. This is compromise, we're taking these risks. And these potential problems are resolvable.

No performance issues from an external file having to be downloaded at startup and then on a regular interval

Ehm… OK. I think, it less than 10 seconds even with bad internet connection and large file, and for current interval (24 hours, that can be changed) it's like 0.01% of time, but OK. Again, with separated thread it can be better, I think.

No issues with firewalls or running the project offline

Firewall with blocking downloads from GitHub?… And offline of projects with gem for requests?… I think, these cases are very rare, and there can be another problems. But OK, you're right.

There certainly could be other cons I'm not thinking of too. Feel free to list them.

Many commits in gem with code but about data from external repo, so it's kind of duplicating.

I hope that's enough for you to see why I made the choice to not pull the file from GitHub by default.

Yes, of course.

Let me know how you'd like to proceed with this.

I can (try to) separate those changes, but I want to describe my vision of suggested approach:

As referenced in the PR description, it's like money-oxr for me. Gem for currencies exchange rates don't store a copy of all currency rates, which is (automatically) updating with interval. It's like… pointless. It (almost) requires regular updates from gem side, and duplicating a large massive of data. While this data is updating every day (or even maybe hour).

The data for this gem (crawlers list) is not too frequently updating, but for me the principle the same: separate code and data. Like you separate code from database, code from uploaded files, etc. You don't store them in repo with code and update regularly.

AlexWayfer force-pushed the v2 branch from 8846892 to f1ceb89 Compare August 31, 2018 13:42

samuelgiles reviewed Sep 5, 2018

View reviewed changes

AlexWayfer force-pushed the v2 branch from f1ceb89 to 3972d79 Compare September 5, 2018 20:15

adamcrown added the enhancement label Sep 30, 2018

AlexWayfer force-pushed the v2 branch from 3972d79 to c5eb194 Compare September 30, 2018 19:05

AlexWayfer force-pushed the v2 branch from c5eb194 to b04b438 Compare July 9, 2019 14:39

Merge branch 'master' into v2

84ce412

adamcrown requested changes Aug 10, 2019

View reviewed changes

AlexWayfer changed the title ~~Version 2~~ WIP: [Blocked by #41] Version 2 Aug 15, 2019

AlexWayfer mentioned this pull request Feb 20, 2020

Update thor version rubysec/bundler-audit#237

Merged

AlexWayfer marked this pull request as draft December 12, 2020 13:52

AlexWayfer changed the title ~~WIP: [Blocked by #41] Version 2~~ [Blocked by #41] Version 2 Dec 12, 2020

AlexWayfer mentioned this pull request Mar 14, 2023

Various improvements #41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Blocked by #41] Version 2 #38

[Blocked by #41] Version 2 #38

AlexWayfer commented Aug 31, 2018 •

edited

Loading

samuelgiles Sep 5, 2018

AlexWayfer Sep 5, 2018

adamcrown Aug 10, 2019

AlexWayfer commented Sep 5, 2018

adamcrown commented Sep 30, 2018

AlexWayfer commented Sep 30, 2018 •

edited

Loading

AlexWayfer commented Apr 8, 2019

AlexWayfer commented Jul 9, 2019

adamcrown commented Aug 6, 2019

adamcrown left a comment

adamcrown Aug 10, 2019

AlexWayfer Aug 15, 2019

adamcrown Aug 10, 2019

AlexWayfer commented Aug 15, 2019

		@@ -1 +0,0 @@
		2.6.2

[Blocked by #41] Version 2 #38

Are you sure you want to change the base?

[Blocked by #41] Version 2 #38

Conversation

AlexWayfer commented Aug 31, 2018 • edited Loading

Added

Removed

Changed

Development

References

samuelgiles Sep 5, 2018

Choose a reason for hiding this comment

AlexWayfer Sep 5, 2018

Choose a reason for hiding this comment

adamcrown Aug 10, 2019

Choose a reason for hiding this comment

AlexWayfer commented Sep 5, 2018

adamcrown commented Sep 30, 2018

AlexWayfer commented Sep 30, 2018 • edited Loading

AlexWayfer commented Apr 8, 2019

AlexWayfer commented Jul 9, 2019

adamcrown commented Aug 6, 2019

adamcrown left a comment

Choose a reason for hiding this comment

adamcrown Aug 10, 2019

Choose a reason for hiding this comment

AlexWayfer Aug 15, 2019

Choose a reason for hiding this comment

adamcrown Aug 10, 2019

Choose a reason for hiding this comment

AlexWayfer commented Aug 15, 2019

AlexWayfer commented Aug 31, 2018 •

edited

Loading

AlexWayfer commented Sep 30, 2018 •

edited

Loading