-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Selenium basic integration #444
Open
dgoiko
wants to merge
10
commits into
yasserg:master
Choose a base branch
from
dgoiko:selenium
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Commits on Nov 16, 2019
-
Extracted interfaces from Parser and PageFetcher
Extracted interfaces from Parser and PageFetcher in order to make it easier to create totally custom classes
Configuration menu - View commit details
-
Copy full SHA for 761513b - Browse repository at this point
Copy the full SHA 761513bView commit details -
Configuration menu - View commit details
-
Copy full SHA for a64580c - Browse repository at this point
Copy the full SHA a64580cView commit details -
Configuration menu - View commit details
-
Copy full SHA for ad5ebc3 - Browse repository at this point
Copy the full SHA ad5ebc3View commit details
Commits on Jan 24, 2020
-
Made a silly change in an error in javadoc in order to make a commit and pass the merge checks again, now that the bug in java8 checks is fixed in repo
Configuration menu - View commit details
-
Copy full SHA for 4c237eb - Browse repository at this point
Copy the full SHA 4c237ebView commit details -
Silly changes to pass test again
There was an http fetch error on Java11 test. Commit to pass the test again
Configuration menu - View commit details
-
Copy full SHA for 5d0964c - Browse repository at this point
Copy the full SHA 5d0964cView commit details
Commits on Jan 26, 2020
-
Extracted interface for Frontier
This modification would allow to use a database other than sleepycat easilly.
Configuration menu - View commit details
-
Copy full SHA for 4ef7de9 - Browse repository at this point
Copy the full SHA 4ef7de9View commit details -
Allows to use any implementation for the DocIDServer, not only sleepycat
Configuration menu - View commit details
-
Copy full SHA for 8e890b7 - Browse repository at this point
Copy the full SHA 8e890b7View commit details
Commits on May 10, 2020
-
Very basic selenium integration. This is not intended to be a full selenium crawler like Nutch, the main goal is to provide a simple way to crawl full-js pages without directly calling the REST APIs. If you're trying to navigate simple HTML pages with, lets say, a POST form, I'd recommend using the POST Capabilities MR instead Connections stablished through selenium are not counted in the same pool than those opened with HttpClient, so limitations are not taken into consideration. Further commits will attemp to resolve this issue, but it is not straight-forward. Selenium request will NOT be intercepter by the credentials interceptors, and cookies obtained via FormLogin (or any other non-selenium request) will not be visible for selenium browser. You can define inclussions / exclussions on the new SeleniumCrawlConfig class to determine which URLs will be visited using Selenium and which won't. Starting with a Selenium seed is not possible at the moment (although it would be possible using the new functions created in my POST CAPABNILITIES MR, which allow to pass WebURLs to addSeed methods. Please, note that Selenium API does NOT provide headers information, so they won't be available in the Page class. A full selenium integration would require to modify the crawler too deeply, but right now, the active selenium headless browser window is available through page#getFetchedResult. This is a bit unconvenient as it forces you to perform instanceof verifications in order to access it.
Configuration menu - View commit details
-
Copy full SHA for 6548cd6 - Browse repository at this point
Copy the full SHA 6548cd6View commit details -
Selenium now sees the cookies generated by HttpClientRequest and vice-versa
Configuration menu - View commit details
-
Copy full SHA for cd957a5 - Browse repository at this point
Copy the full SHA cd957a5View commit details -
Configuration menu - View commit details
-
Copy full SHA for 8be3bf3 - Browse repository at this point
Copy the full SHA 8be3bf3View commit details
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.