Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selenium basic integration #444

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

Selenium basic integration #444

wants to merge 10 commits into from

Commits on Nov 16, 2019

  1. Extracted interfaces from Parser and PageFetcher

    Extracted interfaces from Parser and PageFetcher in order to make it easier to create totally custom classes
    dgoiko committed Nov 16, 2019
    Configuration menu
    Copy the full SHA
    761513b View commit details
    Browse the repository at this point in the history
  2. Syle fixes

    dgoiko committed Nov 16, 2019
    Configuration menu
    Copy the full SHA
    a64580c View commit details
    Browse the repository at this point in the history
  3. Style fix

    dgoiko committed Nov 16, 2019
    Configuration menu
    Copy the full SHA
    ad5ebc3 View commit details
    Browse the repository at this point in the history

Commits on Jan 24, 2020

  1. change to pass checks

    Made a silly change in an error in javadoc in order to make a commit and pass the merge checks again, now that the bug in java8 checks is fixed in repo
    dgoiko committed Jan 24, 2020
    Configuration menu
    Copy the full SHA
    4c237eb View commit details
    Browse the repository at this point in the history
  2. Silly changes to pass test again

    There was an http fetch error on Java11 test. Commit to pass the test again
    dgoiko committed Jan 24, 2020
    Configuration menu
    Copy the full SHA
    5d0964c View commit details
    Browse the repository at this point in the history

Commits on Jan 26, 2020

  1. Extracted interface for Frontier

    This modification would allow to use a database other than sleepycat easilly.
    dgoiko committed Jan 26, 2020
    Configuration menu
    Copy the full SHA
    4ef7de9 View commit details
    Browse the repository at this point in the history
  2. DocIDServer interface created

    Allows to use any implementation for the DocIDServer, not only sleepycat
    dgoiko committed Jan 26, 2020
    Configuration menu
    Copy the full SHA
    8e890b7 View commit details
    Browse the repository at this point in the history

Commits on May 10, 2020

  1. Selenium basic integration

    Very basic selenium integration. This is not intended to be a full selenium crawler like Nutch, the main goal is to provide a simple way to crawl full-js pages without directly calling the REST APIs. If you're trying to navigate simple HTML pages with, lets say, a POST form, I'd recommend using the POST Capabilities MR instead
    
    Connections stablished through selenium are not counted in the same pool than those opened with HttpClient, so limitations are not taken into consideration. Further commits will attemp to resolve this issue, but it is not straight-forward.
    
    Selenium request will NOT be intercepter by the credentials interceptors, and cookies obtained via FormLogin (or any other non-selenium request) will not be visible for selenium browser.
    
    You can define inclussions / exclussions on the new SeleniumCrawlConfig class to determine which URLs will be visited using Selenium and which won't. Starting with a Selenium seed is not possible at the moment (although it would be possible using the  new functions created in my POST CAPABNILITIES MR, which allow to pass WebURLs to addSeed methods.
    
    Please, note that Selenium API does NOT provide headers information, so they won't be available in the Page class.
    
    A full selenium integration would require to modify the crawler too deeply, but right now, the active selenium headless browser window is available through page#getFetchedResult. This is a bit unconvenient as it forces you to perform instanceof verifications in order to access it.
    dgoiko committed May 10, 2020
    Configuration menu
    Copy the full SHA
    6548cd6 View commit details
    Browse the repository at this point in the history
  2. Persisting cookies

    Selenium now sees the cookies generated by HttpClientRequest and vice-versa
    dgoiko committed May 10, 2020
    Configuration menu
    Copy the full SHA
    cd957a5 View commit details
    Browse the repository at this point in the history
  3. Package separation

    All Selenium classes are now in a new package
    dgoiko committed May 10, 2020
    Configuration menu
    Copy the full SHA
    8be3bf3 View commit details
    Browse the repository at this point in the history