Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable database names #420

Closed
wants to merge 21 commits into from

Commits on Nov 14, 2019

  1. WebURLTupleBinding, Fetcher and WebURL Post

    Added POST to WebURL and PageFetcher. Included support for the new WebURL attributes in WebURLTupleBinding.
    
    I've suggested to deprecate newHttpUriRequest(String) in PageFetcher because it does not allow to pass post parameters.
    dgoiko committed Nov 14, 2019
    Configuration menu
    Copy the full SHA
    e9004f8 View commit details
    Browse the repository at this point in the history
  2. DocIDServer aware of POST data

    DocIDServer is now aware of POST data and allows to visit the same URL if POST parameters are different. (Filling a form with different years, for instance).
    
    Suggested to deprecate getDocId, getNewDocID, addUrlAndDocId and isSeenBefore since they don't allow to pass post parameters.
    
    WebURL has the ability to encode itself into a single unique string.
    
    NOTE: This serialization SHOULD be improved.
    dgoiko committed Nov 14, 2019
    Configuration menu
    Copy the full SHA
    5ac44bb View commit details
    Browse the repository at this point in the history
  3. WebCrawler uses new DocIDServer Post capabilities

    The WebCrawler now passes the WebURL to the DocIDServer instead of passing a String URL.
    
    We assume GET on redirections. I'm not 100% sure if this is allways true.
    
    The case !curURL.getURL().equals(fetchResult.getFetchedUrl()) is still using old methods. Should be reviewed
    dgoiko committed Nov 14, 2019
    Configuration menu
    Copy the full SHA
    f0b2219 View commit details
    Browse the repository at this point in the history
  4. addSeenUrl and addSeed with WebURL parameter

    Added addSeenUrl(WebURL) and addSeed(WebURL)  methods to CrawlController. Did not touch original methods, although I'd suggest to make them create WebURLs and pass them to the new methods.
    dgoiko committed Nov 14, 2019
    Configuration menu
    Copy the full SHA
    20e646c View commit details
    Browse the repository at this point in the history
  5. SUGGESTION: addSeed and addSeenUrl call the WebURL methods

    addSeed and addSeenUrl now will create a WebURL and pass it to the newly created methods. It will make it easier for the user to override those methods.
    dgoiko committed Nov 14, 2019
    Configuration menu
    Copy the full SHA
    2eed8de View commit details
    Browse the repository at this point in the history
  6. PageFetchResult now contains the POST info

    Suggedted to deprecate old fetchedUrl attribute and introduced fetchedWebUrl, which is a WebURL.
    
    Some tab style fixes.
    dgoiko committed Nov 14, 2019
    Configuration menu
    Copy the full SHA
    855cbc4 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    300c07f View commit details
    Browse the repository at this point in the history
  8. PageFetchResult fix

    There was a loop.
    
    Some style fixes.
    dgoiko committed Nov 14, 2019
    Configuration menu
    Copy the full SHA
    72cd50f View commit details
    Browse the repository at this point in the history

Commits on Nov 15, 2019

  1. Configuration menu
    Copy the full SHA
    dc96fd9 View commit details
    Browse the repository at this point in the history
  2. Style fixes

    dgoiko committed Nov 15, 2019
    Configuration menu
    Copy the full SHA
    f768616 View commit details
    Browse the repository at this point in the history
  3. Style fixes

    dgoiko committed Nov 15, 2019
    Configuration menu
    Copy the full SHA
    1d29aab View commit details
    Browse the repository at this point in the history
  4. Style fixes

    dgoiko committed Nov 15, 2019
    Configuration menu
    Copy the full SHA
    98aff77 View commit details
    Browse the repository at this point in the history

Commits on Nov 16, 2019

  1. Configurable database names

    It is possible to configure the database names from the CrawlControler constructor.
    dgoiko committed Nov 16, 2019
    Configuration menu
    Copy the full SHA
    e2da488 View commit details
    Browse the repository at this point in the history
  2. Style fix

    dgoiko committed Nov 16, 2019
    Configuration menu
    Copy the full SHA
    0d0cd15 View commit details
    Browse the repository at this point in the history
  3. Extracted interfaces from Parser and PageFetcher

    Extracted interfaces from Parser and PageFetcher in order to make it easier to create totally custom classes
    dgoiko committed Nov 16, 2019
    Configuration menu
    Copy the full SHA
    761513b View commit details
    Browse the repository at this point in the history
  4. Syle fixes

    dgoiko committed Nov 16, 2019
    Configuration menu
    Copy the full SHA
    a64580c View commit details
    Browse the repository at this point in the history
  5. Style fix

    dgoiko committed Nov 16, 2019
    Configuration menu
    Copy the full SHA
    ad5ebc3 View commit details
    Browse the repository at this point in the history

Commits on Dec 14, 2019

  1. Default docId set to -1

    Now WebURLs can be used as seeds, so we enforce docId to be < 0 unless staten otherwise
    dgoiko committed Dec 14, 2019
    Configuration menu
    Copy the full SHA
    0497657 View commit details
    Browse the repository at this point in the history

Commits on Jan 9, 2020

  1. Merge pull request #1 from dgoiko/POST-capabilities

    Post capabilities
    dgoiko authored Jan 9, 2020
    Configuration menu
    Copy the full SHA
    a5d72ab View commit details
    Browse the repository at this point in the history
  2. Merge pull request #2 from dgoiko/interfaces

    Interfaces
    dgoiko authored Jan 9, 2020
    Configuration menu
    Copy the full SHA
    d76a045 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    30007ca View commit details
    Browse the repository at this point in the history