Skip to content

Phase 1

Past due by over 10 years 82% complete

Requirements

  • A simple server with a REST API to schedule jobs

  • The server and scheduler should be deployable somewhere (ideally Heroku)

  • A job scheduler should be based off of later and moment.

  • Scheduler needs to be queue based. I would suggest Redis. Also use kue for inspiration.

  • Don't worry about complex schedules, delays, retries, etc. For firs…

Requirements

  • A simple server with a REST API to schedule jobs

  • The server and scheduler should be deployable somewhere (ideally Heroku)

  • A job scheduler should be based off of later and moment.

  • Scheduler needs to be queue based. I would suggest Redis. Also use kue for inspiration.

  • Don't worry about complex schedules, delays, retries, etc. For first cut just be able to schedule crawlers immediately.

  • Each crawler can be scheduled to be run as a job. They are separate files. A typical crawler bundle would look like so:

      crawlers
        -> example_crawler
          -> roach.json
          -> crawler.js
          -> steps
            -> custom-step1.js
            -> custom-step2.js
    
    • roach.json - A manifest file defining the default crawler options.
    • crawler.js - The actual crawler file that defines the order of the steps.
    • steps - A directory containing any custom steps that are required.
  • Crawlers are written like so (look into casperjs).

var Crawler = require('crawler');

Crawler.extend({
  // Your extra custom functions for parsing, etc.
  custom: function(params, data, next){
    // do some custom stuff with the data
    next(data);
  }
});

Crawler.visit(this.url)
       .custom()
       .find('a.link')
       .each()
       .click()
       .done();
  • Crawlers are put in a crawlers directory.
  • When a scheduled crawler (job) is finished running the data should be sent to rabbitMQ.
  • It should handle the following document formats:
    • XML
    • JSON
    • Text
    • CSV
    • XLS, XLSX
    • ZIP files
Loading