Phase 1
Requirements
-
A simple server with a REST API to schedule jobs
-
The server and scheduler should be deployable somewhere (ideally Heroku)
-
Scheduler needs to be queue based. I would suggest Redis. Also use kue for inspiration.
-
Don't worry about complex schedules, delays, retries, etc. For firs…
Requirements
-
A simple server with a REST API to schedule jobs
-
The server and scheduler should be deployable somewhere (ideally Heroku)
-
Scheduler needs to be queue based. I would suggest Redis. Also use kue for inspiration.
-
Don't worry about complex schedules, delays, retries, etc. For first cut just be able to schedule crawlers immediately.
-
Each crawler can be scheduled to be run as a job. They are separate files. A typical crawler bundle would look like so:
crawlers -> example_crawler -> roach.json -> crawler.js -> steps -> custom-step1.js -> custom-step2.js
roach.json
- A manifest file defining the default crawler options.crawler.js
- The actual crawler file that defines the order of the steps.steps
- A directory containing any custom steps that are required.
-
Crawlers are written like so (look into casperjs).
var Crawler = require('crawler');
Crawler.extend({
// Your extra custom functions for parsing, etc.
custom: function(params, data, next){
// do some custom stuff with the data
next(data);
}
});
Crawler.visit(this.url)
.custom()
.find('a.link')
.each()
.click()
.done();
- Crawlers are put in a
crawlers
directory. - When a scheduled crawler (job) is finished running the data should be sent to rabbitMQ.
- It should handle the following document formats:
- XML
- JSON
- Text
- CSV
- XLS, XLSX
- ZIP files