Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawlers should support custom steps #17

Open
ekryski opened this issue Mar 18, 2014 · 14 comments
Open

Crawlers should support custom steps #17

ekryski opened this issue Mar 18, 2014 · 14 comments
Labels
Milestone

Comments

@ekryski
Copy link
Contributor

ekryski commented Mar 18, 2014

A crawler might look something like this.

var Crawler = require('roach').job();

Crawler.extend({
  // Your extra custom functions for parsing, etc.
  custom: function(params, data, next){
    // do some custom stuff with the data
    next(data);
  }
});

Crawler.visit(this.url)
       .custom()
       .find('a.link')
       .each()
       .click()
       .done();
@ekryski ekryski added this to the Phase 1 milestone Mar 18, 2014
@bredele
Copy link

bredele commented Mar 25, 2014

Hey Eric, I was thinking about something more like this:

var crawler = require('roach').crawler({
  custom: function(){}
});
crawler
  .get('http://something.io/file.zip')
  .pipe(crawler.unzip())
  .find('a.link')
  .custom()
  .done();

I think it would be better to access the crawler methods (get, zip, csv, etc) outside of the job.

@bredele bredele closed this as completed Mar 25, 2014
@bredele bredele reopened this Mar 25, 2014
@ekryski
Copy link
Contributor Author

ekryski commented Mar 25, 2014

Ya that should work. Although I suspect it will look more like this:

var crawler = require('roach').crawler({
  custom: function(){}
});
crawler
  .get('http://something.io/file.zip')
  .pipe(crawler.unzip())
  .pipe(crawler.find('a.link'))
  .pipe(crawler.custom())
  .done();

@bredele
Copy link

bredele commented Mar 25, 2014

Yes, having a pipe api would help us to be more consistent.

@bredele
Copy link

bredele commented Mar 25, 2014

I'm going to work on this right now.

@ekryski
Copy link
Contributor Author

ekryski commented Mar 25, 2014

Either use pipe or we use the jquery style .then().

@bredele
Copy link

bredele commented Mar 25, 2014

Do you know a chainable library with pipe?

@bredele
Copy link

bredele commented Mar 25, 2014

The then is more like promise based right?

@ekryski
Copy link
Contributor Author

ekryski commented Mar 25, 2014

No I don't know one that has a 'pipe' method explicitly. And ya jQuery's .then() is around their deferred's/promises

@bredele
Copy link

bredele commented Mar 25, 2014

I'll figured what is the best solution, I guess using promises would be easier.

@ekryski
Copy link
Contributor Author

ekryski commented Mar 25, 2014

Ya I thought it would be promise based for sure. Look at how Casper does it. It's basically what we are going to emulate.

@bredele
Copy link

bredele commented Mar 25, 2014

oki I'll.

@ekryski
Copy link
Contributor Author

ekryski commented Mar 25, 2014

Could also look at how gulpjs is doing their pipe method.

@ekryski
Copy link
Contributor Author

ekryski commented Apr 4, 2014

We might need our own through wrapper for custom steps and for the '.XLS' parser.

@bredele bredele modified the milestones: Phase 1, Phase 2 Apr 7, 2014
@bredele
Copy link

bredele commented Apr 7, 2014

Eric, do you want the step handler in the crawler or the job? Personally I think it makes sense to have it inside the crawler (the job doesn't handler streams) as follow:

crawler('http://myfile.txt')
  .step(require('step1'))
  .step(require('step2'));

UPDATE: Having something as above is not possible because every handler return a stream.

@ekryski ekryski modified the milestones: Phase 1, Phase 2 Apr 7, 2014
@bredele bredele modified the milestone: Phase 1 Apr 8, 2014
ekryski pushed a commit that referenced this issue May 20, 2014
add scheduler in roach constructor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants