Fixed scheduler process_spider_output() to yield requests #254

voith · 2017-02-12T09:18:57Z

fixes #253
Here's a screenshot using the same code discussed here.

Nothing seems to break when testing this change manually. The only test that was failing was wrong IMO because it passed a list of requests and items and was only expecting items in return. I have modified that test to make it compatible with this patch.

I've the split this PR into three commits:

The first commit adds a test to reproduce the bug.
The second commit fixes the bug
The third commit fixes the broken test discussed above

A note about the tests added:

The tests might be a little difficult to understand on the first sight. I would recommend to read the following code in order understand the tests:

https://github.com/scrapy/scrapy/blob/master/scrapy/core/spidermw.py#L34-L73: This is to understand how scrapy processes the different methods of the spider middleware.
https://github.com/scrapy/scrapy/blob/master/scrapy/core/scraper.py#L135-L147: This is to understand how the scrapy core executes the spider middleware methods and passes the control to the spider callbacks.

I have simulated the above discussed code in order to write the test.

codecov-io · 2017-02-12T09:46:46Z

Codecov Report

Merging #254 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master     #254   +/-   ##
=======================================
  Coverage   70.15%   70.15%           
=======================================
  Files          68       68           
  Lines        4715     4715           
  Branches      632      632           
=======================================
  Hits         3308     3308           
  Misses       1267     1267           
  Partials      140      140

Impacted Files	Coverage Δ
frontera/contrib/scrapy/schedulers/frontier.py	`96.69% <100%> (ø)`	✅

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a442055...3df3244. Read the comment docs.

sibiryakov · 2017-02-24T10:56:47Z

Hey @voith sorry for a long absence. It looks absolutely fine. I hope your complex tests work. Ready for merge?

voith · 2017-02-27T10:37:39Z

Hi @sibiryakov, This is ready for merge. You can view that the test works by looking at the builds.
like always I add the regression test in the first commit to show failure and than add the fix in the second commit.

sibiryakov · 2017-02-28T17:21:54Z

thank you! 🍺

isra17 · 2017-04-18T20:14:19Z

This PR broke Frontera behaviour. Now every yielded request end up in as a call to add_seeds as well. The initial behavior did swallow the requests in the middleware so Scrapy wasn't scheduling them. By yielding the request from the middleware, it now leave them to scrapy who eventually enqueue them with FronteraScheduler::enqueue_request . Either fix enqueue_request or revert this. For the original issue, shouldn't have this been simply fixed by setting a lower priority to the SchedulerSpiderMiddleware?

isra17 · 2017-04-18T20:14:52Z

Ping @sibiryakov

voith · 2017-04-19T06:20:50Z

@isra17 I too had noticed that add_seeds was getting called for every request yielded. I wasn't aware that it was my PR that broke this.
You are right about the fact the problem can be solved by setting a lower priority, But I thought that the code added earlier was a mistake. I was expecting @sibiryakov to tell me what could possibly break with this change.
I tested the change I made in this PR and made sure that nothing broke, but did not anticipate that it would affect performance.

Well frontera should have had a test case for this.

voith · 2017-04-19T06:23:39Z

I've opened a PR to revert this change #273

isra17 · 2017-04-19T15:17:20Z

Don't worry about that, this is not the kind of issue that broke vanilla Frontera in obvious manner. I didn't see it until I had a middleware with some logic specific to links_extracted.

sibiryakov · 2017-04-24T17:15:20Z

this PR fixes the problem probably #261,
the root of the problem is Scrapy Engine behavior which is calling enqueue_request() every time it gets request instance from one of downloader middlewares (but not always).

sibiryakov · 2017-04-24T17:16:19Z

There is no need to revert this change, it's a step in the right direction: why should allow other middlewares to operate on objects passed Frontera middlewares.

isra17 · 2017-04-24T17:39:30Z

Unless I'm missing something, won't #261 end up scheduling twice the requests?
process_spider_output is still calling links_extracted for all requests, then passing them to Scrapy which in his turn calls enqueue_request that itself add the request to the _pending_requests. So now the requests live both in the frontier queue and the spider queue.

sibiryakov · 2017-04-25T07:07:48Z

@isra17 This will happen only if you have some middleware in Scrapy yielding all requests it gets. Normally, this shouldn't happen.

sibiryakov · 2017-04-25T12:58:41Z

@isra17 I spend more time looking into this, and I think you're right: we will get requests in two places: temp. queue and frontier. I'll release the fix soon.

sibiryakov · 2017-04-25T13:43:38Z

See #276

voith added 3 commits February 12, 2017 14:45

added tests to integrate scrapy midleware with frontera

64c0334

fixed schedulers process_spider_output() to yield requests

57de876

fixed test test_process_spider_output in est_frontera_scheduler.py

3df3244

sibiryakov merged commit 52c7c12 into scrapinghub:master Feb 28, 2017

voith deleted the fix-frontera-scheduler branch February 28, 2017 17:22

voith mentioned this pull request Apr 19, 2017

Revert "Fixed scheduler process_spider_output() to yield requests" #273

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed scheduler process_spider_output() to yield requests #254

Fixed scheduler process_spider_output() to yield requests #254

voith commented Feb 12, 2017 •

edited

Loading

codecov-io commented Feb 12, 2017 •

edited

Loading

sibiryakov commented Feb 24, 2017

voith commented Feb 27, 2017

sibiryakov commented Feb 28, 2017

isra17 commented Apr 18, 2017

isra17 commented Apr 18, 2017

voith commented Apr 19, 2017

voith commented Apr 19, 2017

isra17 commented Apr 19, 2017

sibiryakov commented Apr 24, 2017

sibiryakov commented Apr 24, 2017

isra17 commented Apr 24, 2017

sibiryakov commented Apr 25, 2017

sibiryakov commented Apr 25, 2017

sibiryakov commented Apr 25, 2017

Fixed scheduler process_spider_output() to yield requests #254

Fixed scheduler process_spider_output() to yield requests #254

Conversation

voith commented Feb 12, 2017 • edited Loading

A note about the tests added:

codecov-io commented Feb 12, 2017 • edited Loading

Codecov Report

sibiryakov commented Feb 24, 2017

voith commented Feb 27, 2017

sibiryakov commented Feb 28, 2017

isra17 commented Apr 18, 2017

isra17 commented Apr 18, 2017

voith commented Apr 19, 2017

voith commented Apr 19, 2017

isra17 commented Apr 19, 2017

sibiryakov commented Apr 24, 2017

sibiryakov commented Apr 24, 2017

isra17 commented Apr 24, 2017

sibiryakov commented Apr 25, 2017

sibiryakov commented Apr 25, 2017

sibiryakov commented Apr 25, 2017

voith commented Feb 12, 2017 •

edited

Loading

codecov-io commented Feb 12, 2017 •

edited

Loading