Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed scheduler process_spider_output() to yield requests #254

Merged
merged 3 commits into from
Feb 28, 2017

Conversation

voith
Copy link
Contributor

@voith voith commented Feb 12, 2017

fixes #253
Here's a screenshot using the same code discussed here.
screen shot 2017-02-12 at 3 13 48 pm

Nothing seems to break when testing this change manually. The only test that was failing was wrong IMO because it passed a list of requests and items and was only expecting items in return. I have modified that test to make it compatible with this patch.

I've the split this PR into three commits:

  • The first commit adds a test to reproduce the bug.
  • The second commit fixes the bug
  • The third commit fixes the broken test discussed above

A note about the tests added:

The tests might be a little difficult to understand on the first sight. I would recommend to read the following code in order understand the tests:

I have simulated the above discussed code in order to write the test.

@codecov-io
Copy link

codecov-io commented Feb 12, 2017

Codecov Report

Merging #254 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master     #254   +/-   ##
=======================================
  Coverage   70.15%   70.15%           
=======================================
  Files          68       68           
  Lines        4715     4715           
  Branches      632      632           
=======================================
  Hits         3308     3308           
  Misses       1267     1267           
  Partials      140      140
Impacted Files Coverage Δ
frontera/contrib/scrapy/schedulers/frontier.py 96.69% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a442055...3df3244. Read the comment docs.

@sibiryakov
Copy link
Member

Hey @voith sorry for a long absence. It looks absolutely fine. I hope your complex tests work. Ready for merge?

@voith
Copy link
Contributor Author

voith commented Feb 27, 2017

Hi @sibiryakov, This is ready for merge. You can view that the test works by looking at the builds.
like always I add the regression test in the first commit to show failure and than add the fix in the second commit.

@sibiryakov sibiryakov merged commit 52c7c12 into scrapinghub:master Feb 28, 2017
@sibiryakov
Copy link
Member

thank you! 🍺

@voith voith deleted the fix-frontera-scheduler branch February 28, 2017 17:22
@isra17
Copy link
Contributor

isra17 commented Apr 18, 2017

This PR broke Frontera behaviour. Now every yielded request end up in as a call to add_seeds as well. The initial behavior did swallow the requests in the middleware so Scrapy wasn't scheduling them. By yielding the request from the middleware, it now leave them to scrapy who eventually enqueue them with FronteraScheduler::enqueue_request . Either fix enqueue_request or revert this. For the original issue, shouldn't have this been simply fixed by setting a lower priority to the SchedulerSpiderMiddleware?

@isra17
Copy link
Contributor

isra17 commented Apr 18, 2017

Ping @sibiryakov

@voith
Copy link
Contributor Author

voith commented Apr 19, 2017

@isra17 I too had noticed that add_seeds was getting called for every request yielded. I wasn't aware that it was my PR that broke this.
You are right about the fact the problem can be solved by setting a lower priority, But I thought that the code added earlier was a mistake. I was expecting @sibiryakov to tell me what could possibly break with this change.
I tested the change I made in this PR and made sure that nothing broke, but did not anticipate that it would affect performance.

Well frontera should have had a test case for this.

@voith
Copy link
Contributor Author

voith commented Apr 19, 2017

I've opened a PR to revert this change #273

@isra17
Copy link
Contributor

isra17 commented Apr 19, 2017

Don't worry about that, this is not the kind of issue that broke vanilla Frontera in obvious manner. I didn't see it until I had a middleware with some logic specific to links_extracted.

@sibiryakov
Copy link
Member

this PR fixes the problem probably #261,
the root of the problem is Scrapy Engine behavior which is calling enqueue_request() every time it gets request instance from one of downloader middlewares (but not always).

@sibiryakov
Copy link
Member

There is no need to revert this change, it's a step in the right direction: why should allow other middlewares to operate on objects passed Frontera middlewares.

@isra17
Copy link
Contributor

isra17 commented Apr 24, 2017

Unless I'm missing something, won't #261 end up scheduling twice the requests?
process_spider_output is still calling links_extracted for all requests, then passing them to Scrapy which in his turn calls enqueue_request that itself add the request to the _pending_requests. So now the requests live both in the frontier queue and the spider queue.

@sibiryakov
Copy link
Member

@isra17 This will happen only if you have some middleware in Scrapy yielding all requests it gets. Normally, this shouldn't happen.

@sibiryakov
Copy link
Member

@isra17 I spend more time looking into this, and I think you're right: we will get requests in two places: temp. queue and frontier. I'll release the fix soon.

@sibiryakov
Copy link
Member

See #276

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scrapy's spider middlewares having process_spider_output() don't work with frontera
4 participants