Skip to content
This repository has been archived by the owner on Jun 10, 2024. It is now read-only.

Commit

Permalink
update docs for curl command support of self.crawl and new css-sele…
Browse files Browse the repository at this point in the history
…ctor helper
  • Loading branch information
binux committed Mar 6, 2015
1 parent ddfe678 commit 3a01cb2
Show file tree
Hide file tree
Showing 5 changed files with 28 additions and 15 deletions.
9 changes: 9 additions & 0 deletions docs/apis/self.crawl.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,15 @@ self.crawl(url, **kwargs)
* `taskid` - unique id for each task. _default: md5(url)_ , can be overrided by define your own `def get_taskid(self, task)`
* `force_update` - force update task params when task is in `ACTIVE` status.

cURL command
------------

`self.crawl(curl_command)`

cURL is a command line tool to make a HTTP request. cURL command can get from chrome devtools > network panel, right click a request and `Copy as cURL`.

You can use cURL command as the first argument of `self.crawl`. It will parse the command and make the HTTP request just like curl do.

@config(**kwargs)
-----------------
default kwargs for self.crawl of method. Any `self.crawl` with this callback will use this config.
Expand Down
Binary file modified docs/imgs/css_selector_helper.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
7 changes: 7 additions & 0 deletions docs/tutorial/AJAX-and-more-HTTP.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,13 @@ You can get this with [Chrome Developer Tools](https://developer.chrome.com/devt

In most case, the last thing you need is to copy right URL + method + headers + body from Network panel.

cURL command
------------

`self.crawl` supports `cURL` command as argument to make the HTTP request. It will parse the arguments in the command and use it as fetch parameters.

With `Copy as cURL` of a request, you can get a `cURL` command and paste to `self.crawl(command)` to make crawling easy.

HTTP Method
-----------

Expand Down
22 changes: 9 additions & 13 deletions docs/tutorial/HTML-and-CSS-Selector.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,26 +105,22 @@ pyspider provide a tool called `CSS selector helper` to make it easier to genera

![CSS Selector helper](imgs/css_selector_helper.png)

The element will be highlighted in yellow when mouse over. When you click it, all elements with same CSS Selector will frame in red and add the pattern to the cursor position of your code. Add following code and put cursor between the two quotation marks:
The element will be highlighted in yellow while mouse over. When you click it, a pre-selected CSS Selector pattern is shown on the bar above. You can edit the features to locate the element and add it to your source code.

```
self.crawl(response.doc('').attr.href, callback=self.index_page)
```

click "Next »", selector pattern should have been added to your code:
click "Next »" in the page and add selector pattern to your code:

```
def index_page(self, response):
for each in response.doc('a[href^="http"]').items():
if re.match("http://www.imdb.com/title/tt\d+/$", each.attr.href):
self.crawl(each.attr.href, callback=self.detail_page)
self.crawl(response.doc('HTML>BODY#styleguide-v2>DIV#wrapper>DIV#root>DIV#pagecontent>DIV#content-2-wide>DIV#main>DIV.leftright>DIV#right>SPAN.pagination>A').attr.href, callback=self.index_page)
self.crawl(response.doc('#right a').attr.href, callback=self.index_page)
```

Click `run` again and move to the next page, we found that "« Prev" has the same selector pattern as "Next »". When using above code you may find pyspider selected the link of "« Prev", not "Next »". A solution for this is select both of them:

```
self.crawl([x.attr.href for x in response.doc('HTML>BODY#styleguide-v2>DIV#wrapper>DIV#root>DIV#pagecontent>DIV#content-2-wide>DIV#main>DIV.leftright>DIV#right>SPAN.pagination>A').items()], callback=self.index_page)
self.crawl([x.attr.href for x in response.doc('#right a').items()], callback=self.index_page)
```

Extracting Information
Expand All @@ -138,17 +134,17 @@ Add keys you need to result dict and collect value using `CSS selector helper` r
def detail_page(self, response):
return {
"url": response.url,
"title": response.doc('HTML>BODY#styleguide-v2>DIV#wrapper>DIV#root>DIV#pagecontent>DIV#content-2-wide>DIV#maindetails_center_top>DIV.article.title-overview>DIV#title-overview-widget>TABLE#title-overview-widget-layout>TBODY>TR>TD#overview-top>H1.header>SPAN.itemprop').text(),
"rating": response.doc('HTML>BODY#styleguide-v2>DIV#wrapper>DIV#root>DIV#pagecontent>DIV#content-2-wide>DIV#maindetails_center_top>DIV.article.title-overview>DIV#title-overview-widget>TABLE#title-overview-widget-layout>TBODY>TR>TD#overview-top>DIV.star-box.giga-star>DIV.star-box-details>STRONG>SPAN').text(),
"director": [x.text() for x in response.doc('div[itemprop="director"] span[itemprop="name"]').items()],
"title": response.doc('.header > [itemprop="name"]').text(),
"rating": response.doc('.star-box-giga-star').text(),
"director": [x.text() for x in response.doc('[itemprop="director"] span').items()],
}
```

Note that, `CSS Selector helper` may not always work (directors and starts have a same pattern). You can write selector pattern manually with tools like [Chrome Dev Tools](https://developer.chrome.com/devtools):
Note that, `CSS Selector helper` may not always work. You could write selector pattern manually with tools like [Chrome Dev Tools](https://developer.chrome.com/devtools):

![inspect element](imgs/inspect_element.png)

You doesn't need to write every ancestral element in selector pattern, only the elements which can differentiate with not needed elements, is enough. However, it needs experience on scraping or Web developing to know which attribute is important, can be used as locator. You can also test CSS Selector in the JavaScript Console by using `$$` like `$$('div[itemprop="director"] span[itemprop="name"]')`
You doesn't need to write every ancestral element in selector pattern, only the elements which can differentiate with not needed elements, is enough. However, it needs experience on scraping or Web developing to know which attribute is important, can be used as locator. You can also test CSS Selector in the JavaScript Console by using `$$` like `$$('[itemprop="director"] span')`

Running
-------
Expand Down
5 changes: 3 additions & 2 deletions pyspider/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -366,8 +366,9 @@ def phantomjs(ctx, phantomjs_path, port):
os.path.dirname(pyspider.__file__), 'fetcher/phantomjs_fetcher.js')
try:
_phantomjs = subprocess.Popen([phantomjs_path,
phantomjs_fetcher,
str(port)])
"--ssl-protocol=any",
phantomjs_fetcher,
str(port)])
except OSError:
return None

Expand Down

0 comments on commit 3a01cb2

Please sign in to comment.