PHOLCIDAE

Pholcidae, commonly known as cellar spiders, are a spider family in the suborder Araneomorphae.

About

Pholcidae is a tiny Python module allows you to write your own crawl spider fast and easy.

View end of README to read about changes in v2

Dependencies

python 2.7 or higher

Install

pip install git+https://github.com/bbrodriges/pholcidae.git

Basic example

from pholcidae2 import Pholcidae

class MySpider(Pholcidae):

	def crawl(self, data):
    	print(data.url)

settings = {'domain': 'www.test.com', 'start_page': '/sitemap/'}

spider = MySpider()
spider.extend(settings)
spider.start()

Allowed settings

Settings must be passed as dictionary to extend method of the crawler.

Params you can use:

Required

domain string - defines domain which pages will be parsed. Defines without trailing slash.

Additional

start_page string - URL which will be used as entry point to parsed site. Default: /
protocol string - defines protocol to be used by crawler. Default: http://
valid_links list - list of regular expression strings (or full URLs), which will be used to filter site URLs to be passed to crawl() method. Default: ['(.*)']
append_to_links string - text to be appended to each link before fetching it. Default: ''
exclude_links list - list of regular expression strings (or full URLs), which will be used to filter site URLs which must not be checked at all. Default: []
cookies dict - a dictionary of string key-values which represents cookie name and cookie value to be passed with site URL request. Default: {}
headers dict - a dictionary of string key-values which represents header name and value value to be passed with site URL request. Default: {}
follow_redirects bool - allows crawler to bypass 30x headers and not follow redirects. Default: True
precrawl string - name of function which will be called before start of crawler. Default: None
postcrawl string - name of function which will be called after the end crawling. Default: None
callbacks dict - a dictionary of key-values which represents URL pattern from valid_links dict and string name of self defined method to get parsed data. Default: {}
proxy dict - a dictionary mapping protocol names to URLs of proxies, e.g., {'http': 'http://user:passwd@host:port'}. Default: {}

New in v2:

silent_links list - list of regular expression strings (or full URLs), which will be used to filter site URLs which must not pass page data to callback function, yet still collect URLs from this page. Default: []
valid_mimes list - list of strings representing valid MIME types. Only URLs that can be identified with this MIME types will be parsed. Default: []
threads int - number of concurrent threads of pages fetchers. Default: 1
with_lock bool - whether use or not lock while URLs sync. It slightly decreases crawling speed but eliminates race conditions. Default: True
hashed bool - whether or not store parsed URLs as shortened SHA1 hashes. Crawler may run a little bit slower but consumes a lot less memory. Default: False
respect_robots_txt bool - whether or not read robots.txt file before start and add Disallow directives to exclude_links list. Default: True

Response attributes

While inherit Pholcidae class you can override built-in crawl() method to retrieve data gathered from page. Any response object will contain some attributes depending on page parsing success.

Successful parsing

body string - raw HTML/XML/XHTML etc. representation of page.
url string - URL of parsed page.
headers AttrDict - dictionary of response headers.
cookies AttrDict - dictionary of response cookies.
status int - HTTP status of response (e.g. 200).
match list - matched part from valid_links regex.

Unsuccessful parsing

body string - raw representation of error.
status int - HTTP status of response (e.g. 400). Default: 500
url string - URL of parsed page.

Example

See test.py

Note

Pholcidae does not contain any built-in XML, XHTML, HTML or other parser. You can manually add any response body parsing methods using any available python libraries you want.

v2 vs v1

Major changes have been made in version 2.0:

All code has been completely rewritten from scratch
Less abstractions = more speed
Threads support
Matches in page data are now list and not optional
Option stay_in_domain has been removed. Crawler cannot break out of initial domain anymore.

There are some minor code changes which breaks backward code compatibility between version 1.x and 2.0:

You need to explicitly pass settings to extend method of your crawler
Option autostart has been removed. You must call spider.srart() explisitly
Module is now called pholcidae2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

PHOLCIDAE - Tiny python web crawler

Pholcidae

About

Dependencies

Install

Basic example

Allowed settings

Response attributes

Example

Note

v2 vs v1

Files

README.md

Latest commit

History

README.md

File metadata and controls

PHOLCIDAE - Tiny python web crawler

Pholcidae

About

Dependencies

Install

Basic example

Allowed settings

Response attributes

Example

Note

v2 vs v1