Second Defense Version

Our web crawler now supports microservice architecture! The diagram is below:

Presentation

The visual complement to our oral report is here

Usage

To start everything, you have to launch the task_manager and crawler executables. Optionally, to start the HTTP server for the search engine, also launch nysh_search

Task Manager

./task_manager <config-file-name>

The default address and port are http://localhost:18082. The option to configure them is in the to-do list.

An example of a config file and its necessary fields:

seed_file = data/seed.txt

allowed_domains = wikipedia.org
allowed_domains = google.com

allowed_langs = en
allowed_langs = uk

db_address = mongodb://localhost:27017/
db_name = nysh_pages
col_name = pages_0_3

It is recommended to terminate the task manager with a /terminate/ request instead of Ctrl+C SIGINT signal since, in that case, the link queue and visited pages will be dumped into seed_file, and the crawling could be continued from the same state after a restart.
Example of a /terminate/ request:

Crawler

To start crawling, launch crawlers executables with task_manager address as a CLI argument

./nysh_crawler <pages-per-single-get-request> <task-manager-address>

Crawler will stop automatically when its task manager stops.

Searcher

After starting

./nysh_search <database-name> <collection-name> <database-uri> <cli-mode ("cli"|null)>

a RESTful server will start on http://localhost:18081. The option to configure the port is coming soon. If the fourth argument is set to "cli", the command line interface for full-text search in the database will be launched.

An example of a search request:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Second Defense Version

Presentation

Usage

Task Manager

Crawler

Searcher

Releases: rwmutel/nyshporka

v0.2

Second Defense Version

Presentation

Usage

Task Manager

Crawler

Searcher

v0.1