Skip to content

Releases: rwmutel/nyshporka

v0.2

04 May 10:19
a47c583
Compare
Choose a tag to compare
v0.2 Pre-release
Pre-release

Second Defense Version

Our web crawler now supports microservice architecture! The diagram is below:

image

Presentation

The visual complement to our oral report is here

Usage

To start everything, you have to launch the task_manager and crawler executables. Optionally, to start the HTTP server for the search engine, also launch nysh_search

Task Manager

./task_manager <config-file-name>

The default address and port are http://localhost:18082. The option to configure them is in the to-do list.

An example of a config file and its necessary fields:

seed_file = data/seed.txt

allowed_domains = wikipedia.org
allowed_domains = google.com

allowed_langs = en
allowed_langs = uk

db_address = mongodb://localhost:27017/
db_name = nysh_pages
col_name = pages_0_3

It is recommended to terminate the task manager with a /terminate/ request instead of Ctrl+C SIGINT signal since, in that case, the link queue and visited pages will be dumped into seed_file, and the crawling could be continued from the same state after a restart.
Example of a /terminate/ request:

image

Crawler

To start crawling, launch crawlers executables with task_manager address as a CLI argument

./nysh_crawler <pages-per-single-get-request> <task-manager-address>

Crawler will stop automatically when its task manager stops.

Searcher

After starting

./nysh_search <database-name> <collection-name> <database-uri> <cli-mode ("cli"|null)>

a RESTful server will start on http://localhost:18081. The option to configure the port is coming soon. If the fourth argument is set to "cli", the command line interface for full-text search in the database will be launched.

An example of a search request:

image

v0.1

03 May 20:16
Compare
Choose a tag to compare
v0.1 Pre-release
Pre-release

This version is Nyshporka Web Crawler MVP presented at the first course project defense on the Architecture of Computer Systems 2023. It is more of a sandbox where we demonstrated the sequential version of our web crawler and experimented with technologies we plan to use in the future (web scrapping, database management, etc.)
For more info and prerequisites, don't hesitate to get in touch with the authors :)

Presentation is here!