Releases: rwmutel/nyshporka
v0.2
Second Defense Version
Our web crawler now supports microservice architecture! The diagram is below:
Presentation
The visual complement to our oral report is here
Usage
To start everything, you have to launch the task_manager
and crawler
executables. Optionally, to start the HTTP server for the search engine, also launch nysh_search
Task Manager
./task_manager <config-file-name>
The default address and port are http://localhost:18082
. The option to configure them is in the to-do list.
An example of a config file and its necessary fields:
seed_file = data/seed.txt
allowed_domains = wikipedia.org
allowed_domains = google.com
allowed_langs = en
allowed_langs = uk
db_address = mongodb://localhost:27017/
db_name = nysh_pages
col_name = pages_0_3
It is recommended to terminate the task manager with a /terminate/
request instead of Ctrl+C
SIGINT signal since, in that case, the link queue and visited pages will be dumped into seed_file
, and the crawling could be continued from the same state after a restart.
Example of a /terminate/
request:
Crawler
To start crawling, launch crawlers executables with task_manager address as a CLI argument
./nysh_crawler <pages-per-single-get-request> <task-manager-address>
Crawler will stop automatically when its task manager stops.
Searcher
After starting
./nysh_search <database-name> <collection-name> <database-uri> <cli-mode ("cli"|null)>
a RESTful server will start on http://localhost:18081
. The option to configure the port is coming soon. If the fourth argument is set to "cli", the command line interface for full-text search in the database will be launched.
An example of a search request:
v0.1
This version is Nyshporka Web Crawler MVP presented at the first course project defense on the Architecture of Computer Systems 2023. It is more of a sandbox where we demonstrated the sequential version of our web crawler and experimented with technologies we plan to use in the future (web scrapping, database management, etc.)
For more info and prerequisites, don't hesitate to get in touch with the authors :)