Initial Release: Dark Web Monitoring Webcrawler v1.0
We are excited to announce the initial release of the Dark Web Monitoring Webcrawler, a robust, scalable, and secure tool designed to monitor activities on the dark web. This repository provides a framework for collecting and analyzing dark web data with a focus on privacy and security.
Core Features
-
Docker-Based Deployment: Quick and seamless setup using Docker Compose to orchestrate services.
-
Advanced Search Functionality: Comprehensive search capabilities with options to filter and refine results.
-
Data Visualization: Generates visual representations of crawled data for easier analysis.
-
Customizable Search Parsers: Supports integration of custom parsers to enhance data extraction from specific websites.
-
Integrated Machine Learning Models: Uses NLP and machine learning models for content categorization, search relevance, and detection of data patterns.
Prerequisites
Ensure the following tools are installed on your system:
- Python
- Docker
- Docker Compose
Installation
Step 1: Clone the Repository
git clone https://github.com/yourusername/dark-web-monitoring-webcrawler.git
cd dark-web-monitoring-webcrawler
Step 2: Build and Start the Docker
docker-compose up --build
This command will build and start the following services:
- API Service (
api
): The main webcrawler service. - MongoDB (
mongo
): Stores crawled data. - Redis (
redis_server
): Manages caching and task queuing. - Tor Containers (
tor-extend-*
): Ensures robust anonymity by routing traffic through different Tor exit nodes.
Usage
Running the Webcrawler
Option 1: Direct Execution
- Copy the
app/libs/nltk_data
folder to the appropriate directory:- Windows:
appdata
directory. - Linux: Home directory.
- Windows:
- Navigate to the
Genesis-Crawler/app/
directory. - Start the crawler:
python main_direct.py
Option 2: Using Docker
- Use Docker Compose to build and start the webcrawler:
docker-compose up --build
Configuring Tor Instances
Each Tor container is configured to run as a separate instance, routing traffic through different Tor exit nodes. This increases anonymity and reduces the chances of IP bans.
Scaling
You can scale the number of Tor instances by modifying the docker-compose.yml
file and adding more tor-extend-*
services as needed.
Project Structure
api/
: Contains the webcrawler source code.data/db/
: Directory where MongoDB stores data.dockerFiles/
: Dockerfiles for building custom images.
Contribution
We welcome contributions to improve the Dark Web Monitoring Webcrawler. To contribute:
- Fork the repository.
- Create a new branch:
git checkout -b feature-branch
- Commit your changes:
git commit -m "Add a new feature"
- Push your changes and open a pull request.
License
This project is licensed under the MIT License, making it free and open for further development.
Disclaimer
The Dark Web Monitoring Webcrawler is intended for research and educational purposes only. Users are responsible for ensuring compliance with local laws and regulations.
GitHub Repository: Dark Web Monitoring Webcrawler