Ozon Scraper

This project allows scraping data about products, review and review media from the Ozon website. Selenium is used for scraping, and PostgreSQL is used for data storage.

🛠️ Requirements

Python 3.12+
Docker & Docker Compose
Chrome Browser
Chrome Driver for your Chrome version

🏗️ Installation

Create virtual environment and activate it and install dependencies.

python3.12 -m venv env
source env/bin/activate
pip install --upgrade pip && pip install -r requirements.txt

Copy/rename .env.dist to .env and fill in the required data.
```
cp .env.dist .env
```

Build the Docker images for the project.

docker build -t scraper-base -f contrib/docker/scraper/Dockerfile .
docker compose -p ozon-scraper --env-file .env -f contrib/docker/docker-compose.yml build

Start the Docker containers. This will initialize the database and run migrations.

docker compose -p ozon-scraper --env-file .env -f contrib/docker/docker-compose.yml up -d

Update the DATABASE_URL in the .env file to point to the database via the exposed port. Change the db host in to localhost.
Export the environment variables to your shell session.
```
source contrib/scripts/export_env.sh
```
Use management commands (start with the --help flag to get more information about management commands).
```
cd src/
python -m scrap.manage --help
```

Scraping Steps

Note: Management commands are run from the ./src directory.

Load categories into the database.

The category data is located in the ./data/categories folder. More about the dataset. To load the categories into the database, use the following management command. This command will populate the category and category meta tables.
```
python -m scrap.manage load_ozon_categories_from_api_results
```
Configure categories.

After loading, modify the ms_scraper_ozon_category_meta table using raw SQL queries or a database tool like DBeaver. Set the is_parsing_enabled column to TRUE and assign a parsing_priority (lower values are processed first) to the categories you want to scrape.
Scrape category pages.

This command walks through category pages, collecting product data.
```
python -m scrap.manage scrape_ozon_category_pages
```
Scrape product reviews.

This command goes through the products and collects reviews and associated media.
```
python -m scrap.manage scrape_ozon_product_reviews_from_state
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Ozon Scraper

🛠️ Requirements

🏗️ Installation

Scraping Steps

Files

README.md

Latest commit

History

README.md

File metadata and controls

Ozon Scraper

🛠️ Requirements

🏗️ Installation

Scraping Steps