Skip to content

Latest commit

 

History

History
87 lines (56 loc) · 2.95 KB

README.md

File metadata and controls

87 lines (56 loc) · 2.95 KB

Ozon Scraper

This project allows scraping data about products, review and review media from the Ozon website. Selenium is used for scraping, and PostgreSQL is used for data storage.

🛠️ Requirements

  • Python 3.12+
  • Docker & Docker Compose
  • Chrome Browser
  • Chrome Driver for your Chrome version

🏗️ Installation

  • Create virtual environment and activate it and install dependencies.

    python3.12 -m venv env
    source env/bin/activate
    pip install --upgrade pip && pip install -r requirements.txt
  • Copy/rename .env.dist to .env and fill in the required data.

    cp .env.dist .env
  • Build the Docker images for the project.

    docker build -t scraper-base -f contrib/docker/scraper/Dockerfile .
    docker compose -p ozon-scraper --env-file .env -f contrib/docker/docker-compose.yml build
  • Start the Docker containers. This will initialize the database and run migrations.

    docker compose -p ozon-scraper --env-file .env -f contrib/docker/docker-compose.yml up -d
  • Update the DATABASE_URL in the .env file to point to the database via the exposed port. Change the db host in to localhost.

  • Export the environment variables to your shell session.

    source contrib/scripts/export_env.sh
  • Use management commands (start with the --help flag to get more information about management commands).

    cd src/
    python -m scrap.manage --help

Scraping Steps

Note: Management commands are run from the ./src directory.

  1. Load categories into the database.

    The category data is located in the ./data/categories folder. More about the dataset. To load the categories into the database, use the following management command. This command will populate the category and category meta tables.

    python -m scrap.manage load_ozon_categories_from_api_results
  2. Configure categories.

    After loading, modify the ms_scraper_ozon_category_meta table using raw SQL queries or a database tool like DBeaver. Set the is_parsing_enabled column to TRUE and assign a parsing_priority (lower values are processed first) to the categories you want to scrape.

  3. Scrape category pages.

    This command walks through category pages, collecting product data.

    python -m scrap.manage scrape_ozon_category_pages
  4. Scrape product reviews.

    This command goes through the products and collects reviews and associated media.

    python -m scrap.manage scrape_ozon_product_reviews_from_state