Skip to content

This project demonstrates how to build a web scraper using Scrapy, a powerful Python framework for web scraping, and store the extracted data in MongoDB, a flexible NoSQL database.

License

Notifications You must be signed in to change notification settings

AathifZahir/Py-Scrap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

📚 Web Scraping Project with Scrapy and MongoDB

Python Version Scrapy Version MongoDB License Last Commit


📋 Table of Contents


📖 About the Project

This project demonstrates how to build a web scraper using Scrapy, a powerful Python framework for web scraping, and store the extracted data in MongoDB, a flexible NoSQL database.

Objective

The scraper is designed to extract product information from Amazon. It:

  • Extracts relevant data like product names, prices, ratings, and images.
  • Handles pagination to scrape data across multiple pages.
  • Stores the extracted data in a MongoDB database for further analysis or processing.

The scraper can be used for various products, not just books. You can search for any product category by passing the desired keyword.


⚙️ Getting Started

Follow these steps to get the project running on your local machine. 🚀

📦 Prerequisites

Before running the scraper, ensure you have the following installed:

  • 🐍 Python 3.9 or higher
  • 🕷️ Scrapy
  • 💾 MongoDB
  • 🔗 pymongo

🔧 Setup Instructions

  1. Clone the Repository:

    git clone https://github.com/yourusername/books_scraper.git
    cd books_scraper
  2. Set Up a Virtual Environment:

    python -m venv venv
  3. Activate the Virtual Environment:

    • On Windows:

      venv\Scripts\activate
    • On Unix or macOS:

      source venv/bin/activate
  4. Install the Required Packages:

    pip install scrapy pymongo
  5. Configure MongoDB:

    Ensure MongoDB is installed and running on your local machine. The default connection settings in the project are:

    • Host: localhost
    • Port: 27017
    • Database: books_db
    • Collection: books

    If your MongoDB configuration differs, update the settings in settings.py accordingly.


💡 Running the Scraper

To execute the scraper, use the following command:

scrapy crawl book -a keyword="laptops"

The scraper will:

  • Use the passed keyword (default is "books").
  • Start at the specified Amazon search URL.
  • Navigate through the pages and extract data like product names, prices, ratings, and images.
  • Store the extracted data in the MongoDB database.

If you do not pass a keyword, the scraper will default to searching for "books". Example:

scrapy crawl book

📂 Project Structure

The project follows Scrapy's standard structure:

books_scraper/
├── books/
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders/
│       ├── __init__.py
│       └── book_spider.py
├── scrapy.cfg
└── README.md
  • items.py: Defines the data structure for the scraped items.
  • pipelines.py: Contains the pipeline for processing and storing items in MongoDB.
  • settings.py: Configuration settings for the Scrapy project, including MongoDB connection details.
  • spiders/book_spider.py: The main spider responsible for scraping Amazon.

✨ Customization

To adapt the scraper for different keywords or websites:

  1. Pass a Keyword Dynamically:

    The spider can be run with a dynamic keyword using the -a argument. Example for scraping books related to "laptops":

    scrapy crawl book -a keyword="laptops"

    By default, if no keyword is passed, the scraper will search for "books".

  2. Update the start_urls:

    Modify the start_urls list in book_spider.py to point to a different website or category.

  3. Adjust the Parsing Logic:

    Ensure the CSS selectors in the parse method of book_spider.py accurately target the desired data fields on the new website.

  4. Handle Pagination:

    If the target website uses a different pagination structure, update the pagination handling logic in the parse method accordingly.


📜 License

This project is licensed under the MIT License. See the LICENSE file for details. 📄


📚 References

For more detailed information on the tools and techniques used in this project, refer to the following resources:


⭐ Support

If you like this project, please give it a ⭐ by clicking the star button at the top of the repository! It helps others discover the project and motivates me to improve it further. ❤️


This update now allows users to pass a custom keyword for the search and if not passed, the default is "books".

About

This project demonstrates how to build a web scraper using Scrapy, a powerful Python framework for web scraping, and store the extracted data in MongoDB, a flexible NoSQL database.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published