📚 Web Scraping Project with Scrapy and MongoDB

📋 Table of Contents

📖 About the Project
⚙️ Getting Started
- 📦 Prerequisites
- 🔧 Setup Instructions
💡 Running the Scraper
📂 Project Structure
✨ Customization
📜 License
📚 References

📖 About the Project

This project demonstrates how to build a web scraper using Scrapy, a powerful Python framework for web scraping, and store the extracted data in MongoDB, a flexible NoSQL database.

Objective

The scraper is designed to extract product information from Amazon. It:

Extracts relevant data like product names, prices, ratings, and images.
Handles pagination to scrape data across multiple pages.
Stores the extracted data in a MongoDB database for further analysis or processing.

The scraper can be used for various products, not just books. You can search for any product category by passing the desired keyword.

⚙️ Getting Started

Follow these steps to get the project running on your local machine. 🚀

📦 Prerequisites

Before running the scraper, ensure you have the following installed:

🐍 Python 3.9 or higher
🕷️ Scrapy
💾 MongoDB
🔗 pymongo

🔧 Setup Instructions

Clone the Repository:

git clone https://github.com/yourusername/books_scraper.git
cd books_scraper

Set Up a Virtual Environment:
```
python -m venv venv
```
Activate the Virtual Environment:
- On Windows:
```
venv\Scripts\activate
```
- On Unix or macOS:
```
source venv/bin/activate
```
Install the Required Packages:
```
pip install scrapy pymongo
```
Configure MongoDB:

Ensure MongoDB is installed and running on your local machine. The default connection settings in the project are:
- Host: localhost
- Port: 27017
- Database: books_db
- Collection: books
If your MongoDB configuration differs, update the settings in settings.py accordingly.

💡 Running the Scraper

To execute the scraper, use the following command:

scrapy crawl book -a keyword="laptops"

The scraper will:

Use the passed keyword (default is "books").
Start at the specified Amazon search URL.
Navigate through the pages and extract data like product names, prices, ratings, and images.
Store the extracted data in the MongoDB database.

If you do not pass a keyword, the scraper will default to searching for "books". Example:

scrapy crawl book

📂 Project Structure

The project follows Scrapy's standard structure:

books_scraper/
├── books/
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders/
│       ├── __init__.py
│       └── book_spider.py
├── scrapy.cfg
└── README.md

items.py: Defines the data structure for the scraped items.
pipelines.py: Contains the pipeline for processing and storing items in MongoDB.
settings.py: Configuration settings for the Scrapy project, including MongoDB connection details.
spiders/book_spider.py: The main spider responsible for scraping Amazon.

✨ Customization

To adapt the scraper for different keywords or websites:

Pass a Keyword Dynamically:

The spider can be run with a dynamic keyword using the -a argument. Example for scraping books related to "laptops":
```
scrapy crawl book -a keyword="laptops"
```
By default, if no keyword is passed, the scraper will search for "books".
Update the start_urls:

Modify the start_urls list in book_spider.py to point to a different website or category.
Adjust the Parsing Logic:

Ensure the CSS selectors in the parse method of book_spider.py accurately target the desired data fields on the new website.
Handle Pagination:

If the target website uses a different pagination structure, update the pagination handling logic in the parse method accordingly.

📜 License

This project is licensed under the MIT License. See the LICENSE file for details. 📄

📚 References

For more detailed information on the tools and techniques used in this project, refer to the following resources:

⭐ Support

If you like this project, please give it a ⭐ by clicking the star button at the top of the repository! It helps others discover the project and motivates me to improve it further. ❤️

This update now allows users to pass a custom keyword for the search and if not passed, the default is "books".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📚 Web Scraping Project with Scrapy and MongoDB

📋 Table of Contents

📖 About the Project

Objective

⚙️ Getting Started

📦 Prerequisites

🔧 Setup Instructions

💡 Running the Scraper

📂 Project Structure

✨ Customization

📜 License

📚 References

⭐ Support

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
books		books
venv		venv
LICENSE		LICENSE
README.md		README.md

License

AathifZahir/Py-Scrap

Folders and files

Latest commit

History

Repository files navigation

📚 Web Scraping Project with Scrapy and MongoDB

📋 Table of Contents

📖 About the Project

Objective

⚙️ Getting Started

📦 Prerequisites

🔧 Setup Instructions

💡 Running the Scraper

📂 Project Structure

✨ Customization

📜 License

📚 References

⭐ Support

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages