Skip to content

vcavanna/scrapers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Edmunds Data Scraping and Analytics

Project Description

Scraping Edmunds Website to Redshift

This project scrapes the car sales aggregation website edmunds.com for cars and uploads those cars to an AWS Redshift Serverless data warehouse. The data load job is done in Python with the Beautiful Soup , requests, and Boto 3 libraries.

Analyzing Redshift Data to Advise Purchases

Analytics will provide some additional modeling via dbt in the warehouse to answer particular analytics questions:
  1. What is the distribution of value for a particular make, model, trim, history and miles driven?
  2. How long does it take for a great car at a great price to be sold by a dealer?
  3. Given a particular make, model, and trim, can you alert a particular user when a car enters the market at the price point that they want?
  4. More questions pending.

Web Portal for Generating Analytics

A web portal (soon to be developed) will include the ability to have users sign up and view personal analytics for cars that they are interested in (like answering question #3)

Project Backlog

  • Additional data modeling via dbt select statements
  • Uploading the dataset to kaggle
  • Find a solution so running the data scrape doesn't have to happen locally
  • Replace this project backlog with individual issues on Github issues
  • Revise the CONTRIBUTING.md doc according to the Open Source Guide

    Installation and Run Instructions

    You can run the data scraping aspect of this project from your local Windows computer for a particular make and model here.

    Windows:

    1. Check that you have python version 3.11 or greater (for windows: type "py") and gh in commandline (for windows: type "gh --version")
    2. From the command line, change directory to the location you want to include the repo: cd /PATH/TO/REPO/
    3. Enter: "gh repo clone vcavanna/bs_linkedin_src"
    4. From the root directory of the project, initialize the virtual environment: py -m venv venv
    5. Activate the virtual environment: .\venv\Scripts\activate
    6. If using Visual Studio Code: access the command palette with ctrl + shift + p, then type: "python: select interpreter"
    7. Visual Studio Code: Select the venv directory located in the root directory (".\venv\Scripts\python.exe")
    8. py -m pip install bs4, requests
    9. cd edmunds_etl
    10. py "edmunds_scraper.py"
    11. You should see a csv file appear in the root directory of the project with all of the used cars in the edmunds database for the make and model.
    12. At this point, the workflow would continue with the upload_to_s3.py script. However, for that you would need AWS credentials.

    Contributing

    To understand how to contribute, please read the CONTRIBUTE page.

    Contributing to AWS

    1. You might not need to. One to-do list item on this project is to dissociate Redshift database calls from the rest of the project (i.e. have an interface that has a redshift implementation and a local database implementation). So keep an eye out for contributing in that way.
    2. If you have to contribute to AWS, contact me. I'll set up a single sign on MFA account so you can add to Redshift and S3 aspects of the project.
    3. See the tutorials and guides section below for more on what has been helpful to learn Redshift, S3, lambda, etc.

    Tutorials and Guides

    Web Scraping

  • Credit to realpython.com's tutorial Beautiful Soup: Build a Web Scraper With Python for the introduction to webscraping.
  • AWS

  • AWS Active Credit Application How to receive $1000 AWS Active Credit for your side project or startup for getting free credits.
  • Redshift

  • Credit to the AWS tutorial Loading Data from Amazon S3 for the introduction to using the redshift database.
  • Credit to the AWS tutorial redshift python connect for docs on the python-redshift interaction
  • Lambda

  • Credit to this quick-start for showing how to quickly process s3 uploads into the Redshift database.
  • Credit to this knowledge center article for explaining how to fix issues caused by not using the latest boto3 version.
  • Credit to this stack overflow answer for pointing to lambda docs in AWS.
  • S3

  • Credit to the Boto3 quickstart for introducing me to boto3, essential to the copying csv to s3 aspect of this project.
  • Credit to this stackoverflow answer for explaining why the above quickstart needs to be configured differently for sso sessions and how to do that.
  • IAM Identity Center

  • Credit to the AWS IAM Identity Center user guide for how to set up IAM Identity Center and its users.
  • Credit to the AWS CLI configure with a Single Sign on Session for how to programmatically make use of the above IAM Identity Center to make credentials.
  • Open Source / Git Guides

  • Credit to the Open Source Guide for how to start this project as open source.
  • Credit to the Github issues docs for helping me to understand how issues can enable team collaboration and to-do lists.
  • Open Source Datasets

    This includes either sources currently in use, or sources that I really want to use to enhance the quality of the end project.
  • Car Features and MSRP by CopperUnion
  • Car Details Dataset by AKSHAY DATTATRAY KHARE, from 'Car Dekho'.
  • EDA Car Data Analysis by MELIKE DILEKCI, from 'Car Dekho'.
  • Vehicle Listings from Craigslist.org, a great resource from Austin Reese, also see AustinReese's github
  • Ownership of Cars dataset from Kiattisak Rattanaporn
  • Car Sales database by SURAJ. I don't know the source, but it's data on salespersons and commissions for a year by the car.
  • About

    A scraper for the Edmunds site to make buying a car easier

    Topics

    Resources

    Stars

    Watchers

    Forks

    Contributors 3

    •  
    •  
    •  

    Languages