Edmunds Data Scraping and Analytics

Project Description

Scraping Edmunds Website to Redshift

This project scrapes the car sales aggregation website edmunds.com for cars and uploads those cars to an AWS Redshift Serverless data warehouse. The data load job is done in Python with the Beautiful Soup , requests, and Boto 3 libraries.

Analyzing Redshift Data to Advise Purchases

Analytics will provide some additional modeling via dbt in the warehouse to answer particular analytics questions:

What is the distribution of value for a particular make, model, trim, history and miles driven?
How long does it take for a great car at a great price to be sold by a dealer?
Given a particular make, model, and trim, can you alert a particular user when a car enters the market at the price point that they want?
More questions pending.

Web Portal for Generating Analytics

A web portal (soon to be developed) will include the ability to have users sign up and view personal analytics for cars that they are interested in (like answering question #3)

Project Backlog

Additional data modeling via dbt select statements

Uploading the dataset to kaggle

Find a solution so running the data scrape doesn't have to happen locally

Replace this project backlog with individual issues on Github issues

Revise the CONTRIBUTING.md doc according to the Open Source Guide

Installation and Run Instructions

You can run the data scraping aspect of this project from your local Windows computer for a particular make and model here.

Windows:

Check that you have python version 3.11 or greater (for windows: type "py") and gh in commandline (for windows: type "gh --version")
From the command line, change directory to the location you want to include the repo: cd /PATH/TO/REPO/
Enter: "gh repo clone vcavanna/bs_linkedin_src"
From the root directory of the project, initialize the virtual environment: py -m venv venv
Activate the virtual environment: .\venv\Scripts\activate
If using Visual Studio Code: access the command palette with ctrl + shift + p, then type: "python: select interpreter"
Visual Studio Code: Select the venv directory located in the root directory (".\venv\Scripts\python.exe")
py -m pip install bs4, requests
cd edmunds_etl
py "edmunds_scraper.py"
You should see a csv file appear in the root directory of the project with all of the used cars in the edmunds database for the make and model.
At this point, the workflow would continue with the upload_to_s3.py script. However, for that you would need AWS credentials.

Contributing

To understand how to contribute, please read the CONTRIBUTE page.

Contributing to AWS

You might not need to. One to-do list item on this project is to dissociate Redshift database calls from the rest of the project (i.e. have an interface that has a redshift implementation and a local database implementation). So keep an eye out for contributing in that way.
If you have to contribute to AWS, contact me. I'll set up a single sign on MFA account so you can add to Redshift and S3 aspects of the project.
See the tutorials and guides section below for more on what has been helpful to learn Redshift, S3, lambda, etc.

Tutorials and Guides

Web Scraping

Credit to realpython.com's tutorial Beautiful Soup: Build a Web Scraper With Python for the introduction to webscraping.

AWS

AWS Active Credit Application How to receive $1000 AWS Active Credit for your side project or startup for getting free credits.

Redshift

Credit to the AWS tutorial Loading Data from Amazon S3 for the introduction to using the redshift database.

Credit to the AWS tutorial redshift python connect for docs on the python-redshift interaction

Lambda

Credit to this quick-start for showing how to quickly process s3 uploads into the Redshift database.

Credit to this knowledge center article for explaining how to fix issues caused by not using the latest boto3 version.

Credit to this stack overflow answer for pointing to lambda docs in AWS.

S3

Credit to the Boto3 quickstart for introducing me to boto3, essential to the copying csv to s3 aspect of this project.

Credit to this stackoverflow answer for explaining why the above quickstart needs to be configured differently for sso sessions and how to do that.

IAM Identity Center

Credit to the AWS IAM Identity Center user guide for how to set up IAM Identity Center and its users.

Credit to the AWS CLI configure with a Single Sign on Session for how to programmatically make use of the above IAM Identity Center to make credentials.

Open Source / Git Guides

Credit to the Open Source Guide for how to start this project as open source.

Credit to the Github issues docs for helping me to understand how issues can enable team collaboration and to-do lists.

Open Source Datasets

This includes either sources currently in use, or sources that I really want to use to enhance the quality of the end project.

Car Features and MSRP by CopperUnion

Car Details Dataset by AKSHAY DATTATRAY KHARE, from 'Car Dekho'.

EDA Car Data Analysis by MELIKE DILEKCI, from 'Car Dekho'.

Vehicle Listings from Craigslist.org, a great resource from Austin Reese, also see AustinReese's github

Ownership of Cars dataset from Kiattisak Rattanaporn

Car Sales database by SURAJ. I don't know the source, but it's data on salespersons and commissions for a year by the car.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
docs		docs
edmunds_etl		edmunds_etl
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Edmunds Data Scraping and Analytics

Project Description

Scraping Edmunds Website to Redshift

Analyzing Redshift Data to Advise Purchases

Web Portal for Generating Analytics

Project Backlog

Installation and Run Instructions

Contributing

Contributing to AWS

Tutorials and Guides

Web Scraping

AWS

Redshift

Lambda

S3

IAM Identity Center

Open Source / Git Guides

Open Source Datasets

About

Contributors 3

Languages

vcavanna/scrapers

Folders and files

Latest commit

History

Repository files navigation

Edmunds Data Scraping and Analytics

Project Description

Scraping Edmunds Website to Redshift

Analyzing Redshift Data to Advise Purchases

Web Portal for Generating Analytics

Project Backlog

Installation and Run Instructions

Contributing

Contributing to AWS

Tutorials and Guides

Web Scraping

AWS

Redshift

Lambda

S3

IAM Identity Center

Open Source / Git Guides

Open Source Datasets

About

Topics

Resources

Stars

Watchers

Forks

Contributors 3

Languages