Used Cars so called Second-hand's car have a huge market base. Many consider to buy an Used Car intsead of buying a new one, as it's is feasible and a better investment.
The main reason for this huge market is that when one buys a New Car and sale it just another day without any default on it, the price of car reduces by 30%.
There are too many frauds in the market who not only sale wrong but also they could mislead to very wrong price of a vehicle.
To overcome this frauds and misleading ourselves from fake and improper prices, here I used this Algorithm predicting car values besed on some of the main features defining the values of cars by using real-world #CarDekho dataset to Predict the price of any used car.
Car Price Prediction is a really an interesting Machine Learning problem for a beginner as there are many factors that influence the price of a car in the second-hand market. In this Project, we will be looking at a dataset based on sale/purchase of cars where our end goal will be predicting the price of the car given its features to maximize the profit.
I've used a separate ML environment where only limited but required libraries were installed
Use !pip command to install those libraries into your Environment
NumPy : It is an ibnbiult Library used in Python bu sometimes unexpectedly we need to download it !pip install numpy
Pandas : !pip install pandas
MatplotLib : !pip install matplotlib
SciKit_Learn : !pip install sklearn
Seaborn : !pip install seaborn
Pickle : !pip install pickle
- Import very first package for data reading for carrying out preprocessing techniques on the same
import pandas as pd
- Now, assign the data values from Dataset Car_dataset.csv with
df = pd.read_csv('Car_dataset.csv')
- Visualize and validate whether the dataset is successfully assigned to the vaariable
- Check size of the Dataset
Choosing features uniquely defines each car's properties hence varying values can be achieved
Features used here are
Using these features and their unique values which directly classify/differentiate each car
- Checking if presence of
values in the dataset
- Describing all calculated statistical terms aka Sum, Mean, Standard Deviation, Minimum, Maximum etc
- Fetching Columns present before Data Preparation
- Neglecting unncessary column(s) from the Dataset i.e.
may include many and is not uniquely differentiating as a feature
final_dataset = df[['Year', 'Selling_Price', 'Km_Driven', 'Fuel', 'Seller_Type','Transmission', 'Owner']]
We actually can add or modify features as per references for training and testing the model
Here, we are going to add a new feature
to simplify how many years a particular car is been used -
Add a
column to Dataset having value 2021 in all the rows as 2021 is the current year
final_dataset['Current_Year'] = 2021
- Getting
with simple logic and finally addingCar_Age
final_dataset['Car_Age'] = final_dataset['Current_Year'] - final_dataset['Year']
- As we know how is old the car now, we can neglect both the
columns now
final_dataset.drop(['Year'], axis = 1, inplace = True)
final_dataset.drop(['Current_Year'], axis = 1, inplace = True)
Converting encoded unicode
In case if you don't know unicoding, let me simplify your doubt just in a minute with a small table;
{Parameter1} | {Parameter2} | {Parameter3} | Description |
1 | 0 | 0 | This will represent the value is belonging to {Parameter1} |
0 | 1 | 0 | This will represent the value is belonging to {Parameter2} |
0 | 0 | 1 | This will represent the value is belonging to {Parameter3} |
TIP: But also when Parameter1==0 and Parameter2==0, It will actually represent belonging to {Parameter3} itself |
final_dataset = pd.get_dummies(final_dataset, drop_first = True) # First column should be deleted from "dummy variable trap"
Now it's time to Visualize the Data prepared till now
- Import
and plot a Pairplot very quickly
import seaborn as sbs
- Import MatplotLib as well and plot a heatmap having correlation in between the data
- For more of the
%matplotlib inline
term refer this article.
import matplotlib.pyplot as plt
%matplotlib inline
# Heatmapping the data
corrmat = final_dataset.corr()
top_corr_features = corrmat.index
plt.figure(figsize = (20, 20))
# Visualize the heatmap
hmap = sbs.heatmap(final_dataset[top_corr_features].corr(), annot = True, cmap = "RdYlGn") # Color pattern chosen here = "RdYlGn"
- Let's have a look again to the Dataset prepared till now
- Looking at the very first column of Dataset it is the
we are going to predict through our ML Model - So in this case, we won't be needing this column for our model building
- Let's then neglect
then but now by usingiloc
X = final_dataset.iloc[:,1:]
Y = final_dataset.iloc[:,0]
FEATURE Importance
- Let's now fit our X and Y values to the model with
- Import
from sklearn.ensemble import ExtraTreesRegressor
model = ExtraTreesRegressor(), Y)
- Now we can know each of the features' Importance
Are you able to understand the feature's understand? Or Can you tell which of the features is more important than another one?
So that's exactly where Visualization plays a very important role for drawing insights from the data we couldn't understand
Let's say we are going to plot a Graph of Features Importance
feat = pd.Series(model.feature_importances_, index = X.columns)
feat.nlargest(5).plot(kind = 'barh')
- Whoosh! After all of the DATA PREPARATION, we can build our model for real
- But before we'd do that, we have to split the data for training and testing our model
- Training the model what is called Building a ML Model and Testing of the trained model will output the predicted Selling Price of the car which is our end goal
- For Training and Testing we'll be using 8:2 ratio data. However it is best to use more of the present data for training purpose as it'll give very great accuracy at the end
- Rest 20% of the data will be used for testing our ML model for its accuracy
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, y_test = train_test_split(X,Y, test_size = 0.2)
X_train.shape # Checking the size of the dataset used for training our ML model
The goal of ENSEMBLE METHODS is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator
In this project we're using
from sklearn.ensemble import RandomForestRegressor
rf_random = RandomForestRegressor()
For Estimation, we'll be introducing `RandomizedSearchCV'
Randomized search on hyper parameters: RandomizedSearchCV implements a “fit” and a “score” method. It also implements “score_samples”, “predict”, “predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are implemented in the estimator used. For more of
refer to this Article
from sklearn.model_selection import RandomizedSearchCV
# Hyperparameters
# RandomizedSearchCV
import numpy as np
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1200, num = 12)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(5, 30, num = 6)] # max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10, 15, 100]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 5, 10]
- Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf}
- Use the random grid to search for best hyperparameters; First create the base model to tune
rf = RandomForestRegressor()
- Random search of parameters, using 3 fold cross validation, search across 100 different combinations
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid,scoring='neg_mean_squared_error', n_iter = 10, cv = 5, verbose=2, random_state=42, n_jobs = 1)
- Fit a number of decision tree classifiers on various sub-samples of the dataset,Y_train)
This Process takes couple of minutes but in the end you'll able to see the steps done and probaby an output like this one here
- Print the best parameters
Our output gives the best parameters as 1000 for n_estimators and 2 for min_samples_split. It also gives 1 for min_samples_leaf, 'sqrt' for max_features and 25 for max_depth.
- Print the best score: Mean cross-validated score of the estimator
- Finally
can be made by our ML Model
predictions = rf_random.predict(X_test)
- Visualize those 'Predictions' with the testing data we've splitted and secured to serve
- Now calculate the
Mean Absolute Error
,Mean Squared Error
,Metrics Mean Squared Error
from sklearn import metrics
- To use this Dataset in future or for deployment, store a
import pickle
# Open a file, where you ant to store the data
file = open('random_forest_regression_model.pkl', 'wb')
# Dump information to that file
pickle.dump(rf_random, file)
Please Feel free to Contact
- This Project is built for Predicting Used Car Selling Prices but not New Cars Prices or Showroom Prices
- The Dataset I've used in this project and more such Datasets for practices can also be found here
