To build a regression model with the lowest error to predict Sales Price of houses sold in Ames
-
Read 'train.csv' to clean and organise data
-
Create a regression model based on the Ames Housing Dataser to predict the price of a house at sales in Ames, IA using train-test split
-
Predict Sales Price using predictor values given in 'test.csv' to generate unknown data
- Data Preparation
- One Hot Encoding
- Feature Engineering
- Model Pre-Work
- Train/Test Spilt
- Instantiation
- Model Selection
- Prediction
- Summary
Refer to the data description.
-
Linear Regression
-
Ridge Regression
-
Lasso Regression
-
Elastic Net
Lasso Regression (Lasso) Model is chosen for the modelling of the Ames Housing data, for the prediction of Sale Price. The model is able to achieve a R2 score of 0.887, which means it covers 88.7% of the data. And a RMSE value of 31260 based on Kaggle submission.
30 features were used for the LR model:
Based on the coefficient, Total SF is the most significant variables that will affect the house price, followed by Overall Quality and Kitchen Quality, which makes sense. 1 square feet increase in the house area will increase the price by close to USD 40,000.
Also, we can observe that the top 4 neighbourhood that will increase the price positively are North Ridge Height, Stone Bridge, Green Hills and North Ridge.
Further research proved the model right as seen from the map above. Green Hills are a stone throw away from Iowa University and relatively close to Ames city center. North Ridge, North Ridge Height and Stone Bridge are within the upper class neighbourhood in Ames, with closeby malls, neighborhood center, parks and even a golf course (A sport for rich people!).
We can also possibly deduce where the other richer and poorer neighbourhood in AMES based on whether the neighbourhood has + or - coefficient on Sale Price.
Limitation
-Many variables were drop because of skewed data/null values
-Model will improve if the data collection is more comprehensive
-Errors increases when predicting sale price of higher range
-External Unknown variables not included
In summary, we can conclude that the LR model for the AMES housing dataset has addressed the problem statement of estimating Sale Price of AMES houses with lowest possible error.