Skip to content

Commit

Permalink
Clean up and comment out
Browse files Browse the repository at this point in the history
  • Loading branch information
bcopy committed Oct 11, 2024
1 parent cead59f commit 639aeff
Show file tree
Hide file tree
Showing 4 changed files with 94 additions and 3,224 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -2052,145 +2052,21 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"[Back to Top](#Table-of-Contents)\n",
"### Conclusion\n",
"\n",
"## Step 4: Modeling\n",
"In this case study, we explored the Titanic dataset following the steps of the data mining process:\n",
"\n",
"Now we have a relatively clean dataset(Except for the **Cabin** column which has many missing values). We can do a classification on Survived to predict whether a passenger could survive the disaster or a regression on Fare to predict ticket fare. This dataset is not a good dataset for regression. But since we don't talk about classification in this workshop we will construct a linear regression on Fare in this exercise."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Task16: Construct a regression on Fare\n",
"Construct regression model with statsmodels.\n",
"1. We started by understanding the business context and the objectives of the analysis.\n",
"2. We then explored and understood the data, identifying important features and their relationships.\n",
"3. Finally, we prepared the data by handling missing values and creating new features.\n",
"\n",
"Pick Pclass, Embarked, FamilySize as independent variables."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"simpletable\">\n",
"<caption>OLS Regression Results</caption>\n",
"<tr>\n",
" <th>Dep. Variable:</th> <td>Fare</td> <th> R-squared: </th> <td> 0.427</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Model:</th> <td>OLS</td> <th> Adj. R-squared: </th> <td> 0.424</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Method:</th> <td>Least Squares</td> <th> F-statistic: </th> <td> 131.9</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Date:</th> <td>Wed, 24 Apr 2019</td> <th> Prob (F-statistic):</th> <td>1.92e-104</td>\n",
"</tr>\n",
"<tr>\n",
" <th>Time:</th> <td>12:07:17</td> <th> Log-Likelihood: </th> <td> -4495.8</td> \n",
"</tr>\n",
"<tr>\n",
" <th>No. Observations:</th> <td> 891</td> <th> AIC: </th> <td> 9004.</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Df Residuals:</th> <td> 885</td> <th> BIC: </th> <td> 9032.</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Df Model:</th> <td> 5</td> <th> </th> <td> </td> \n",
"</tr>\n",
"<tr>\n",
" <th>Covariance Type:</th> <td>nonrobust</td> <th> </th> <td> </td> \n",
"</tr>\n",
"</table>\n",
"<table class=\"simpletable\">\n",
"<tr>\n",
" <td></td> <th>coef</th> <th>std err</th> <th>t</th> <th>P>|t|</th> <th>[0.025</th> <th>0.975]</th> \n",
"</tr>\n",
"<tr>\n",
" <th>Intercept</th> <td> 79.2989</td> <td> 3.543</td> <td> 22.381</td> <td> 0.000</td> <td> 72.345</td> <td> 86.253</td>\n",
"</tr>\n",
"<tr>\n",
" <th>C(Pclass)[T.2]</th> <td> -59.0955</td> <td> 3.921</td> <td> -15.073</td> <td> 0.000</td> <td> -66.790</td> <td> -51.401</td>\n",
"</tr>\n",
"<tr>\n",
" <th>C(Pclass)[T.3]</th> <td> -68.8790</td> <td> 3.253</td> <td> -21.174</td> <td> 0.000</td> <td> -75.264</td> <td> -62.494</td>\n",
"</tr>\n",
"<tr>\n",
" <th>C(Embarked)[T.Q]</th> <td> -11.8147</td> <td> 5.446</td> <td> -2.169</td> <td> 0.030</td> <td> -22.504</td> <td> -1.126</td>\n",
"</tr>\n",
"<tr>\n",
" <th>C(Embarked)[T.S]</th> <td> -14.9202</td> <td> 3.414</td> <td> -4.371</td> <td> 0.000</td> <td> -21.620</td> <td> -8.220</td>\n",
"</tr>\n",
"<tr>\n",
" <th>FamilySize</th> <td> 7.8256</td> <td> 0.789</td> <td> 9.919</td> <td> 0.000</td> <td> 6.277</td> <td> 9.374</td>\n",
"</tr>\n",
"</table>\n",
"<table class=\"simpletable\">\n",
"<tr>\n",
" <th>Omnibus:</th> <td>1043.506</td> <th> Durbin-Watson: </th> <td> 2.040</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Prob(Omnibus):</th> <td> 0.000</td> <th> Jarque-Bera (JB): </th> <td>118621.734</td>\n",
"</tr>\n",
"<tr>\n",
" <th>Skew:</th> <td> 5.718</td> <th> Prob(JB): </th> <td> 0.00</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Kurtosis:</th> <td>58.357</td> <th> Cond. No. </th> <td> 13.4</td> \n",
"</tr>\n",
"</table><br/><br/>Warnings:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
],
"text/plain": [
"<class 'statsmodels.iolib.summary.Summary'>\n",
"\"\"\"\n",
" OLS Regression Results \n",
"==============================================================================\n",
"Dep. Variable: Fare R-squared: 0.427\n",
"Model: OLS Adj. R-squared: 0.424\n",
"Method: Least Squares F-statistic: 131.9\n",
"Date: Wed, 24 Apr 2019 Prob (F-statistic): 1.92e-104\n",
"Time: 12:07:17 Log-Likelihood: -4495.8\n",
"No. Observations: 891 AIC: 9004.\n",
"Df Residuals: 885 BIC: 9032.\n",
"Df Model: 5 \n",
"Covariance Type: nonrobust \n",
"====================================================================================\n",
" coef std err t P>|t| [0.025 0.975]\n",
"------------------------------------------------------------------------------------\n",
"Intercept 79.2989 3.543 22.381 0.000 72.345 86.253\n",
"C(Pclass)[T.2] -59.0955 3.921 -15.073 0.000 -66.790 -51.401\n",
"C(Pclass)[T.3] -68.8790 3.253 -21.174 0.000 -75.264 -62.494\n",
"C(Embarked)[T.Q] -11.8147 5.446 -2.169 0.030 -22.504 -1.126\n",
"C(Embarked)[T.S] -14.9202 3.414 -4.371 0.000 -21.620 -8.220\n",
"FamilySize 7.8256 0.789 9.919 0.000 6.277 9.374\n",
"==============================================================================\n",
"Omnibus: 1043.506 Durbin-Watson: 2.040\n",
"Prob(Omnibus): 0.000 Jarque-Bera (JB): 118621.734\n",
"Skew: 5.718 Prob(JB): 0.00\n",
"Kurtosis: 58.357 Cond. No. 13.4\n",
"==============================================================================\n",
"\n",
"Warnings:\n",
"[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
"\"\"\""
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# import statsmodels.formula.api as smf\n",
"# result = smf.ols(\"Fare ~ C(Pclass) + C(Embarked) + FamilySize\", data=df_titanic).fit()\n",
"# result.summary()"
"This analysis allowed us to draw several interesting conclusions about the factors that influenced survival and ticket prices on the Titanic. However, it's important to note that this is just a beginning. For a more in-depth analysis, we could consider:\n",
"\n",
"- Using classification techniques to predict survival.\n",
"- Exploring other features or combinations of features.\n",
"- Using more advanced modeling techniques.\n",
"\n",
"This case study illustrates how data analysis can help us understand historical events and draw lessons that could be applicable in other contexts."
]
}
],
Expand Down
Loading

0 comments on commit 639aeff

Please sign in to comment.