Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pull #10

Open
wants to merge 31 commits into
base: master
Choose a base branch
from
Open

pull #10

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
609763c
Add files via upload
XiaodongAAA Nov 5, 2017
2b9561f
Update index.Rmd
XiaodongAAA Nov 5, 2017
94b2d33
Create README.Rmd
XiaodongAAA Nov 5, 2017
8cef790
Add files via upload
XiaodongAAA Nov 6, 2017
ddc4cdc
Add files via upload
XiaodongAAA Nov 6, 2017
20fb646
Add files via upload
XiaodongAAA Nov 6, 2017
08e574d
Set theme jekyll-theme-merlot
XiaodongAAA Nov 6, 2017
b5a7109
Add files via upload
XiaodongAAA Nov 6, 2017
00e5372
Add files via upload
XiaodongAAA Nov 6, 2017
d978c5f
Create data
XiaodongAAA Nov 11, 2017
83ad44c
Delete data
XiaodongAAA Nov 11, 2017
e33d72f
Update README.Rmd
XiaodongAAA Nov 11, 2017
e7ce3c8
creat data file
XiaodongAAA Nov 11, 2017
8e11508
data_creating
XiaodongAAA Nov 11, 2017
06304e4
data_creating
XiaodongAAA Nov 11, 2017
1132480
data_creating
XiaodongAAA Nov 12, 2017
2306a63
data_analysis
XiaodongAAA Nov 12, 2017
9bd752b
data_analysis
XiaodongAAA Nov 13, 2017
e60175f
data_analysis
XiaodongAAA Nov 13, 2017
8fafbf3
data_analysis
XiaodongAAA Nov 13, 2017
94e1205
Add files via upload
XiaodongAAA Nov 18, 2017
bc9b08f
Add files via upload
XiaodongAAA Nov 18, 2017
4545b22
update all files
XiaodongAAA Nov 21, 2017
0443a17
update all files
XiaodongAAA Nov 21, 2017
d7e604a
update all files
XiaodongAAA Nov 27, 2017
4118e97
update all files
XiaodongAAA Nov 28, 2017
4439f0a
update all files
XiaodongAAA Nov 28, 2017
4f3108e
update all files
XiaodongAAA Nov 28, 2017
08a6ae1
update all files
XiaodongAAA Nov 28, 2017
932c7b7
update exe5
XiaodongAAA Dec 2, 2017
e64f4b1
change themes
XiaodongAAA Dec 10, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions README.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
## A short discription of the course
This course--Introduction to Open Data Science--is aimed to help students to understand the principles and advantages of using open research tools with open data and understand the possibilities of reproducible research. After learning this course the students should know how to use R, RStudio, RMarkdown and GitHUb and also know how to learn more of these open software tools. Beside, and also the most important for me, the students will also know how to apply certain statistical methods of data science.


*The link of my course diary is listed below:*
https://XiaodongAAA.github.io/IODS-project/
158 changes: 158 additions & 0 deletions README.html

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions _config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
theme: jekyll-theme-merlot
8 changes: 7 additions & 1 deletion chapter1.Rmd
Original file line number Diff line number Diff line change
@@ -1,4 +1,10 @@

# About the project

*Write a short description about the course and add a link to your github repository here. This is an R markdown (.Rmd) file so you can use R markdown syntax. See the 'Useful links' page in the mooc area (chapter 1) for instructions.*
*Write a short description about the course and add a link to your github repository here. This is an R markdown (.Rmd) file so you can use R markdown syntax. See the 'Useful links' page in the mooc area (chapter 1) for instructions.*

## A short discription of the course
This course--Introduction to Open Data Science--is aimed to help students to understand the principles and advantages of using open research tools with open data and understand the possibilities of reproducible research. After learning this course the students should know how to use R, RStudio, RMarkdown and GitHUb and also know how to learn more of these open software tools. Beside, and also the most important for me, the students will also know how to apply certain statistical methods of data science.

*The link of my github repository is listed below:*
https://github.com/XiaodongAAA/IODS-project
162 changes: 162 additions & 0 deletions chapter1.html

Large diffs are not rendered by default.

78 changes: 77 additions & 1 deletion chapter2.Rmd
Original file line number Diff line number Diff line change
@@ -1,7 +1,83 @@
# Insert chapter 2 title here
# Regression and model validation

*Describe the work you have done this week and summarize your learning.*

- Describe your work and results clearly.
- Assume the reader has an introductory course level understanding of writing and reading R code as well as statistical methods
- Assume the reader has no previous knowledge of your data or the more advanced methods you are using

*Created at 10:00 12.11.2017*
*@author:Xiaodong Li*
*The script for RStudio Exercise 2 -- data analysis*

**Import packages:**
```{r}
library(dplyr)
library(GGally)
library(ggplot2)
```

## Step 1:read data
```{r}
lrn2014 = read.table('/home/xiaodong/IODS_course/IODS-project/data/learning2014.txt',sep='\t',header = TRUE)
```
Structure of the data
```{r}
str(lrn2014)
```
Dimensions of the data
```{r}
dim(lrn2014)
```
Data description
According to the structure and dimensions of the data, the data frame contains 7 variables which are `gender`,`age`,`attitude`,`deep`,`stra`,`surf`,`points`. In each variable, there are 166 observations. `gender` represents male (M) and female (F) surveyors. `age` is the ages (in years) of the people derived from the date of birth.In `attitude` column lists the global attitudes toward statistics. Columns `deep`,`surf` and `stra` list the questions related to deep, surface and strategic learning. The related question could be found in the following page, http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-meta.txt
The `points` column list the exam points from the survey.

## Step2: Explore the data
Plot the relationships between the variables
```{r}
p <- ggpairs(lrn2014, mapping = aes(col=gender,alpha=0.3))
p
```

The figure shows relationships between different variables. From the figure we can see that, from the `gender` column, female surveyors are more than male surveyors. Most of the people being surveyed are young generations, aged around 20 years old. The `attitudes` of man are higher than those of wemon towards statistics, which reflects that man has more positive attitudes towards statistics than wemen. The questions of `deep` and `strategic learning` are almost the same for men and wemen. The mean scores for them are around 3 and 4. However, the `surface questions` are quite different between the men and wemen surveyors. For wemen, the answers are centered at around 3.0 while the answers of men are centered at around 2.3. The exam `points` got from male and female answerers are quite the same and the most points are about 23 for both of them.

## Step 3: Multiple regression
```{r}
reg_model=lm(points~attitude+stra+surf,data=lrn2014)
summary(reg_model)
```

The target variable `point` is fitted to three explanatory variables: `attitude`, `stra` and `surf`. According to the results of the model, `surf` does not have a statistically significant relationship with the target variable. So, `surf` is removed from the fitting model and points is modelled according to `attitude` and `stra` again.

## Step 4: Model again
```{r}
reg_model2=lm(points~attitude+stra, data=lrn2014)
summary(reg_model2)
```

The model is fitted with the target `points` and two explanatory variable, `attitude` and `stra`. According to the summary results, the relationship between these variable should be $points=8.9729+3.4658*attitude+0.9137*stra$.
* The `Std. Error` is the standard deviation of the sampling distribution of the estimate of the coefficient under the standard regression assumptions.
* The `t values` are the estimates divided by there standard errors. It is an estimation of how extreme the value you see is, relative to the standard error.
* `Pr.` is the `p-value` for the hypothesis for which the `t value` is the test statistic. It tell you the probability of a test statistic at least as unusual as the one you obtained, if the null hypothesis were true (the null hypothesis is usually 'no effect', unless something else is specified). So, if the `p-value` is very low, then there is a higher probability that you're see data that is counter-indicative of zero effect.
* `Residual standard error` represents the standard deviation of the residuals. It's a measure of how close the fit is to the points.
* `The Multiple R-squared` is the proportion of the variance in the data that's explained by the model. The more variables you add, the large this will be. The `Adjusted R-squared` reduces that to account for the number of variables in the model.
* The `F-statistic` on the last line is telling you whether the regression as a whole is performing 'better than random',in other words, it tells whether your model fits better than you'd expect if all your predictors had no relationship with the response.
* `The p-value` in the last row is the p-value for that test, essentially comparing the full model you fitted with an intercept-only model.

## Step 5: Diagnostic plots
Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage are plotted
```{r}
par(mfrow=c(2,2))
plot(reg_model2, which=c(1,2,5))
```

* The `Residuals vs Fitted` values plot examines if the errors have constant variance. The graph shows a reasonable constant variance without any pattern.
* The `Normal QQ-polt` checks if the errors are normally distributed. We see from the graph a very good linear model fit, indicating a normally distributed error set.
* The `Residuals vs Leverage` confirms if there are any outliers with high leverage. From the graph, it shows that all the leverage are below 0.06, indicating good model fitting.

# Conclusion
The data `learning_2014` is explord and analysed by using graphical overview, data summary, multiple regression and diagnostic plots methods. The relationships between different variables are examined by a single plot which shows all possible scatter plots from the columns of `learning_2014` data. The exam points `Points` variable is fitted with combination variables `attitude` and `surf` and showed a reasonable good linear trend, despite the relatively low R-squared value. The validity of the model is checked by means of multiple residual analysis. The model predicts that the exam points of the students are positively correlated with their attitude and surface approaches.



289 changes: 289 additions & 0 deletions chapter2.html

Large diffs are not rendered by default.

141 changes: 141 additions & 0 deletions chapter3.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# Chapter 3 Logistic regression
Created on 18.11.2017
@author: Xiaodong Li
This is the script for RStudio exercise 3 -- Data analysis
The work focuses on exploring data and performing and interpreting logistic regression analysis on the UCI Machine Learning Repository, Student Performance Data Set.

## Step 0: import packages
```{r}
library(tidyr)
library(dplyr)
library(ggplot2)
```

## Step 1:read data
```{r}
alc=read.csv('/home/xiaodong/IODS_course/IODS-project/data/alc.csv',sep=',',header = TRUE)
colnames(alc)
```
The data used in the exercise is a joined data set that combines the two student alcohol consumption data sets, student-mat.csv and student-por.csv. The two data sets are retrieved from the UCI Machine Learning Repository. The data are from two identical questionaires related to secondary school student alcohol consumption in Portugal. For more background information, please check [here.](https://archive.ics.uci.edu/ml/datasets/Student+Performance) The variables not used for joining the two data have been combined by averaging. The `alc_use` colume is the average of weekday (`Dalc`) and weekend (`Walc`) alcohol use. The `high_use` column records if the `alc_use` is higher than 2 or not.

## Step 2:hypothesis about relationships with alcohol consumption
Choose four interesting variables in the data and present personal hypothesis about their relationships with alcohol consumption.
- `failures`: positive correlation, the more alcohol consumption, the more failures
- `absenses`: positive correlation, the more alcohol consumption, the more absenses
- `sex`: male is more than female students with high alcohol use
- `studytime`: negative correlation, the more alcohol consumption, the less studytime

## Step 3: Explore the distributions of the chosen variables and there relationships with alcohol consumption

### The relationship between sex and alcohol use
```{r}
bar_sex=ggplot(alc, aes(x=alc_use,fill=sex))+geom_bar(); bar_sex
```
`sex`~`alc_use`:
According to the count~alc_use bar figure plotted according to different sexes, we can see that female students are the main low alcohol users (`alc_use` < 2.5), however, for high alcohol users (`alc_use` > 2.5), most of them are male students. The bar figure also tells us that most of the alcohol users are very light users and the numbers of them decrease quickly with the increasing alcohol use levels (except for the extreme high users).

### The relationship between absences, failures, studytime and alcohol use
The failures, absences and studytime are scaled according to the counts of the alcohol users in different levels.
```{r}
alc2=group_by(alc,alc_use)
tab_sum=summarise(alc2,count=n(),absences=sum(absences),failures=sum(failures),studytime=sum(studytime))
tab_sum=mutate(tab_sum,abs_count=absences/count,fai_count=failures/count,styt_count=studytime/count)
tab_sum
```
```{r}
bar_absences=ggplot(tab_sum,aes(x=alc_use,y=abs_count))+geom_bar(stat='identity'); bar_absences
bar_failures=ggplot(tab_sum,aes(x=alc_use,y=fai_count))+geom_bar(stat='identity'); bar_failures
bar_studytime=ggplot(tab_sum,aes(x=alc_use,y=styt_count))+geom_bar(stat='identity'); bar_studytime
```

`absences`~`alc_use`:
There ia an increasing trend with absenses and alcohol use, which is in line with our hypothesis. When the alcohol use is 4.5, the absence is extremely high. The second high absence happens when the alcohol use is on the highest level.
`failures`~`alc_use`:
There is an positive correlation between failures and alcohol use. For light alcohol users, the failures are also in a low level, however, the failure reaches the highest mount when `alc_use`= 3. After that the failures fall down with incresing alcohol use, and we interpret this as lacking of enough samples. For extreme high alcohol users (`alc_use`=5), the failures are at the highest level, the same as `alc_use`=3.
`studytime`~`alc_use`:
The figure shows that there is no obvious relations between the study time and alcohol use. This is not in agreement with our hypothesis before. The lowest alcohol users have the most study time and the study time of the highest users are low compared with the other alcohol using levels. But the difference bwteen them are not quite obvious.

### Box plots by goups
Box plots are an excellent way of displaying and comparing distributions. The box visualizes the percentiles of the 25th, 50th and 75th of the data, while the whiskers show the typical range and the ourliers of a variable.
```{r}
box_absences=ggplot(alc,aes(x=high_use,y=absences))+geom_boxplot(); box_absences
box_failures=ggplot(alc,aes(x=high_use,y=failures))+geom_boxplot(); box_failures
box_studytime=ggplot(alc,aes(x=high_use,y=studytime))+geom_boxplot(); box_studytime
```

From the box plot of `absences` vs. `high_use`, it is obvious that the high alcohol users (`alc_use` > 2) are most likely to be absent from school. The box plot of `studytime` vs. `high_use` shows that high alcohol use also reduces the study time of the students. The conclusiona are in line with our hypothesis before.

## Step 4: Logistic regression
The logistic regression is used here to identify factors (failures,absences,sex and studytime) related to higher than average student alcohol consumption.
```{r}
m=glm(high_use~failures+absences+sex+studytime, data=alc,family='binomial')
summary(m)
coef(m)
```

According to the summary results, the estimated coefficients for failures, absences, sexM and studytime are 0.360, 0.087, 0.795 and -0.340 respectively. The results show that, for failures, absences and sexM, the correlations between them and alcohol use are positive while for studytime, the correlation is negative. This is in agreement with our previous hypothesis. According to the `P value` shown in the summary part, the biggest possibility happens between `absences` and `high_use`. The relation between `failures` and `high_use` may seem not quite convincing and this is in agreement with the box plot shown in the last part.

```{r}
OR=coef(m) %>% exp
CI=confint(m) %>% exp
cbind(OR,CI)
```

The ratio of expected "success" to "failures" are called the odds: p/(1-p). Odds are an alternative way of expressing probabilities. Higher odds corresponds to a higher probability of success when OR > 1. Odds higher than 1 means that X is positively associated with "success". If OR < 1, lower odds corresponds to the higher probability of success. The computational target variable in the logistic regression model is the log of odds, so applying exponent function to the modelled values gives the odds.
From the summary of the odds one can see that sexM gives the largest odds. This means that sexM has higher probability to high alcolhol use compared to failures and absences. The odds of studytime is the lower than 1. The result indicate that higher alcohol use corresponds to less study time.
The confidence intervals of 2.5% and 97.5 % for the odd ratios are also listed in the data frame.

## Step 5: Binary predictions
predict() the probability of high_use
```{r}
probabilities <- predict(m, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability>0.5)
select(alc, failures, absences, sex, studytime, high_use, probability, prediction) %>% tail(10)
table(high_use = alc$high_use, prediction = alc$prediction)%>%addmargins
g=ggplot(alc, aes(x = probability, y = high_use, col=prediction))
g=g+geom_point()
g
table(high_use = alc$high_use, prediction = alc$prediction)%>% prop.table%>%addmargins
```
According to the last 10 rows of data, we can see that most of the predictions are correct, except for the last two one. The last two samples are high alcohol users however, the prediction show that they are not, which is incorrect.
The cross tabulation of predictions versus the actual values show that 256 out of 268 `FALSE`(non-high alcohol users) values were predicted correctly by the model, while only 34 of 114 `TRUE`(high alcohol users) were predicted correctly. The correct prediction rate of `FALSE` samples is 95.5% while the correct prediction rate of `TRUE` samples is only 29.8%.
The results show that the model could give relatively good predictions for `FALSE` results while for the prediction of `TRUE` results is not sensitive.

## Step 6: Compute the average number of incorrect predictions
Accuracy: the average number of correctly classified observations.
Penalty (loss) function: the mean of incorrectly classified observations.
Less penalty function means better predictions.
```{r}
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the data
loss_func(class = alc$high_use, prob = alc$probability)
```
So, the wrong predictions of the model is about 24.1%. Combined with the analysis from step 5, we know that most of the wrong predictions are `TRUE` results (12 wrongly prediction for `FALSE` scenarios and 80 wrongly prediction for `TRUE` scenarios).

## Step 7: Cross-validation
Cross-validation is a method of testing a predictive model on unseen data. In cross-validation, the value of a penalty (loss) function (mean prediction error) is computed on data not used for finding the model. The low value of cross-validation result means better model predictions.
Perform 10-fold cross-validation
```{r}
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
cv$delta[1]
```
The 10-fold cross-validation result show that the test set performance are mostly between 0.25 and 0.26. There is no obvious smaller prediction error than the model introduced in DataCamp which has an error of about 0.26.













354 changes: 354 additions & 0 deletions chapter3.html

Large diffs are not rendered by default.

Loading