TuomoNieminen · XiaodongAAA · Nov 5, 2017 · Nov 5, 2017 · Nov 5, 2017 · Nov 6, 2017
diff --git a/README.Rmd b/README.Rmd
@@ -0,0 +1,6 @@
+## A short discription of the course
+This course--Introduction to Open Data Science--is aimed to help students to understand the principles and advantages of using open research tools with open data and understand the possibilities of reproducible research. After learning this course the students should know how to use R, RStudio, RMarkdown and GitHUb and also know how to learn more of these open software tools. Beside, and also the most important for me, the students will also know how to apply certain statistical methods of data science. 
+
+
+*The link of my course diary is listed below:*  
+https://XiaodongAAA.github.io/IODS-project/
diff --git a/README.html b/README.html
diff --git a/_config.yml b/_config.yml
@@ -0,0 +1 @@
+theme: jekyll-theme-merlot
diff --git a/chapter1.Rmd b/chapter1.Rmd
@@ -1,4 +1,10 @@
 
 # About the project
 
-*Write a short description about the course and add a link to your github repository here. This is an R markdown (.Rmd) file so you can use R markdown syntax. See the 'Useful links' page in the mooc area (chapter 1) for instructions.*
+*Write a short description about the course and add a link to your github repository here. This is an R markdown (.Rmd) file so you can use R markdown syntax. See the 'Useful links' page in the mooc area (chapter 1) for instructions.*
+
+## A short discription of the course
+This course--Introduction to Open Data Science--is aimed to help students to understand the principles and advantages of using open research tools with open data and understand the possibilities of reproducible research. After learning this course the students should know how to use R, RStudio, RMarkdown and GitHUb and also know how to learn more of these open software tools. Beside, and also the most important for me, the students will also know how to apply certain statistical methods of data science. 
+
+*The link of my github repository is listed below:*  
+https://github.com/XiaodongAAA/IODS-project
diff --git a/chapter1.html b/chapter1.html
diff --git a/chapter2.Rmd b/chapter2.Rmd
@@ -1,7 +1,83 @@
-# Insert chapter 2 title here
+# Regression and model validation
 
 *Describe the work you have done this week and summarize your learning.*
 
 - Describe your work and results clearly. 
 - Assume the reader has an introductory course level understanding of writing and reading R code as well as statistical methods
 - Assume the reader has no previous knowledge of your data or the more advanced methods you are using  
+
+*Created at 10:00 12.11.2017*  
+*@author:Xiaodong Li*  
+*The script for RStudio Exercise 2 -- data analysis*
+
+**Import packages:**
+```{r}
+library(dplyr)
+library(GGally)
+library(ggplot2)
+```
+
+## Step 1:read data
+```{r}
+lrn2014 = read.table('/home/xiaodong/IODS_course/IODS-project/data/learning2014.txt',sep='\t',header = TRUE)
+```
+Structure of the data
+```{r}
+str(lrn2014)
+```
+Dimensions of the data
+```{r}
+dim(lrn2014)
+```
+Data description
+According to the structure and dimensions of the data, the data frame contains 7 variables which are `gender`,`age`,`attitude`,`deep`,`stra`,`surf`,`points`. In each variable, there are 166 observations. `gender` represents male (M) and female (F) surveyors. `age` is the ages (in years) of the people derived from the date of birth.In `attitude` column lists the global attitudes toward statistics. Columns `deep`,`surf` and `stra` list the questions related to deep, surface and strategic learning. The related question could be found in the following page, http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-meta.txt
+The `points` column list the exam points from the survey.
+
+## Step2: Explore the data
+Plot the relationships between the variables
+```{r}
+p <- ggpairs(lrn2014, mapping = aes(col=gender,alpha=0.3))
+p
+```
+
+The figure shows relationships between different variables. From the figure we can see that, from the `gender` column, female surveyors are more than male surveyors. Most of the people being surveyed are young generations, aged around 20 years old. The `attitudes` of man are higher than those of wemon towards statistics, which reflects that man has more positive attitudes towards statistics than wemen. The questions of `deep` and `strategic learning` are almost the same for men and wemen. The mean scores for them are around 3 and 4. However, the `surface questions` are quite different between the men and wemen surveyors. For wemen, the answers are centered at around 3.0 while the answers of men are centered at around 2.3. The exam `points` got from male and female answerers are quite the same and the most points are about 23 for both of them. 
+
+## Step 3: Multiple regression
+```{r}
+reg_model=lm(points~attitude+stra+surf,data=lrn2014)
+summary(reg_model)
+```
+
+The target variable `point` is fitted to three explanatory variables: `attitude`, `stra` and `surf`. According to the results of the model, `surf` does not have a statistically significant relationship with the target variable. So, `surf` is removed from the fitting model and points is modelled according to `attitude` and `stra` again.
+
+## Step 4: Model again
+```{r}
+reg_model2=lm(points~attitude+stra, data=lrn2014)
+summary(reg_model2)
+```
+
+The model is fitted with the target `points` and two explanatory variable, `attitude` and `stra`. According to the summary results, the relationship between these variable should be $points=8.9729+3.4658*attitude+0.9137*stra$.  
+* The `Std. Error` is the standard deviation of the sampling distribution of the estimate of the coefficient under the standard regression assumptions.  
+* The `t values` are the estimates divided by there standard errors. It is an estimation of how extreme the value you see is, relative to the standard error.  
+* `Pr.` is the `p-value` for the hypothesis for which the `t value` is the test statistic. It tell you the probability of a test statistic at least as unusual as the one you obtained, if the null hypothesis were true (the null hypothesis is usually 'no effect', unless something else is specified). So, if the `p-value` is very low, then there is a higher probability that you're see data that is counter-indicative of zero effect.  
+* `Residual standard error` represents the standard deviation of the residuals. It's a measure of how close the fit is to the points.  
+* `The Multiple R-squared` is the proportion of the variance in the data that's explained by the model. The more variables you add, the large this will be. The `Adjusted R-squared` reduces that to account for the number of variables in the model.  
+* The `F-statistic` on the last line is telling you whether the regression as a whole is performing 'better than random',in other words, it tells whether your model fits better than you'd expect if all your predictors had no relationship with the response.  
+* `The p-value` in the last row is the p-value for that test, essentially comparing the full model you fitted with an intercept-only model.  
+
+## Step 5: Diagnostic plots
+Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage are plotted
+```{r}
+par(mfrow=c(2,2))
+plot(reg_model2, which=c(1,2,5))
+```
+
+* The `Residuals vs Fitted` values plot examines if the errors have constant variance. The graph shows a reasonable constant variance without any pattern.
+* The `Normal QQ-polt` checks if the errors are normally distributed. We see from the graph a very good linear model fit, indicating a normally distributed error set.
+* The `Residuals vs Leverage` confirms if there are any outliers with high leverage. From the graph, it shows that all the leverage are below 0.06, indicating good model fitting.
+
+# Conclusion
+The data `learning_2014` is explord and analysed by using graphical overview, data summary, multiple regression and diagnostic plots methods. The relationships between different variables are examined by a single plot which shows all possible scatter plots from the columns of `learning_2014` data.  The exam points `Points` variable is fitted with combination variables `attitude` and `surf` and showed a reasonable good linear trend, despite the relatively low R-squared value. The validity of the model is checked by means of multiple residual analysis. The model predicts that the exam points of the students are positively correlated with their attitude and surface approaches.
+
+
+
diff --git a/chapter2.html b/chapter2.html
diff --git a/chapter3.Rmd b/chapter3.Rmd
@@ -0,0 +1,141 @@
+# Chapter 3 Logistic regression
+Created on 18.11.2017
+@author: Xiaodong Li
+This is the script for RStudio exercise 3 -- Data analysis
+The work focuses on exploring data and performing and interpreting logistic regression analysis on the UCI Machine Learning Repository, Student Performance Data Set.
+
+## Step 0: import packages
+```{r}
+library(tidyr)
+library(dplyr)
+library(ggplot2)
+```
+
+## Step 1:read data
+```{r}
+alc=read.csv('/home/xiaodong/IODS_course/IODS-project/data/alc.csv',sep=',',header = TRUE)
+colnames(alc)
+```
+The data used in the exercise is a joined data set that combines the two student alcohol consumption data sets, student-mat.csv and student-por.csv. The two data sets are retrieved from the UCI Machine Learning Repository. The data are from two identical questionaires related to secondary school student alcohol consumption in Portugal. For more background information, please check [here.](https://archive.ics.uci.edu/ml/datasets/Student+Performance) The variables not used for joining the two data have been combined by averaging. The `alc_use` colume is the average of weekday (`Dalc`) and weekend (`Walc`) alcohol use. The `high_use` column records if the `alc_use` is higher than 2 or not. 
+
+## Step 2:hypothesis about relationships with alcohol consumption
+Choose four interesting variables in the data and present personal hypothesis about their relationships with alcohol consumption.
+- `failures`: positive correlation, the more alcohol consumption, the more failures  
+- `absenses`: positive correlation, the more alcohol consumption, the more absenses  
+- `sex`: male is more than female students with high alcohol use  
+- `studytime`: negative correlation, the more alcohol consumption, the less studytime 
+
+## Step 3: Explore the distributions of the chosen variables and there relationships with alcohol consumption
+
+### The relationship between sex and alcohol use
+```{r}
+bar_sex=ggplot(alc, aes(x=alc_use,fill=sex))+geom_bar(); bar_sex
+```
+`sex`~`alc_use`:  
+According to the count~alc_use bar figure plotted according to different sexes, we can see that female students are the main low alcohol users (`alc_use` < 2.5), however, for high alcohol users (`alc_use` > 2.5), most of them are male students. The bar figure also tells us that most of the alcohol users are very light users and the numbers of them decrease quickly with the increasing alcohol use levels (except for the extreme high users).
+
+### The relationship between absences, failures, studytime and alcohol use
+The failures, absences and studytime are scaled according to the counts of the alcohol users in different levels.
+```{r}
+alc2=group_by(alc,alc_use)
+tab_sum=summarise(alc2,count=n(),absences=sum(absences),failures=sum(failures),studytime=sum(studytime))
+tab_sum=mutate(tab_sum,abs_count=absences/count,fai_count=failures/count,styt_count=studytime/count)
+tab_sum
+```
+```{r}
+bar_absences=ggplot(tab_sum,aes(x=alc_use,y=abs_count))+geom_bar(stat='identity'); bar_absences
+bar_failures=ggplot(tab_sum,aes(x=alc_use,y=fai_count))+geom_bar(stat='identity'); bar_failures
+bar_studytime=ggplot(tab_sum,aes(x=alc_use,y=styt_count))+geom_bar(stat='identity'); bar_studytime
+```
+
+`absences`~`alc_use`:  
+There ia an increasing trend with absenses and alcohol use, which is in line with our hypothesis. When the alcohol use is 4.5, the absence is extremely high. The second high absence happens when the alcohol use is on the highest level.  
+`failures`~`alc_use`:   
+There is an positive correlation between failures and alcohol use. For light alcohol users, the failures are also in a low level, however, the failure reaches the highest mount when `alc_use`= 3. After that the failures fall down with incresing alcohol use, and we interpret this as lacking of enough samples. For extreme high alcohol users (`alc_use`=5), the failures are at the highest level, the same as `alc_use`=3.  
+`studytime`~`alc_use`:    
+The figure shows that there is no obvious relations between the study time and alcohol use. This is not in agreement with our hypothesis before. The lowest alcohol users have the most study time and the study time of the highest users are low compared with the other alcohol using levels. But the difference bwteen them are not quite obvious. 
+
+### Box plots by goups
+Box plots are an excellent way of displaying and comparing distributions. The box visualizes the percentiles of the 25th, 50th and 75th of the data, while the whiskers show the typical range and the ourliers of a variable. 
+```{r}
+box_absences=ggplot(alc,aes(x=high_use,y=absences))+geom_boxplot(); box_absences
+box_failures=ggplot(alc,aes(x=high_use,y=failures))+geom_boxplot(); box_failures
+box_studytime=ggplot(alc,aes(x=high_use,y=studytime))+geom_boxplot(); box_studytime
+```
+
+From the box plot of `absences` vs. `high_use`, it is obvious that the high alcohol users (`alc_use` > 2) are most likely to be absent from school. The box plot of `studytime` vs. `high_use` shows that high alcohol use also reduces the study time of the students. The conclusiona are in line with our hypothesis before.   
+
+## Step 4: Logistic regression
+The logistic regression is used here to identify factors (failures,absences,sex and studytime) related to higher than average student alcohol consumption. 
+```{r}
+m=glm(high_use~failures+absences+sex+studytime, data=alc,family='binomial')
+summary(m)
+coef(m)
+```
+
+According to the summary results, the estimated coefficients for failures, absences, sexM and studytime are 0.360, 0.087, 0.795 and -0.340 respectively. The results show that, for failures, absences and sexM, the correlations between them and alcohol use are positive while for studytime, the correlation is negative. This is in agreement with our previous hypothesis. According to the `P value` shown in the summary part, the biggest possibility happens between `absences` and `high_use`. The relation between `failures` and `high_use` may seem not quite convincing and this is in agreement with the box plot shown in the last part. 
+
+```{r}
+OR=coef(m) %>% exp
+CI=confint(m) %>% exp
+cbind(OR,CI)
+```
+
+The ratio of expected "success" to "failures" are called the odds: p/(1-p). Odds are an alternative way of expressing probabilities. Higher odds corresponds to a higher probability of success when OR > 1. Odds higher than 1 means that X is positively associated with "success". If OR < 1, lower odds corresponds to the higher probability of success. The computational target variable in the logistic regression model is the log of odds, so applying exponent function to the modelled values gives the odds.   
+From the summary of the odds one can see that sexM gives the largest odds. This means that sexM has higher probability to high alcolhol use compared to failures and absences. The odds of studytime is the lower than 1. The result indicate that higher alcohol use corresponds to less study time.  
+The confidence intervals of 2.5% and 97.5 % for the odd ratios are also listed in the data frame.  
+
+## Step 5: Binary predictions
+predict() the probability of high_use
+```{r}
+probabilities <- predict(m, type = "response")
+alc <- mutate(alc, probability = probabilities)
+alc <- mutate(alc, prediction = probability>0.5)
+select(alc, failures, absences, sex, studytime, high_use, probability, prediction) %>% tail(10)
+table(high_use = alc$high_use, prediction = alc$prediction)%>%addmargins
+g=ggplot(alc, aes(x = probability, y = high_use, col=prediction))
+g=g+geom_point()
+g
+table(high_use = alc$high_use, prediction = alc$prediction)%>% prop.table%>%addmargins
+```
+According to the last 10 rows of data, we can see that most of the predictions are correct, except for the last two one. The last two samples are high alcohol users however, the prediction show that they are not, which is incorrect.  
+The cross tabulation of predictions versus the actual values show that 256 out of 268 `FALSE`(non-high alcohol users) values were predicted correctly by the model, while only 34 of 114 `TRUE`(high alcohol users) were predicted correctly. The correct prediction rate of `FALSE` samples is 95.5% while the correct prediction rate of `TRUE` samples is only 29.8%.  
+The results show that the model could give relatively good predictions for `FALSE` results while for the prediction of `TRUE` results is not sensitive.  
+
+## Step 6: Compute the average number of incorrect predictions
+Accuracy: the average number of correctly classified observations.
+Penalty (loss) function: the mean of incorrectly classified observations.
+Less penalty function means better predictions.
+```{r}
+# define a loss function (mean prediction error)
+loss_func <- function(class, prob) {
+  n_wrong <- abs(class - prob) > 0.5
+  mean(n_wrong)
+}
+# call loss_func to compute the average number of wrong predictions in the data
+loss_func(class = alc$high_use, prob = alc$probability)
+```
+So, the wrong predictions of the model is about 24.1%. Combined with the analysis from step 5, we know that most of the wrong predictions are `TRUE` results (12 wrongly prediction for `FALSE` scenarios and 80 wrongly prediction for `TRUE` scenarios).  
+
+## Step 7: Cross-validation
+Cross-validation is a method of testing a predictive model on unseen data. In cross-validation, the value of a penalty (loss) function (mean prediction error) is computed on data not used for finding the model. The low value of cross-validation result means better model predictions.  
+Perform 10-fold cross-validation
+```{r}
+library(boot)
+cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
+cv$delta[1]
+```
+The 10-fold cross-validation result show that the test set performance are mostly between 0.25 and 0.26. There is no obvious smaller prediction error than the model introduced in DataCamp which has an error of about 0.26.  
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/chapter3.html b/chapter3.html