generated from r4ds/bookclub-template
-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy path02_model-development.Rmd
223 lines (146 loc) · 9.3 KB
/
02_model-development.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
# Model Development
**Learning objectives:**
- Real-World Applications of Predictive Modeling
- Introduce two primary modeling approaches
- Explain the Model Development Process (MDP) framework
- Provide a comprehensive notation guide for model development
- Introduce Notation and Model Representation
## Modeling Approaches {-}
- **Explanatory Modelling**: Models are applied for inferential purposes like testing a hypothesis.
- **Predictive Modelling**: Models are used to predicting the value of a new or future observation.
## Predictive Modeling Examples {-}
- Creating a model for **scoring risks of transactions** in a large financial company _can take several months_
- Creating a model for roughly predicting **the demand for deliveries** in a small pizza-delivery chain can be _quickly updated or even discarded_, without major consequences.
## Importance of Model Development Process {-}
- Help to define resources needed to **develop** and **maintain** a model
- Make sure that **no important steps are missed** when developing the model
![](img/02-model-development/05-importance-of-planning.jpg){width=60% height=80% fig-align="center"}
## CRISP-DM: _Cross-industry Standard Process for Data Mining_ {-}
**It is a tool-agnostic procedure**
![](img/02-model-development/01-crisp_process.jpg){width=60% height=80% fig-align="center"}
## Lifecycle of a Predictive Model {-}
![](img/02-model-development/02-MDP_washmachine.png){width=60% height=80% fig-align="center"}
## MDP: _Model-development Process_ {-}
Consecutive iterations are not identical because the knowledge increases during the process and consecutive iterations are performed with different goals in mind
- The five phases, present in CRISP-DM, are shown in the rows
- A single bar in each row represents a number of resources (for instance, a week-worth workload)
- The stages are indicated at the top of the diagram
![](img/02-model-development/03-mdp_general.png){width=60% height=80% fig-align="center"}
## MDP: _Model-development Process_ {-}
- _Stage 1: Problem Formulation_
- Defining datasets that will be used for training and validation
- Deciding which performance measures will be used for the evaluation of the performance of the final model
- _Stage 2: Crisp Modelling_
- Focuses on the creation of first versions
- Defines how complex the models needs to be
- _Stage 3: Fine-tuning_
- Focuses on improving the initial version(s) of the model
- Selecting the best one according to the pre-defined metrics
- _Stage 4: Maintenance and Decommissioning_
- Monitoring the performance of the model after its implementation
## Notation {-}
|**Description**|**Example**|**Meaning**|
|:--------------|:---------:|:----------|
|Capital letters|$X$ or $Y$ |Scalar random variables|
|Lower case letters|$x$ or $y$ |Observed values of variables|
|Underline capital letters|$\underline{X}$|Matrix|
|Underline lower case letters|$\underline{x}$|Column vector|
|Prime|$\underline{x}'$|Transposition|
## Notation {-}
|**Description**|**Example**|**Meaning**|
|:--------------|:---------:|:----------|
|E function|$E(Y)$|Expected (mean) value|
|Var function|$Var(Y)$|Variance of a random variable|
|Subscript|$E_{Y|X=x}(Y)$ <br> <br> $E_{Y|x}(Y)$ <br> <br> $E_Y(Y|X = x)$|Indicate the distribution used to compute the parameters. <br> <br> In this example, the equation represents the mean of $Y$ given that random variable $X$ assumes the value of $x$|
## Notation {-}
|**Description**|**Example**|**Meaning**|
|:--------------|:---------:|:----------|
|Letter y|$Y$|Represent the dependent (random) variable (number or binary indicator)|
|Lower case i|$y_i$|Identifies one specific observation|
|Lower case n|$n$|Number of observations in the data available for modelling|
|Lower case p|$p$|Number of explanatory variables|
|Caligraphic X|$\mathcal{X} = \mathcal{R}^p$ <br> <br> $\underline{x}_i \ni \mathcal{X}$|P-dimensional space|
|An asterisk in the subscript|$x_*$|An observation of interest|
## Notation {-}
|**Description**|**Example**|**Meaning**|
|:--------------|:---------:|:----------|
|Lower case j|$\underline{x}^j$|Number of coordinate of vector $\underline{x}$|
|Lower case i|$\underline{x}_i = (x^1_i, \dots, x^p_i)'$|Identify one specific observation of a (column) vector of the explanatory variables|
|Parenthesis to apply power|$\left( x_i^j \right)^2$|Denotes the power of the $j$-th variable for the $i$-th observation|
|Caligraphic J|$x^\mathcal{J}$|Denotes a subset of indices|
|Negative Mathematical J|$x^{-\mathcal{J}}$|Removing the $j$-th coordinate|
|$=z$|$\underline{x}^{j|=z}=(x^1, \dots, x^{j-1}, z, x^{j+1}, \dots, x^p)'$|Replaces the $j$-th component of vector $\underline{x}$ with $z$|
## Notation {-}
|**Description**|**Example**|**Meaning**|
|:--------------|:---------:|:----------|
|$*j$|$\underline{X}^{*j}$|Denote a matrix with the values as $\underline{X}$ except of $j$-th which elements are permuted|
|Model residual|$r_i = y_i - f(x_i)$ <br> <br> $r_i = y_i - \hat{y}_i$|The difference between the observed value of the dependent variable $Y$ for the $i$-th observation and the model's prediction for that observation|
## Model description {-}
It is a function $f: \mathcal{X} \rightarrow \mathcal{R}$ that transforms a point from $\mathcal{X}$ into a real number.
<br>
> The methods presented in this book can be used directly for **multivariate dependent variables**.
## Data understanding {-}
- Tabular summaries
- Statistical methods
- Visualization techniques
- Distribution of a single variable
- Histogram (continues or categorical)
- Empirical cumulative-distribution (ECD)
- Relationship between pairs of variables
- Box plot (categorical vs continues)
- Mosaic plot (two categorical variables)
- Scatter plot (two continuous variables)
![](img/02-model-development/04-data-understanding-visualization.png)
## Understanding dependent variable {-}
- Explore the approximation to **normality** and **symmetry** of the distribution, which can be important for models like:
- Linear Regression
- General Linear Models (GLM)
- Ridge and Lasso Regression
- Confirm if the proportion of observations in different **categories is balanced or not**, as most the classification methods do not work well if the categories are _substantial imbalance_.
## Understanding explanatory variables {-}
- Investigation of their distribution to identify variables presenting **little variability**.
- Relationship between explanatory variables themselves
- **Strongly correlated** explanatory variables can produce problem when optimization algorithms
- Their relationship with the dependent variable
- Variable selection/filtering if **do not appear to be related** to the dependent variable
- Variable transformation to make linear the relation for example
## Model assembly (fitting) {-}
- We want to define the **distribution of a dependent variable** $Y$ given $\underline{x}$.
- We usually **do not evaluate the entire distribution**, but just some of its characteristics, like:
- Expected value (mean)
- Variance
- A Quantile
- **A good model** that can approximate conditional expected value $E_{Y|\underline{x}}(Y) \approx f(x)$ needs to reflect a satisfactory **predictive performance**.
- The Model fitting is a procedure of selecting a value for **model coefficients** $\underline{\theta} \in \Theta$ that minimizes some loss function $L()$:
$$
\hat{\underline{\theta}} = \text{arg} \min_{\underline{\theta} \in \Theta} L \{ \underline{y}, f(\theta, X) \}
$$
## The expected squared-error of prediction {-}
$$
\begin{split}
E_{(Y,\hat{\underline{\theta}})|\underline{x_*}} \{ Y-\hat{Y} \}^2 & =
\underbrace{E_{Y|\underline{x} _*} \{ Y - f(\theta; \underline{x}_*) \}^2}_\text{Variability of Y around its conditional expected value} + \\ \\
& \underbrace{[f(\theta;\underline{x}_*) - E_{\hat{\theta}|\underline{x}_*} \{ f(\underline{\hat{\theta}}; \underline{x}_*) \}]^2}_\text{Difference between the expected value and its estimate, Squared Bias} + \\ \\
& \underbrace{E_{\underline{\hat{\theta}}|\underline{x}_*}[f(\underline{\hat{\theta}};\underline{x}_*) - E_{\underline{\hat{\theta}}|\underline{x}_*} \{ f(\underline{\hat{\theta}}; \underline{x}_*) \}]^2}_\text{Change in the model due to the training data used, Variance}
\end{split}
$$
## Difference between explanatory and predictive modelling {-}
- **Explanatory modelling**: Tries to minimize the **squared bias**, as we are interested in obtaining the most _accurate representation of the investigated phenomenon_ and the related theory.
- **Predictive modelling**: Tries to minimize the sum of the **squared bias** and the **estimation variance**, because we are interested in minimization of the prediction error.
## Model audit {-}
This book presents techniques that allow:
- Decomposing a model’s predictions into components that can be attributed to particular explanatory variables
- Conducting a sensitivity analysis for a model's predictions
- Summarizing the predictive performance of a model
- Assessing the importance of an explanatory variable
- Evaluating the effect of an explanatory variable on a model’s predictions
- A detailed examination of both overall and instance-specific model performance
## Meeting Videos {-}
### Cohort 1 {-}
`r knitr::include_url("https://www.youtube.com/embed/URL")`
<details>
<summary> Meeting chat log </summary>
```
LOG
```
</details>