-
Notifications
You must be signed in to change notification settings - Fork 0
/
mccv_robust.qmd
153 lines (130 loc) · 5.49 KB
/
mccv_robust.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
---
title: "Estimating Robustness"
format:
html:
code-fold: true
code-summary: 'Show The Code'
---
Often, a question of interest is at what sample size (used in training a model) can I detect an effect? Furthermore, is the effect I'm detecting a robust signal? MCCV can estimate the robustness of an effect or signal by learning from data at different sizes of a sample. For example, a large effect or performance, say 0.8 AUROC, may be reached using all data available in the sample. However, can I still reach 0.8 AUROC at a smaller sample size? Also, is the detected signal robust to the size of the sample used in training the model or does a particular cut of the data drive model learning? This article will show how learning from varying sample sizes may or may not show robust signal aka detection of an effect representative of the data generating process.
This first example defines two classes of data (class 1 and class 2) with a predictor drawn from very similar distributions. I expect robust signal (i.e. AUROC) to be detected as the proportion of samples increases:
```{python}
#| warning: false
import numpy as np
N=100
np.random.seed(0)
Z1 = np.random.beta(2,3,size=N,)
np.random.seed(0)
Z2 = np.random.beta(2,2.5,size=N)
Z = np.concatenate([Z1,Z2])
import scipy as sc
Y = np.concatenate([np.repeat(0,N),np.repeat(1,N)])
import pandas as pd
df = pd.DataFrame(data={'Y' : Y,'Z' : Z})
df.index.name = 'pt'
```
```{r}
#| warning: false
library(tidyverse)
df <- tibble::tibble(
Y = reticulate::py$Y,
Z = reticulate::py$Z
)
df[['Y']] <- factor(df$Y,levels=c(0,1))
df %>%
ggplot(aes(Y,Z)) +
geom_boxplot(outlier.size = NA,alpha=0,linewidth=2) +
geom_point(position = position_jitter(width = .2),pch=21,fill='gray',size=3) +
labs(x = "Response",y="Predictor") +
theme_bw(base_size = 16)
```
```{python}
import mccv
perf_dfs = []
for ts in [.1,.2,.3,.4,.5,.6,.7,.8]:
mccv_obj = mccv.mccv(num_bootstraps=200,n_jobs=4)
mccv_obj.test_size = ts
mccv_obj.set_X(df[['Z']])
mccv_obj.set_Y(df[['Y']])
mccv_obj.run_mccv()
perf_df = mccv_obj.mccv_data['Performance']
perf_df.insert(len(perf_df.columns),'training_size',1-ts)
perf_df.insert(len(perf_df.columns),'test_size',ts)
perf_dfs.append(perf_df)
```
```{r}
reticulate::py$perf_dfs %>%
bind_rows() %>%
ggplot(aes(factor(training_size),value)) +
geom_boxplot(outlier.size = NA,alpha=0,linewidth=2) +
geom_point(position = position_jitter(width = .2),pch=21,fill='gray',size=3) +
scale_x_discrete(
labels = function(x)paste0(as.double(x)*100,"%")
) +
labs(x = "Sample Size for MCCV Training",y="AUROC",caption=paste0(
"As we increase our sample size for learning,\n",
"performance increases as expected,\n",
"but so does AUROC variability")) +
theme_bw(base_size = 16)
```
The second example, instead, defines two classes of data drawn from two different distributions. I would expect non-robust signal detected as the sample size for training is increased.
```{python}
#| warning: false
import numpy as np
N=100
np.random.seed(0)
Z1 = np.random.beta(2,2.5,size=N)
np.random.seed(0)
Z2 = np.random.beta(6,5,size=N)
Z = np.concatenate([Z1,Z2])
import scipy as sc
Y = np.concatenate([np.repeat(0,N),np.repeat(1,N)])
import pandas as pd
df = pd.DataFrame(data={'Y' : Y,'Z' : Z})
df.index.name = 'pt'
```
```{r}
#| warning: false
library(tidyverse)
df <- tibble::tibble(
Y = reticulate::py$Y,
Z = reticulate::py$Z
)
df[['Y']] <- factor(df$Y,levels=c(0,1))
df %>%
ggplot(aes(Y,Z)) +
geom_boxplot(outlier.size = NA,alpha=0,linewidth=2) +
geom_point(position = position_jitter(width = .2),pch=21,fill='gray',size=3) +
labs(x = "Response",y="Predictor") +
theme_bw(base_size = 16)
```
```{python}
import mccv
perf_dfs = []
for ts in [.1,.2,.3,.4,.5,.6,.7,.8]:
mccv_obj = mccv.mccv(num_bootstraps=200,n_jobs=4)
mccv_obj.test_size = ts
mccv_obj.set_X(df[['Z']])
mccv_obj.set_Y(df[['Y']])
mccv_obj.run_mccv()
perf_df = mccv_obj.mccv_data['Performance']
perf_df.insert(len(perf_df.columns),'training_size',1-ts)
perf_df.insert(len(perf_df.columns),'test_size',ts)
perf_dfs.append(perf_df)
```
```{r}
reticulate::py$perf_dfs %>%
bind_rows() %>%
ggplot(aes(factor(training_size),value)) +
geom_boxplot(outlier.size = NA,alpha=0,linewidth=2) +
geom_point(position = position_jitter(width = .2),pch=21,fill='gray',size=3) +
scale_x_discrete(
labels = function(x)paste0(as.double(x)*100,"%")
) +
labs(x = "Sample Size for MCCV Training",y="AUROC",caption=paste0(
"As we increase our sample size for learning,\n",
"performance increases as expected but also stagnates")) +
theme_bw(base_size = 16)
```
In short, my thinking is the data generating process is captured in a sample only if a robust signal is found. A robust signal can be represented by a linear, average increase in AUROC performance as sample size using for training increases. Otherwise, the signal-to-noise ratio is lower than what would be needed to make generalizable predictions from the specified model and data. In this last example, the evidence is unclear as expected whether the two classes of data are generated by the same process. I say this for two reasons:
1. There is stagnant performance between using 30% and 80% of the sample size for training
2. There is a stark difference between using 20% and 90% of the sample size for training. I would expect there to be more overlap compared to complete non-overlap.