mccv_robust.qmd

---
title: "Estimating Robustness"
format:
  html:
    code-fold: true
    code-summary: 'Show The Code'
---

Often, a question of interest is at what sample size (used in training a model) can I detect an effect? Furthermore, is the effect I'm detecting a robust signal? MCCV can estimate the robustness of an effect or signal by learning from data at different sizes of a sample. For example, a large effect or performance, say 0.8 AUROC, may be reached using all data available in the sample. However, can I still reach 0.8 AUROC at a smaller sample size? Also, is the detected signal robust to the size of the sample used in training the model or does a particular cut of the data drive model learning? This article will show how learning from varying sample sizes may or may not show robust signal aka detection of an effect representative of the data generating process.

This first example defines two classes of data (class 1 and class 2) with a predictor drawn from very similar distributions. I expect robust signal (i.e. AUROC) to be detected as the proportion of samples increases:

```{python}
#| warning: false
import numpy as np
N=100
np.random.seed(0)
Z1 = np.random.beta(2,3,size=N,)
np.random.seed(0)
Z2 = np.random.beta(2,2.5,size=N)
Z = np.concatenate([Z1,Z2])

import scipy as sc
Y = np.concatenate([np.repeat(0,N),np.repeat(1,N)])

import pandas as pd
df = pd.DataFrame(data={'Y' : Y,'Z' : Z})
df.index.name = 'pt'
```

```{r}
#| warning: false
library(tidyverse)

df <- tibble::tibble(
    Y = reticulate::py$Y,
    Z = reticulate::py$Z
)
df[['Y']] <- factor(df$Y,levels=c(0,1))

df %>% 
    ggplot(aes(Y,Z)) +
    geom_boxplot(outlier.size = NA,alpha=0,linewidth=2) +
    geom_point(position = position_jitter(width = .2),pch=21,fill='gray',size=3) +
    labs(x = "Response",y="Predictor") +
    theme_bw(base_size = 16)
```

```{python}
import mccv

perf_dfs = []
for ts in [.1,.2,.3,.4,.5,.6,.7,.8]:
    mccv_obj = mccv.mccv(num_bootstraps=200,n_jobs=4)
    mccv_obj.test_size = ts
    mccv_obj.set_X(df[['Z']])
    mccv_obj.set_Y(df[['Y']])
    mccv_obj.run_mccv()
    perf_df = mccv_obj.mccv_data['Performance']
    perf_df.insert(len(perf_df.columns),'training_size',1-ts)
    perf_df.insert(len(perf_df.columns),'test_size',ts)
    perf_dfs.append(perf_df)
```

```{r}
reticulate::py$perf_dfs %>% 
    bind_rows() %>% 
    ggplot(aes(factor(training_size),value)) +
    geom_boxplot(outlier.size = NA,alpha=0,linewidth=2) +
    geom_point(position = position_jitter(width = .2),pch=21,fill='gray',size=3) +
    scale_x_discrete(
        labels = function(x)paste0(as.double(x)*100,"%")
    ) +
    labs(x = "Sample Size for MCCV Training",y="AUROC",caption=paste0(
        "As we increase our sample size for learning,\n",
        "performance increases as expected,\n",
        "but so does AUROC variability")) +
    theme_bw(base_size = 16)
```

The second example, instead, defines two classes of data drawn from two different distributions. I would expect non-robust signal detected as the sample size for training is increased.

```{python}
#| warning: false
import numpy as np
N=100
np.random.seed(0)
Z1 = np.random.beta(2,2.5,size=N)
np.random.seed(0)
Z2 = np.random.beta(6,5,size=N)
Z = np.concatenate([Z1,Z2])

import scipy as sc
Y = np.concatenate([np.repeat(0,N),np.repeat(1,N)])

import pandas as pd
df = pd.DataFrame(data={'Y' : Y,'Z' : Z})
df.index.name = 'pt'
```

```{r}
#| warning: false
library(tidyverse)

df <- tibble::tibble(
    Y = reticulate::py$Y,
    Z = reticulate::py$Z
)
df[['Y']] <- factor(df$Y,levels=c(0,1))

df %>% 
    ggplot(aes(Y,Z)) +
    geom_boxplot(outlier.size = NA,alpha=0,linewidth=2) +
    geom_point(position = position_jitter(width = .2),pch=21,fill='gray',size=3) +
    labs(x = "Response",y="Predictor") +
    theme_bw(base_size = 16)
```

```{python}
import mccv

perf_dfs = []
for ts in [.1,.2,.3,.4,.5,.6,.7,.8]:
    mccv_obj = mccv.mccv(num_bootstraps=200,n_jobs=4)
    mccv_obj.test_size = ts
    mccv_obj.set_X(df[['Z']])
    mccv_obj.set_Y(df[['Y']])
    mccv_obj.run_mccv()
    perf_df = mccv_obj.mccv_data['Performance']
    perf_df.insert(len(perf_df.columns),'training_size',1-ts)
    perf_df.insert(len(perf_df.columns),'test_size',ts)
    perf_dfs.append(perf_df)
```

```{r}
reticulate::py$perf_dfs %>% 
    bind_rows() %>% 
    ggplot(aes(factor(training_size),value)) +
    geom_boxplot(outlier.size = NA,alpha=0,linewidth=2) +
    geom_point(position = position_jitter(width = .2),pch=21,fill='gray',size=3) +
    scale_x_discrete(
        labels = function(x)paste0(as.double(x)*100,"%")
    ) +
    labs(x = "Sample Size for MCCV Training",y="AUROC",caption=paste0(
        "As we increase our sample size for learning,\n",
        "performance increases as expected but also stagnates")) +
    theme_bw(base_size = 16)
```

In short, my thinking is the data generating process is captured in a sample only if a robust signal is found. A robust signal can be represented by a linear, average increase in AUROC performance as sample size using for training increases. Otherwise, the signal-to-noise ratio is lower than what would be needed to make generalizable predictions from the specified model and data. In this last example, the evidence is unclear as expected whether the two classes of data are generated by the same process. I say this for two reasons:

1.  There is stagnant performance between using 30% and 80% of the sample size for training
2.  There is a stark difference between using 20% and 90% of the sample size for training. I would expect there to be more overlap compared to complete non-overlap.