-
Notifications
You must be signed in to change notification settings - Fork 13
/
Copy pathdata.qmd
1247 lines (874 loc) · 82 KB
/
data.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Dealing with Data {#sec-data}
![](img/chapter_gp_plots/gp_plot_9.svg){width=75% fig-align='center'}
```{r}
#| label: setup-data
#| include: false
df_reviews = read_csv("https://tinyurl.com/moviereviewsdata")
```
It's an inescapable fact that models need data to work. One of the dirty secrets in data science is that getting the data right will do more for your model than any fancy algorithm or hyperparameter tuning. Data is messy, and there are a lot of ways it can be messy. In addition, dealing with a target variable on its terms can lead to more interesting results. In this chapter we'll discuss some of the most common data issues and things to consider. There's a lot to know about data before you ever get into modeling it, so we'll give you some things to think about in this chapter.
## Key Ideas {#sec-data-key-ideas}
- Data transformations can provide many modeling benefits.
- Label and text-based data still needs a numeric representation, and this can be accomplished in a variety of ways.
- The data type for the target may suggest a particular model, but does not necessitate one.
- The data *structure*, for example, temporal, spatial, censored, etc., may suggest a particular modeling domain to use.
- Missing data can be handled in a variety of ways, and the simpler approaches are typically not great.
- Class imbalance is a very common issue in classification problems, and there are a number of ways to deal with it.
- Latent variables are everywhere!
### Why this matters {#sec-data-why-matters}
Knowing your data is one of the most important aspects of any application of *data* science. It's not just about knowing what you have, but also what you can do with it. The data you have will influence the models you can potentially use, the features you can create and manipulate, and have a say on the results you can expect.
### Helpful context {#sec-data-good-to-know}
We're talking very generally about data here, so not much background is needed. The models mentioned here are covered in other chapters, or build upon those, but we're not doing any actual modeling here.
## Feature & Target Transformations {#sec-data-transformations}
Transforming variables from one form to another provides several benefits in modeling, whether applied to the target, features, or both. *Transformation should be used in most model situations*. Just some of these benefits include:
- More comparable feature effects and related parameters
- Faster estimation
- Easier convergence
- Helping with heteroscedasticity
For example, just **centering** features, i.e., subtracting their respective means, provides a more interpretable intercept that will fall within the actual range of the target variable in a standard linear regression. After centering, the intercept tells us what the value of the target variable is when the features are at their means (or reference value if categorical). Centering also puts the intercept within the expected range of the target, which often makes for easier parameter estimation. So even if easier interpretation isn't a major concern, variable transformations can help with convergence and speed up estimation, so can always be of benefit.
### Numeric variables {#sec-data-numeric}
The following table shows the interpretation of some very common transformations applied to numeric variables - logging, and standardizing to mean zero, standard deviation of one[^scaletoany]. Note that for logging, these are approximate interpretations.
[^scaletoany]: Scaling to a mean of zero and standard deviation of one is not the only way to scale variables. You can technically scale to any mean and standard deviation you want, but in tabular data settings you will have some different and possibly less interpretability, and may lose something in model estimation performance (convergence). For deep learning, the actual normalization may be adaptive applied across iterations.
```{r}
#| echo: false
#| eval: true
#| label: create-tbl-transformation
#| tbl-cap: Common Numeric Transformations
#|
# gt doesn't seem to have an actual working answer to latex within cells, though kabel_extra and modelsummary appear to handle it as expected, at least if you just run it. Rendering to pdfseems to fail. tried using \U03B2 for beta, but that didn't work either.
# tibble(
# Target = c("y", "y", "log(y)", "log(y)", "y", "scale(y)", "scale(y)"),
# Feature = c("x", "log(x)", "x", "log(x)", "scale(x)", "x", "scale(x)"),
# Interpretation = c(
# "$\\Delta y = \\beta\\Delta x$",
# "$\\Delta y \\approx (\\beta/100)\\%\\Delta x$",
# "$\\%\\Delta y \\approx 100\\cdot \\beta\\%\\Delta x$",
# "$\\%\\Delta y = \\beta\\%\\Delta x$",
# "$\\Delta y = \\beta\\sigma\\Delta x$",
# "$\\sigma\\Delta y = \\beta\\Delta x$",
# "$\\sigma\\Delta y = \\beta\\sigma\\Delta x$"
# ),
# ) |>
# modelsummary::datasummary_df()
# christ what year is this?
tab = tibble(
Target = c("y", "log(y)", "log(y)", "y", "scale(y)"),
Feature = c("x", "x", "log(x)", "scale(x)", "scale(x)"),
`Change in X` = c("1 unit", "1 unit", "1% change", "1 standard deviation", "1 standard deviation"),
`Change in Y` = c("B unit", "100 * (exp(B) -1)%", "B%", "B unit", "B standard deviation"),
Benefits = c("Interpretation", "Heteroscedasticity in y", "Interpretation, help with feature extremes", "Interpretation, estimation", "Interpretation, estimation"),
)
```
:::{.content-visible when-format='html'}
```{r}
#| echo: false
#| eval: true
#| label: tbl-transformation
#| tbl-cap: Common Numeric Transformations
tab |> gt()
```
:::
:::{.content-visible when-format='pdf'}
```{r}
#| echo: false
#| eval: true
#| label: tbl-transformation-pdf
#| tbl-cap: Common Numeric Transformations
# table will overrun even with scriptsize
tab |> select(-Benefits) |> gt()
```
:::
For example, it is very common to use **standardized** or **scaled** variables. Some also call this **normalizing**, as with **batch or layer normalization** in deep learning, but [this term can mean a lot of things](https://en.wikipedia.org/wiki/Normalization_(statistics)), so one should be clear in their communication. If $y$ and $x$ are both standardized, a one unit (i.e., one standard deviation) change in $x$ leads to a $\beta$ standard deviation change in $y$. So, if $\beta$ was .5, a standard deviation change in $x$ leads to a half standard deviation change in $y$. In general, there is nothing to lose by standardizing, so you should employ it often.
Another common transformation, particularly in machine learning, is **min-max scaling**. This involves changing variables to range from a chosen minimum value to a chosen maximum value, and usually this means zero and one respectively. This transformation can make numeric and categorical indicators more comparable, or at least put them on the same scale for estimation purposes, and so can help with convergence and speed up estimation. The following demonstrates how we can employ such approaches.
:::{.panel-tabset}
###### Python
When using [sklearn]{.pack}, it's a bit of a verbose process to do such a simple transformation. However it is beneficial when you want to do more complicated things, especially when using data pipelines.
```{python}
#| eval: false
#| label: py-numeric-transformation
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import numpy as np
# Create a random sample of integers
data = np.random.randint(low=0, high=100, size=(5, 3))
# Apply StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
# Apply MinMaxScaler
minmax_scaler = MinMaxScaler()
minmax_scaled_data = minmax_scaler.fit_transform(data)
```
###### R
R makes it easy to do simple transformations like standardization and logs without external packages, but you can also use tools like [recipes]{.pack} and [mlr3]{.pack} pipeline operations when needed to make sure your preprocessing is applied appropriately.
```{r}
#| eval: false
#| label: r-numeric-transformation
# Create a sample dataset
data = matrix(sample(1:100, 15), nrow = 5)
# Standardization
scaled_data = scale(data)
# Min-Max Scaling
minmax_scaled_data = apply(data, 2, function(x) {
(x - min(x)) / (max(x) - min(x))
})
```
:::
Using a **log** transformation for numeric targets and features is straightforward, and [comes with several benefits](https://stats.stackexchange.com/questions/107610/what-is-the-reason-the-log-transformation-is-used-with-right-skewed-distribution). For example, it can help with **heteroscedasticity**, which is when the variance of the target is not constant across the range of the predictions[^notnormal]. It can also help to keep predictions positive after transformation, allows for interpretability gains, and more.
:::{.content-visible when-format='html'}
```{r}
#| echo: false
#| label: fig-log-transformation-heteroscedasticity
#| fig-cap: Log Transformation and Heteroscedasticity
#| out-width: 100%
# Generate x values
set.seed(123)
N = 250
x = rpois(N, lambda = 10)
# Generate y values with heteroscedasticity
y = exp(.1*x + rnorm(length(x), mean = 0, sd = .5))
# Create a data frame
data = tibble(x, y)
# Fit a linear regression model without log transformation
model = lm(y ~ x, data = data)
# plot(model)
# Fit a linear regression model with log transformation for y
model_log = lm(log(y) ~ x, data = data)
# plot(model_log)
fitted_residuals = data.frame(
model = rep(c("Model", "Model w/ Log"), each = length(data$x)),
fitted_values = c(model$fitted, model_log$fitted),
residuals = c(resid(model), resid(model_log))
)
# Create a ggplot with facetted scatter plots
p = ggplot(fitted_residuals, aes(x = fitted_values, y = residuals)) +
geom_point() +
facet_wrap(~ model, nrow = 1, scales = 'free') +
labs(
x = "Fitted Values",
y = "Residuals",
caption = "\nVariance increases with predicted values for the model without log transformation,\nbut is consistent across prediction for the model with log transformation."
) +
theme(
plot.caption = element_text(hjust = 0, size = 10)
)
p
ggsave(
"img/data-log_transformation_heteroscedasticity.svg",
p,
width = 8,
height = 6
)
```
:::
:::{.content-visible when-format='pdf'}
![Log transformation and heteroscedasticity](img/data-log_transformation_heteroscedasticity.svg){width=100% #fig-log-transformation-heteroscedasticity}
:::
One issue with logging is that it is not a linear transformation. While this can help capture nonlinear feature-target relationships, it can also make some post-modeling transformations less straightforward. Also if you have a lot of zeros, 'log plus one' transformations are not going to be enough to help you overcome that hurdle[^logp1]. Logging also won't help much when the variables in question have few distinct values, like ordinal variables, which we'll discuss later in @sec-data-ordinal.
[^logp1]: That doesn't mean you won't see many people try (and fail).
[^notnormal]: For the bazillionth time, logging does not make data 'normal' so that you can meet your normality assumption in linear regression.
:::{.callout-note title='Categorizing continuous variables' collapse='true'}
It is rarely necessary or a good idea to transform a numeric feature or target to a categorical one. Doing so potentially throws away useful information by making the feature a less reliable measure of the underlying construct. For example, discretizing age to 'young' and 'old' does not help your model, and you can always get predictions for what you would consider 'young' and 'old' after the fact. It is extremely common to see, particularly in machine learning contexts when applied to target variables so that a classification approach can be used. But it's just not a statistically or practically sound thing to do, and can ultimately hinder interpretation.
:::
### Categorical variables {#sec-data-cat}
Despite their ubiquity in data, we can't analyze raw text information as it is. Character strings, and labeled features like factors, must be converted to a numeric representation before we can analyze them. For categorical features, we can use something called **effects coding** to test for specific types of group differences. Far and away the most common type is called **dummy coding** or **one-hot encoding**[^onehotnotdummy], which we visited previously @sec-lm-categorical-features. In these situations we create columns for each category, and the value of the column is 1 if the observation is in that category, and 0 otherwise. Here is a one-hot encoded version of the `season` feature that was demonstrated previously.
[^onehotnotdummy]: Note that one-hot encoding can refer to just the 1/0 coding for all categories, or to the specific case of dummy coding where one category is dropped. Make sure the context is clear.
```{r}
#| echo: false
#| label: tbl-one-hot
#| tbl-cap: One-hot Encoding
model.matrix(rating ~ -1 + season, df_reviews) |>
as_tibble() |>
bind_cols(df_reviews |> select(season)) |>
head(9) |>
gt() |>
fmt_number(columns = everything(), decimals = 0)
```
:::{.callout-note title='Dummy Coding Explained' collapse='true'}
For statistical models, when doing one-hot encoding all relevant information is incorporated in k-1 groups, where k is the number of groups, so one category will be dropped from the model matrix. This dropped category is called the **reference**. In a standard linear model, the intercept represents the mean of the target for the reference category. The coefficients for the other categories are the difference between the mean for the reference category and the group mean of the category being considered.
As an example, in the case of the `season` feature, if the dropped category is `winter`, the intercept tells us the mean rating for `winter`, and the coefficients for the other categories are the difference between the value for `winter` and the mean of the target for `fall`, `summer` and `spring`.
In other models, we include all categories in the model. The model learns how to use them best and might only consider some or one of them at a time.
:::
When we encode categories for statistical analysis, we can summarize their impact on the target variable in a single result for all categories. For a model with only categorical features, we can use an **ANOVA** (@sec-lm-categorical-anova) for this. But a similar approach can also be used for mixed models, splines, and other models to summarize categorical, spline, and other effects. Techniques like SHAP also provide a way to summarize the total effect of a categorical feature (@sec-knowing-shap-values).
#### Text embeddings {#sec-data-embeddings}
When it comes to other string representations like sentences and paragraphs, we can use other methods to represent them numerically. One important way to encode text is through an **embedding**. This is a way of representing the text as a vector of numbers, at which point the numeric embedding feature is used in the model like any other. The way to do this usually involves a model or a specific part of the model's architecture, one that learns the best way to represent the text or categories numerically. This is commonly used in deep learning, and natural language processing in particular. However, embeddings can also be used as a preprocessing step in *any* modeling situation.
To understand how embeddings work, consider a one-hot encoded matrix for a categorical variable. This matrix then connects to a hidden layer of a neural network. The weights learned for that layer are the embeddings for the categorical variable. While this isn't the exact method used (there are more efficient methods that don't require the actual matrix), the concept is the same. In addition, we normally don't even use whole words. Instead, we break the text into smaller units called **tokens**, like characters or subwords, and then use embeddings for those units. [Tokenization](https://huggingface.co/learn/nlp-course/en/chapter6/5) is used in many of the most successful models for natural language processing, including those such as [ChatGPT](https://www.youtube.com/watch?v=zduSFxRajkE).
```{r}
#| echo: false
#| eval: false
#| label: half-nn-graph-setup
g = DiagrammeR::grViz('img/graphical-embedding.dot')
g |>
DiagrammeRsvg::export_svg() |>
charToRaw() |>
rsvg::rsvg_svg("img/graphical-embedding.svg")
```
![Conceptual example of an embedding](img/graphical-embedding.svg){width=25% #fig-embedding}
#### Multiclass targets {#sec-data-multiclass}
We've talked about and demonstrated models with binary targets, but what about when there are more than two classes? In statistical settings, we can use a **multinomial regression**, which is a generalization of (binomial) logistic regression to more than two classes via the multinomial distribution. Depending on the tool, you may have to use a *multivariate* target of the counts, though most commonly they would be zeros and ones for a classification model, which then is just a one-hot encoded target. The following table demonstrates how this might look.
```{r}
#| echo: false
#| label: tbl-multinomial-data
#| tbl-cap: Multinomial Data Example
tibble(
x1 = rnorm(5),
x2 = rpois(5, 5),
target = c("A", "C", "C", "B", "A"),
`Class A` = c(1, 0, 0, 0, 1),
`Class B` = c(0, 0, 0, 1, 0),
`Class C` = c(0, 1, 1, 0, 0)
) |>
gt() |>
fmt_number(columns = -x1, decimals = 0) |>
cols_align("center")
```
With Bayesian tools, it's common to use the **[categorical distribution](https://en.wikipedia.org/wiki/Categorical_distribution)**, which is a different generalization of the Bernoulli distribution to more than two classes. Unlike the bi/multinomial distribution, it is not a count distribution, but an actual distribution over discrete values.
In the machine learning context, we can use a variety of models we'd use for binary classification. How the model is actually implemented will depend on the tool, but one of the more popular methods is to use **one-vs-all** or **one-vs-one** strategies, where you treat each class as the target in a binary classification problem. In the first case of one vs. all, you would have a model for each class that predicts whether an observation is in that class versus the other classes. In the second case, you would have a model for each pair of classes. You should generally be careful with either approach if interpretation is important, as it can make the feature effects very difficult to understand. As an example, we can't expect feature X to have the same effect on the target in a model for class A vs B, as it does in a model for class A vs. (B & C) or A & C. As such, it can be misleading when the models are conducted as if the categories are independent.
Regardless of the context, interpretation is now spread across multiple target outputs, and so it can be difficult to understand the overall effect of a feature on the target. Even in the statistical model setting (e.g., a multinomial regression), you now have coefficients that regard *relative* effects for one class versus a reference group, and so they cannot tell you a *general* effect of a feature on the target. This is where tools like marginal effects and SHAP can be useful (@sec-knowing-feature).
#### Multilabel targets {#sec-data-multilabel}
Multilabel targets are a bit more complicated, and are not as common as multiclass targets. In this case, each observation can have multiple labels. For example, if we wanted to predict genre based on the movie review data, we could choose to allow a movie to be both a comedy and action film, a sci-fi horror, or a romantic comedy. In this setting, labels are not mutually exclusive. If there are not too many unique label settings, we can treat the target as we would other multiclass targets. But if there are many, we might need to use a different model to go about things more efficiently.
:::{.callout-note title='Categorical Objective Functions' collapse='true'}
In many situations where you have a categorical target you will use a form of **cross-entropy loss** for the objective function. You may see other names such as *log loss* or *logistic loss* or *negative log likelihood* depending on the context, but usually it's just a [different name for the same underlying objective](https://en.wikipedia.org/wiki/Cross-entropy#Cross-entropy_loss_function_and_logistic_regression).
:::
### Ordinal variables {#sec-data-ordinal}
So far in our discussion of categorical data, the categories are assumed to have no order. But it's quite common to have labels like "low", "medium", and "high", or "very bad", "bad", "neutral", "good", "very good", or simply are a few numbers, like ratings from 1 to 5. **Ordinal data** is categorical data that has a known ordering, but which still has arbitrary labels. Let us repeat that, *ordinal data is categorical data*.
#### Ordinal features {#sec-data-ordinal-features}
The simplest way to treat ordinal features is as if they were numeric. If you do this, then you're just pretending that it's not categorical. In practice this is usually fine for features. Most of the transformations we mentioned previously aren't going to be as useful, but you can still use them if you want. For example, logging ratings 1-5 isn't going to do anything for you model-wise, but it technically doesn't hurt anything. You should know that typical statistics like means and standard deviations don't really make sense for ordinal data, so the main reason for treating them as numeric is for modeling convenience.
If you choose to treat an ordinal feature as categorical, you can ignore the ordering and do the same as you would with categorical data. This would allow for some nonlinearity since the category means will be whatever they need to be. There are some specific techniques to coding ordinal data for use in linear models, but they are not commonly used, and they generally aren't going to help the model performance or interpreting the feature, so we do not recommend them. You could however use old-school **effects coding** that you would incorporate traditional ANOVA models, but again, you'd need a good reason to do so[^helm].
[^helm]: Try [Helmert coding](https://en.wikipedia.org/wiki/Helmert_transformation) for instance! No, don't do that.
The take home message for ordinal features is generally simple. Treat them as you would numeric features or non-ordered categorical features. Either is fine.
#### Ordinal targets {#sec-data-ordinal-targets}
Ordinal targets can be trickier to deal with. If you treat them as numeric, you're assuming that the difference between 1 and 2 is the same as the difference between 2 and 3, and so on. This is probably not true. If you treat them as categorical and use standard models for that setting, you're assuming that there is no connection between categories. So what should you do?
There are a number of ways to model ordinal targets, but probably the most common is the **proportional odds model**. This model can be seen as a generalization of the logistic regression model, and is very similar to it, and actually identical if you only had two categories. It basically is a model of category 2 vs. category 1, 3 vs. (2 or 1), 4 vs. (3, 2, 1), etc. But other models beyond proportional odds are also possible. As an example, one approach would concern subsequent categories, the 1-2 category change, the 2-3 category change, and so on.
As an example, here are predictions from an ordinal model. In this case, we categorize[^dontcat] rounded movie ratings as 2 or less, 3, or 4 or more, and predict the probability of each category based on the release year of the movie. So we get three sets of predicted probabilities, one for each category. In this example, we see that the probability of a movie being rated 4 or more has increased over time, while the probability of a movie being rated 2 or less has decreased. The probability of a movie being rated 3 has remained relatively constant, and is most likely.
[^dontcat]: This is just for demonstration purposes. You should not categorize a continuous variable unless you have a very good reason to do so.
```{r}
#| echo: false
#| eval: false
#| label: fig-ordinal-target
df_ordinal = df_reviews |>
mutate(
rating = case_when(
round(rating) <= 2 ~ '<= 2',
round(rating) >= 4 ~ '>= 4',
.default = '3'
),
rating = ordered(rating, levels = c('<= 2', '3', '>= 4'))
)
model_ordinal = MASS::polr(rating ~ release_year, data = df_ordinal)
# Plot the model
p_dat = ggeffects::ggpredict(model_ordinal, terms = 'release_year[all]') |>
as_tibble() |>
mutate(response.level = factor(response.level, levels = c('<= 2', '3', '>= 4')))
p_dat |>
ggplot(aes(x = x, y = predicted, group = response.level)) +
geom_line(aes(linetype = response.level), linewidth = 1) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = .15) +
scale_linetype_manual(values = c('dashed', 'solid', 'dotted')) +
annotate('text', x = median(p_dat$x), y = .5, label = '3', hjust = 0, size = 4) +
annotate('text', x = 2018, y = .35, label = '>= 4', hjust = 0, size = 4) +
annotate('text', x = 1988, y = .35, label = '<= 2', hjust = 0, size = 4) +
# facet_wrap(~ response.level) +
scale_y_continuous(labels = scales::percent) +
labs(
title = 'Proportional Odds Model for Rating by Release Year',
caption = 'Predicted Probability (as percentage) of a rounded rating score: 2 or less, 3 or 4 or more.',
x = 'Release Year',
y = ''
) +
theme(
legend.key.size = unit(1.5, 'cm'),
plot.caption = element_text(hjust = 0, size = 12)
)
ggsave('img/data-ordinal_target.svg', width = 8, height = 6)
```
![Proportional Odds Model for Rating by Release Year](img/data-ordinal_target.svg){width=100% #fig-ordinal-target}
Ordinality of a categorical outcome is largely ignored in machine learning applications. The outcome is either treated as numeric or multiclass classification. This is not necessarily a bad thing, especially if prediction is the primary goal. But if you need a categorical prediction, treating the target as numeric means you have to make an arbitrary choice to classify the predictions. And if you treat it as multiclass, you're ignoring the ordinality of the target, which may not work as well in terms of performance.
#### Rank data {#sec-data-rank}
Though ranks are ordered, with rank data we are referring to cases where the observations are uniquely ordered. An ordinal vector of 1-6 with numeric labels could be something like [2, 1, 1, 3, 4, 2], where no specific value is required and any value could be repeated.In contrast, rank data would be [2, 1, 3, 4, 5, 6], each being unique (unless you allow for ties). For example, in sports, a ranking problem would regard predicting the actual finish of the runners. Assuming you have a modeling tool that actually handles this situation, the objective will be different from other scenarios. Statistical modeling methods include using the Plackett-Luce distribution (or the simpler variant Bradley-Terry model). In machine learning, you might use so-called [learning to rank methods](https://en.wikipedia.org/wiki/Learning_to_rank), like the [RankNet and LambdaRank algorithms](https://icml.cc/Conferences/2015/wp-content/uploads/2015/06/icml_ranking.pdf), and other variants for [deep learning models](https://github.com/tensorflow/ranking).
## Missing Data {#sec-data-missing}
```{r}
#| echo: false
#| label: tbl-missing-data
#| tbl-cap: A Table with Missing Values
set.seed(123)
df = data.frame(
x1 = rpois(5, 5),
x2 = rpois(5, 3),
x3 = sample(c("A", "B", "C"), 5, replace = TRUE)
) |>
mutate(across(everything(), as.character)) |>
as.matrix()
# Randomly introduce missing values
df[sample(1:15, 4)] = NA
# Print the resulting data frame
df |>
as_tibble() |>
mutate(across(everything(), \(x) replace_na(x, '?'))) |>
gt()
```
Missing data is a common challenge in data science, and there are a number of ways to deal with it, usually by substituting, or **imputing**, the substituted value for the missing one. Here we'll provide an overview of common techniques to deal with missing data.
### Complete case analysis {#sec-data-missing-complete}
The first way to deal with missing data is the simplest - **complete case analysis**. Here we only use observations that have no missing data and drop the rest. Unfortunately, this can lead to a lot of lost data, and can lead to biased statistical results if the data is not **missing completely at random** (missingness is not related to the observed or unobserved data). There are special cases of some models that by their nature can ignore the missingness under an assumption of **missing at random** (missingness is not related to the unobserved data), but even those models would likely benefit from some sort of imputation. If you don't have much missing data though, dropping the missing data is fine for practical purposes[^dropmiss]. How much is too much? Unfortunately that depends on the context, but if you have more than 10% missing, you should probably be looking at alternatives.
[^dropmiss]: While many statisticians will possibly huff and puff at the idea of dropping data, there are two things to consider. With minimal missingness you'll likely never come to a different conclusion unless you have very little data to come to a conclusion about, which is already the bigger problem. Secondly, it's impossible to prove one way or another if the data is missing at random, because doing so would require knowing the missing values.
### Single value imputation {#sec-data-missing-single}
**Single value imputation** involves replacing missing values with a single value, such as the mean, median, mode or some other typical value of the feature. As common an approach as this is, it will rarely help your model for a variety of reasons. Consider a numeric feature that is 50% missing, and for which you replace the missing with the mean. How good do you think that feature will be when at least half the values are identical? Whatever variance it normally would have and share with the target is probably reduced, and possibly dramatically. Furthermore, you've also attenuated correlations it has with the other features, which may mute other modeling issues that you would otherwise deal with in some way (e.g., collinearity), or cause you to miss out on interactions.
Single value imputation makes perfect sense if you *know* that the missingness should be a specific value, like a count feature where missing means a count of zero. If you don't have much missing data, it's unlikely this would have any real benefit over complete case analysis. One exception is imputing the feature allows you to use all the other complete feature samples that would otherwise be dropped. But then, you could just drop this less informative feature while keeping the others, as it will likely not be very useful in the model.
### Model-based imputation {#sec-data-missing-model}
**Model-based imputation** is more complicated, but can be very effective. In essence, you run a model for complete cases in which the feature with missing values is now the target, and all the other features and primary target are used to predict it. You then use that model to predict the missing values, using the predictions as the imputed values. After these predictions are made, you move on to the next feature and do the same. There are no restrictions on which model you use for which feature. If the other features in the imputation model also have missing data, you can use something like mean imputation to get more complete data if necessary as a first step, and then when their turn comes, impute those values.
Although the implication is that you would have one model per feature and then be done, you can do this iteratively for several rounds, such that the initial imputed values are then used in subsequent model rounds to reimpute other features' missing values. You can do this as many times as you want, but the returns will diminish. In this setting, we are assuming you'll ultimately end with a single imputed value for each missing one, which reflects the last round of imputation.
### Multiple imputation {#sec-data-missing-mi}
**Multiple imputation** (MI) is a more complicated technique, but can be very useful in some situations, depending on what you're willing to sacrifice for having better uncertainty estimates. The idea is that you create multiple imputed datasets, each of which is based on the **predictive distribution** of the model used in model-based imputation (See @sec-knowing-model-vis). Say we use a linear regression assuming a normal distribution to impute feature A. We would then draw repeatedly from the predictive distribution of that model to create multiple datasets with (randomly) imputed values for feature A.
Let's say we do this 10 times, and we now have 10 imputed data sets, each with a now complete feature A, but each with somewhat different imputed values. We now run our desired model on each of these datasets. Final model results are averaged in some way to get final parameter estimates. Doing so acknowledges that your single imputation methods have uncertainty in those imputed values, and that uncertainty is incorporated into the final model estimates, including the uncertainty in those estimates.
MI can in theory handle any source of missingness and can be a very powerful technique. But it has some drawbacks that are often not mentioned, but which everyone that's used it has experienced. One is that you need a specified target distribution for all imputation models used, in order to generate random draws with appropriate uncertainty. Your final model presumably is also a probabilistic model with coefficients and variances you are trying to estimate and understand. MI probably isn't going to help boosting or deep learning models that have native methods for dealing with missing values, or at least, offers little over single value imputation for those approaches. In addition, if you have very large data and a complicated model, you could be spending a long time both waiting for the models and debugging them, because you still would need to assess the imputation models much like any other for the most part. Finally, few data or post-model processing tools that you commonly use will work with MI results, especially those regarding visualization. So you will have to hope that whatever package you use for MI will do what you need. As an example, you'd have to figure out how you're going to impute interaction or spline terms if you have them.
*Practically speaking*, MI takes a lot of effort to often come to the same conclusions you would have with a single imputation method, or possibly fewer conclusions for anything beyond GLM coefficients and their standard errors. But if you want the best uncertainty estimates for those models, MI can be the way to go.
### Bayesian imputation {#sec-data-missing-bayesian}
One final option is to run a Bayesian model where the missing values are treated as parameters to be estimated, and they would have priors just like other parameters as well. MI is basically a variant of Bayesian imputation that can be applied to the non-Bayesian model setting, so why not just use the actual Bayesian approach? Some modeling packages can allow you to try this very easily, and it can be very effective. But it is also very computationally intensive, and can be very slow as you may be increasing the number of parameters to estimate dramatically. At least it would be more fun than standard MI, so we recommend exploring it if you were going to do MI anyway.
## Class Imbalance {#sec-data-class-imbalance}
:::{.content-visible when-format='html'}
```{r}
#| echo: false
#| label: fig-class-imbalance
#| fig-cap: Class Imbalance
p = df_reviews |>
mutate(y = ifelse(rating < 2, "Class A", "Class B")) |>
count(y) |>
mutate(prop = n / sum(n)) |>
ggplot(aes(x = y, y = prop)) +
geom_col(width = .05) +
geom_point(aes(y = prop), size = 10, alpha = 1, color = okabe_ito[2]) +
coord_flip() +
scale_y_continuous(breaks = seq(0, 1, .1), labels = scales::percent_format()) +
labs(x = "", y = "", title = "")
p
ggsave(
"img/data-class_imbalance.svg",
p,
width = 6,
height = 4
)
```
:::
:::{.content-visible when-format='pdf'}
![Class imbalance](img/data-class_imbalance.svg){width=100% #fig-class-imbalance}
:::
**Class imbalance** refers to the situation where the target variable has a large difference in the number of observations in each class. For example, if you have a binary target, and 90% of the observations are in one class, and 10% in the other, you would have class imbalance. You'll almost never see a 50/50 split in the real world, but the issue is that as we move further away from that point, we can start to see problems in model estimation, prediction, and interpretation. In this example, if we just predict the majority class in a binary classification problem, our accuracy would be 90%! Under other circumstances that might be a great result for accuracy, but in this case it's not. So right off the bat one of our favorite metrics to use for classification models isn't going to help us much.
For classification problems, *class imbalance is the rule, not the exception*. This is because nature just doesn't sort itself into nice and even bins. The majority of people in a random sample do not have cancer, the vast majority of people have not had a heart attack in the past year, most people do not default on their loans, and so on.
There are a number of ways to help deal with class imbalance, and the method that works best will depend on the situation. Some of the most common are:
- **Use different metrics**: Use metrics that are less affected by class imbalance, such as area under a receiver operating characteristic curve (AUC), or those that balance the tradeoff between precision and recall, like the F1 score, or balance recall and true negative rate, like the balanced accuracy score.
- **Oversampling/Undersampling**: Randomly sample from the minority (majority) class to increase (decrease) the number of observations in that class. For example, we can randomly sample with replacement from the minority class to increase the number of observations in that class. Or we can take a sample from the majority class to balance the resulting dataset.
- **Weighted objectives**: Weight the loss function to give more weight to the minority class. Although commonly employed, and simple to implement with tools like lightgbm and xgboost, it often fails to help, and can cause other issues.
- **Thresholding**: Change the threshold for classification to be more sensitive to the minority class. Nothing says you have to use 0.5 as the threshold for classification, and you can change it to be more sensitive to the minority class. This is a very simple approach, and may be all you need.
These are not necessarily mutually exclusive. For example, it's probably a good idea to switch your focus to a metric besides accuracy even as you employ other techniques to handle imbalance.
### Calibration issues in classification {#sec-data-calibration}
Probability **calibration** is often a concern in classification problems. It is a bit more complex of an issue than just having class imbalance, but is often discussed in the same setting. Having calibrated probabilities refers to the situation where the predicted probabilities of the target match up well to the actual proportion of observed classes. For example, if a model predicts an average 0.5 probability of loan default for a certain segment of the samples, the actual proportion of defaults should be around 0.5.
One way to assess calibration is to use a **calibration curve**, which is a plot of the predicted probabilities vs. the observed proportions. We bin our predicted probabilities, say, into 5 or 10 equal bins. We then calculate the average predicted probability and the average observed proportion of the target in each bin. If the model is well-calibrated, the points should fall along the 45-degree line. If the model is not well-calibrated, the points will fall above or below the line.
In @fig-calibration-plot, one model seems to align well with the observed proportions based on the chosen bins. The other model (dashed line) is not so well calibrated, and is overshooting with its predictions. For example, that model's average prediction for the third bin predicts a ~0.5 probability of the outcome, while the actual proportion is around 0.2.
:::{.content-visible when-format='html'}
```{r}
#| echo: false
#| label: fig-calibration-plot
#| fig-cap: Calibration Plot
#| out-width: 100%
set.seed(42)
N = 2500
p_pred = sort(runif(N))
t_obs = rbinom(N, 1, p_pred)
p_dat = tibble(
t_obs = t_obs,
p_pred = p_pred,
p_pred_bad = p_pred^(.5),
p_bin = ntile(p_pred, 10)
) |>
group_by(p_bin) |>
summarize(
pred_prob = mean(p_pred),
pred_worse = mean(p_pred_bad),
true_prob = mean(t_obs)
) |>
ungroup() |>
mutate(
p_bin = as.factor(p_bin)
) |>
pivot_longer(cols = c(pred_prob, pred_worse), names_to = "pred_type", values_to = "pred_prob")
p = p_dat |>
ggplot(aes(x = pred_prob, y = true_prob, color = pred_type)) +
geom_abline(intercept = 0, slope = 1, color = 'gray', lwd= 1.5) +
geom_line(aes(linetype = pred_type), show.legend = FALSE) +
geom_point(
data = p_dat |> filter(p_bin==3, pred_type == "pred_worse"),
size = 8,
alpha = .5,
show.legend = FALSE
) +
geom_point(size = 4, alpha = 1) +
scale_x_continuous(breaks = seq(0, 1, .1)) +
scale_y_continuous(breaks = seq(0, 1, .1)) +
scale_color_manual(
values = c(okabe_ito[['darkblue']], okabe_ito[['orange']]),
labels = c("Predicted (Better)", "Predicted (Worse)")
) +
labs(
x = "Predicted",
y = "Observed",
caption = 'Points are the average of the binned probabilities/proportions.',
) +
theme(
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12),
)
p
ggsave("img/data-calibration_plot.svg")
```
:::
:::{.content-visible when-format='pdf'}
![Calibration plot](img/data-calibration_plot.svg){width=100% #fig-calibration-plot}
:::
While the issue is an important one, it's good to keep the issue of calibration and imbalance separate. As miscalibration implies bias, bias can happen irrespective of the class proportions, and can be due to a variety of factors related to the model, target, or features. Furthermore, miscalibration is not inherent to any particular model.
The assessment of calibration in this manner also has a few issues that we haven't seen reported in the documentation for the functionality. For one, the observed 'probabilities' are proportions based on arbitrarily chosen bins. The observed values also have some **measurement error**, and have a natural variability that will partly reflect sample size[^testcalibbinsize]. These plots are often presented such that observed proportions are labeled as the 'true' probabilities. However, you do not have the *true* probabilities, just the *observed* class labels, so whether your model's predicted probabilities match observed proportions is a bit of a different question. The predictions have uncertainty as well, and this will depend on the model, sample size, and other factors. And finally, the number of bins chosen can also affect the appearance of the plot in a notable way if the sample size is small.
[^testcalibbinsize]: Note that each bin will reflect the portion of the test set size in this situation. If you have a small test set, the observed proportions will be more variable, and the calibration plot will be more variable as well.
All this is to say that each point in a calibration plot, 'true' or predicted, has some uncertainty with it, and the difference in those values is not formally tested in any way by a calibration curve plot. Their uncertainty, if it was actually measured, could even overlap while still being statistically different! So, if we're interested in a more rigorous statistical assessment, the differences between models and the 'best case scenario' would need additional steps to suss out.
Some methods are available to calibrate probabilities if they are deemed miscalibrated, but they are not commonly implemented in practice, and often involve another model-based technique, with all of its own assumptions and limitations. It's also not exactly clear that forcing your probabilities to be on the line is helping solve the actual modeling goal in any way[^practicalprob]. But if you are interested, you can read more [here](https://scikit-learn.org/stable/modules/calibration.html).
[^practicalprob]: Oftentimes we are only interested in the ordering of the predictions, and not the actual probabilities. For example, if we are trying to identify the top 10% of people most likely to default on their loans, we'll just take the top 10% of predictions, and the actual probabilities are irrelevant for that goal.
```{r}
#| echo: false
#| eval: false
#| label: r-bootstrap-calibration-plot
#| fig-cap: Bootstrap Calibration Plot
# ultimately I think this gets too far afield
p_dat = read_csv("data/bootstrap_calibration.csv")
p_dat = p_dat |>
filter(type == "true") |>
rename(
true_prob = prob,
true_lower = lower,
true_upper = upper
) |>
bind_cols(
p_dat |>
filter(type == "pred") |>
select(-type) |>
rename(pred_prob = prob, pred_lower = lower, pred_upper = upper)
)
p_dat |>
ggplot(aes(x = pred_prob, y = true_prob)) +
geom_abline(intercept = 0, slope = 1, linetype = 'dashed', color = 'gray', lwd= 1) +
geom_segment(aes(x = pred_lower, y = true_prob, xend = pred_upper, yend = true_prob), color = "gray", size = 0.5) +
geom_segment(aes(x = pred_prob, y = true_lower, xend = pred_prob, yend = true_upper), color = "gray", size = 0.5) +
geom_line() +
geom_point(size = 3, alpha = 1) +
scale_x_continuous(breaks = seq(0, 1, .1)) +
scale_y_continuous(breaks = seq(0, 1, .1)) +
scale_color_manual(values = c(okabe_ito[2], okabe_ito[1])) +
labs(x = "Avg (Binned) Probability", y = "Observed (Binned)\nProportions", title = "")
```
## Censoring and Truncation {#sec-data-censoring}
:::{.content-visible when-format='html'}
```{r}
#| echo: false
#| label: fig-censoring
#| fig-cap: Censoring for Time Until Death
# Create a data frame with age and censoring information
df = tibble(
person = LETTERS[1:5],
age = c(53, 72, 80, 65, 83),
censoring = c("End of Life", "Censored", "Censored", "End of Life", "Censored"))
# Plot the potential censoring
p = ggplot(df, aes(x = person, age, fill = censoring, color = censoring)) +
geom_col(width = .05, alpha = .5, show.legend=FALSE) +
geom_segment(
aes(x = person, y = age, xend = person, yend = 85),
linetype = "dashed",
color = "gray50",
data = df |> filter(censoring == "Censored")
) +
geom_point(size = 5, alpha = 1) +
lims(y = c(0, 90)) +
labs(x = "Person", y = "Age", title = "") +
scale_fill_manual(values = unname(okabe_ito)[c(1,5)], aesthetics = c('color', 'fill')) +
coord_flip()
p
ggsave(
"img/data-censoring.svg",
p,
width = 6,
height = 4
)
```
:::
:::{.content-visible when-format='pdf'}
![Censoring for Time Until Death](img/data-censoring.svg){width=100% #fig-censoring}
:::
Sometimes, we just don't see all the data there is to see. **Censoring** is one situation where the target variable is not fully observed. This is common in techniques where the target is the 'time to an event', like death from a disease, but the event has not yet occurred for some observations in the data (thankfully!). Specifically this is called **right censoring**, and is the most common type of censoring, and depicted in @fig-censoring, where several individuals are only observed to a certain age and were still alive at that time. There is also [**left censoring**](https://stats.stackexchange.com/questions/144037/right-censoring-and-left-censoring), where the censoring happens from the other direction, and data before a certain point is unknown. Finally, there is **interval censoring**, where the event of interest occurs within some interval, but the exact value is unknown.
[^eventhistory]: Survival analysis is also called **event history analysis**, and is widely used in biostatistics, sociology, demography, and other disciplines where the target is the time to an event, such as death, marriage, divorce, etc.
**Survival analysis**[^eventhistory] is a common modeling technique in this situation, especially in fields informed by biostatistics, but you may also be able to keep things even more simple via something like [tobit regression](https://m-clark.github.io/models-by-example/tobit.html). In the tobit model, you assume that the target is fully observed, but that the values are censored, and you model the probability of censoring. This is a common technique in econometrics, and allows you to keep a traditional linear model context.
**Truncation** is a situation where the target variable is only observed if it is above or below some value, even though we know other possibilities exist. One of the issues is that default distributional methods assume a distribution that is not bounded in the way that the data exhibits. In @fig-truncation, we restrict our data to 70 and below for practical or other reasons, but typical modeling methods predicting age would not respect that.
:::{.content-visible when-format='html'}
```{r}
#| echo: false
#| label: fig-truncation
#| fig-cap: Truncation
library(ggplot2)
set.seed(1234)
# Generate a random sample from a normal distribution
data = rpois(100, 60)
# Censor some of the values
key_value = 70
# Create a data frame for plotting
df = tibble(
data = data,
censored = factor(if_else(data > key_value, "Unobserved", "Observed")),
)
# Plot the original data and the censored data
p = ggplot(df, aes(x = data)) +
geom_bar(
aes(
alpha = after_stat(x < key_value),
group = after_scale(fill)
),
# alpha = 0.5,
color = NA,
show.legend = FALSE,
) +
geom_vline(xintercept = key_value, color = "gray", size = 1) +
annotate(
geom = "text",
x = key_value + 1,
y = 3,
label = "Unobserved",
color = "gray",
size = 6,
hjust = 0,
) +
scale_fill_manual(values = c(okabe_ito)) +
labs(x = "Ages", y = "", title = "") +
theme(
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
)
p
ggsave(
"img/data-truncation.svg",
p,
width = 6,
height = 4
)
```
:::
:::{.content-visible when-format='pdf'}
![Truncation](img/data-truncation.svg){width=100% #fig-truncation}
:::
You could truncate predictions after the fact, but this is a bit of a hack, and often results in lumpiness in the predictions at the boundary that is rarely realistic. Alternatively, Bayesian methods allow you to model the target as a distribution with truncated distributions, and so you can model the probability of the target being above or below some value. There are also models such as **hurdle models** that might prove useful where the truncation is theoretically motivated, for example, a zero-inflated Poisson model for count data where the zero counts are due to a separate process than the non-zero counts.
```{r}
#| echo: false
#| eval: false
library(truncnorm)
library(geomtextpath)
# Define the range for x
set.seed(42)
x = seq(-2, 4, length.out = 1000)
x_pos = x[x > 0]
# Normal distribution
normal_pdf = dnorm(x, mean = 1, sd = 1)
# Folded Normal distribution
folded_normal_pdf = dnorm(abs(x), mean = 1, sd = 1) + dnorm(-abs(x), mean = 1, sd = 1)
# Truncated Normal distribution
truncated_normal_pdf = dtruncnorm(x, a = 0, b = Inf, mean = 1, sd = 1)
# Create a data frame for plotting
data = tibble(
distribution = rep(c('Normal', 'Folded Normal', 'Truncated Normal'), each = length(x)),
x = rep(x, 3),
dens = c(normal_pdf, folded_normal_pdf, truncated_normal_pdf),
) |>
mutate(
dens = ifelse(distribution != 'Normal' & x < 0, NA, dens)
)
data |>
ggplot(aes(x = x, y = dens, color = distribution)) +
geom_line(linewidth = 1, show.legend = FALSE) +
geom_textpath(
data = data |> filter(distribution != 'Normal'),
aes(label = distribution),
vjust = -.1,
hjust = 0.33,
size = 5,
show.legend = FALSE
) +
geom_textpath(
data = data |> filter(distribution == 'Normal'),
aes(label = distribution),
vjust = -.1,
size = 5,
show.legend = FALSE
) +
geom_area(
aes(group = distribution, fill = distribution),
show.legend = FALSE,
position = 'identity',
alpha = .05
) +
scale_color_manual(values = unname(okabe_ito), aesthetics = c('color', 'fill')) +
labs(
title = 'PDF for Normal, Folded Normal, and Truncated Normal Distributions',
x = '',
y = '',
caption = 'Each distribution has a mean of 1 and standard deviation of 1'
) +
theme(
plot.caption = element_text(size = 12),
plot.title = element_text(size = 16, hjust = 0),
# legend.title = element_blank(),
# legend.key.size = unit(3, 'cm'),
# legend.key.height = unit(1, 'cm'),
)
ggsave('img/trunc_vs_folded.svg')
```
:::{.callout-note title="Censoring vs. Truncation" collapse='true'}
One way to distinguish censored and truncated data is that censored data is usually due to some external process such that the target is not observed, but could be possible (capping reported income at $1 million). Truncated data, on the other hand, is due to some internal process that prevents the target from being observed, and is often derived from sample selection (we only want to model non-millionaires). We would not want predictions past the censored point to be unlikely, but we would want predictions past the truncated point to be impossible. Trickier still is that for bounded or truncated distributions that might be applied to the truncated scenario, such as folded vs. truncated distributions, they would not result in the same probability distributions even if they can be applied to the same data situation.
![Folded and Truncated Distributions](img/trunc_vs_folded.svg){width=75%}
:::
## Time Series {#sec-data-time}
```{r}
#| echo: false
#| eval: false
#| label: fig-time-series
#| fig-cap: Time Series Data
set.seed(123)
# Generate date sequence
dates = seq(as.Date("2022-01-01"), as.Date("2022-12-31"), by = "day")
# Generate increasing trend
trend = seq(1, length(dates))
# Generate seasonality
seasonality = sin(2 * pi * (1:length(dates)) / 365 * 7) *10000
# Combine trend and seasonality
y = trend + trend^2 + seasonality
p = tibble(date = dates, value = y) |>
ggplot(aes(x = date, y = value)) +
geom_line(linewidth=3, color = okabe_ito[1]) +
# geom_smooth(se = TRUE, color = okabe_ito[1], method = 'gam', formula = y ~ s(x, bs = 'cs', k = 25)) +
geom_smooth(se = TRUE, method = 'gam') +
labs(x = "", y = "") +
scale_y_continuous(labels = NULL) +
scale_x_date(date_labels = "%b", date_breaks = "1 month") +
theme(
axis.ticks.y = element_blank(),
)
p
ggsave(
"img/data-time_series.svg",
p,
width = 8,
height = 6
)
```
![Time-series data](img/data-time_series.svg){#fig-time-series}
Time series data is any data that incorporates values over a period of time. This could be something like a state's population over years, or the max temperature of an area over days. Time series data is very common in data science, and there are a number of ways to model such data.
### Time-based targets {#sec-data-time-targets}
As in other settings, the most common approach when the target is some value that varies time is to use a linear model of some kind. While the target varies over time, the features may be time-varying or not, and it is very common to have both types of features. There are traditional autoregressive models that use the target's past values (**lags**) as features, for example, autoregressive moving average (**ARIMA**) models. In this case, the target is a function of its past values, and possibly the past values of other features. In these models, **forecasting**, or making predictions about future values, is the primary goal. Care will have to be taken to avoid data leakage in the training process, and to use proper standardization of the data if applicable. In some settings, common transformations such as logging can be used. In other cases, you may want to use a **differencing** approach to make the data stationary, which can be useful for some models. Differencing is the process of subtracting the previous value from the current value, and can be done multiple times to remove trends and seasonality. Often this is done automatically as part of a hyperparameter search, but it can be done manually as well. Boosting models using lagged features can also be used for time series data, and can be very effective in many situations.
**Longitudinal data**[^paneldata] is a special case of time series data, where the target is a function of time (e.g. date), but the data is typically grouped in some fashion, and often of a shorter sequence. An example would be a model for school performance for students over several semesters, where values are clustered within students over time. In this case, you can use some sort of time series regression, though many do not do well with shorter series. Instead, you can use a **mixed model** (@sec-lm-extend-mixed-models), where you model the target as a function of time, but also include a random effect for the grouping variable, in this case, students. This could, and typically does, include both random intercepts and slopes for time. This is a very common approach in many domains, and can be very effective in terms of performance as well. Mixed models can also be used for longer series, where the random effects are based on autoregressive covariance matrices. In this case, an ARIMA component is added to the linear model as a random effect to account for the time series nature of the data. This is fairly common in Bayesian contexts. Generalized additive models also work well in this setting.
Other models provide additional options for incorporating historical information, such as Bayesian methods for marketing data (@sec-unc-bayes) or reinforcement learning approaches (@sec-ml-more-reinforcement). Some models get more complex, like **recurrent neural networks** and their generalizations, and were specifically developed for sequential data. More recently, [transformer-based models](https://research.google/blog/a-decoder-only-foundation-model-for-time-series-forecasting/) have shown promise for time series analysis beyond text. In this area, 'foundational' models like TimeGPT, Moirai, Chronos, or TinyTimeMixers have been proposed, though they often require more data than is typically available in many common settings. In addition, they have not proven to be as effective as simpler models in many cases[^nixtlaexp].
[^nixtlaexp]: For example, see Nixtla's experiments with [Amazon's Chronos vs. a simple combination of univariate models](https://github.com/Nixtla/nixtla/tree/main/experiments/amazon-chronos), and similar. NB: Amazon said that the default settings were not very good, and it [would perform much better with different settings](https://github.com/Nixtla/nixtla/pull/281), but possibly still would not beat the simpler model.
So, many models can be found specific to time series targets, and the choice of model will depend on the data, the questions we want to ask, and the goals we have. Most traditional models are still linear models, but traditional machine learning models, like boosting, and many neural network models have been developed that can handle time series data as well.
:::{.callout-note title="Time Series vs. Longitudinal" collapse='true'}
The primary distinguishing feature for referring data as 'time-series' and 'longitudinal' is the number of time points, where the latter typically has relatively few. This arbitrary though.
:::
[^paneldata]: You may also hear the term **panel data** in econometrics-oriented disciplines.
### Time-based features {#sec-data-time-features}
When it comes to time-series features, we can apply time-specific transformations. One technique is the [**fourier transform**](https://betterexplained.com/articles/an-interactive-guide-to-the-fourier-transform/), which can be used to decompose a time series into its component frequencies, much like how we use PCA (@sec-ml-more-unsuper). This can be useful for identifying periodicity in the data, which can be used as a feature in a model.
In marketing contexts, some perform **adstocking** with features. This approach models the delayed effect of features over time, such that they may have their most important impact immediately, but still can impact the present target from past values. For example, a marketing campaign might have the most significant impact immediately after it's launched, but it can still influence the target variable at later time points. Adstocking helps capture this delayed effect without having to include multiple lagged features in the model. That said, including lagged features is also an option in this setting. In this case, you would have a feature for the current time point (*t*), the same feature for the previous time point (*t-1*), the feature for the time point before that (*t-2*), and so on.
#### Scaling time features {#sec-data-time-scale}
If you have the year as a feature, you can use it as a numeric feature or as a categorical feature. If you treat it as numeric, you need to consider what a zero means. In a linear model, the intercept usually represents the outcome when all features are zero. But with a feature like year, a zero year isn't meaningful in most contexts. To solve this, you can shift the values so that the earliest time point, like the first year in your data, becomes zero. This way, the intercept in your model will represent the outcome for this first time point, which is more meaningful. The same goes if you are using months or days as a numeric feature. It doesn't really matter which year/month/day is zero, just that zero refers to one of the actual time points observed. Shifting your time feature in this manner can also help with convergence for some types of models. In addition, you may want to convert the feature to represent decades, or quarters, or some other time period, to help with interpretation.
Dates and/or times can be a bit trickier. Often you can just split dates out into year, month, day, etc., and proceed with those as features. In other cases you'd want to track the time period to assess possible seasonal effects. You can use something like a **cyclic** approach (e.g., [cyclic spline or sine/cosine transformation](https://developer.nvidia.com/blog/three-approaches-to-encoding-time-information-as-features-for-ml-models/)) to get at yearly or within-day seasonal effects. As mentioned, a fourier transform can also be used to decompose the time series into its component frequencies for use as model features. Time components like hours, minutes, and seconds can often be dealt with in similar ways, but you will more often deal with the periodicity in the data. For example, if you are looking at hourly data, you may want to consider the 24-hour cycle.
:::{.callout-note title="Calendars are hard" collapse='true'}
*Weeks are not universal*. Some start on Sunday, others Monday. Some data contexts only consider weekdays. [Some systems](https://en.wikipedia.org/wiki/ISO_week_date) may have 52 or 53 weeks in a year, and dates may not be in the same week from one year to the next, etc. So use extra caution when considering weeks as a feature.
:::
#### Covariance structures {#sec-data-covariance-struct}
In many cases you'll have features that vary over time but are not a time-oriented feature like year or month. For example, you might have a feature that is the number of people who visited a website over days. This is a time-varying feature, but it's not a time metric in and of itself.
In general, we'd like to account for the time-dependent correlations in our data, and a common way to do so is to posit a covariance structure that accounts for this in some fashion. This helps us understand how data points are related to each other over time, and requires us to estimate the correlations between observations. As a starting point, consider linear regression. In a standard linear regression model, we assume that the samples are independent of one another, with a constant variance and no covariance.
Instead, we can also use something like a **mixed model**, where we include a random effect for each group and estimate the variance attributable to the grouping effect. By default, this ultimately assumes a constant correlation from time point to time point, but many tools allow you to specify a more complex covariance structure. A common method is to use autoregressive covariance structure that allows for correlations further apart in time to lessen. In this sense the covariance comes in as an added *random effect*, rather than being a model in and of itself as with ARIMA. Many such approaches to covariance structures are special cases of **gaussian processes**, which are a very general technique to model time series, spatial, and other types of data.
```{r}
#| echo: false
#| eval: false
#| label: fig-covariance-structure
#| fig-cap: AR (1) Covariance Structure Visualized
library(nlme)
library(lme4)
model_corAR = lme(Reaction ~ 1, random= ~1|Subject, data = sleepstudy, correlation = corAR1(form=~Days))
df = tibble(id = rep(1:2, e= 12), Months = rep(1:12, 2))
model_corAR = lme(rnorm(24) ~ 1, random= ~1|id, data = df, correlation = corAR1(form=~Months))
# scico::scico_palette_show()
# phi = abs(coef(model_corAR$modelStruct$corStruct, unconstrained = F)*3)
phi = .66
init = corMatrix(Initialize(corAR1(phi, form = ~1|id), data=df))
rc_ar = bdsmatrix::bdsmatrix(rep(12, 2), blocks = rep(init[[1]][lower.tri(init[[1]], diag = T)], 2)) |>
as.matrix()
dates = seq.Date(as.Date("2022-01-15"), as.Date("2022-12-15"), by = "month")
dates = factor(month.abb, levels = month.abb)
l = length(dates)
p = tibble(x = rep(dates, e=l), y = rep(dates, times=l)) |>
mutate(
z = c(rc_ar[1:l, 1:l]),
z = ifelse(z == 0, NA, z)
) |>
ggplot(aes(x = x, y = y)) +
geom_tile(aes(fill=z), color = 'white', linewidth=3,show.legend = FALSE) +
# scale_x_date(breaks = '3 months', labels = label_date_short(format = c('%b'))) +
# scale_y_date(breaks = '3 months', labels = label_date_short(format = c('%b'))) +
# scale_x_date(breaks = '1 months', labels = date_format(format = c('%b'))) +
# scale_y_date(breaks = '1 months', labels = date_format(format = c('%b'))) +