-
Notifications
You must be signed in to change notification settings - Fork 2
/
intro-mixed-model.Rmd
1697 lines (1009 loc) · 82.4 KB
/
intro-mixed-model.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Intro Mixed Models"
output: html_document
date: "2023-06-22"
---
# Foundations of Mixed Modelling
```{r, include = FALSE}
suppressPackageStartupMessages({
library(webexercises)
library(glossary)
})
knitr::knit_hooks$set(webex.hide = function(before, options, envir) {
if (before) {
if (is.character(options$webex.hide)) {
hide(options$webex.hide)
} else {
hide()
}
} else {
unhide()
}
})
```
```{r, include = FALSE}
library(tidyverse)
library(rstatix)
library(performance)
library(see)
library(lmerTest)
library(patchwork)
library(DiagrammeR)
library(kableExtra)
library(broom.mixed)
library(ggeffects)
library(DHARMa)
library(sjPlot)
library(MuMIn)
library(emmeans)
library(downloadthis)
library(report)
library(MuMIn)
library(webexercises)
report_p <- function(p, digits = 3) {
reported <- if_else(p < 0.001,
"p < 0.001",
paste("p=", round(p, digits)))
return(reported)
}
```
## What is a mixed model?
Mixed models (also known as linear mixed models or hierarchical linear models) are statistical tests that build on the simpler tests of regression, t-tests and ANOVA. All of these tests are special cases of the general linear model; they all fit a straight line to data to explain variance in a systematic way.
The key difference with a linear mixed-effects model is the inclusion of random effects - variables where our observations are grouped into subcategories that systematically affect our outcome - to account for important structure in our data.
The mixed-effects model can be used in many situations instead of one of our more straightforward tests when this structure may be important. The main advantages of this approach are:
i) mixed-effects models account for more of the variance
ii) mixed-effects models incorporate group and even individual-level differences
iii) mixed-effects models cope well with missing data, unequal group sizes and repeated measurements
## Fixed vs Random effects
Fixed effects and random effects are terms commonly used in mixed modeling, which is a statistical framework that combines both in order to analyze data.
In mixed modeling, the fixed effects are used to estimate the overall relationship between the predictors and the response variable, while the random effects account for the within-group variability and allow for the modeling of the individual differences or group-specific effects.
The hierarchical structure of data refers to a data organization where observations are nested within higher-level groups or clusters. For example, students nested within classrooms, patients nested within hospitals, or employees nested within companies. This hierarchical structure introduces dependencies or correlations within the data, as observations within the same group tend to be more similar to each other than observations in different groups.
The need for mixed models arises when we want to account for these dependencies and properly model the variability at different levels of the hierarchy. Traditional regression models, such as ordinary least squares (OLS), assume that the observations are independent of each other. However, when working with hierarchical data, this assumption is violated, and ignoring the hierarchical structure can lead to biased or inefficient estimates, incorrect standard errors, and misleading inference.
By including random effects, mixed models allow for the estimation of both within-group and between-group variability. They provide a flexible framework for modeling the individual or group-specific effects and can capture the heterogeneity within and between groups. Additionally, mixed models can handle unbalanced or incomplete data, where some groups may have different numbers of observations.
### Fixed effects
In broad terms, fixed effects are variables that we expect will affect the dependent/response variable: they’re what you call explanatory variables in a standard linear regression.
Fixed effects are more common than random effects, at least in their use. Fixed effects estimate different levels with no relationship assumed between the levels. For example, in a model with a dependent variable of body length and a fixed effect for fish sex, you would get an estimate of mean body length for males and then an estimate for females separately.
We can consider this in terms of a very simple linear model, here the estimated intercept is the expected value of the outcome $y$ when the predictor $x$ has a value of 0. The estimated slope is the expected change in $y$ for a single unit change in $x$. These parameters are "fixed", meaning that each individual in the population has the same expected value for the intercept and slope.
The difference between the expected value and true value is called "residual error".
$$Y_i = \beta_0 + \beta_1X_i + \epsilon_i$$
#### Examples:
1. Medical Research: In a clinical trial studying the effectiveness of different medications for treating a specific condition, the fixed effects could include categorical variables such as treatment group (e.g., medication A, medication B, placebo) or dosage level (e.g., low, medium, high). These fixed effects would capture the systematic differences in the response variable (e.g., symptom improvement) due to the specific treatment received.
2. Education Research: Suppose a study examines the impact of teaching methods on student performance in different schools. The fixed effects in this case might include variables such as school type (e.g., public, private), curriculum approach (e.g., traditional, progressive), or classroom size. These fixed effects would help explain the differences in student achievement across schools, accounting for the systematic effects of these factors.
3. Environmental Science: Imagine a study investigating the factors influencing bird species richness across different habitats. The fixed effects in this context could include variables such as habitat type (e.g., forest, grassland, wetland), habitat disturbance level (e.g., low, medium, high), or geographical region. These fixed effects would capture the systematic variations in bird species richness associated with the specific habitat characteristics.
Fixed effects are the default effects that we all learn as we begin to understand statistical concepts, and fixed effects are the default effects in functions like `lm()` and `aov()`.
### Random effects
Random effects are less commonly used but perhaps more widely encountered in nature. Each level can be considered a random variable from an underlying process or distribution in a random effect.
A random effect is a parameter that is allowed to vary across groups or individuals. Random effects do not take a single fixed value, rather they follow a distribution (usually the normal distribution). Random effects can be added to a model to account for variation around an intercept or slope. Each individual or group then gets their own estimated random effect, representing an adjustment from the mean.
So random effects are usually grouping factors for which we are trying to control. They are always categorical, as you can’t force R to treat a continuous variable as a random effect. A lot of the time we are not specifically interested in their impact on the response variable, but we know that they might be influencing the patterns we see.
#### Examples:
1. Longitudinal Health Study: Consider a study tracking the blood pressure of individuals over multiple time points. In this case, a random effect can be included to account for the individual-specific variation in blood pressure. Each individual's blood pressure measurements over time would be treated as repeated measures within that individual, and the random effect would capture the variability between individuals that is not explained by the fixed effects. This random effect allows for modeling the inherent individual differences in blood pressure levels.
2. Social Network Analysis: Suppose a study examines the influence of peer groups on adolescent behavior. The study may collect data on individual behaviors within schools, where students are nested within classrooms. In this scenario, a random effect can be incorporated at the classroom level to account for the shared social environment within each classroom. The random effect captures the variability in behavior among classrooms that is not accounted for by the fixed effects, enabling the study to analyze the effects of individual-level and classroom-level factors simultaneously.
3. Ecological Study: Imagine a research project investigating the effect of environmental factors on species abundance in different study sites. The study sites may be geographically dispersed, and a random effect can be included to account for the variation between study sites. The random effect captures the unexplained heterogeneity in species abundance across different sites, allowing for the examination of the effects of environmental variables while accounting for site-specific differences.
The random effect $U_i$ is often assumed to follow a normal distribution with a mean of zero and a variance estimated during the model fitting process.
$$Y_i = β_0 + U_j + ε_i$$
```{block, type = "warning"}
In the book "Data analysis using regression and multilevel/hierarchical models" (@gelman_hill_2006). The authors examined five definitions of fixed and random effects and found no consistent agreement.
(1) Fixed effects are constant across individuals, and random effects vary. For example, in a growth study, a model with random intercepts a_i and fixed slope b corresponds to parallel lines for different individuals i, or the model y_it = a_i + b t thus distinguish between fixed and random coefficients.
(2) Effects are fixed if they are interesting in themselves or random if there is interest in the underlying population.
(3) “When a sample exhausts the population, the corresponding variable is fixed; when the sample is a small (i.e., negligible) part of the population the corresponding variable is random.”
(4) “If an effect is assumed to be a realized value of a random variable, it is called a random effect.”
(5) Fixed effects are estimated using least squares (or, more generally, maximum likelihood) and random effects are estimated with shrinkage.
Thus it turns out fixed and random effects are not born but made. We must make the decision to treat a variable as fixed or random in a particular analysis.
```
> When determining what should be a fixed or random effect in your study, consider what are you trying to do? What are you trying to make predictions about? What is just variation (a.k.a “noise”) that you need to control for?
```{block, type = "warning"}
More about random effects:
Note that the golden rule is that you generally want your random effect to have at least five levels. So, for instance, if we wanted to control for the effects of fish sex on body length, we would fit sex (a two level factor: male or female) as a fixed, not random, effect.
This is, put simply, because estimating variance on few data points is very imprecise. Mathematically you could, but you wouldn’t have a lot of confidence in it. If you only have two or three levels, the model will struggle to partition the variance - it will give you an output, but not necessarily one you can trust.
Finally, keep in mind that the name random doesn’t have much to do with mathematical randomness. Yes, it’s confusing. Just think about them as the grouping variables for now. Strictly speaking it’s all about making our models representative of our questions and getting better estimates. Hopefully, our next few examples will help you make sense of how and why they’re used.
There’s no firm rule as to what the minimum number of factor levels should be for random effect and really you can use any factor that has two or more levels. However it is commonly reported that you may want five or more factor levels for a random effect in order to really benefit from what the random effect can do (though some argue for even more, 10 levels). Another case in which you may not want to random effect is when you don’t want your factor levels to inform each other or you don’t assume that your factor levels come from a common distribution. As noted above, male and female is not only a factor with only two levels but oftentimes we want male and female information estimated separately and we’re not necessarily assuming that males and females come from a population of sexes in which there is an infinite number and we’re interested in an average.
```
## Why use mixed models?
The provided code generates a dataset that is suitable for testing mixed models. Let's break down the code and annotate each step:
```{r, include = FALSE}
# Generating a fake dataset with different means for each group
set.seed(123) # Setting seed for reproducibility
rand_eff <- data.frame(group = as.factor(seq(1:5)),
b0 = rnorm(5, mean = 0, sd = 20),
b1 = rnorm(5, 0, 0.5))
```
This section creates a data frame called rand_eff containing random effects. It consists of five levels of a grouping variable (group), and for each level, it generates random effects (b0 and b1) using the `rnorm` function.
```{r}
data <- expand.grid(group = as.factor(seq(1:10)),
obs = as.factor(seq(1:100))) %>%
left_join(rand_eff,
by = "group") %>%
mutate(x = runif(n = nrow(.), 0, 10),
B0 = 20,
B1 = 2,
E = rnorm(n = nrow(.), 0, 10)) %>%
mutate(y = B0 + b0 + x * (B1 + b1) + E)
data <- expand.grid(group = as.factor(seq(1:4)),
obs = as.factor(seq(1:100)))
```
This section creates the main dataset (data) for testing mixed models. It uses `expand.grid` to create a combination of levels for the grouping variable (group) and observation variable (obs). It then performs a `left join` with the rand_eff data frame, matching the group variable to incorporate the random effects for each group.
The code continues to `mutate` the dataset by adding additional variables:
- x is a random predictor variable generated using `runif` to have values between 0 and 10.
- B0 and B1 represent fixed effects of the intercept and slope with predetermined values of 20 and 2, respectively.
- E represents the error term, generated using `rnorm` with a mean of 0 and standard deviation of 10.
- Finally, y is created as the response variable using a linear model equation that includes the fixed effects (B0 and B1), random effects (b0 and b1), the predictor variable (x), and the error term (E).
```{r}
data.1 <- expand.grid(group = as.factor(5),
obs = as.factor(seq(1:30)))
data <- bind_rows(data, data.1) %>%
left_join(rand_eff,
by = "group") %>%
mutate(x = runif(n = nrow(.), 0, 10),
B0 = 20,
B1 = 2,
E = rnorm(n = nrow(.), 0, 10)) %>%
mutate(y = B0 + b0 + x * (B1 + b1) + E)
```
This section creates an additional dataset (`data.1`) with a specific group (group = 5) and a smaller number of observations (obs = 30) for testing purposes. This is then appended to the original dataset, we will see the effect of having a smaller group within our random effects when we discuss partial pooling and shrinkage later on.
Now we have three variables to consider in our models: x, y and group (with five levels).
```{r}
data %>%
select(x, y, group, obs) %>%
head()
```
Now that we have a suitable simulated dataset, let's start modelling!
### All in one model
We will begin by highlighting the importance of considering data structure and hierarchy when building linear models. To illustrate this, we will delve into an example that showcases the consequences of ignoring the underlying data structure. We might naively construct a single linear model that ignores the group-level variation and treats all observations as independent. This oversimplified approach fails to account for the fact that observations within groups are more similar to each other due to shared characteristics.
$$Y_i = \beta_0 + \beta_1X_i + \epsilon_i$$
```{r}
basic_model <- lm(y ~ x, data = data)
summary(basic_model)
```
Here we can see that the basic linear model has produced a statistically significant regression analysis (*t*~998~ = 11.24, *p* <0.001) with an $R^2$ of 0.11. There is a medium effect positive relationship between changes in x and y (Estimate = 2.765, S.E. = 0.25).
We can see that clearly if we produce a simple plot of x against y:
```{r, fig.cap = "Simple scatter plot x against y"}
plot(data$x, data$y)
```
Here if we use the function `geom_smooth()` on the scatter plot, the plot also includes a fitted regression line obtained using the "lm" method. This allows us to examine the overall trend and potential linear association between the variables.
```{r, fig.cap = "Scatter plot displaying the relationship between the independent variable and the dependent variable. The points represent the observed data, while the fitted regression line represents the linear relationship between the variables. The plot helps visualize the trend and potential association between the variables."}
ggplot(data, aes(x = x,
y = y)) +
geom_point() +
labs(x = "Independent Variable",
y = "Dependent Variable")+
geom_smooth(method = "lm")
```
The `check_model()` function from the `performance` package (@R-performance) is used to evaluate the performance and diagnostic measures of a statistical model. It provides a comprehensive assessment of the model's fit, assumptions, and predictive capabilities. By calling this function, you can obtain a summary of various evaluation metrics and diagnostic plots for the specified model.
It enables you to identify potential issues, such as violations of assumptions, influential data points, or lack of fit, which can affect the interpretation and reliability of your model's results
```{r}
performance::check_model(basic_model)
```
Looking at the fit of our model we would be tempted to conclude that we have an accurate and robust model.
However, when data is hierarchically structured, such as individuals nested within groups, there is typically correlation or similarity among observations within the same group. By not accounting for this clustering effect, estimates derived from a single model can be biased and inefficient. The assumption of independence among observations is violated, leading to incorrect standard errors and inflated significance levels.
From the figure below we can see the difference in median and range of x values within each of our groups:
```{r, fig.cap = "Linear model conducted on all data"}
ggplot(data, aes(x = group,
y = y)) +
geom_boxplot() +
labs(x = "Groups",
y = "Dependent Variable")
```
In this figure, we colour tag the data points by group, this can be useful for determining if a mixed model is appropriate.
Here's why:
By colour-coding the data points based on the grouping variable, the plot allows you to visually assess the within-group and between-group variability. If there are noticeable differences in the data patterns or dispersion among the groups, it suggests that the data may have a hierarchical structure, where observations within the same group are more similar to each other than to observations in other groups.
```{r}
# Color tagged by group
plot_group <- ggplot(data, aes(x = x,
y = y,
color = group,
group = group)) +
geom_point(alpha = 0.6) +
labs(title = "Data Coloured by Group",
x = "Independent Variable",
y = "Dependent Variable")+
theme(legend.position="none")
plot_group
```
From the above plots, it confirms that our observations from within each of the ranges aren’t independent. We can’t ignore that: as we’re starting to see, it could lead to a completely erroneous conclusion.
### Multiple analyses approach
Running separate linear models per group, also known as stratified analysis, can be a feasible approach in certain situations. However, there are several drawbacks including
- Increased complexity
- Inability to draw direct conclusions on overall variability
- Reduced statistical power
- Inflated Type 1 error risk
- Inconsistent estimates
- Limited ability to handle unbalanced/missing data
```{r, fig.cap = "Scatter plot showing the relationship between the independent variable (x) and the dependent variable (y) colored by group. Each subplot represents a different group. The line represents the group-level linear regression smoothing."}
# Plotting the relationship between x and y with group-level smoothing
ggplot(data, aes(x = x, y = y, color = group, group = group)) +
geom_point(alpha = 0.6) + # Scatter plot of x and y with transparency
labs(title = "Data Colored by Group", x = "Independent Variable", y = "Dependent Variable") +
theme(legend.position = "none") +
geom_smooth(method = "lm") + # Group-level linear regression smoothing
facet_wrap(~group) # Faceting the plot by group
```
```{r}
# Creating nested data by grouping the data by 'group'
nested_data <- data %>%
group_by(group) %>%
nest()
# Fitting linear regression models to each nested data group
models <- map(nested_data$data, ~ lm(y ~ x, data = .)) %>%
map(broom::tidy)
# Combining the model results into a single data frame
combined_models <- bind_rows(models)
# Filtering the rows to include only the 'x' predictor
filtered_models <- combined_models %>%
filter(term == "x")
# Adding a column for the group index using rowid_to_column function
group_indexed_models <- filtered_models %>%
rowid_to_column("group")
# Modifying the p-values using a custom function report_p
final_models <- group_indexed_models %>%
mutate(p.value = report_p(p.value))
final_models
```
In the code above, the dataset data is first grouped by the variable 'group' using the `group_by` function, and then the data within each group is nested using the `nest` function. This results in a new dataset nested_data where each group's data is stored as a nested tibble.
Next, a linear regression model (`lm`) is fit to each nested data group using the `map` function. The `broom::tidy` function is applied to each model using map to extract the model summary statistics, such as coefficients, p-values, and standard errors. The resulting models are stored in the models object.
The `bind_rows` function is used to combine the model results into a single data frame called combined_models. The data frame is then filtered to include only the rows where the predictor is 'x' using the `filter` function, resulting in the filtered_models data frame.
To add a column for the group index, the `rowid_to_column` function is applied to the filtered_models data frame, creating the group_indexed_models data frame with an additional column named 'group'.
Finally, the p-values in the group_indexed_models data frame are modified using a custom function `report_p`
`r hide("report_p function")`
```{r, eval = FALSE}
report_p <- function(p, digits = 3) {
reported <- if_else(p < 0.001,
"p < 0.001",
paste("p=", round(p, digits)))
return(reported)
}
```
`r unhide()`
### Complex model
Using a group level term with an interaction on x as a fixed effect means explicitly including the interaction term between x and the group as a predictor in the model equation. This approach assumes that the relationship between x and the outcome variable differs across groups and that these differences are constant and fixed. It implies that each group has a unique intercept (baseline level) and slope (effect size) for the relationship between x and the outcome variable. By treating the group level term as a fixed effect, the model estimates *specific parameter values for each group*.
If we are not explicitly interested in the outcomes or differences for each individual group (but wish to account for them) - this may not be the best option as it can lead to *overfitting* and it uses a lot more degrees of freedom - impacting estimates and widening our confidence intervals. As with running multiple models above, there is limited ability to make inferences outside of observed groups, and it does not handle missing data or unbalanced designs well.
$$Y_i = \beta_0 + \beta_1X_i + \beta_2.group_i+\beta_3(X_i.group_i)+\epsilon_i$$
```{r}
additive_model <- lm(y ~ x*group, data = data)
summary(additive_model)
```
## Our first mixed model
A mixed model is a good choice here: it will allow us to use all the data we have (higher sample size) and account for the correlations between data coming from the groups. We will also estimate fewer parameters and avoid problems with multiple comparisons that we would encounter while using separate regressions.
We can now join our random effect $U_j$ to the full dataset and define our y values as
$$Y_{ij} = β_0 + β_1*X_{ij} + U_j + ε_{ij}$$.
We have a response variable, and we are attempting to explain part of the variation in test score through fitting an independent variable as a fixed effect. But the response variable has some residual variation (i.e. unexplained variation) associated with group. By using random effects, we are modeling that unexplained variation through variance.
We now want to know if an association between `y ~ x` exists after controlling for the variation in group.
### Running mixed effects models with lmerTest
This section will detail how to run mixed models with the `lmer` function in the R package `lmerTest` (@R-lmerTest). This builds on the older `lme4` (@R-lme4) package, and in particular add p-values that were not previously included. There are other R packages that can be used to run mixed-effects models including the `nlme` package (@R-nlme) and the `glmmTMB` package (@R-glmmTMB). Outside of R there are also other packages and software capable of running mixed-effects models, though arguably none is better supported than R software.
`r hide("Plotting random intercepts")`
```{r, eval = FALSE}
plot_function2 <- function(model, title = "Data Coloured by Group"){
data <- data %>%
mutate(fit.m = predict(model, re.form = NA),
fit.c = predict(model, re.form = NULL))
data %>%
ggplot(aes(x = x, y = y, col = group)) +
geom_point(pch = 16, alpha = 0.6) +
geom_line(aes(y = fit.c, col = group), linewidth = 2) +
coord_cartesian(ylim = c(-40, 100))+
labs(title = title,
x = "Independent Variable",
y = "Dependent Variable")
}
mixed_model <- lmer(y ~ x + (1|group), data = data)
plot_function2(mixed_model, "Random intercept")
```
`r unhide()`
```{r, echo = FALSE}
plot_function2 <- function(model, title = "Data Coloured by Group"){
data <- data %>%
mutate(fit.m = predict(model, re.form = NA),
fit.c = predict(model, re.form = NULL))
data %>%
ggplot(aes(x = x, y = y, col = group)) +
geom_point(pch = 16, alpha = 0.6) +
geom_line(aes(y = fit.c, col = group), linewidth = 2) +
coord_cartesian(ylim = c(-40, 100))+
labs(title = title,
x = "Independent Variable",
y = "Dependent Variable")
}
mixed_model <- lmer(y ~ x + (1|group), data = data)
plot_function2(mixed_model, "Random intercept")
```
```{r}
# random intercept model
mixed_model <- lmer(y ~ x + (1|group), data = data)
summary(mixed_model)
```
Here our groups clearly explain a lot of variance
```
205/(205 + 101) = 0.669 / 66.9%
```
So the differences between groups explain ~67% of the variance that’s “left over” *after* the variance explained by our fixed effects.
### Partial pooling
It is worth noting that random effect estimates are a function of the group-level information and the overall (grand) mean of the random effect. Group levels with low sample size or poor information (i.e., no strong relationship) are more strongly influenced by the grand mean, which adds information to an otherwise poorly-estimated group. However, a group with a large sample size or strong information (i.e., a strong relationship) will have very little influence from the grand mean and largely reflect the information contained entirely within the group. This process is called partial pooling (as opposed to no pooling, where no effect is considered, or total pooling, where separate models are run for the different groups). Partial pooling results in the phenomenon known as shrinkage, which refers to the group-level estimates being shrunk toward the mean. What does all this mean? If you use a random effect, you should be prepared for your factor levels to have some influence from the overall mean of all levels. With a good, clear signal in each group, you won’t see much impact from the overall mean, but you will with small groups or those without much signal.
Below we can take a look at the estimates and standard errors for three of our previously constructed models:
```{r}
pooled <- basic_model %>%
broom::tidy() %>%
mutate(Approach = "Pooled", .before = 1) %>%
select(term, estimate, std.error, Approach)
no_pool <- additive_model %>%
broom::tidy() %>%
mutate(Approach = "No Pooling", .before = 1) %>%
select(term, estimate, std.error, Approach)
partial_pool <- mixed_model %>%
broom.mixed::tidy() %>%
mutate(Approach = "Mixed Model/Partial Pool", .before = 1) %>%
select(Approach, term, estimate, std.error)
bind_rows(pooled, no_pool, partial_pool) %>%
filter(term %in% c("x" , "(Intercept)") )
```
Pooling helps to improve the precision of the estimates by borrowing strength from the entire dataset. However, this can also lead to differences in the estimates and standard errors compared to models without pooling.
- Pooled: in the pooled model the averaged estimates may not accurately reflect the true values within each group. As a result, the estimates in pooled models can be biased towards the average behavior across all groups. We can see this in the too small standard error of the intercept, underestimating the variance in our dataset. At the same time if there is substantial variability in the relationships between groups, the pooled estimates can be less precise. This increased variability across groups can contribute to larger standard errors of the difference (SED) for fixed effects in pooled models.
- No pooling: This model is extremely precise with the smallest errors, however these estimates reflect conditions only for the first group in the model
- Partial pooling/Mixed models: this model reflects the greater uncertainty of the Mean and SE of the intercept. However, the SED in a partial pooling model accounts for both the variability within groups and the uncertainty between groups. Compared to a no pooling approach, the SED in a partial pooling model tends to be smaller because it incorporates the pooled information, which reduces the overall uncertainty. This adjusted SED provides a more accurate measure of the uncertainty associated with the estimated differences between groups or fixed effects.
## Predictions
One misconception about mixed-effects models is that we cannot produce estimates of the relationships for each group.
But how do we do this?
We can use the `coef()` function to extract the estimates (strictly these are predictions) for the random effects. This output has several components.
```{r}
coef(mixed_model)
```
This function produces our 'best linear unbiased predictions' (BLUPs) for the intercept and slope of the regression at each site. The predictions given here are different to those we would get if we ran individual models on each site, as BLUPs are a product of the compromise of complete-pooling and no-pooling models. Now the predicted intercept is influenced by other sites leading to a process called 'shrinkage'.
Why are these called predictions and not estimates? Because we have estimated the variance at each site and from here essentially borrowed information across sites to improve the accuracy, to combine with the fixed effects. So in the strictest sense we are *predicting* relationships rather than through direct observation.
This generous ability to make predictions is one of the main advantages of a mixed-model.
The `summary()` function has already provided the estimates of the fixed effects, but they can also be extracted with the `fixef()` function.
```{r}
fixef(mixed_model)
```
We can also apply `anova()` to a single model to get an F-test for the fixed effect
```{r}
anova(mixed_model)
```
### Shrinkage in mixed models
The graph below demonstrates the compromise between complete pooling and no pooling. It plots the overall regression line/mean (the fixed effects from the `lmer` model), the predicted slopes at each site from the mixed-effects model, and compares this to the estimates from each site (nested lm).
As you can see most of the groups show shrinkage, that is they deviate less from the overall mean, most obviously in group 5, where the sample size is deliberately reduced. Here you can see the predicted line is much closer to the overall regression line, reflecting the greater uncertainty. The slope is drawn towards the overall mean by shrinkage.
```{r}
# Nesting the data by group
nested_data <- data %>%
group_by(group) %>%
nest()
# Fitting linear regression models and obtaining predictions for each group
nested_models <- map(nested_data$data, ~ lm(y ~ x, data = .)) %>%
map(predict)
```
```{r, fig.cap = "Regression relationships from fixed-effects and mixed effects models, note shrinkage in group 5"}
# Creating a new dataframe and adding predictions from different models
data1 <- data %>%
mutate(fit.m = predict(mixed_model, re.form = NA),
fit.c = predict(mixed_model, re.form = NULL)) %>%
arrange(group,obs) %>%
mutate(fit.l = unlist(nested_models))
# Creating a plot to visualize the predictions
data1 %>%
ggplot(aes(x = x, y = y, colour = group)) +
geom_point(pch = 16) +
geom_line(aes(y = fit.l, linetype = "lm"), colour = "black")+
geom_line(aes(y = fit.c, linetype = "lmer"))+
geom_line(aes(y = fit.m, linetype = "Mean"), colour = "grey")+
scale_linetype_manual(name = "Model Type",
labels = c("Mean", "lmer", "lm"),
values = c("dotdash", "solid", "dashed"))+
facet_wrap( ~ group)+
guides(colour = "none")
```
`re.form = NA`: When re.form is set to NA, it indicates that the random effects should be ignored during prediction. This means that the prediction will be based solely on the fixed effects of the model, ignoring the variation introduced by the random effects. This is useful when you are interested in estimating the overall trend or relationship described by the fixed effects, without considering the specific random effects of individual groups or levels.
`re.form = NULL`: Conversely, when re.form is set to NULL, it indicates that the random effects should be included in the prediction. This means that the prediction will take into account both the fixed effects and the random effects associated with the levels of the random effect variable. The model will use the estimated random effects to generate predictions that account for the variation introduced by the random effects. This is useful when you want to visualize and analyze the variation in the response variable explained by different levels of the random effect.
```{r, eval = FALSE, echo = FALSE}
basic_pred <- emmeans(basic_model, specs = ~ x, at = list(x = c(0, 2.5, 5, 7.5, 10))) %>% as_tibble()
mixed_pred <- emmeans(mixed_model, specs = ~ x, at = list(x = c(0, 2.5, 5, 7.5, 10))) %>% as_tibble()
pooled_plot <- data %>%
ggplot(aes(x = x, y = y)) +
geom_point(pch = 16, alpha = 0.6, aes(col = group)) +
geom_line(aes(x = x, y = emmean), linewidth = 1, data = basic_pred) +
geom_line(aes(x = x, y = lower.CL), linewidth = 0.5, linetype = 2, col = "gray50", data = basic_pred) +
geom_line(aes(x = x, y = upper.CL), linewidth = 0.5, linetype = 2, col = "gray50", data = basic_pred) +
coord_cartesian(ylim = c(0, 70))+
labs(title = "Pooled model",
x = "Independent Variable",
y = "Dependent Variable") +
theme(legend.position = "none")
partial_plot <- data %>%
ggplot(aes(x = x, y = y)) +
geom_point(pch = 16, alpha = 0.6, aes(col = group)) +
geom_line(aes(x = x, y = emmean), linewidth = 1, data = basic_pred) +
geom_line(aes(x = x, y = lower.CL), linewidth = 0.5, linetype = 2, col = "gray50", data = mixed_pred) +
geom_line(aes(x = x, y = upper.CL), linewidth = 0.5, linetype = 2, col = "gray50", data = mixed_pred) +
coord_cartesian(ylim = c(0, 70))+
labs(title = "Mixed model",
x = "Independent Variable",
y = "Dependent Variable") +
theme(legend.position = "none")
pooled_plot /
partial_plot
```
It's not always easy/straightforward to make prediciton or confidence intervals from complex general and generalized linear mixed models, luckily there are some excellent R packages that will do this for us.
### `ggeffects`
`ggeffects` (@R-ggeffects) is a light-weight package that aims at easily calculating marginal effects and adjusted predictions
#### `ggpredict`
The argument `type = random` in the `ggpredict` function (from the ggeffects package @R-ggeffects) is used to specify the type of predictions to be generated in the context of mixed effects models. The main difference between using `ggpredict` with and without `type = random` lies in the type of predictions produced:
Without `type = random`: `ggpredict` will generate **fixed-effects predictions**. These estimates are based solely on the fixed effects of the model, ignoring any variability associated with the random effects. The resulting estimate represent the average or expected values of the response variable for specific combinations of predictor values, considering only the fixed components of the model.
Estimated mean fixed effects provide a comprehensive view of the average effect of dependent variables on the response variable. By plotting the estimated mean fixed effects using ggpredict, you can visualize how the response variable changes across different levels or values of the predictors while considering the effects of other variables in the model.
```{r}
library(ggeffects)
ggpredict(mixed_model, terms = "x") %>%
plot(., add.data = TRUE)
```
With `type = random`: ggpredict will generate **predictions** that **incorporate both fixed and random effects**. These predictions take into account the variability introduced by the random effects in the model. The resulting predictions reflect not only the average trend captured by the fixed effects but also the additional variability associated with the random effects at different levels of the grouping factor(s).
By default the figure now produces **prediciton intervals**
```{r}
ggpredict(mixed_model, terms = "x", type = "random") %>%
plot(., add.data = TRUE)
```
```{block, type = "info"}
Confidence Intervals:
A confidence interval is used to estimate the range of plausible values for a population parameter, such as the mean or the regression coefficient, based on a sample from that population. It provides a range within which the true population parameter is likely to fall with a certain level of confidence. For example, a 95% confidence interval implies that if we were to repeat the sampling process many times, 95% of the resulting intervals would contain the true population parameter.
Prediction Intervals:
On the other hand, a prediction interval is used to estimate the range of plausible values for an individual observation or a future observation from the population. It takes into account both the uncertainty in the estimated model parameters and the inherent variability within the data. A 95% prediction interval provides a range within which a specific observed or predicted value is likely to fall with a certain level of confidence. The wider the prediction interval, the greater the uncertainty around the specific value being predicted.
```
Furthermore, `ggpredict()` enables you to explore group-level predictions, which provide a deeper understanding of how the response variable varies across different levels of the grouping variable. By specifying the desired grouping variable in ggpredict when `type = random`, you can generate plots that depict the predicted values for each group separately, allowing for a comparative analysis of group-level effects.
```{r}
ggpredict(mixed_model, terms = c("x", "group"), type = "random") %>%
plot(., add.data = TRUE) +
facet_wrap(~group)+
theme(legend.position = "none")
```
### `sjPlot`
Another way to visualise mixed model results is with the package `sjPlot`(@R-sjPlot), and if you are interested in showing the variation among levels of your random effects, you can plot the departure from the overall model estimate for intercepts - *and slopes, if you have a random slope model*:
```{r}
library(sjPlot)
plot_model(mixed_model, terms = c("x", "group"), type = "re")/
(plot_model(mixed_model, terms = c("x", "group"), type = "est")+ggtitle("Fixed Effects"))
```
```{block, type = "warning"}
The values you see are NOT actual values, but rather the difference between the general intercept or slope value found in your model summary and the estimate for this specific level of random/fixed effect.
```
`sjPlot` is also capable of producing prediction plots in the same way as `ggeffects`:
```{r}
plot_model(mixed_model,type="pred",
terms=c("x", "group"),
pred.type="re",
show.data = T)+
facet_wrap( ~ group)
```
## Checking model assumptions
We need to be just as conscious of testing the assumptions of mixed effects models as we are with any other. The assumptions are:
1. Within-group errors are independent and normally distributed with mean zero and variance $\sigma^2$
2. Within-group errors are independent of the random effects.
3. Random effects are normally distributed with mean zero.
4. Random effects are independent for different groups, except as specified by nesting (more on this later)
Several model check plots would help us to confirm/deny these assumptions, but note that QQ-plots may not be relevant because of the model structure. Two commonly-used plots are:
```{r}
plot(mixed_model)
```
```{r}
qqnorm(resid(mixed_model))
qqline(resid(mixed_model))
```
It can often be useful to check the distribution of residuals in each of the groups (e.g. blocks) to check assumptions 1 and 2. We can do this by plotting the residuals against the fitted values, separately for each level of the random effect, using the `coplot()` function:
```{r}
coplot(resid(mixed_model) ~ fitted(mixed_model) | data$group)
```
Each sub-figure of this plot, which refers to an individual group, doesn’t contain much data. It can be hard to judge whether the spread of residuals around fitted values is the same for each group when observations are low.
We can check assumption 3 with a histogram (here the more levels we have, the more we can be assured):
```{r}
rand_dist <- as.data.frame(ranef(mixed_model)) %>%
mutate(group = grp,
b0_hat = condval,
intercept_cond = b0_hat + summary(mixed_model)$coef[1,1],
.keep = "none")
hist(rand_dist$b0_hat)
```
Overlaying the random distribution of our intercept over the regression line produces a plot like the below:
```{r, fig.cap = "Marginal fit, heavy black line from the random effect model with a histogram of the of the distribution of conditional intercepts"}
data1 %>%
ggplot(aes(x = x, y = y)) +
geom_point(pch = 16, col = "grey") +
geom_violinhalf(data = rand_dist,
aes(x = 0, y = intercept_cond),
trim = FALSE,
width = 4,
adjust = 2,
fill = NA)+
geom_line(aes(y = fit.m), linewidth = 2) +
coord_cartesian(ylim = c(-40, 100))+
labs(x = "Independent Variable",
y = "Dependent Variable")
```
### performance
The `check_model()` function from the performance package in R is a useful tool for evaluating the performance and assumptions of a statistical model, particularly in the context of mixed models. It provides a comprehensive set of diagnostics and visualizations to assess the model's fit, identify potential issues, and verify the assumptions underlying the model, which can be difficult with complex models
```{r}
library(performance)
check_model(mixed_model)
```
### DHARma
Simulation-based methods, like those available with DHARMa (@R-DHARMa), are often preferred for model validation and assumption checking because they offer flexibility and do not rely on specific assumptions. Simulation is particularly useful for evaluating complex models with multiple levels of variability, non-linearity, or complex hierarchical structures. Such models may not be adequately evaluated by solely examining residuals, and simulation provides a more robust approach to assess their assumptions and performance.
Read the authors summary [here](https://theoreticalecology.wordpress.com/2016/08/28/dharma-an-r-package-for-residual-diagnostics-of-glmms/)
```{r}
library(DHARMa)
resid.mm <- simulateResiduals(mixed_model)
plot(resid.mm)
```
```{block, type = "warning"}
When you have a lot of data, even minimal deviations from the expected distribution will become significant (this is discussed in the help/vignette for the DHARMa package). You will need to assess the distribution and decide if it is important for yourself.
```
## Practice Questions
1. In mixed-effects models, independent variables are also called:
`r mcq(c("Random Effects", answer = "Fixed Effects", "Mediators", "Variance"))`
2. A random effect is best described as?
`r longmcq(c("A variable that is noisier than other variables", "An effect that cannot be predicted based on group membership", answer="A variable where subgroups influence the outcome","Choosing slope values with a random number generator"))`
3. Which of the following is **not** an advantage of mixed-effects models?
`r longmcq(c("They can cope with situations where individuals contribute different numbers of observations", "They can handle group-level differences in the outcome", "They can deal with variation due to different groups", answer = "They are guaranteed to give significant results when ANOVAs do not"))`
4. Mixed-effects models cope well with missing data because?
`r longmcq(c(answer = "They can still estimate parameters even when some observations are missing", "They delete observations with missing data", "They set all values to 0"))`
# Worked Example 1 - Dolphins
## How do I decide if it is a fixed or random effect?
One of the most common questions in mixed-effects modelling is how to decide if an effect should be considered as fixed or random. This can be quite a complicated question, as we have touched upon briefly before the definition of a random effect is not universal. It is a considered process that can include the hypothesis or question you are asking.
1 ) Are you directly interested in the effect in question. If the answer is yes it should be a fixed effect.
2 ) Is the variable continuous? If the answer is yes it should be a fixed effect.
3 ) Does the variable have less than five levels? If ther answer is yes it should be a fixed effect.
## Dolphins
This dataset was collected to measure resting lung function in 32 bottlenose dolphins. The main dependent variable was the tidal volume ($V_T$) measured in litres, as an index of lung capacity.
```{r, eval = FALSE, echo = FALSE}
link = "https://raw.githubusercontent.com/UEABIO/intro-mixed-models/main/book/files/dolphins.csv",
```
```{r, echo = FALSE}
dolphins <- read_csv("book/files/dolphins.csv")
dolphins$direction <- factor(dolphins$direction)
dolphins <- drop_na(dolphins)
```
```{r, eval = FALSE}
dolphins <- read_csv("dolphins.csv")
```
Here we are interested in the relationship between ($V_T$) and body mass (kg), we have measurements which are taken on the breath *in* and the breath *out*, and each dolphin has been observed between one and four times.
We need to determine our fixed effects, random effects and model structure:
1) Body Mass `r mcq(c("Random Effect", answer = "Fixed Effect"))`
2) Direction `r mcq(c("Random Effect", answer = "Fixed Effect"))`
3) Animal 1) Body Mass `r mcq(c(answer = "Random Effect", "Fixed Effect"))`
4) With the basic structure `y ~ x + z + (1|group)` what do you think this model should be?:
`fitb(c("vt ~ bodymass + direction + (1|animal)", "vt ~ direction + bodymass + (1|animal)"), ignore_ws = TRUE, width = "20")`
`r hide("Solution")`
1) We are clearly interested in the effect of body mass on ($V_T$) so this is a **fixed effect**.
2) We may think that the relationship with ($V_T$) and body mass may be different on the in and out breath. We may not be *directly* interested in this, but it has fewer than fivel levels so this is a **fixed effect**. (outbreath coded as 1, inbreath coded as 2).
3) Individual dolphins - if we averaged across measurements for each dolphin, our measurement precision would be different for each animal. If we include each data point, we would be double-counting some animals and our observations would not be independent. To account for the multiple observations we should treat animal as a **random effect**.
```{r}
dolphmod <- lmer(vt ~ bodymass + direction + (1|animal), data=dolphins)
```
`r unhide()`
With our basic linear model in place - we should carry out model fit checks with `DHARMa` or `performance::check_model()`, but assuming this is a good fit we can look at interpretation:
```{r}
summary(dolphmod)
```
```{r}
dolphins.1 <- dolphins %>%
mutate(fit.m = predict(dolphmod, re.form = NA),
fit.c = predict(dolphmod, re.form = NULL))
```
```{r, fig.cap = "Scatter plot of VT as a function of body mass for dolphins. Different directions of breath are represented by different colors. The solid lines indicate the marginal fitted values from our model."}
dolphins.1 %>%
ggplot(aes(x = bodymass, y = vt, group = direction)) +
geom_point(pch = 16, aes(colour = direction)) +
geom_line(aes(y = fit.m,
linetype = direction),
linewidth = 1) +
labs(x = "Body Mass",
y = "VT")
```
```{r}
plot_model(dolphmod,type="pred",
terms=c("bodymass", "direction"),
pred.type="fe",
show.data = T)
```
`r hide("A brief summary")`
We fitted a linear mixed model (estimated using REML and nloptwrap optimizer) to predict ($V_T$) with bodymass(kg) and direction (in/out breath). 95% Confidence Intervals (CIs) and p-values were computed using a Wald t-distribution approximation. We included a random intercept effect of animal to account for repeated measurements (of between 1 to 4 observations) across a total of 32 bottlenosed dolphins.
We found that for every 1kg increase in bodymass, ($V_T$) increased by 0.02 litres (95% CI [0.01 - 0.02]), $t_{107}$ = 5.15, p < 0.001. The inbreath had on average a higher volume than the outbreath (1.11 litre difference [0.71 - 1.52]; $t_{107}$ = 5.48, p < 0.001).