-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathSpatial_Prediction_Genomic_Offset.Rmd
326 lines (217 loc) · 13.6 KB
/
Spatial_Prediction_Genomic_Offset.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
---
title: "Visualizing geographic predictions of genetic offset measures"
author: "Olivier François (Université Grenoble-Alpes)"
output:
pdf_document: default
html_document: default
---
### Introduction
Genomic offset statistics predict the maladaptation of populations to rapid habitat alteration based on association of genotypes with environmental variation. This brief tutorial explains how to represent predictions of genomic offset statistics within geographic maps. This can be achieved using standard `R` packages dedicated to spatial analysis.
In the tutorial, spatial prediction of genomic offset will be illustrated by analyzing publicly available genomic data from 1,096 European lines of the model plant *Arabidopsis thaliana* (https://www.1001genomes.org/) using climate models described in the sixth IPCC report (IPCC AR6).
To represent geographic maps, there are many other and often better methods than the ones presented in this tutorial. I do not claim being a cartography specialist, and the approach below is likely to be q suboptimal one. It is at least quite flexible, easy to reproduce, and it can be used as basis for improved representations.
### Loading genomic and environmental data
Displaying genomic offset statistics in geographic space will require that the `R` packages `terra`, `geodata`, `fields`, and `maps` are installed. The tutorial will use `LEA` for performing a genotype-environment association study and for computing genomic offset statistics.
```{r, warning = FALSE, message= FALSE}
# Required packages
# Loading worldclim/cimp6 bioclimatic data
library(terra)
library(geodata)
# displaying images and maps
library(fields)
library(maps)
# Adjusting genotype-environment association models
library(LEA)
```
Genomic and geographic data for A. thaliana samples are available from a previous tutorial on **running structure-like population genetic analyses with `R`**.
The data contain 1,096 genotypes from the first chromosome of the plant and geographic coordinates (latitude and longitude) associated with each sample. They can be downloaded as follows.
```{r, warning = FALSE, message= FALSE}
# default timeout option is 60s -- increase to 300s
options(timeout = max(300, getOption("timeout")))
# download sample genotypes in the working directory (54.4 MB)
url = "http://membres-timc.imag.fr/Olivier.Francois/Arabidopsis/A_thaliana_chr1.geno"
download.file(url = url, destfile = "./A_thaliana_chr1.geno")
# download sample coordinates in the working directory
url = "http://membres-timc.imag.fr/Olivier.Francois/Arabidopsis/at_coord.coord"
download.file(url = url, destfile = "./at_coord.coord")
```
The data are then loaded as `R` objects, and the sample coordinates can be visualized as follows.
```{r, message = FALSE}
genotype = LEA::read.geno("./A_thaliana_chr1.geno")
coordinates = as.matrix(read.table("./at_coord.coord"))
```
```{r, fig.width=5,fig.height=5}
plot(coordinates, cex = .4, col = "darkblue",
xlab = "Longitude", ylab = "Latitude",
main = "Sample coordinates", las = 1)
maps::map(add = TRUE, interior = FALSE, col = "grey40")
```
### Bioclimatic variables
In the following presentation, predictions will be based on bioclimatic variables extracted from the worldclim database. The bioclimatic variables are downloaded below. The `climate` object contains 19 historical temperature and precipitation variables with low resolution. A temporary path is used for downloading. If additional analyses are planed, changing to a non-temporary directory could be useful to avoid reloading the data (which may be slow).
```{r}
# Download global bioclimatic data from worldclim
climate <- geodata::worldclim_global(var = 'bio',
res = 10,
download = TRUE,
path=tempdir())
```
Next, the `climate_future` object contains future temperature and precipitation variables predicted from the SSP2-4.5 scenario developed with respect to the sixth IPCC report (IPCC AR6). Several climate models are available with `geodata`. The one used here is called 'ACCESS-ESM1-5'. If additional analyses are planed, changing to a non-temporary directory could be useful to avoid reloading the data
```{r}
# Download future climate scenario from 'ACCESS-ESM1-5' climate model.
climate_future <- geodata::cmip6_world(model='ACCESS-ESM1-5',
ssp='245',
time='2041-2060',
var='bioc',
download = TRUE,
res=10,
path=tempdir())
```
Now, environmental data can be extracted for each sample site. The extraction command results in an environmental matrix `X.env` having 1,096 rows and 19 columns after removing IDs.
```{r}
# extracting historical environmental data for A. thaliana samples
X.env = terra::extract(x = climate,
y = data.frame(coordinates),
cells = FALSE)
# remove IDs
X.env = X.env[,-1]
# extracting future environmental data for A. thaliana samples
#X.env_fut = terra::extract(x = climate_future, y = data.frame(coordinates), cells=FALSE)
#X.env_fut = X.env_fut[,-1]
```
### Genotype-Environment Association study
To evaluate genomic offset statistics, environmental effect sizes must be estimated at each genomic locus. This can be achieved by applying a latent factor mixed model (LFMM) in `LEA`. Based on a previous analysis, five latent factors are used in the LFMM. Because temperature and precipitation have distinct units, the environmental data are be centered by substracting their mean, and then divided by their standard deviation.
Another approach reduces the dimension of the bioclimatic data set by performing scaled PCA on temperature and precipitation variables separately. New variables could then be defined by retaining the first components in each separate analysis. Working with a large sample size, dimension reduction is not implemented here.
```{r}
# latent factor GEA model
mod_lfmm = LEA::lfmm2(input = genotype,
env = scale(X.env),
K = 5,
effect.sizes = TRUE)
# get environmental effect sizes
B <- mod_lfmm@B
```
The GEA model can also be used to define a subset of candidate loci to be included in genomic offset computation.
```{r}
pv = lfmm2.test(mod_lfmm,
input = genotype,
env = scale(X.env),
full = TRUE)
plot(-log10(pv$pvalue),
xlab = "SNPs",
cex = .3, pch = 19, col = "blue")
```
A set of candidate loci needs to be relatively large. Statistical significance is not a requirement here, and a cut-off threshold at minus log10 (pvalue) greater than 5 is chosen (but try 0 to 4).
```{r}
# define candidate loci for GO analysis
candidates = -log10(pv$pvalue) > 5
# taking all loci for GO analysis
# candidates = -log10(pv$pvalue) > 0
# how many candidate loci?
sum(candidates)
```
### Extracting historical and future climate for Europe
First, a reasonable range of longitude and latitude coordinates must be defined. This range of values includes the European mainland and some islands. Below `nc` is a critical resolution parameter. Increasing the `nc` value produces more precise maps, but at higher costs.
```{r}
## nc = resolution, higher is better but slower
nc = 200
# range of longitude for Europe (deg E)
long.mat <- seq(-10, 40, length = nc)
# range of latitude for Europe (deg N)
lat.mat <- seq(36, 67, length = nc)
# matrix of cells for Europe (nc times nc)
coord.mat <- NULL
for (x in long.mat)
for (y in lat.mat) coord.mat <- rbind(coord.mat, c(x,y))
```
Then, the `R` package `terra` can extract historical climate and future climate data for every cell defined in the above coordinate matrix `coord.mat`.
```{r}
# Extract historical climate
env.new = terra::extract(x = climate,
y = data.frame(coord.mat),
cells = FALSE)
env.new = env.new[,-1]
# Extract future climate
env.pred = terra::extract(x = climate_future,
y = data.frame(coord.mat),
cells=FALSE)
env.pred = env.pred[,-1]
```
### Computing genomic offset for environmental matrices
The R package `LEA` can compute genomic offset statistics by using the `genomic.offset` function. Here, genomic offset statistics will be recalculated without the help of the `genomic.offset` function (which, by the way, is not complicated). The genomic offset value is recalculated in order to allow missing environmental data (NA's). For terrestrial species, environmental NA's are a (tricky) way to display data in land areas only, showing seas as blank areas.
As the lfmm was adjusted on scaled historical environmental predictors, the same scaling must be performed for the future data. This is done below.
```{r}
## scaling bioclimatic variables (with the same scale as in the lfmm)
m.x <- apply(X.env, 2, FUN = function(x) mean(x, na.rm = TRUE))
sd.x <- apply(X.env, 2, function(x) sd(x, na.rm = TRUE))
env.new <- t(t(env.new) - m.x) %*% diag(1/sd.x)
env.pred <- t(t(env.pred) - m.x) %*% diag(1/sd.x)
```
For example, consider a particular site in Germany with longitude around 10.06689 E, and latitude around 50.30769.
```{r}
# Coordinates (long, lat) of a geographic site in Germany, Europe
coord.mat[36139,]
```
The genomic offset at this particular geographic location can be calculated as follows. The statistic corresponds to the geometric GO defined in (Gain et al. 2023).
```{r}
## Geometric genomic offset for long = 10.06689, lat = 50.30769
mean(((env.new - env.pred)[36139,] %*% t(B[candidates,]))^2)
```
Displaying a map requires to repeat the above computation for all geographic locations in the coordinate matrix. In `R`, this is usually be achieved by using the `apply` function. Below, a less elegant approach is carried out to evaluate the genomic offset at each cell. The reason for using a slow method is that the faster method may overload the memory space. So, be patient or modify the code chunk to avoid the loop (run the last line of code only).
```{r}
## gg contains the Gain et al. geometric GO computed at each matrix cell
## be patient, it may be very slow for large nc
gg = NULL
for (i in 1:nrow(env.new)){
gg[i] = mean(((env.new - env.pred)[i,] %*% t(B[candidates,]))^2, na.rm = TRUE)
}
# Impatient users may try this
# gg = rowMeans(((env.new - env.pred) %*% t(B[candidates,]))^2, na.rm = TRUE)
```
The matrix that corresponds to a geographic mapping of all genomic offset statistics can be obtained as follows.
```{r}
## matrix of genomic offset for the Europe map
## NA when below sea level.
go = t(matrix(gg, byrow = FALSE, ncol = nc))
```
Let us check the histogram of GO statistics.
```{r}
hist(as.numeric(go),
main = "Histogram of GO values",
xlab = "Geometric GO")
```
There are several ways to represent the GO matrix in `R`. One option is by using the `R` package `fields`. This option places a color key at the right of the figure.
```{r, fig.width = 6,fig.height=5}
# my colors - they might change the story!
my.colors = colorRampPalette(c("lightblue3", "orange2", "red3"))(100)
## bins extreme values above .1 - see histogram
# go2 = go
# go2[go2 > .1] = .1
fields::image.plot(long.mat, lat.mat, go,
col = my.colors,
las = 1,
xlab = "Longitude",
ylab = "Latitude")
## add contour of Europe and sample locations
maps::map(add = TRUE, interior = FALSE, col = "grey40")
points(coordinates, col = "grey40", cex = .3)
```
Another graphic option is to use `image` in the R package `terra` after conversion as raster. This might then be harder to read the figure axes as longitude and latitude.
```{r, fig.width=5,fig.height=5 }
r <- terra::rast(t(go)[nc:1,])
terra::image(r,
col = my.colors,
las = 1,
xlab = "xcells",
ylab = "ycells")
```
For A. thaliana, the most pessimistic predictions of maladaptation under scenario SSP2-4.5 are for areas in France, Italy, Belgium, Germany and in Central Europe. Regions in the Alps, Northern Europe and in areas under oceanic influence appear to be at lower risk. Of course, this interpretation is specific to the bioclimatic variables considered and to the IPPC scenario used. The result requires further evidence from additional scenarios and combinations of predictors.
## References
1. Alonso-Blanco, C., Andrade, J., Becker, C., Bemm, F., Bergelson, J., Borgwardt, et al. (2016). 1,135 genomes reveal the global pattern of polymorphism in Arabidopsis thaliana. Cell, 166(2), 481-491.
2. Caye, K., Jumentier, B., Lepeule, J., & François, O. (2019). LFMM 2: fast and accurate inference of gene-environment associations in genome-wide studies. Molecular biology and evolution, 36(4), 852-860.
3. Frichot, E., François, O. (2015). LEA: An R package for landscape and ecological association studies. Methods in Ecology and Evolution, 6(8), 925-929.
4. Gain, C., Rhoné, B., Cubry, P., Salazar, I., Forbes, F., Vigouroux, Y., et al. (2023). A quantitative theory for genomic offset statistics. Molecular Biology and Evolution, 40(6), msad140.
5. IPCC AR6 Synthesis Report: Climate Change 2023, March 2023.
6. Hijmans R (2024). terra: Spatial Data Analysis. R package version 1.7-71.
7. Hijmans RJ, Barbosa M, Ghosh A, Mandel A (2023). `geodata: Download Geographic Data`. R package
version 0.5-9.
8.Douglas Nychka, Reinhard Furrer, John Paige, Stephan Sain (2021). fields: Tools for spatial data. R
package version 15.2.