-
Notifications
You must be signed in to change notification settings - Fork 1
/
rase_supplementary_document.Rmd
250 lines (207 loc) · 10.9 KB
/
rase_supplementary_document.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
---
title: "Quantification of the association of bacterial clones with antibiotic resistance"
subtitle: "Supplementary Document 1 for 'Rapid heuristic inference of antibiotic resistance and susceptibility by genomic neighbor typing'"
author: "Karel Břinda, Alanna Callendrello, Kevin C Ma, Derek R MacFadden, Themoula Charalampous, Robyn S Lee, Lauren Cowley, Crista B Wadsworth, Yonatan H Grad, Gregory Kucherov, Justin O’Grady, Michael Baym, and William P Hanage"
header-includes:
- \usepackage{bbm}
output:
pdf_document:
fig_caption: yes
number_sections: TRUE
keep_tex: TRUE
bibliography: "references.bib"
date: '`r format(Sys.Date(), "%B %d, %Y")`'
---
```{r, include=FALSE}
# todo: add numbers to pg markers
library(knitr)
knitr::opts_chunk$set(
fig.width=4.5,
fig.height=4.5,
fig.align='center',
fig.pos='t!')
knitr::opts_chunk$set(
echo=FALSE,
warning=FALSE,
message=FALSE,
cache=TRUE,
results=FALSE)
```
```{r}
library("knitcitations")
cleanbib()
options("citation_format" = "pandoc")
```
```{r}
source("evaluate-association.R")
library(png)
library(grid)
df.pneumo <- read.delim("spneumoniae_sparc.lineages.tsv")
df.gono <- read.delim("ngonorrhoeae_gisp.lineages.tsv")
all.ants = sort(unique(c(get.ants(df.pneumo), get.ants(df.gono))))
```
# Introduction
In this document, we show how to quantify the association of bacterial clones
with antibiotic resistance. For a given
bacterial species and all antibiotics of interest,
we construct optimal predictors of resistance from lineages and
calculate the associated Receiver Operating Characteristics (ROC)
curve and its Areas under the Curve (AUC).
Comparing the resulting curves and areas among antibiotics helps
to understand the different levels of associations.
We use this framework to show that for the pathogens *Streptococcus pneumoniae* and
*Neisseria gonorrhoeae* antibiotic resistance is highly associated with the population
structure. This provides evidence of the suitability of genomic neighbor
typing as a diagnostic method for these pathogens.
# Optimal lineage-to-resistance classifiers
## Model
Let us consider a bacterial species and assume that it has $g$ lineages.
For the purpose of this document, lineages are arbitrary classes of equivalence;
they can be defined based on sequence typing
(e.g., MLST `r citep("10.1146/annurev.micro.59.030804.121325")`)
or clustering (e.g., using BAPS `r citep("10.1093/molbev/mst028")` or PopPUNK `r citep("10.1101/gr.241455.118")`).
Assume that lineages are equally probable; i.e., a randomly drawn isolate $x$
comes from every lineage $i$ with the probability $\frac{1}{g}$.
Assume that for every isolate $x$, we can always correctly determine its lineage $\ell(x)$.
In a clinical setting, this could mean that we can always
determine isolate's MLST sequence type.
Let us consider an antibiotic and assume that every isolate is
either resistant or susceptible to this antibiotic.
Assume that resistance within a lineage $i$ is
iid with the Bernoulli distribution and let $r_i$
denote the probability of resistance. The constants $r_i$
can be determined based on epidemiological data or as proportions
of resistance isolates in individual lineages from population-level
studies.
<!--
Assume that we have a large number of isolates ($N\gg g$) for which we want to
determine resistance.
-->
## Lineage-to-resistance classifiers
For a given species, an antibiotic and fixed probabilities of resistance within
lineages $r_1, \ldots, r_g$,
we construct memoryless probabilistic resistance classifiers $C$ of the form
\[
C(\ell(x))\to\{\text{'S'},\text{'R'}\}.
\]
In other words, for every isolate we identify its lineage and
the classifier predicts resistance based on the knowledge
of the lineage.
All such classifiers can be parametrized as $C_{(R_1,\ldots,R_g)}$,
where $R_i\in[0, 1]$ is the probability of reporting resistance
given the sample has been identified as belong to the lineage $i$.
For instance, the classifier $C_{(0,\ldots,0)}$ always assigns 'S',
$C_{(\frac{1}{2},\ldots,\frac{1}{2})}$ assigns 'R' and 'S' like a fair coin,
$C_{(1,\ldots,1)}$ always assigns 'R', and
$C_{(1,0,\ldots,0)}$ always assigns 'R' for the first lineage and 'S' for the other ones.
With the knowledge of the probabilities of resistance ($r_1, \ldots, r_g$)
within individual lineages $1,\ldots,g$,
we can express false positive rate (FPR)
and true positive rate (TPR)
as a function of the classifier parameters ($R_1, \ldots, R_g$).
```{r chull, fig.width=4, fig.height=4, out.width="40%", fig.cap = "An illustration of the ROC diagram for a species with 4 lineages with resistance probabilities $(r_i)_{i=1}^4=(0.1, 0.4, 0.8, 0.9)$. The grey area corresponds to all the possible lineage-to-resistance classifiers $C_{(R_1,R_2,R_3,R_4)}$ while the red points are the ``vertex\'\' classifiers with integer parameters; i.e., $R_i\\in\\{0,1\\}$ for every $i\\in\\{1,\\ldots,g\\}$. RL denotes the number of lineages for which resistance is reported by an integer classifier, i.e., $\\sum_i R_i$. The optimal classifiers that maximize the AUC lie on the piece-wise linear red curve, which we term the optimal ROC curve."}
source("example-roc-chull.R")
par(mfrow = c(1, 1), mai = c(0.7, 0.8, 0.3, 0.02))
par(ps = 12, cex = 0.8, cex.main = 1)
#par(mgp=c(2,1,0))
#plot.roc.chull(lineages=1, main="1 lineage")
example.roc.chull(lineages = 4,
p.vec=c(0.1, 0.4, 0.8, 0.9),
seed = 5,
main = "ROC curve for a species with 4 lineages")
```
Let us use the standard notation:
\[
\mathrm{FPR} = \frac{\mathrm{FP}}{\mathrm{FP}+\mathrm{TN}}
\qquad \text{and} \qquad
\mathrm{TPR} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}
\]
where
\begin{align*}
\mathrm{TP} &= \text{\#true positives; i.e., resistant isolates predicted as resistant} \\
\mathrm{FP} &= \text{\#false positives; i.e., susceptible isolates predicted as resistant} \\
\mathrm{FN} &= \text{\#false negatives; i.e., resistant isolates predicted as susceptible} \\
\mathrm{TN} &= \text{\#true negatives; i.e., susceptible isolates predicted as susceptible}
\end{align*}
Given our assumptions, we can estimate the performance of a classifier $C_{R_1,\ldots,R_g}$ as
\[
\begin{array}{ll}
\mathrm{TP} \approx \frac{N}{g}\sum_{i=1}^g r_iR_i \qquad & \mathrm{FP} \approx \frac{N}{g}\sum_{i=1}^g (1-r_i)R_i\\
\mathrm{FN} \approx \frac{N}{g}\sum_{i=1}^g r_i(1-R_i) \qquad & \mathrm{TN} \approx \frac{N}{g}\sum_{i=1}^g (1-r_i)(1-R_i)
\end{array}
\]
where $N$ is the number of samples tested. We then obtain the following estimates:
\[
\mathrm{FPR} \approx \frac{\sum_{i=1}^g (1-r_i)R_i}{\sum_{i=1}^g (1-r_i)}\\
\qquad \text{and} \qquad
\mathrm{TPR} \approx \frac{\sum_{i=1}^g r_iR_i}{\sum_{i=1}^g r_i}
\]
Using these formulae, we can map every classifier $C_{(R_1,\ldots,R_g)}$
to the ROC space by
\[
f(R_1, \ldots, R_g) \to \left(
\frac{\sum_{i=1}^g (1-r_i)R_i}{\sum_{i=1}^g (1-r_i)},
\frac{\sum_{i=1}^g r_iR_i}{\sum_{i=1}^g r_i}
\right)
\]
It is easy to see that the map $f$ is linear. Since the set of all possible classifiers is
the cube $[0,1]^g$ in the parameter space, it is convex and its
image in the ROC space must be convex too; an example is provided
in Figure \ref{fig:chull}. Moreover, the image of the cube is equal to the
convex hull of the images of individual cube vertices (red points in Figure \ref{fig:chull}).
## Optimal ROC curves
Our aim now is to find the optimal ROC curve maximizing the AUC (the red curve in Figure \ref{fig:chull}).
Even though the curve corresponds to infinitely
many classifiers, it is a piece-wise linear function which is fully defined
by the $g+1$ ``vertex'' classifiers $C^{(0)}, \ldots, C^{(g)}$ lying on the line intersections.
Enumerating this classifier sequence corresponds to
putting lineages to the order in which they are
switched from susceptible to resistant (i.e., $R_i=0\to R_i=1$) along the red line.
The first classifier $C^{(0)}$, $(0,0)$ in the ROC diagram, corresponds to all
lineages being marked as susceptible, while the last classifier, $C^{(g)}$, $(1,1)$
in the diagram to all lineages being predicted marked as resistant.
The optimal ROC curve can be computed in multiple different ways.
For instance, we can enumerate all cube vertices in the parameter space,
map them to the ROC space, compute the convex hull, and extract its upper part.
More efficiently, we can construct the classifier sequence directly in the ROC
diagram by the following iterative process.
We start at $(0,0)$; i.e., with all lineages marked as susceptible, and at every step
we switch one susceptible lineage to resistant so that
the corresponding step in the ROC graph has maximal possible slope, and
we continue until all lineages are marked as resistant.
If lineages are equally probable, it is easy to see that this order corresponds
to sorting lineages by $r_i$.
```{r pneumo, fig.width=8, fig.height=4, out.width="80%", fig.cap = "\\emph{S. pneumoniae} ROC curves and the corresponding AUCs for ceftriaxone (cro), erythromycin (ery), penicillin (pen), trimethoprim (sxt), and tetracycline (tet)."}
par(mfrow = c(1, 2), mai = c(0.7, 0.8, 0.3, 0.02))
par(ps = 12, cex = 0.8, cex.main = 1)
par(mgp = c(2, 1, 0))
plot.roc(df.pneumo, main="ROC - S. pneumoniae", all.ants = all.ants, norm = 1)
plot.auc(df.pneumo, main="AUC - S. pneumoniae", all.ants = all.ants, norm = 1, annotate = F)
```
```{r gono, fig.width=8, fig.height=4, out.width="80%", fig.cap = "\\emph{N. gonorrhoeae} ROC curves and the corresponding AUCs for azithromycin (azm), cefixime (cfm), ciprofloxacine (cip), and ceftriaxone (cro)."}
par(mfrow = c(1, 2), mai = c(0.7, 0.8, 0.3, 0.02))
par(ps = 12, cex = 0.8, cex.main = 1)
par(mgp = c(2, 1, 0))
plot.roc(df.gono, main = "ROC - N. gonorrhoeae", all.ants = all.ants, norm = 1)
plot.auc(df.gono, main = "AUC - N. gonorrhoeae", all.ants = all.ants, norm = 1, annotate = F)
```
# Results
We applied the method to 616 pneumococcal genomes from a carriage study
in Massachusetts children `r citep(c("10.1038/ng.2625","10.1038/sdata.2015.58"))`
and 1102 clinical gonococcal isolates collected from 2000 to 2013
by the Centers for Disease Control and
Prevention’s Gonococcal Isolate Surveillance Project (GISP) `r citep("10.1093/infdis/jiw420")`.
For all isolates, we inferred the resistance categories
as described in the main manuscript, including the
ancestral state responstruction step.
As lineages we used the sequenced clusters
computed using Bayesian Analysis of
Population Structure (BAPS) `r citep("10.1093/molbev/mst028")`.
Lineages of *S. pneumoniae* are predictive for benzylpenicillin,
ceftriaxone, trimethoprim-sulfamethoxazole, erythromycin,
and tetracycline resistance with AUC ranging from 0.90 to 0.97 (Figure \ref{fig:pneumo}). In *N. gonorrhoeae* ciprofloxacin, ceftriaxone, and cefixime attained comparably large AUCs (from 0.93 to 0.98) whereas azithromycin demonstrated lower association (AUC 0.80) (Figure \ref{fig:gono}).
# Bibliography
```{r}
write.bibtex(file="references.bib")
bibliography()
```