-
Notifications
You must be signed in to change notification settings - Fork 0
Variable Observation SIR Epidemic
Created by Alexander Zarebski
Last edited using
BEAST v2.7.5
remaster v2.3.0
timtam v0.4.0
timtamslamR v0.0.1
This tutorial demonstrates how to use Timtam to estimate the reproduction number and prevalence of infection at two points in time using a sequence alignment and a time series of confirmed cases.
In this tutorial, we will estimate the reproduction number and prevalence of infection in a simulated epidemic. The data consists of a sequence alignment, sample-sequences.fasta (which has calendar dated sequences) and a time series of daily case counts, aggregated-occurrences.csv (again with date shown in calendar dates). The difference between this tutorial and the last one is that here we consider the case where surveillance didn’t start until 2 weeks after the start of the outbreak.
We simulated the data using an SIR model. From our data collection perspective, the ‘removed’ individuals correspond to one of three possible scenarios: i) they recover from the infection without observation, ii) they can be diagnosed and have a genetic sequence generated (in which case we get a time stamped sequence for that case) or they can be diagnosed but generate an unsequenced sample (in which case they appear in the dataset as a confirmed case but there is no sequence data attached to the case). We recorded a calendar date for the sequenced samples. We aggregate the unsequenced samples into daily counts giving rise to the time series data.
The following figure shows the number of infectious individuals (“infectious”) through time along with the cumulative number of people who were previously infectious but no longer are, e.g. because they have recovered or been sampled. The number of infectious individuals is growing exponentially initially, but then slows down as the number of susceptible individuals diminishes.
The following figure shows the daily number of confirmed cases: infectious that have been observed but for which a sequence was not collected. These data form the time series that we will use.
In the SIR model the force of infection decreases as the number of susceptible individuals declines. To account for this in our analysis of these data we will use a time varying birth rate. I.e. we will allow the birth rate the change through time to match the dynamics of the epidemic. We will allow it to change at two points in time at 14 and 7 days prior to the present. To simplify this tutorial we will assume that net removal rate, denoted by sigma, of \(0.2\) is known along with the clock rate (of \(10-3\)) and the origin time (\(40\)).
We will start by using BEAUti to set up an XML file which handles most of the analysis, then tweak the result to estimate the historical prevalence and birth rate.
- Create a work folder in your computer and download the sample-sequences.fasta and aggregated-occurrences.csv files there.
- Run the preprocessing.R script (with
Rscript preprocessing.R
) which- associates a time with each of the sequences;
- generates some diagnostic plots (shown below;)
- formats the time series;
- and saves the timing of the last sequence which we need later.
- Create an XML file using BEAUti.
- Load the sequence FASTA file into BEAUti
- In the Tip Dates tab, parse the tip-dates using the default auto-configured values function.
- Leave the site model tab with the default values.
- In the Clock Model tab, set the clock rate to
0.001
. - In the Priors tab, select the Tim Tam Time Series Model.
- (Optional) set the chain length to a smaller value if you are in a hurry; 1,000,000 MCMC steps should do.
- Save the resulting file as an XML, e.g.
timtam.xml
.
- Now it’s time to tweak the XML by opening it in a text editor and
making the following edits:
- Remove operators relating to the clock rate, as we are working
with a known (from the simulation parameters) evolutionary rate.
There should be two of these clock operators, you can identify
them because they have the clock rate among the parameters that
they act on. Search
clockRate
to find these in the file. - We need to tweak
historySizes
andhistoryChecks
which are the parameters used to estimate the historical prevalence. Make the following edits:- the
stateNode
forTTHistorySizes.t
, should have its value set to something similar to100 200 500 400
as a starting point, (n.b. you may need to adjust this to different values for your own dataset.) - the prior distribution on the history sizes can have its upper
bound set to
4000
(in general this should be the population size), - the
operator
on the history sizes can be left as it is, - the
log
for the history sizes can be left as it is, - and the
historyTimes
parameter of the Timtamdistribution
needs to be set to30.0 20.0 10.0 0.0
so that we estimate the prevalence at thirty, twenty and ten days prior to the present and at the present.
- the
- Copy and paste the two number series from the R script output as
the values of
disasterTimes
anddisasterSizes
respectively, in the Timtam distribution node. You should recognise it by theid
value ofTimTam.t:sample-sequences
. This will format the time series of the sampling times and sizes properly. Note that BEAUti may have addeddimension
to equal'3'
to these nodes, delete these attributes. - Double check that
sigma
(the net removal rate) is set to (0.2\) inTimTam.d:sample-sequences
. - Double check that the value of
originTime
is set to \(40\). - Since we want the R0 to change we need to set the
TTR0
parameter to have adimension
"3"
(or set three initial values!) - To specify that there are no observations initially, we need to
define the sampling proportions to have multiple values, to do
this, find the
TTPropPsi
andTTPropTS
variables and makes sure they look like the following:
- Remove operators relating to the clock rate, as we are working
with a known (from the simulation parameters) evolutionary rate.
There should be two of these clock operators, you can identify
them because they have the clock rate among the parameters that
they act on. Search
<parameter id="TTPropPsi.t:sample-sequences-timed" spec="parameter.RealParameter" lower="0.0" name="stateNode" upper="1.0">0.0 0.1</parameter>
<parameter id="TTPropTS.t:sample-sequences-timed" spec="parameter.RealParameter" lower="0.0" name="stateNode" upper="1.0">0.0 0.4</parameter>
- Since we are assuming that the observation probabilities are
initially zero, we need to remove the random walk operators,
RealRandomWalkOperator
, on these parameters (or comment them out). We can keep theScaleOperators
because any scaling of zero stays at zero. - To specify when the effective reproduction number and the
observation probabilities change, set the change points for it by
adding the XML snippet below to the
TimTam.t:sample-sequences
node and addingconditionOnObservation="false"
to this node as well.
<parameter spec="parameter.RealParameter" estimate="false" name="r0ChangeTimes">14.0 7.0</parameter>
<parameter spec="parameter.RealParameter" estimate="false" name="propPsiChangeTimes">26.0</parameter>
<parameter spec="parameter.RealParameter" estimate="false" name="propTimeSeriesChangeTimes">26.0</parameter>
Finally, save the resulting XML which should look like timtam.xml.
- Run MCMC by executing the XML file on BEAST (you may want to grab a coffee while this is running.)
- Download the simulation data (to use for subsequent visualisations) in tutorial-3-simulation-output.json. Make sure it is in your working directory.
- Plot the results using postprocessing.R. This will produce two figures similar to the ones shown below, summarising some of the epidemiological estimates from your analysis.
By following this tutorial, we have used a multiple sequence alignment and a time series of observed (but not sequenced) case data to estimate the basic reproduction number of a pathogen and the prevalence of the disease at the time of the last observation in our simulated epidemic.
The first figure below shows the posterior distribution of the effective reproduction number through time along with its \(95\%\) credible interval and the value of this parameter in the full simulation. This figure also shows the that the surveillance started with a vertical line. Our estimate of the effective reproduction number changes through time because we allowed it to change at 14 and 7 days before the present in the XML.
Note how in the plot, the 95% HPD includes the real value of the basic reproduction number. We would expect to recover the true value of an estimated parameter credible interval when working with real-world data.
The second figure shows the posterior distribution of the prevalence at the two points in time we have history checks for, i.e. the number of infectious people. The true number of infectious individuals through time in the simulation is shown in red and the blue points show our estimate and its \(95\%\) HPD.
The plots produced by the provided R script from your analysis should look similar to the ones shown above but not identical. Remember that the MCMC samples across parameter space randomly, but while the exact values vary from one run to the next, the distribution of these values should be similar.
The epidemic was simulated using remaster with the model specification here.
To see how the sequences have been assigned times — recalling that
the original FASTA file only contained calendar dates — we can refer
to the following figure which was produced when we used timtamslamR
in the preprocessing of the simulated data.