-
Notifications
You must be signed in to change notification settings - Fork 0
Variable Parameters SIR Epidemic
Created by Alexander Zarebski
Last edited using
BEAST v2.7.5
remaster v1.0.0
timtam v0.4.0
timtamslamR v0.0.1
This tutorial demonstrates how to use Timtam to estimate the reproduction number and prevalence of infection at two points in time using a sequence alignment and a time series of confirmed cases.
In this tutorial, we will estimate the reproduction number and prevalence of infection in a simulated epidemic. The data consists of a sequence alignment, sample-sequences.fasta (which has calendar dated sequences) and a time series of daily case counts, aggregated-occurrences.csv (again with date shown in calendar dates). These data were simulated from an SIR model. From our data collection perspective, the ‘removed’ individuals correspond to one of three possible scenarios: i) they recover from the infection without ever being diagnosed (in which case we don’t observe their infection at all), ii) they can be diagnosed and have a genetic sequence generated (in which case we get a time stamped sequence for that case) or they can be diagnosed but generate an unsequenced sample (in which case they appear in the dataset as a confirmed case but there is no sequence data attached to the case). For the sequenced samples a sequence is recorded with a calendar date attached. For the unsequenced samples, they are aggregated into daily counts giving rise to the time series data.
The following figure shows the number of infectious individuals (“infectious”) through time along with the cumulative number of people who were previously infectious but no longer are, e.g. because they have recovered or been sampled. The number of infectious individuals is growing exponentially initially, but then slows down as the number of susceptible individuals diminishes.
The following figure shows the daily number of confirmed cases: infectious that have been observed but for which a sequence was not collected. These data form the time series that we will use.
In the SIR model the force of infection decreases as the number of susceptible individuals declines. To account for this in our analysis of these data we will use a time varying birth rate. I.e. we will allow the birth rate the change through time to match the dynamics of the epidemic. We will allow it to change at two points in time at 14 and 7 days prior to the present. To simplify this tutorial we will assume that net removal rate, denoted by sigma, of \(0.2\) is known along with the clock rate (of \(10-3\)) and the origin time (\(40\)). There are \(195\) sequences that were sampled.
We will start by using BEAUti to set up an XML file which handles most of the analysis. We will then tweak this so that the historical prevalence is estimated and the birth rate changes through time.
- Create a work folder in your computer and download the sample-sequences.fasta and aggregated-occurrences.csv files there.
- Run the pre-processing.R script (with
Rscript pre-processing.R
) which- associates a time with each of the sequences;
- generates some diagnostic plots (shown below;)
- formats the time series;
- and saves the timing of the last sequence which we need later.
- Create an XML file using BEAUti.
- Load the sequence FASTA file into BEAUti
- In the Tip Dates tab, parse the tip-dates using the default auto-configured values function.
- Leave the site model tab with the default values.
- In the Clock Model tab, set the clock rate to
0.001
. - In the Priors tab, select the Tim Tam Time Series Model.
- (Optional) set the chain length to a smaller value if you are in a hurry; 1,000,000 MCMC steps should do.
- Save the resulting file as an XML, e.g.
timtam.xml
.
- Now it’s time to tweak the XML by opening it in a text editor and making the
following edits:
- Remove operators relating to the clock rate, as we are working with a known
(from the simulation parameters) evolutionary rate. There should be two of
these clock operators, you can identify them because they have the clock
rate among the parameters that they act on. Search
clockRate
to find these in the file. - We need to tweak
historySizes
andhistoryChecks
which are the parameters used to estimate the historical prevalence. There are several things that will need to be edited:- the
stateNode
forTTHistorySizes.t
, should have its value set to something similar to600 400
as a starting point, (n.b. you may need to adjust this to different values for your own dataset.) - the prior distribution on the history sizes can have its upper bound set
to
4000.0
, - the
operator
on the history sizes can be left as it is, - the
log
for the history sizes can be left as it is, - and the
historyTimes
parameter of the Timtamdistribution
needs to be set to10.0 0.0
so that we estimate the prevalence at ten days prior to the present and at the present.
- the
- Copy and paste the two number series from the R script output as the values
of
disasterTimes
anddisasterSizes
respectively, in the Timtam distribution node. You should recognise it by theid
value ofTimTam.t:sample-sequences
. This will format the time series of the sampling times and sizes so that they are included properly. Note that BEAUti may have addeddimension='3'
to these nodes, in which case, delete these attributes. - Double check that
sigma
(the net removal rate) is set to the known value of \(0.2\) inTimTam.d:sample-sequences
. - Double check that the value of
originTime
is set to \(40\). - Since we want the R0 to change we need to set the
TTR0
parameter to have adimension="3"
(or set three initial values!) - To specify when the effective reproduction number changes, set the change
points for it by adding the XML snippet below to the
TimTam.t:sample-sequences
node and addingconditionOnObservation="false"
to this node as well. - Save the resulting XML which should look like timtam.xml.
- Remove operators relating to the clock rate, as we are working with a known
(from the simulation parameters) evolutionary rate. There should be two of
these clock operators, you can identify them because they have the clock
rate among the parameters that they act on. Search
<parameter spec="parameter.RealParameter" estimate="false" name="r0ChangeTimes">14.0 7.0</parameter>
- Run MCMC by executing the XML file on BEAST (you may want to grab a coffee while this is running.)
- Download the simulation data (which is needed for subsequent visualisations) in tutorial-8-simulation-output.json and make sure it is in your working directory.
- Plot the results using post-processing.R. This will produce two figures similar to the ones shown below, summarising some of the epidemiological estimates from your analysis.
By following this tutorial, we have used a multiple sequence alignment and a time series of observed (but not sequenced) case data to estimate the basic reproduction number of a pathogen and the prevalence of the disease at the time of the last observation in our simulated epidemic.
The first figure below shows the posterior distribution of the effective reproduction number through time along with its \(95\%\) credible interval and the value of this parameter in the full simulation. Our estimate of the effective reproduction number changes through time because we allowed it to change at 14 and 7 days before the present in the XML.
Note how in the plot, the 95% HPD includes the real value of the basic reproduction number. We would expect to recover the true value of an estimated parameter credible interval when working with real-world data.
The second figure shows the posterior distribution of the prevalence at the two points in time we have history checks for, i.e. the number of infectious people. The true number of infectious individuals through time in the simulation is shown in red and the blue points show our estimate and its \(95\%\) HPD.
The plots produced by the provided R script from your analysis should look very similar to the ones shown above but not identical. Remember that the MCMC samples across parameter space randomly, but while the exact values vary from one run to the next, the distribution of these values should be mostly the same.
Finally, we might be interested in some of the summary statistics of the reconstructed tree to see how well they have been estimated. If you have run the scripts above in order, a JSON file should have been produced which recorded the true values from the simulation. The following figure shows the posterior marginal distributions of the tree height and length and their true values. Note that the tree height is constrained by our assumption about the known origin time.
To see how the sequences have been assigned times — recalling that
the original FASTA file only contained calendar dates — we can refer
to the following figure which was produced when we used timtamslamR
in the pre-processing of the simulated data.
The epidemic was simulated using remaster with the model specification here.