-
Notifications
You must be signed in to change notification settings - Fork 0
Constant Parameters R0
Created by Alexander Zarebski
Last edited using
BEAST v2.7.1
remaster v1.0.0
timtam v0.4.0
This tutorial demonstrates how to use TimTam to estimate the reproduction number and prevalence of infection using a sequence alignment and a time series of confirmed cases. Often in genomic epidemiology, the available set of sequences is not representative of the total number of recorded cases due to limitations in sequencing capacity, accessibility to samples or other reasons. TimTam facilitates the inclusion of case counts combined with available genomic data to estimate epidemiological parameters from outbreaks.
In this tutorial, we will work with data obtained from a simulated epidemic. The data consists of a sequence alignment, sample-sequences.fasta and a time series of daily case counts, aggregated-occurrences.csv. These data were simulated from a birth-death process in which individuals infect (therefore ‘give birth to’) new individuals and cease to be infectious (therefore ‘die’) after a given time. From the epidemic process, individuals that cease to be infectious as they recover or are removed from the epidemic due to a fatal outcome of the infection. However, from our data collection perspective, these ‘removed’ individuals correspond to one of three possible scenarios: i) they recover from the infection without ever being diagnosed (in which case we don’t observe their infection at all), ii) they can be diagnosed and have a genetic sequence generated (in which case we get a time stamped sequence for that case) or they can be diagnosed but generate an unsequenced sample (in which case they appear in the dataset as a confirmed case but there is no sequence data attached to the case). For the unsequenced samples, they are aggregated into daily counts giving rise to the time series data.
The following figure shows the number of infectious individuals (“infectious”) through time along with the cumulative number of people who were previously infectious, for example, who have recovered or been sampled. Note from the subtitle that there are 1726 infectious individuals at the present. The number of infectious individuals is growing exponentially, as indicated by the linear growth on a logarithmic y-scale.
The following figure shows the daily number of confirmed cases: infectious that have been observed but for which a sequence was not collected. These data form the time series that we will use.
These data were simulated with MASTER, with parameters chosen so the epidemic has a basic reproduction number of \(1.85\). To simplify this tutorial we will assume a known net removal rate, denoted by sigma, of (\(0.1 = 0.046 + 0.008 + 0.046\)), and the pathogen evolves with a substitution rate of (\(10-3\)) substitutions/site/day constantly across the tree. Both values fall within the ballpark of what might be expected for an RNA virus. The origin time for our simulated epidemic took place (\(70\)) days before the end of the simulation.
- Create a work folder in your computer and download the sample-sequences.fasta and aggregated-occurrences.csv files there.
- Create an XML file using BEAUti.
- Load the sequence data into BEAUti
- Parse the tip-dates using the default auto-configured values function.
- Leave the site model tab with the default values.
- Set the clock rate to
0.001
in the clock model tab. - Select the Tim Tam Time Series Model on the tree prior tab.
- (Optional) set the chain length to a smaller value if you are in a hurry; 1,000,000 MCMC steps should do.
- Save the resulting file as an XML, eg
timtam.xml
.
- We now need to estimate some parameter values to add to the XML. To do this, download and run the R script print-disaster-data.R and keep a copy of the data that gets printed out on the Console or Terminal. There should be two sets of values, these are the representation of the time series data that TimTam expects. We do this because the times in the simulation are relative to date of the last sequence we have; we therefore need to shift the times in the XML so they agree with the ones we just computed.
Rscript print-disaster-data.R
- Now it’s time to tweak the XML by opening it in a text editor and making the
following edits:
- Remove operators relating to the clock rate, as we are working with a known
(from the simulation parameters) evolutionary rate. There should be two of
these clock operators, you can identify them because they begin with
<operator id="clock.rate"
- Remove all references to
historySizes
andhistoryChecks
, these are used to estimate historical prevalence and are not needed for this example. There are several things that will need to be removed:- the
stateNode
, - the prior distribution on the history sizes,
- the
operator
on the history sizes, - the
log
for the history sizes, - the
historySizes
attribute of the TimTamdistribution
, - and the
historyTimes
parameter of the TimTamdistribution
.
WARNING removing the history size checks will mean that you can’t estimate the prevalence, if you are interested in how to do this, read the subsequent tutorials or look at the timtam.xml which still has these included.
- the
- Copy and paste the two number series from the R script output as the values
of
disasterTimes
anddisasterSizes
respectively, in the TimTam distribution node. You should recognise it by theid
value ofTimTam.t:sample-sequences
. This will format the time series of the sampling times and sizes so that they are included properly. Note that BEAUti may have addeddimension='3'
to these nodes, in which case delete these attributes. - Double check that
sigma
is set to the known value of \(0.1\) inTimTam.d:sample-sequences
. - Double check that the value of
originTime
is set to \(70\). - Save the resulting XML which should look like timtam.xml.
- Remove operators relating to the clock rate, as we are working with a known
(from the simulation parameters) evolutionary rate. There should be two of
these clock operators, you can identify them because they begin with
- Run MCMC by executing the XML file on BEAST (you may want to grab a coffee while this is running.)
- Download the simulation data (which is needed for subsequent visualisations) in tutorial-6-simulation-output.traj
- Plot the results using post-processing.R. This will produce two figures similar to the ones shown below, summarising some of the epidemiological estimates from your analysis.
By following this tutorial, we have used a multiple sequence alignment and a time series of observed (but not sequenced) case data to estimate the basic reproduction number of a pathogen and the prevalence of the disease at the time of the last observation in our simulated epidemic.
The first figure shows the posterior distribution of the (basic) reproduction number as inferred during your BEAST run. The true value for the reproduction number (defined as a known value in the simulation) is shown as a vertical line.
Note how in the plot, the 95% HPD includes the real value of the basic reproduction number. We would expect to recover the true value of an estimated parameter credible interval when working with real-world data.
If you have run the MCMC with the history size checks included, the post-processing will generate a second figure. The second figure shows the posterior distribution of the prevalence at the time of the last observation, i.e. the number of infectious people at the present time. Recall from the observation we made above that this number should be approximately 1726 infectious cases at that time. How does the estimate from the figure below and the estimate from your own analysis compare to the true number of infectious cases at this last time point?
The plots produced by the provided R script from your analysis should look very similar to the ones shown above but not identical. Remember that the MCMC samples across parameter space randomly, but while the exact values vary from one run to the next, the distribution of these values should be mostly the same.