Variable Observation SIR Epidemic

Example: Variable Parameters to Model an SIR Epidemic Part II

Created by Alexander Zarebski

Last edited using

BEAST v2.7.5
remaster v2.3.0
timtam v0.4.0
timtamslamR v0.0.1

This tutorial demonstrates how to use Timtam to estimate the reproduction number and prevalence of infection at two points in time using a sequence alignment and a time series of confirmed cases.

Data and model

In this tutorial, we will estimate the reproduction number and prevalence of infection in a simulated epidemic. The data consists of a sequence alignment, sample-sequences.fasta (which has calendar dated sequences) and a time series of daily case counts, aggregated-occurrences.csv (again with date shown in calendar dates). The difference between this tutorial and the last one is that here we consider the case where surveillance didn’t start until 2 weeks after the start of the outbreak.

We simulated the data using an SIR model. From our data collection perspective, the ‘removed’ individuals correspond to one of three possible scenarios: i) they recover from the infection without observation, ii) they can be diagnosed and have a genetic sequence generated (in which case we get a time stamped sequence for that case) or they can be diagnosed but generate an unsequenced sample (in which case they appear in the dataset as a confirmed case but there is no sequence data attached to the case). We recorded a calendar date for the sequenced samples. We aggregate the unsequenced samples into daily counts giving rise to the time series data.

The following figure shows the number of infectious individuals (“infectious”) through time along with the cumulative number of people who were previously infectious but no longer are, e.g. because they have recovered or been sampled. The number of infectious individuals is growing exponentially initially, but then slows down as the number of susceptible individuals diminishes.

The following figure shows the daily number of confirmed cases: infectious that have been observed but for which a sequence was not collected. These data form the time series that we will use.

In the SIR model the force of infection decreases as the number of susceptible individuals declines. To account for this in our analysis of these data we will use a time varying birth rate. I.e. we will allow the birth rate the change through time to match the dynamics of the epidemic. We will allow it to change at two points in time at 14 and 7 days prior to the present. To simplify this tutorial we will assume that net removal rate, denoted by sigma, of \(0.2\) is known along with the clock rate (of \(10^-3\)) and the origin time (\(40\)).

Analysis

We will start by using BEAUti to set up an XML file which handles most of the analysis, then tweak the result to estimate the historical prevalence and birth rate.

Setting up with BEAUti

Create a work folder in your computer and download the sample-sequences.fasta and aggregated-occurrences.csv files there.
Run the preprocessing.R script (with Rscript preprocessing.R) which
- associates a time with each of the sequences;
- generates some diagnostic plots (shown below;)
- formats the time series;
- and saves the timing of the last sequence which we need later.
Create an XML file using BEAUti.
1. Load the sequence FASTA file into BEAUti
2. In the Tip Dates tab, parse the tip-dates using the default auto-configured values function.
3. Leave the site model tab with the default values.
4. In the Clock Model tab, set the clock rate to 0.001.
5. In the Priors tab, select the Tim Tam Time Series Model.
6. (Optional) set the chain length to a smaller value if you are in a hurry; 1,000,000 MCMC steps should do.
7. Save the resulting file as an XML, e.g. timtam.xml.

Adjusting the XML for our model

Now it’s time to tweak the XML by opening it in a text editor and making the following edits:
- Remove operators relating to the clock rate, as we are working with a known (from the simulation parameters) evolutionary rate. There should be two of these clock operators, you can identify them because they have the clock rate among the parameters that they act on. Search clockRate to find these in the file.
- We need to tweak historySizes and historyChecks which are the parameters used to estimate the historical prevalence. Make the following edits:
  - the stateNode for TTHistorySizes.t, should have its value set to something similar to 100 200 500 400 as a starting point, (n.b. you may need to adjust this to different values for your own dataset.)
  - the prior distribution on the history sizes can have its upper bound set to 4000 (in general this should be the population size),
  - the operator on the history sizes can be left as it is,
  - the log for the history sizes can be left as it is,
  - and the historyTimes parameter of the Timtam distribution needs to be set to 30.0 20.0 10.0 0.0 so that we estimate the prevalence at thirty, twenty and ten days prior to the present and at the present.
- Copy and paste the two number series from the R script output as the values of disasterTimes and disasterSizes respectively, in the Timtam distribution node. You should recognise it by the id value of TimTam.t:sample-sequences. This will format the time series of the sampling times and sizes properly. Note that BEAUti may have added dimension to equal '3' to these nodes, delete these attributes.
- Double check that sigma (the net removal rate) is set to (0.2\) in TimTam.d:sample-sequences.
- Double check that the value of originTime is set to \(40\).
- Since we want the R0 to change we need to set the TTR0 parameter to have a dimension "3" (or set three initial values!)
- To specify that there are no observations initially, we need to define the sampling proportions to have multiple values, to do this, find the TTPropPsi and TTPropTS variables and makes sure they look like the following:

<parameter id="TTPropPsi.t:sample-sequences-timed" spec="parameter.RealParameter" lower="0.0" name="stateNode" upper="1.0">0.0 0.1</parameter>
<parameter id="TTPropTS.t:sample-sequences-timed" spec="parameter.RealParameter" lower="0.0" name="stateNode" upper="1.0">0.0 0.4</parameter>

Since we are assuming that the observation probabilities are initially zero, we need to remove the random walk operators, RealRandomWalkOperator, on these parameters (or comment them out). We can keep the ScaleOperators because any scaling of zero stays at zero.
To specify when the effective reproduction number and the observation probabilities change, set the change points for it by adding the XML snippet below to the TimTam.t:sample-sequences node and adding conditionOnObservation="false" to this node as well.

	<parameter spec="parameter.RealParameter" estimate="false" name="r0ChangeTimes">14.0 7.0</parameter>
	<parameter spec="parameter.RealParameter" estimate="false" name="propPsiChangeTimes">26.0</parameter>
	<parameter spec="parameter.RealParameter" estimate="false" name="propTimeSeriesChangeTimes">26.0</parameter>

Finally, save the resulting XML which should look like timtam.xml.

Running the analysis and post-processing

Run MCMC by executing the XML file on BEAST (you may want to grab a coffee while this is running.)
Download the simulation data (to use for subsequent visualisations) in tutorial-3-simulation-output.json. Make sure it is in your working directory.
Plot the results using postprocessing.R. This will produce two figures similar to the ones shown below, summarising some of the epidemiological estimates from your analysis.

Results

By following this tutorial, we have used a multiple sequence alignment and a time series of observed (but not sequenced) case data to estimate the basic reproduction number of a pathogen and the prevalence of the disease at the time of the last observation in our simulated epidemic.

The first figure below shows the posterior distribution of the effective reproduction number through time along with its \(95\%\) credible interval and the value of this parameter in the full simulation. This figure also shows the that the surveillance started with a vertical line. Our estimate of the effective reproduction number changes through time because we allowed it to change at 14 and 7 days before the present in the XML.

Note how in the plot, the 95% HPD includes the real value of the basic reproduction number. We would expect to recover the true value of an estimated parameter credible interval when working with real-world data.

The second figure shows the posterior distribution of the prevalence at the two points in time we have history checks for, i.e. the number of infectious people. The true number of infectious individuals through time in the simulation is shown in red and the blue points show our estimate and its \(95\%\) HPD.

The plots produced by the provided R script from your analysis should look similar to the ones shown above but not identical. Remember that the MCMC samples across parameter space randomly, but while the exact values vary from one run to the next, the distribution of these values should be similar.

Epidemic simulation

The epidemic was simulated using remaster with the model specification here.

To see how the sequences have been assigned times — recalling that the original FASTA file only contained calendar dates — we can refer to the following figure which was produced when we used timtamslamR in the preprocessing of the simulated data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly