Variable Parameters SIR Epidemic

Example: Variable Parameters to Model an SIR Epidemic Part I

Created by Alexander Zarebski

Last edited using

BEAST v2.7.5
remaster v1.0.0
timtam v0.4.0
timtamslamR v0.0.1

This tutorial demonstrates how to use Timtam to estimate the reproduction number and prevalence of infection at two points in time using a sequence alignment and a time series of confirmed cases.

Data and model

In this tutorial, we will estimate the reproduction number and prevalence of infection in a simulated epidemic. The data consists of a sequence alignment, sample-sequences.fasta (which has calendar dated sequences) and a time series of daily case counts, aggregated-occurrences.csv (again with date shown in calendar dates). These data were simulated from an SIR model. From our data collection perspective, the ‘removed’ individuals correspond to one of three possible scenarios: i) they recover from the infection without ever being diagnosed (in which case we don’t observe their infection at all), ii) they can be diagnosed and have a genetic sequence generated (in which case we get a time stamped sequence for that case) or they can be diagnosed but generate an unsequenced sample (in which case they appear in the dataset as a confirmed case but there is no sequence data attached to the case). For the sequenced samples a sequence is recorded with a calendar date attached. For the unsequenced samples, they are aggregated into daily counts giving rise to the time series data.

The following figure shows the number of infectious individuals (“infectious”) through time along with the cumulative number of people who were previously infectious but no longer are, e.g. because they have recovered or been sampled. The number of infectious individuals is growing exponentially initially, but then slows down as the number of susceptible individuals diminishes.

The following figure shows the daily number of confirmed cases: infectious that have been observed but for which a sequence was not collected. These data form the time series that we will use.

In the SIR model the force of infection decreases as the number of susceptible individuals declines. To account for this in our analysis of these data we will use a time varying birth rate. I.e. we will allow the birth rate the change through time to match the dynamics of the epidemic. We will allow it to change at two points in time at 14 and 7 days prior to the present. To simplify this tutorial we will assume that net removal rate, denoted by sigma, of \(0.2\) is known along with the clock rate (of \(10^-3\)) and the origin time (\(40\)). There are \(195\) sequences that were sampled.

Analysis

We will start by using BEAUti to set up an XML file which handles most of the analysis. We will then tweak this so that the historical prevalence is estimated and the birth rate changes through time.

Setting up with BEAUti

Create a work folder in your computer and download the sample-sequences.fasta and aggregated-occurrences.csv files there.
Run the pre-processing.R script (with Rscript pre-processing.R) which
- associates a time with each of the sequences;
- generates some diagnostic plots (shown below;)
- formats the time series;
- and saves the timing of the last sequence which we need later.
Create an XML file using BEAUti.
1. Load the sequence FASTA file into BEAUti
2. In the Tip Dates tab, parse the tip-dates using the default auto-configured values function.
3. Leave the site model tab with the default values.
4. In the Clock Model tab, set the clock rate to 0.001.
5. In the Priors tab, select the Tim Tam Time Series Model.
6. (Optional) set the chain length to a smaller value if you are in a hurry; 1,000,000 MCMC steps should do.
7. Save the resulting file as an XML, e.g. timtam.xml.

Adjusting the XML for our model

Now it’s time to tweak the XML by opening it in a text editor and making the following edits:
- Remove operators relating to the clock rate, as we are working with a known (from the simulation parameters) evolutionary rate. There should be two of these clock operators, you can identify them because they have the clock rate among the parameters that they act on. Search clockRate to find these in the file.
- We need to tweak historySizes and historyChecks which are the parameters used to estimate the historical prevalence. There are several things that will need to be edited:
  - the stateNode for TTHistorySizes.t, should have its value set to something similar to 600 400 as a starting point, (n.b. you may need to adjust this to different values for your own dataset.)
  - the prior distribution on the history sizes can have its upper bound set to 4000.0,
  - the operator on the history sizes can be left as it is,
  - the log for the history sizes can be left as it is,
  - and the historyTimes parameter of the Timtam distribution needs to be set to 10.0 0.0 so that we estimate the prevalence at ten days prior to the present and at the present.
- Copy and paste the two number series from the R script output as the values of disasterTimes and disasterSizes respectively, in the Timtam distribution node. You should recognise it by the id value of TimTam.t:sample-sequences. This will format the time series of the sampling times and sizes so that they are included properly. Note that BEAUti may have added dimension='3' to these nodes, in which case, delete these attributes.
- Double check that sigma (the net removal rate) is set to the known value of \(0.2\) in TimTam.d:sample-sequences.
- Double check that the value of originTime is set to \(40\).
- Since we want the R0 to change we need to set the TTR0 parameter to have a dimension="3" (or set three initial values!)
- To specify when the effective reproduction number changes, set the change points for it by adding the XML snippet below to the TimTam.t:sample-sequences node and adding conditionOnObservation="false" to this node as well.
- Save the resulting XML which should look like timtam.xml.

<parameter spec="parameter.RealParameter" estimate="false" name="r0ChangeTimes">14.0 7.0</parameter>

Running the analysis and post-processing

Run MCMC by executing the XML file on BEAST (you may want to grab a coffee while this is running.)
Download the simulation data (which is needed for subsequent visualisations) in tutorial-8-simulation-output.json and make sure it is in your working directory.
Plot the results using post-processing.R. This will produce two figures similar to the ones shown below, summarising some of the epidemiological estimates from your analysis.

Results

By following this tutorial, we have used a multiple sequence alignment and a time series of observed (but not sequenced) case data to estimate the basic reproduction number of a pathogen and the prevalence of the disease at the time of the last observation in our simulated epidemic.

The first figure below shows the posterior distribution of the effective reproduction number through time along with its \(95\%\) credible interval and the value of this parameter in the full simulation. Our estimate of the effective reproduction number changes through time because we allowed it to change at 14 and 7 days before the present in the XML.

Note how in the plot, the 95% HPD includes the real value of the basic reproduction number. We would expect to recover the true value of an estimated parameter credible interval when working with real-world data.

The second figure shows the posterior distribution of the prevalence at the two points in time we have history checks for, i.e. the number of infectious people. The true number of infectious individuals through time in the simulation is shown in red and the blue points show our estimate and its \(95\%\) HPD.

The plots produced by the provided R script from your analysis should look very similar to the ones shown above but not identical. Remember that the MCMC samples across parameter space randomly, but while the exact values vary from one run to the next, the distribution of these values should be mostly the same.

Finally, we might be interested in some of the summary statistics of the reconstructed tree to see how well they have been estimated. If you have run the scripts above in order, a JSON file should have been produced which recorded the true values from the simulation. The following figure shows the posterior marginal distributions of the tree height and length and their true values. Note that the tree height is constrained by our assumption about the known origin time.

To see how the sequences have been assigned times — recalling that the original FASTA file only contained calendar dates — we can refer to the following figure which was produced when we used timtamslamR in the pre-processing of the simulated data.

Epidemic simulation

The epidemic was simulated using remaster with the model specification here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly