Skip to content

Tutorial 3

Alex Zarebski edited this page Jul 29, 2022 · 6 revisions

Tutorial 3

This tutorial demonstrates how to use TimTam to estimate the reproduction number and prevalence of infection from a sequence alignment and a time series of confirmed cases generated during an SIR type epidemic.

Data and model

The data we will analyse in this tutorial consists of a sequence alignment, sample-sequences.fasta, and a time series of daily case counts, aggregated-occurrences.csv. These data come from a simulation of an SIR model with MASTER and a sequence simulation using seqgen (using the interface in the R package phangorn). The figure below shows the number of people in the population who are susceptible, infectious or removed through time.

We can plot the time series of the daily number of observed cases (that were not sequenced) as shown in the following figure.

In the SIR model the force of infection decreases as the number of susceptible individuals declines. To account for this in our analysis of these data we will use a time varying birth rate. I.e. we will allow the birth rate the change through time to match the dynamics of the epidemic. We will allow it to change four uniformly spaced times across the last 18 days prior to the present. To simplify this tutorial we will assume that the death rate (of \(0.1\)) is known along with the clock rate (of \(10-3\)) and the origin time (\(40\)).

Analysis

To simplify this tutorial we will assume that the death rate of \(0.1\) is known along with the clock rate (\(0.001\)) and the origin time (\(40\)). We will consider a birth rate that changes four times at \(18.0\), \(13.5\), \(9.0\) and \(4.5\) days before the present.

  1. Create a new folder to work in and put copies of the data there: sample-sequences.fasta, aggregated-occurrences.csv.
  2. Create an XML file using BEAUti.
    1. Load the sequence data into BEAUti
    2. Include the tip-dates, the default auto-configured values should work.
    3. Leave the site model on the default values.
    4. Set the clock rate to 0.001 in the clock model.
    5. Select the tree prior to Tim Tam Model.
    6. (Optional) set the chain length to a smaller value if you are in a hurry.
    7. Save the resulting file as an XML.
  3. Tweak the XML:
    • Remove operators relating to the clock rate (there should be two of them.)
    • Set the removal rate to \(0.1\), this will appear as a parameter with the attribute name equal to "mu".
    • Set the origin time to \(40\), this will appear as a parameter with the attribute name equal to "originTime.
    • Yes, this is cheating, but to avoid a painfully slow burn-in, set the initial values of the parameters to values close to those from the simulation:
      • samplingRate should be close to \(0.02\),
      • nuProb (the scheduled unsequenced sampling probability) should be close to \(0.08\), and
      • birthRate should be close to \(0.4\) and put an upper bound on the birth rate parameter at something about \(1.0\). We will explain this later.
    • Since we want the birth rate to change four times we need to set the birthRate parameter to have a dimension="5".
    • To specify when the birth rate changes, set the change point for the birth rate by adding the XML snippet below to the TimTam.t:sample-sequences node and add conditionOnObservation="false" to this node as well.
  4. Download and run print-disaster-data.R and keep a copy of the data that gets printed out, this is the representation of the time series data that TimTam expects. Because we are working with times relative to date of the last sequence we have to shift the times to agree with that.
    • You can run the R script in RStudio (after setting the working directory to the one containing the time series CSV you downloaded before), or from the command line with Rscript print-disaster-data.R.
  5. Include the output from the R script as the values of disasterTimes and disasterSizes in the TimTam distribution node, if you have followed the steps above, this should have the ID TimTam.t:sample-sequences. The final XML should look like timtam.xml.
  6. Run MCMC (you may want to grab a coffee while this is running.)
  7. Download the simulation data from here and then plot both the simulation data and the estimates of the reproduction number and prevalence using post-processing.R, this should produce the two figures shown below.
<parameter spec="parameter.RealParameter" estimate="false" name="lambdaChangeTimes">18.0 13.5 9.0 4.5</parameter>

Results

To re-cap, we have used a multiple sequence alignment and a time series of observed (but not sequenced) case data to estimate the prevalence of infection and the effective reproduction number through time. The figures below show the true values of these quantities in red, and our estimates of them in blue.

The first figure below shows the posterior distribution of the effective reproduction number through time along with its \(95\%\) credible interval and the value of this parameter in the full simulation. Our estimate of the effective reproduction number changes through time because we allowed it to change at four times before the present in the XML.

The next figure shows the posterior distribution of the prevalence at the time of the last observation, i.e. the number of infectious people at the present, along with the true value of this quantity in the simulation.

As with every BEAST analysis, you should use something like Tracer to check for obvious issues in the MCMC mixing and convergence, but doing so is beyond the scope of this tutorial.

Appendix

Putting bounds on the birth rate

While the chain is converging, it sometimes visits regions of parameter space where numerical issues start to arise. Typically this will be when the birth rate gets unreasonably large. Setting an upper bound on the birth rate constrains the MCMC to explore only plausible regions of parameter space. Since the peak of the posterior is well within the bound we used, this should not influence the final results in a meaningful way, and it substantially speeds up the convergence.

Chain length

The figures above used two million samples which takes less than an hour, but is still substantially longer than people want to wait for a tutorial result. Ensure you carry out a range of posterior checks to ensure you have a sufficient number of posterior samples if you want to use this method on “real” data.

Simulation

The data used in this tutorial was simulated with MASTER. The simulation follows an SIR type epidemic in a closed population of 4000 people. The XML that specified this simulation is epidemic-simulation.xml and the R script to extract sequences and the time series of cases is pre-processing.R.

Running BEAUti and the BEAST

There is a GUI built-into BEAST and that is sufficient to use to complete this tutorial. If you want to run this from the command line though, get a copy of the BEAST2 jar file and you can run these programs with the following:

java -cp <path/to/beast.jar> beast.app.beauti.Beauti

and

java -cp <path/to/beast.jar> beast.app.beastapp.BeastMain <path/to/timtam.xml>