Skip to content

Constant Parameters Canonical

Alex Zarebski edited this page Nov 8, 2022 · 4 revisions

Example: Constant Parameters

Created by Alexander Zarebski

Edited by Bernardo Gutierrez on 12/08/2022

Last edited using

  • BEAST v2.7.1
  • remaster v1.0.0
  • timtam v0.4.0

This tutorial demonstrates how to use TimTam to estimate the reproduction number and prevalence of infection using a sequence alignment and a time series of confirmed cases. Often in genomic epidemiology, the available set of sequences is not representative of the total number of recorded cases due to limitations in sequencing capacity, accessibility to samples or other reasons. TimTam facilitates the inclusion of case counts combined with available genomic data to estimate epidemiological parameters from outbreaks.

Data and model

In this tutorial, we will work with data obtained from a simulated epidemic. The data consists of a sequence alignment, sample-sequences.fasta and a time series of daily case counts, aggregated-occurrences.csv. These data were simulated from a birth-death process in which individuals infect (therefore ‘give birth to’) new individuals and cease to be infectious (therefore ‘die’) after a given time. From the epidemic process, individuals that cease to be infectious as they recover or are removed from the epidemic due to a fatal outcome of the infection. However, from our data collection perspective, these ‘removed’ individuals correspond to one of three possible scenarios: i) they recover from the infection without ever being diagnosed (in which case we don’t observe their infection at all), ii) they can be diagnosed and have a genetic sequence generated (in which case we get a time stamped sequence for that case) or they can be diagnosed but generate an unsequenced sample (in which case they appear in the dataset as a confirmed case but there is no sequence data attached to the case). For the unsequenced samples, they are aggregated into daily counts giving rise to the time series data.

The following figure shows the number of infectious individuals (“infectious”) through time along with the cumulative number of people who were previously infectious, for example, who have recovered or been sampled. Note from the subtitle that there are 2222 infectious individuals at the present. The number of infectious individuals is growing exponentially, as indicated by the linear growth on a logarithmic y-scale.

The following figure shows the daily number of confirmed cases: infectious that have been observed but for which a sequence was not collected. These data form the time series that we will use.

Note that at the end of the simulated epidemic there are approximately 2250 infectious individuals. The parameters used to simulate this epidemic are set to represent a pathogen with a basic reproduction number of \(1.85\). To simplify this tutorial we will assume a known death rate of (\(0.046\)), and that the pathogen evolves with a molecular clock rate of (\(10-3\)) substitutions/site/day. Both values fall within the ballpark of what might be expected for an RNA virus. The origin time for our simulated epidemic took place (\(70\)) days before the end of the simulation.

Analysis

  1. Create a work folder in your computer and download the sample-sequences.fasta and aggregated-occurrences.csv files there.
  2. Create an XML file using BEAUti.
    1. Load the sequence data into BEAUti
    2. Parse the tip-dates using the default auto-configured values function. In order for this to work, you will need to adjust the formatter so that it will take everything after the last underscore (i.e. _).
    3. Leave the site model tab with the default values.
    4. Set the clock rate to 0.001 in the clock model tab.
    5. Select the Tim Tam Model on the tree prior tab.
    6. (Optional) set the chain length to a smaller value if you are in a hurry; 1,000,000 MCMC steps should do.
    7. Save the resulting file as an XML.
  3. We now need to estimate some parameter values to add to the XML. to do this, download and run the R script print-disaster-data.R and keep a copy of the data that gets printed out on the Console or Terminal. There should be two sets of values, these are the representation of the time series data that TimTam expects. We do this because the times in the simulation are relative to date of the last sequence we have; we therefore need to shift the times in the XML so they agree with the ones we just computed.
  4. Now it’s time to tweak the XML by opening it in a text editor and making the following edits:
    • Remove operators relating to the clock rate, as we are working with a known (from the simulation parameters) evolutionary rate. There should be two of these clock operators, you can identify them because they begin with <operator id="clock.rate".
    • Ensure the originTime is set to \(70.0\).
    • Remove all references to historySizes and historyChecks, these are used to estimate historical prevalence and are not needed for this example. WARNING removing the history size checks will mean that you can’t estimate the prevalence, if you are interested in how to do this, read the subsequent tutorials or look at the timtam.xml which still has these included.
    • Copy and paste the two number series from the R script output as the values of disasterTimes and disasterSizes respectively, in the TimTam distribution node. You should recognise it by the id TimTam.t:sample-sequences. This will format the time series of the sampling times and sizes so that they are included properly.
    • The final XML should look like timtam.xml.
  5. Run MCMC by executing the XML file on BEAST (you may want to grab a coffee while this is running.)
  6. Plot the results using post-processing.R. This will produce two figures similar to the ones shown below, summarising some of the epidemiological estimates from your analysis.

Results

By following this tutorial, we have used a multiple sequence alignment and a time series of observed (but not sequenced) case data to estimate the basic reproduction number of a pathogen and the prevalence of the disease at the time of the last observation in our simulated epidemic.

The first figure shows the posterior distribution of the (basic) reproduction number as inferred during your BEAST run. The true value for the reproduction number (defined as a known value in the simulation) is shown as a vertical line.

Note how in the plot, the 95% HPD includes the real value of the basic reproduction number. We would expect to recover the true value of an estimated parameter credible interval when working with real-world data.

If you have run the MCMC with the history size checks included, the post-processing will generate a second figure. The second figure shows the posterior distribution of the prevalence at the time of the last observation, i.e. the number of infectious people at the present time. Recall from the observation we made above that this number should be approximately 2250 infectious cases at that time. How does the estimate from the figure below and the estimate from your own analysis compare to the true number of infectious cases at this last time point?

The plots produced by the provided R script from your analysis should look very similar to the ones shown above but not identical. Remember that the MCMC samples across parameter space randomly, but while the exact values vary from one run to the next, the distribution of these values should be mostly the same.