This project shows what is possible in terms of tracking the ancestral recombination graph (ARG) during a foward-time simulation implemented using fwdpp and msprime. The interface to the implementation is in terms of a Python package implemented using fwdpy11 and pybind11.
GPLv3 or later (see COPYING)
We define a C++ class called "ancestry_tracker", which stores nodes and edges as they appear forwards in time. These data structures, and their updating, are non-intrusive, meaning that they don't care about any of the fwdpp internals. Rather, we simply have to define a new "iterate a generation" function that uses both fwdpp machinery and updates an ancestry_tracker as appropriate.
Using pybind11, we make ancestry_trackers visible to Python as an AncestryTracker class. The Python class has access to the nodes and edges as NumPy structured arrays, which can be viewed "for free", meaning that no copy from C++ to Python is required to look at them.
We define an ArgSimplifier class to bridge the AncestryTracker and msprime. The __call__ operator of ArgSimplifier will accept an AncestryTracker as an argument and use the msprime API to simplify the input data into a set of trees.
This is a proof of principle implementation only. It has been tested that it gives the correct sample properties in distribution and that internal data structures are sane during simulation. The API is not guaranteed to be neither ideal nor idiomatic. Further, various safety checks are missing on the C++ side. We simply do not check for integer overflow or other possible things that may happen if garbage collection occurs too infrequently. These limitations, and others, will be dealt with (and tested for) when we port these ideas into fwdpy11.
This section will get you to a point where you can run the C++ simulations described in the paper. These instructions have been confirmed to work in a clean conda environment using Python3. We strongly recommend that this package is compiled within a clean conda env.
First, install a Python3 environment for your user. It doesn't matter if you use a "full" installer or a "miniconda" installer. At this point, you may need to follow the conda installation instructions.
Note
Linux is the preferred environment here. If you insist on using Apple's OS X, then you must add a --gcc flag to the last command in the following code block. I (KRT) make no guarantees about success on OS X.
Instructions for conda on Linux:
conda create --name ftprime_ms python
source activate ftprime_ms
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda install gcc fwdpy11==0.1.4 msprime==0.5.0 pybind11==2.2.1 pandas
git clone https://github.com/molpopgen/fwdpy11_arg_example
cd fwdpy11_arg_example
python setup.py build_ext -i
For example, let's run a simulation with the following parameters:
- N=5e4 diploids
- a region size of theta = rho = 1e4
- deleterious mutation rate equal to one percent of the neutral mutation rate
- simplify ("GC", or garbage-collect) every 1,000 generations
- apply mutations to a sample of 100 individuals at the end of the simulation
- write the timings to a file called timings.txt
The command line for the above is:
python benchmarking.py --popsize 50000 --theta 10000 --rho 10000 --pdel 0.01 --gc 1000 --nsam 100 --outfile1 timings.txt.gz \
--seed $RANDOM
Note
The output file is gzip compressed!
Please be mindful of running these simulations on machines with little RAM! In general, forward simulations are intended to be run on HPC-strength hardware. While tree sequence simplification results in very efficient run times, we are sometimes still using a substantial amount of RAM.
An example of the output is:
prepping sorting appending simplifying fwd_sim_runtime N theta rho simplify_interval
0.05370585599999733 4.384206619999995 0.2173980950000004 2.9446647440000016 5.174604999999977 1000 1000.0 1000.0 100
The fields are:
- prepping: cumulative time spent preparing data for a copy from the C++ side to msprime
- sorting: cumulative time spent sorting tables, which is a requirement for simplification
- simplifying: cumulative time spent simplifying tables
- fwd_sim_runtime: The total time spent simulating
The remaining four columns are the command-line parameters.
The package consists of a mix of C++ and Python code. All source code is in the fwdpy11_arg_example subdirectory of thie main repository.
We define nodes and edges as simple structs, meaning that they are "C-like", consisting only of POD and no constructors or other C++ stuff. This simple design allows C++ vectors of these structs to be treated as NumPy record arrays visible fom Python without needing to make a copy.
- node.hpp defines a node as a simple C-like struct.
- edge.hpp defines and edge as a simple C-like struct.
- ancestry_tracker.hpp defines a C++ struct/class called ancestry_tracker to accumulate nodes and edges during a simulation.
- evolve_generation.hpp handles the details of updating a Wright-Fisher population with an ancestry_tracker.
- handle_recombination.cc/.hpp handles the conversion of fwdpp's recombination breakpoints into types use to make edges.
- wfarg.cc defines a Python module (called wfarg) implemented in C++ via pybind11. It exposes our C++ back-end to Python. The most important user-facing type defined is AncestryTracker, which wraps the C++ ancestry_tracker.
- argsimplifier.py defines ArgSimplifier, which is the bridge between the C++ code to evolve a population and the msprime functionality to simplify the simulated nodes and edges.
- evolve_arg.py defines a function that evolves a population while tracking its ancestry. It integrates concepts from fwdpy11 with the types defined in this package.