tskit-dev · benjeffery · Feb 3, 2025 · hyanwong · Feb 5, 2025 · hyanwong
diff --git a/docs/_static/ancestor_grouping.png b/docs/_static/ancestor_grouping.png
diff --git a/docs/_toc.yml b/docs/_toc.yml
@@ -14,6 +14,7 @@ parts:
 - caption: Inference
   chapters:
   - file: inference
+  - file: large_scale
 - caption: Interfaces
   chapters:
   - file: api

diff --git a/docs/api.rst b/docs/api.rst
@@ -16,6 +16,11 @@ File formats
 Sample data
 +++++++++++
 
+.. autoclass:: tsinfer.VariantData
+    :members:
+    :inherited-members:
+
+
 .. autoclass:: tsinfer.SampleData
     :members:
     :inherited-members:
@@ -60,6 +65,27 @@ Running inference
 
 .. autofunction:: tsinfer.post_process
 
+*****************
+Batched inference
+*****************
+
+.. autofunction:: tsinfer.match_ancestors_batch_init
+
+.. autofunction:: tsinfer.match_ancestors_batch_groups
+
+.. autofunction:: tsinfer.match_ancestors_batch_group_partition
+
+.. autofunction:: tsinfer.match_ancestors_batch_group_finalise
+
+.. autofunction:: tsinfer.match_ancestors_batch_finalise
+
+.. autofunction:: tsinfer.match_samples_batch_init
+
+.. autofunction:: tsinfer.match_samples_batch_partition
+
+.. autofunction:: tsinfer.match_samples_batch_finalise
+
+
 *****************
 Container classes
 *****************

diff --git a/docs/inference.md b/docs/inference.md
@@ -300,4 +300,4 @@ The final phase of a `tsinfer` inference consists of a number steps:
        section
     2. Describe the structure of the output tree sequences; how the
        nodes are mapped, what the time values mean, etc.
-:::
+:::
diff --git a/docs/large_scale.md b/docs/large_scale.md
@@ -0,0 +1,136 @@
+---
+jupytext:
+  text_representation:
+    extension: .md
+    format_name: myst
+    format_version: 0.12
+    jupytext_version: 1.9.1
+kernelspec:
+  display_name: Python 3
+  language: python
+  name: python3
+---
+
+:::{currentmodule} tsinfer
+:::
+
+(sec_large_scale)=
+
+# Large Scale Inference
+
+tsinfer scales well and has been successfully used with datasets up to half a
+million samples. Here we detail considerations and tips for each step of the
+inference process to help you scale up your analysis. A snakemake pipeline
+which implements this parallelisation scheme is available at https://github.com/benjeffery/tsinfer-snakemake.
+
+(sec_large_scale_ancestor_generation)=
+
+## Data preparation
+
+For large scale inference the data must be in [VCF Zarr](https://github.com/sgkit-dev/vcf-zarr-spec)
+format, read by the {class}`VariantData` class. [bio2zarr](https://github.com/sgkit-dev/bio2zarr)
+is recommended for conversion from VCF. [sgkit](https://github.com/sgkit-dev/sgkit) can then
+be used to perform initial filtering.
+
+
+## Ancestor generation
+
+Ancestor generation is generally the fastest step in inference and is not yet
+parallelised out-of-core in tsinfer. However it scales well on machines with
+many cores and hyperthreading via the `num_threads` argument to
+{meth}`generate_ancestors`. The limiting factor is often that the
+entire genotype array for the contig being inferred needs to fit in RAM.
+This is the high-water mark for memory usage in tsinfer.
+Note the `genotype_encoding` argument, setting this to
+{class}`tsinfer.GenotypeEncoding.ONE_BIT` reduces the memory footprint of
+the genotype array by a factor of 8, for a surprisingly small increase in
+runtime. With this encoding, the RAM needed is roughly 
+`num_sites * num_samples * ploidy / 8 bytes.`
+
+## Ancestor matching
+
+Ancestor matching is one of the more time consuming steps of inference. It
+proceeds in groups, progressively growing the tree sequence with younger
+ancestors. At each stage the parallelism is limited to the number of ancestors
+whose possible inheritors are already matched, as all possible inheritors
+of a sample must be matched in an earlier group. For a typical human data set
+the number of samples per group varies from single digits up to approximately
+the number of samples.
+The plot below shows the number of ancestors matched in each group for a typical
+human data set:
+
+```{figure} _static/ancestor_grouping.png
+:width: 80%
+```
+
+There are five tsinfer API methods that can be used to parallelise ancestor
+matching. 
+
+Initially {meth}`match_ancestors_batch_init` should be called to 
+set up the batch matching and to determine the groupings of ancestors.
+This method writes a file `metadata.json` to the `work_dir` that contains
+a JSON encoded dictionary with configuration for later steps, and a key
+`ancestor_grouping` which is a list of dictionaries, each containing the
+list of ancestors in that group (key:`ancestors`) and a proposed partioning of
+those ancestors into sets that can be matched in parallel (key:`partitions`).
+The dictionary is also returned by the method.
+The partitioning is controlled by the `min_work_per_job` and `max_num_partitions`
+arguments. Ancestors are placed in a partition until the sum of their lengths exceeds
+`min_work_per_job`, when a new partition is started. However, the number of partitions
+is not allowed to exceed `max_num_partitions`. It is suggested to set `max_num_partitions`
+to around 3-4x the number of worker nodes available, and `min_work_per_job` to around
+2,000,000 for a typical human data set.
+
+Each group is then matched in turn, either by calling {meth}`match_ancestors_batch_groups`
+to match without partitioning, or by calling {meth}`match_ancestors_batch_group_partition`
+many times in parallel followed by a single call to {meth}`match_ancestors_batch_group_finalise`.
+Each call to {meth}`match_ancestors_batch_groups` or {meth}`match_ancestors_batch_group_finalise`
+outputs the tree sequence to `work_dir`, which is then used by the next group. The length of
+the `ancestor_grouping` in the metadata dictionary determines the group numbers that these methods
+will need to be called for, and the length of the `partitions` list in each group determines
+the number of calls to {meth}`match_ancestors_batch_group_partition` that are needed (if any).
+
+{meth}`match_ancestors_batch_groups` matches groups, without partitioning, from
+`group_index_start` (inclusively) to `group_index_end` (exclusively). Combining
+many groups into one call reduces the overhead from job submission and start
+up times, but note on job failure the process can only be resumed from the
+last `group_index_end`.
+
+To match a single group in parallel, call {meth}`match_ancestors_batch_group_partition`
+once for each partition listed in the `ancestor_grouping[group_index]['partitions']` list,
+incrementing `partition_index`. This will match the ancestors, placing the match data in
+the `working_dir`. Once all are complete a single call to
+{meth}`match_ancestors_batch_group_finalise` will then insert the matches and
+output the tree sequence to `work_dir`.
+
+At anypoint the process can be resumed from the last successfully completed call to 
+{meth}`match_ancestors_batch_groups`. As the tree sequences in `work_dir` checkpoint the
+progress.
+
+Finally after the final group, call {meth}`match_ancestors_batch_finalise` to
+combine the groups into a single tree sequence.
+
+The partitioning in `metadata.json` does not have to be used for every group. As early groups are
+not matching to a large tree sequence it is often faster to not partition the first half of the
+groups, depending on job set up and queueing time on your cluster.
+
+Calls to {meth}`match_ancestors_batch_group_partition` will only use a single core, but 
+{meth}`match_ancestors_batch_groups` will use as many cores as `num_threads` is set to
+Therefore this value and cluster resources requested should be scaled with the number of ancestors,
+which can be read from the metadata dictionary.
+
+
+
+## Sample matching 
+
+Sample matching is far simpler than ancestor matching as it is essentially the same as a single group
+of ancestors. There are three API methods that work together to enable distributed sample matching.
+{meth}`match_samples_batch_init` should be called to set up the batch matching and to determine the
+groupings of samples. Similar to {meth}`match_ancestors_batch_init` is has a `min_work_per_job` and
+`max_num_partitions` arguments to control the level of parallelism. The method writes a file
+`metadata.json` to the directory `work_dir` that contains a JSON encoded dictionary with
+configuration for later steps. This is also returned by the call. The `num_partitions` key in
+this dictionary is the number of times {meth}`match_samples_batch_partition` will need
+to be called, with each partition index as the `partition_index` argument. These calls can happen
+in parallel and write match data to the `work_dir` which is then used by
+{meth}`match_samples_batch_finalise` to output the tree sequence.
diff --git a/tsinfer/formats.py b/tsinfer/formats.py
@@ -2308,7 +2308,7 @@ class VariantData(SampleData):
         the inference process will have ``inferred_ts.num_samples`` equal to double
         the number returned by ``VariantData.num_samples``.
 
-    :param Union(str, zarr.hierarchy.Group) path_or_zarr: The input dataset in
+    :param Union(str, zarr.Group) path_or_zarr: The input dataset in
         `VCF Zarr <https://github.com/sgkit-dev/vcf-zarr-spec>`_ format.
         This can either a path to the Zarr dataset saved on disk, or the
         Zarr object itself.
-Original file line number
+Diff line change
@@ Expand Up @@
            section
 . Describe the structure of the output tree sequences; how the
            nodes are mapped, what the time values mean, etc.
-    :::
+    :::