Code for working with AGP and TPF files as used within the Tree of Life project, where the combination of long read sequencing and HiC data is used to produce whole genome assemblies. It is not therefore intended to cover the full range of AGP and TPF syntax.
Added to your PATH
if the suggested development venv is set up. Run with
--help
for usage.
Parses and reformats AGP and TPF files, converting into either format.
Takes the AGP file output by PretextView and creates TPF files containing precise coordinates of the curated assembly.
Both TPF and AGP file formats described here contain the same information. AGP is the more appropriate format to use, since it was designed for sequence assembly coordinates, whereas TPF was for listing (cosmid, fosmid, YAC or BAC) clones and their accessions in the order that they were tiled to build a chromosome.
Each line in the AGP v2.1 specification contains 9 tab delimited columns. Of these columns:
- DNA Sequence
- column 5 the "component_type" contains
W
in our assemblies, meaning a contig from Whole Genome Shotgun (WGS) sequencing. - columns 10 and greater are extra tag metadata columns not included in the AGP v2.1 specification. (See below for their possible values.)
- column 5 the "component_type" contains
- Gaps
- column 5 the "component_type" contains
U
in our assemblies, for a gap of unknown length. (The other gap typeN
is for gaps of known length.) - column 6 The default length in the specification for
U
gaps is 100 base pairs, but we use 200 bp gaps, as produced by yahs - column 7 has
scaffold
, signifying a gap between two contigs in a scaffold. - column 8 has
yes
, signifying that there is evidence of linkage between the sequence data on either side of the gap.
- column 5 the "component_type" contains
Single words appended in tab-delimted columns beyond column 9, they can contain:
Contaminant
Haplotig
for haplotype-specific contigs.- Haplotypes:
Hap1
,Hap2
…
Painted
where fragment has HiC contacts.Unloc
are fragments attached to chromosomes but unlocalised within them.- Sex Chromosomes:
U
V
W
orW1
,W2
…X
orX1
,X2
…Y
orY1
,Y2
…Z
orZ1
,Z2
…
- B Chromosomes:
B1
,B2
,B3
…
Our TPF files are highly diverged from the original specification.
- We incorporate assembly coordinates, which was not the purpose of TPF files.
- We do not necessarily include any
##
header lines, which were mandatory in the original specification. - DNA Sequence
- column 1 the "accession" is always
?
since the components of our assemblies are not accessioned. - column 2 the "clone name" does not contain a clone name, but
contains the name of scaffold fragment or whole scaffold, with the
format:
<name>:<start>-<end>
i.e. assembly coordinates. - column 3 the "local contig identifier" now contains the name of the scaffold each sequence fragment belongs to. Each TPF file used to contain a single chromosome, but we put a whole genome into a single file, and this column groups the fragments into chromosomes / scaffolds.
- column 4 which in the original specification was used for
indicating
CONTAINED
orCONTAINED_TURNOUT
clones now holds assembly strand information, eitherPLUS
orMINUS
.
- column 1 the "accession" is always
- Gaps
- column 2 is
TYPE-2
, which meant a gap between two clones - column 3 length, using our default of 200 bp.
- column 2 is
In your cloned copy of the git repository:
python3 -m venv --prompt asm-utils venv
source venv/bin/activate
pip install --upgrade pip
pip install --editable .
An alias such as this:
alias atu="cd $HOME/git/agp-tpf-utils && source ./venv/bin/activate"
in your shell's .*rc
file (e.g. ~/.bashrc
for bash
or ~/.zshrc
for
zsh
) can be convenient.
Some changes, such as adding a new command line script to
pyproject.toml
, require the development environment to be
reinstalled:
pip uninstall tola-agp-tpf-utils
pip install --editable .
hash -r
Tests, located in the tests/
directory, are run with the pytest
command from the project root.