-
Notifications
You must be signed in to change notification settings - Fork 25
Graph links
Links are paths through a graph (de bruijn or unitig) -- we focus on de bruijn graph. Links are added to kmers to connect them to other kmers. This helps resolve areas of low complexity in the graph. Links are created from reads with the thread command. Links help us assemble longer contigs when traversing the graph.
- Link information will always occur on a kmer immediately preceding a junction (i.e. in-degree greater than one)
- Orientation of link (forward (F) or reverse (R)) is w.r.t. 'kmer-key' (the lexically lower of a kmer and its reverse complement)
- If a link starts at kmer A and ends at kmer B, there exists a link on kmer B that goes from B to A
Rule 1 may also be re-written in terms of unitigs as:
Link information may only occur on the first node of a unitig going in the opposite direction, or the last node going in the same direction. When a unitig is only one node long this means the node may have links going in both directions.
Given three points in a graph: A,B,C, if we can traverse from B->A and from A->C, then we can traverse from B->C.
To build, pick cleaning threshold and clean links:
mccortex31 thread -m 1G -o graph.raw.ctp.gz -1 seq.fa graph.clean.ctx
mccortex31 links -T link.stats.txt -L 1000 graph.raw.ctp.gz
LINK_THRESH=`grep 'suggested_cutoff=' link.stats.txt | grep -oE '[0-9,]+$'`
mccortex31 links --clean $LINK_THRESH --out graph.clean.ctp.gz graph.raw.ctp.gz
Link files are stored with the extension .ctp
or .ctp.gz
if compressed. They are plain text documents that start with a JSON header and are followed by the following format: a kmer line, followed by one or more link lines which start at the given kmer.
Kmer line: <kmer> <number of links>
-
kmer
is the kmer-key of the initial kmer - number of links attached to this kmer-key
Kmers are those that have links attached. Only kmers that precede a collapse in the graph can have links attached.
link line: <F|R> <num_juncs> <counts0,counts1,...> <junctions> <seq=... juncpos=... ...>
-
F|R
indicates if the link starts with the kmer in the forward (F
) or reverse (R
) orientation -
num_juncs
is the number of junction that this link spans. A junction is when the graph splits (forks) in the direction we are traversing. Edges coming together are not counted as junctions -
counts
is the number of times this link is seen in each sample -
junctions
the junction choices made by this link. If this wasAG
, at the first junction we meet, we take theA
edge. At the next junction we meet, we take theG
edge. -
seq=...
this shortest sequence needed to reconstruct this link against the graph (includes initial kmer) [optional] -
juncpos=...
using the sequence given above, gives the indices of the kmers that haveout-degree > 1
. Zero based - index 0 would be the first kmer (seq[0..k]). [optional]
Here is an example:
{
"file_format": "ctp",
"format_version": 4,
"file_key": "8809efe868d2caa9",
"graph": {
"num_colours": 1,
"kmer_size": 9,
"num_kmers_in_graph": 9625,
"colours": [{
"colour": 0,
"sample": "SeqUniq",
"total_sequence": 10014,
"cleaned_tips": false,
"cleaned_unitigs": false
}]
},
"commands": [{
"key": "0e3507b2",
"cmd": ["../../bin/mccortex31", "thread", "--seq", "seq.uniq.fa", "--out", "seq.uniq.k9
"cwd": "/Users/isaac/mccortex/tests/lossless",
"out_path": "/Users/isaac/mccortex/tests/lossless/seq.uniq.k9.ctp.gz",
"out_key": "8809efe868d2caa9",
"date": "2016-03-21 12:10:45",
"mccortex": "v0.0.3-472-g86b4ffe-dirty",
"htslib": "1.3-37-gfc93dfc",
"zlib": "1.2.5",
"user": "isaac",
"host": "Montag.home",
"os": "Darwin",
"osrelease": "15.3.0",
"osversion": "Darwin Kernel Version 15.3.0: Thu Dec 10 18:40:58 PST 2015; root:xnu-32
"hardware": "x86_64",
"prev": [],
"thread": {
"inputs": [{
"files": ["seq.uniq.fa"],
"interleaved": false,
"fq_offset": 0,
"fq_cutoff": 0,
"hp_cutoff": 0,
"matepair": "FR",
"frag_len_min_bp": 0,
"frag_len_max_bp": 1000,
"one_way_gap_fill": true,
"use_end_check": false,
"max_context": 1,
"gap_variance": 0.100000,
"gap_wiggle": 5
}]
"paths": {
"num_kmers_with_paths": 1065,
"num_paths": 1112,
"path_bytes": 77951,
"contig_hists": [{
"lengths": [10014],
"counts": [1]
}]
}
}
# Comment lines begin with a # and are ignored, but must come after the header
# Format is:
# [kmer] [num_paths] ...(ignored)
# [FR] [num_juncs] [counts0,counts1,...] [juncs:ACAGT] [seq=... juncpos=... ...]
#
# Columns are separated by a single space.
# Columns 1-4 are required ([FR],..,[juncs]) everything after than is optional
ACACGGCCC 1
F 67 1 CGGAATTTTACCTAGTAGATGCAAGCAATATGGAATAGTCTTGGCCATCCGATACTTGTACATGGTC seq=ACACGGCCCACGATTGCCTTAACATTTGGGGTG
AGCTATGCC 2
F 524 1 TATGGCGCGGGATAGCCGGGGAGTACATTGTAGAAGATAAGCAGGGATCTGGTTGCCAGGACAATCCCCAGGCTATAGCCGGGGGTTATCCCATCCCTGCTCTT
R 34 1 AATTCCTACTATCTTATCCACCTCGGAGCCAACC seq=GGCATAGCTTACAACAAGCATCTGGGGCTTTACACAGTGGGCAGTGGCGGCTGGACTGAAAAGCGA