Skip to content
Isaac Turner edited this page Sep 12, 2016 · 13 revisions

Links are paths through a graph (de bruijn or unitig) -- we focus on de bruijn graph. Links are added to kmers to connect them to other kmers. This helps resolve areas of low complexity in the graph. Links are created from reads with the thread command. Links help us assemble longer contigs when traversing the graph.

General Rules:

  1. Link information will always occur on a kmer immediately preceding a junction (i.e. in-degree greater than one)
  2. Orientation of link (forward (F) or reverse (R)) is w.r.t. 'kmer-key' (the lexically lower of a kmer and its reverse complement)
  3. If a link starts at kmer A and ends at kmer B, there exists a link on kmer B that goes from B to A

Rule 1 may also be re-written in terms of unitigs as:

Link information may only occur on the first node of a unitig going in the opposite direction, or the last node going in the same direction. When a unitig is only one node long this means the node may have links going in both directions.

Traversal

Given three points in a graph: A,B,C, if we can traverse from B->A and from A->C, then we can traverse from B->C.

Building and cleaning

To build, pick cleaning threshold and clean links:

mccortex31 thread -m 1G -o graph.raw.ctp.gz -1 seq.fa graph.clean.ctx
mccortex31 links -T link.stats.txt -L 1000 graph.raw.ctp.gz
LINK_THRESH=`grep 'suggested_cutoff=' link.stats.txt | grep -oE '[0-9,]+$'`
mccortex31 links --clean $LINK_THRESH --out graph.clean.ctp.gz graph.raw.ctp.gz

Link files

Link files are stored with the extension .ctp or .ctp.gz if compressed. They are plain text documents that start with a JSON header and are followed by the following format: a kmer line, followed by one or more link lines which start at the given kmer.

Kmer line: <kmer> <number of links>

  • kmer is the kmer-key of the initial kmer
  • number of links attached to this kmer-key

Kmers are those that have links attached. Only kmers that precede a collapse in the graph can have links attached.

link line: <F|R> <num_juncs> <counts0,counts1,...> <junctions> <seq=... juncpos=... ...>

  • F|R indicates if the link starts with the kmer in the forward (F) or reverse (R) orientation
  • num_juncs is the number of junction that this link spans. A junction is when the graph splits (forks) in the direction we are traversing. Edges coming together are not counted as junctions
  • counts is the number of times this link is seen in each sample
  • junctions the junction choices made by this link. If this was AG, at the first junction we meet, we take the A edge. At the next junction we meet, we take the G edge.
  • seq=... this shortest sequence needed to reconstruct this link against the graph (includes initial kmer) [optional]
  • juncpos=... using the sequence given above, gives the indices of the kmers that have out-degree > 1. Zero based - index 0 would be the first kmer (seq[0..k]). [optional]

Here is an example:

{
    "file_format":  "ctp",
    "format_version":       4,
    "file_key":     "8809efe868d2caa9",
    "graph":        {
            "num_colours":  1,
            "kmer_size":    9,
            "num_kmers_in_graph":   9625,
            "colours":      [{
                            "colour":       0,
                            "sample":       "SeqUniq",
                            "total_sequence":       10014,
                            "cleaned_tips": false,
                            "cleaned_unitigs":      false
                        }]
        },
        "commands":     [{
                        "key":  "0e3507b2",
                        "cmd":  ["../../bin/mccortex31", "thread", "--seq", "seq.uniq.fa", "--out", "seq.uniq.k9
                        "cwd":  "/Users/isaac/mccortex/tests/lossless",
                        "out_path":     "/Users/isaac/mccortex/tests/lossless/seq.uniq.k9.ctp.gz",
                        "out_key":      "8809efe868d2caa9",
                        "date": "2016-03-21 12:10:45",
                        "mccortex":     "v0.0.3-472-g86b4ffe-dirty",
                        "htslib":       "1.3-37-gfc93dfc",
                        "zlib": "1.2.5",
                        "user": "isaac",
                        "host": "Montag.home",
                        "os":   "Darwin",
                        "osrelease":    "15.3.0",
                        "osversion":    "Darwin Kernel Version 15.3.0: Thu Dec 10 18:40:58 PST 2015; root:xnu-32
                        "hardware":     "x86_64",
                        "prev": [],
                        "thread":       {
                                "inputs":       [{
                                                "files":        ["seq.uniq.fa"],
                                                "interleaved":  false,
                                                "fq_offset":    0,
                                                "fq_cutoff":    0,
                                                "hp_cutoff":    0,
                                                "matepair":     "FR",
                                                "frag_len_min_bp":      0,
                                                "frag_len_max_bp":      1000,
                                                "one_way_gap_fill":     true,
                                                "use_end_check":        false,
                                                "max_context":  1,
                                                "gap_variance": 0.100000,
                                                "gap_wiggle":   5
                                        }]
        "paths":        {
                "num_kmers_with_paths": 1065,
                "num_paths":    1112,
                "path_bytes":   77951,
                "contig_hists": [{
                                "lengths":      [10014],
                                "counts":       [1]
                        }]
        }
}

# Comment lines begin with a # and are ignored, but must come after the header
# Format is:
#   [kmer] [num_paths] ...(ignored)
#   [FR] [num_juncs] [counts0,counts1,...] [juncs:ACAGT] [seq=... juncpos=... ...]
#
# Columns are separated by a single space.
# Columns 1-4 are required ([FR],..,[juncs]) everything after than is optional

ACACGGCCC 1
F 67 1 CGGAATTTTACCTAGTAGATGCAAGCAATATGGAATAGTCTTGGCCATCCGATACTTGTACATGGTC seq=ACACGGCCCACGATTGCCTTAACATTTGGGGTG
AGCTATGCC 2
F 524 1 TATGGCGCGGGATAGCCGGGGAGTACATTGTAGAAGATAAGCAGGGATCTGGTTGCCAGGACAATCCCCAGGCTATAGCCGGGGGTTATCCCATCCCTGCTCTT
R 34 1 AATTCCTACTATCTTATCCACCTCGGAGCCAACC seq=GGCATAGCTTACAACAAGCATCTGGGGCTTTACACAGTGGGCAGTGGCGGCTGGACTGAAAAGCGA
Clone this wiki locally