Skip to content

Latest commit

 

History

History
executable file
·
96 lines (70 loc) · 5.99 KB

generate_pseudomolecules.step_3.md

File metadata and controls

executable file
·
96 lines (70 loc) · 5.99 KB

<< Back | Workflows ::: Generate pseudomolecules and QC scaffolding errors

Step 3: Sequence editing

Once breakpoints are defined, sequences incorrectly placed by HaploSplit and eventually, erroneous scaffolds, can be broken accordingly to the identified issues using HaploBreak.

In the workflow we assume to have a variable ${HaploSync_path} set in the environment pointing to the installation directory of HaploSync. To do so, run:

export HaploSync_path="/path/to/HaploSync"

HaploBreak

To run the editing do

  • Concatenate sequences files, AGP and and annotation files (if necessary):

    cat Pseudomolecules_hap1.fasta Pseudomolecules_hap2.fasta Pseudomolecules_Un.fasta > genome.fasta
    cat Pseudomolecules_hap1.agp Pseudomolecules_hap2.agp Pseudomolecules_Un.agp > previous_to_actual.genome.agp
    cat Pseudomolecules_hap1.gff3 Pseudomolecules_hap2.gff3 Pseudomolecules_Un.gff3 > genome.annotation.gff3
  • Runs HaploBreak:

    ${HaploSync_path}/HaploBreak.py -f genome.fasta \
    	-b Breakpoints.txt \
    	-g genome.annotation.gff3 \
    	-a previous_to_actual.genome.agp
    	-o ${broken_outname} \
    	-p ${new_sequence_prefix}

where

  • genome.fasta: contains all the pseudomolecules sequences, eventually assembled with HaploSplit.
  • Breakpoints.txt: is the tabular file providing the information on the approximate breakpoint position desired.
  • genome.annotation.gff3: contains the annotation onto the pseudomolecules. It is optional, but if available it is suggested to provide it as :
    • HaploBreak will translate it onto the edited sequences.
    • During the selection of the gaps to break, HaploBreak will prevent the selection of gaps that are bridged by gene loci and that would be eventually disrupted.
  • ${broken_outname}: name prefix for output files.
  • ${new_sequence_prefix}: prefix for ids of the edited sequence.
  • See HaploBreak usage page for further details.

What will HaploBreak do?

HaploBreak will search the nearest available junction between contigs.

  1. First it will search for erroneous junctions in the AGP file to break.

    • It assumes, and tries to correct, HaploSplit errors in first instance.

    • This will appear as sequences (one or more) comprised between two breakpoints corresponding to 2 “gap” components of the AGP file.

      If this is the case, the sequences between breakpoints will be classified as blacklisted. This list can be passed to HaploSplit directly in order to segregate overlapping sequences in different tiling paths, thus to avoiding to perform the same error again (as the tool is not able to do).

  2. If AGP file search fails, gaps within legacy sequences will be searched. The nearest compatible to the suggested breakpoint position will be used to split the scaffold in sub-sequences. To further understand how HaploBreak works, follow the link to HaploBreak usage page or read Minio et al. (2020).

Note: Breakpoint coordinate precision

Breakpoint coordinates given as input are not expected to be precise. HaploBreak will search a suitable junction to associate to a breakpoint with allowing some tolerance. The -d | --distance parameter controls this tolerance by setting a threshold (in bp) on the maximum distance between the breakpoint and its associated junction. By default it is set to 25Kb, which should be relaxed enough to allow a comfortable detection of breakpoints for the user, however in some cases where difference between genotypes may be too high it may be not sufficient. It is suggest therefore to increase slightly the allowed tolerance.

However, as this threshold trigger also the search for in-sequence gap to break in presence of assembly error, it is suggested to increase it carefully in order to hinder the possibility to correct errors.

Output files

  • ${broken_outname}.new_legacy.agp: AGP file reporting the processed sequences (out.new_legacy.fasta) relationship to the legacy sequence structure.

  • ${broken_outname}.new_legacy.annotation.gff3: annotation translated onto the processed sequences (${broken_outname}.new_legacy.fasta).

  • ${broken_outname}.new_legacy.fasta: processed sequences.

  • ${broken_outname}.blacklist.txt: sequences in between two AGP-reported gaps used as breakpoints.

  • ${broken_outname}.breakable_gaps.list.txt: list of input in-sequence gaps suitable for breaking.

  • ${broken_outname}.broken_gaps.given_sequences.list.txt: list of breaking points search result on input sequence:

    The format reports:

    • Header: “>” + the name of the input sequence.

    • Regions: a tabular description of each region for break point search with the first column indicating if the breakpoint fall in a known gap (from the AGP file if given) or within an ungapped region (“sequence”) that requires search for in-sequence gaps. Follow column 2 and 3 with the coordinates of the region.

      Ex:

      >chr01
      sequence        12150000        12857650
      sequence        14110000        20989345
      gap     12957818        12958818
      sequence        14330000        20989345
      >chr02
      sequence        13580000        18271468
      gap     18271468        18272468
      
  • ${broken_outname}.broken_gaps.legacy_sequences.list.txt: list of the broken gaps in terms of the legacy sequences ids (If AGP is given). The format is the same than ${broken_outname}.broken_gaps.given_sequences.list.txt however, only “gap” breakpoints are reported (as the breakpoints derive only from in-sequence gap search).

  • ${broken_outname}.broken_loci.annotation.legacy_updated.txt: Loci that were overlapping AGP-reported gaps and resulted broken by break point selection.

  • ${broken_outname}.legacy_breakable_gaps.list.txt: list of legacy in-sequence gaps suitable for breaking.

Next step

Move on to Step 4 - Build corrected pseudomolecules to create new pseudomolecule sequences from the results of the breaking and solve the previous issues.