Workflow detailing the Remapping of the Phaeodactylum tricornutum genome annotation data to the Telomere-to-telomere assembly
Here contains all the files and scripts used to perform remapping of Phatr3 annoation data from the GCA_000150955.2 (Diatom Consortium) genome assembly to the GCA_914521175.1 (Telomere-to-telomere) genome assembly of Phaeodactylum tricornutum.
Data is hosted on UCSC and can be accessed through the following URL: https://genome.ucsc.edu/cgi-bin/hgGateway?hgsid=1828134926_AOVBj4mafegeRJoItfGJ2k2cDR0R
- Inputs
- Uncleaned up metadata in
.gff
formatgenome_data/old/uncleaned_metadata_file/phatr3_gene_models_with_href.gff3
- Uncleaned up metadata in
- Script
- used
sed
- used
cat genome_data/old/uncleaned_metadata_file/phatr3_gene_models_with_href.gff3 | sed 's=\\==g; s=\/==g; s=%==g; s=(==g; s=)==g; s=|==g; s=&==g; s={==g; s=}==g; s=<==g; s=>==g; s=>==g; s=\*==g; s=?==g; s=\$==g; s=!==g; s=@==g; s="==g; s= ==g; s=:=_=g; s/=//g; s=,==g; s=;=_=g' > tmp.gff
cat tmp.gff | sed "s='==g" > genome_data/old/cleaned_up_metadata_file/old_annotation_data_fully_cleaned_and_w_unique_lines.gff
- Outputs
- cleaned up metadata in
.gff
formatgenome_data/old/cleaned_up_metadata_file/old_annotation_data_fully_cleaned_and_w_unique_lines.gff
- cleaned up metadata in
- Inputs
- old genome assembly in
.fasta
formatassemblies/old_nuclear_assembly.fasta
- cleaned up metadata in
.gff
formatgenome_data/old/cleaned_up_metadata_file/old_annotation_data_fully_cleaned_and_w_unique_lines.gff
- old genome assembly in
- Script
- use
bedtools getfasta
- Need the
–name
and–s
flags to make sure everything downstream works properly
- use
bedtools getfasta -name -s -fi assemblies/old_nuclear_assembly.fasta -bed genome_data/old/cleaned_up_metadata_file/old_annotation_data_fully_cleaned_and_w_unique_lines.gff > gff_sequences.fa
- Outputs
- File containing all sequence in
fasta
formatgff_sequences.fa
- File containing all sequence in
- Inputs
- Names of
.fasta
files in a list (strip the.fa
or.fasta
for this script to work)gff_sequences
- Names of
- Script
./utils/n_check/countNdetailPerFeature.sh
# cd into the folder with gff_sequences
source utils/n_check/countNdetailPerFeature.sh gff_sequences
- Outputs
- List of feature names without N's in the sequence (
.txt
format)gff_sequences_features_with_no_Ns.txt
- List of feature names with N's in the sequence (
.txt
format)gff_sequences_features_with_Ns.txt
- List of feature names with N's in the sequence and how many N's they posess (
.txt
format)gff_sequences_features_withNs.txt
- List of feature names without N's in the sequence (
- Inputs
- file containing all the sequences of genomic features from
.gff
file in.fasta
formatgff_sequences.fa
- list of feature names without N's in the sequence (
.txt
format)gff_sequences_features_with_no_Ns.txt
- file containing all the sequences of genomic features from
- Script
- used
seqtk subseq
- used
seqtk subseq gff_sequences.fa gff_sequences_features_with_no_Ns.txt > gff_sequences_no_ns.fa
- Outputs
- File containing all the sequences of genomic features from
.gff
file excluding those with N's in.fasta
formatgff_sequences_no_ns.fa
- File containing all the sequences of genomic features from
- Inputs
- Names of
.fasta
files in a list (strip the.fa
or.fasta
for this script to work)gff_sequences_no_ns
- Names of
- Script
./utils/sequence_length_check/check_seq_length.sh
- This script requires
bioawk
to work. Make sure you havebioawk
installed in your conda environment and modify the script to source that conda environment from within the script.
- This script requires
# cd into the folder with gff_sequences
source utils/sequence_length_check/check_seq_length.sh gff_sequences_no_ns
- Outputs
- List of feature names without N's in the sequence and that are bigger or equal to 20bp (
.txt
format)gff_sequences_no_ns_bigger_than_or_eq_to_20_bp.txt
- List of feature names without N's in the sequence and that are smaller than 20bp (
.txt
format)gff_sequences_no_ns_than_20_bp.txt
- List of feature names without N's in the sequence and that are smaller than 20bp with their sizes(
.txt
format)gff_sequences_no_ns_than_20_bp_with_their_sizes.txt
- List of feature names without N's in the sequence and that are bigger or equal to 20bp (
- Inputs
- File containing all the sequences of genomic features from
.gff
file in without N's.fasta
formatgff_sequences_no_ns.fa
- List of feature names without N's in the sequence and that are bigger or equal to 20bp (
.txt
format)gff_sequences_no_ns_bigger_than_or_eq_to_20_bp.txt
- File containing all the sequences of genomic features from
- Script
- used
seqtk subseq
- used
seqtk subseq gff_sequences_no_ns.fa gff_sequences_no_ns_bigger_than_or_eq_to_20_bp.txt > gff_sequences_no_ns_bigger_than_or_eq_to_20_bp.fa
- Outputs
- File containing all the sequences of genomic features from
.gff
file excluding those with N's and those that are bigger or equal to 20bp in.fasta
formatgff_sequences_no_ns_bigger_than_or_eq_to_20_bp.fa
- File containing all the sequences of genomic features from
- Inputs
- New genome assembly in
.fasta
formatassemblies/new_nuclear_assembly.fasta
- New genome assembly in
- Script
- use
makeblastdb
- use
makeblastdb -in assemblies/new_nuclear_assembly.fasta -dbtype nucl
- Outputs
- BLAST database index files
- Inputs
- New genome assembly in
.fasta
formatassemblies/new_nuclear_assembly.fasta
- File containing all the sequences of genomic features from
.gff
file excluding those with N's and those that are bigger or equal to 20bp in.fasta
formatgff_sequences_no_ns_bigger_than_or_eq_to_20_bp.fa
- New genome assembly in
- Script
- used
blastn
- Make sure to have
sstrand
included as a column because that takes in stand info
- Make sure to have
- used
blastn -db assemblies/new_nuclear_assembly.fasta -query gff_sequences_no_ns_bigger_than_or_eq_to_20_bp.fa -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore sstrand” -out blast_outputs/features_to_new_filtered_blast.tsv
- Outputs
- Table with BLAST outputs in
.tsv
formatblast_outputs/features_to_new_filtered_blast.tsv
- Table with BLAST outputs in
- Inputs
- Table with BLAST outputs in
.tsv
formatblast_outputs/features_to_new_filtered_blast.tsv
- Table with BLAST outputs in
- Script
- used
awk
- used
awk 'BEGIN{OFS="\t"} {print $0, NR-1}' blast_outputs/features_to_new_filtered_blast.tsv > blast_outputs/features_to_new_filtered_blast_rows_appended.tsv
- Outputs
- Table with BLAST outputs and row numbers appended in
.tsv
formatblast_outputs/features_to_new_filtered_blast_rows_appended.tsv
- Note: Rows are 0-indexed
- Table with BLAST outputs and row numbers appended in
- Inputs
- Table with BLAST outputs and row numbers appended in
.tsv
formatblast_outputs/features_to_new_filtered_blast_rows_appended.tsv
- Table with BLAST outputs and row numbers appended in
- Script
utils/format/blastToBedToolsInput.py
python3 blastToBedToolsInput.py -bo blast_outputs/features_to_new_filtered_blast_rows_appended.tsv -of oldGenesBedToolsInput.txt
- Outputs
- BLAST outputs in
.bed
formatfeatures_to_new_filtered_blast.bed
- BLAST outputs in
- Inputs
- New genome assembly in
.fasta
formatassemblies/new_nuclear_assembly.fasta
- BLAST outputs in
.bed
formatfeatures_to_new_filtered_blast.bed
- New genome assembly in
- Script
- used
bedtools getfasta
- used
bedtools getfasta -fi assemblies/new_nuclear_assembly.fasta -bed features_to_new_filtered_blast.bed -name -s -fo features_to_new_filtered_blast.fasta
- Outputs
- BLAST outputs in
.fasta
formatfeatures_to_new_filtered_blast.fasta
- BLAST outputs in
- Inputs
- BLAST outputs in
.fasta
formatfeatures_to_new_filtered_blast.fasta
- BLAST outputs in
- Script
- used
awk
- used
awk -F "\n" 'BEGIN{RS=">"} {print $1}' features_to_new_filtered_blast.fasta > features_to_new_filtered_blast_list.txt
- Outputs
- List of header names from
.fasta
file (.txt
file)features_to_new_filtered_blast_list.txt
- List of header names from
- Inputs
- File with BLAST queries sequences in
.fasta
formatgff_sequences_no_ns_bigger_than_or_eq_to_20_bp.fa
- File with BLAST subject sequences in
.fasta
formatfeatures_to_new_filtered_blast.fasta
- List of header names from
.fasta
file (.txt
file)features_to_new_filtered_blast_list.txt
- Table with BLAST outputs and row numbers appended in
.tsv
formatblast_outputs/features_to_new_filtered_blast_rows_appended.tsv
- File with BLAST queries sequences in
- Script
utils/global_alignment/needleAlignBlastUpdaterV2.py
python3 utils/global_alignment/needleAlignBlastUpdaterV2.py -nf features_to_new_filtered_blast_list.txt -gf gff_sequences_no_ns_bigger_than_or_eq_to_20_bp.fa -sf features_to_new_filtered_blast.fasta -bf blast_outputs/features_to_new_filtered_blast_rows_appended.tsv -obf blast_outputs/all_blasted_to_new_assembly_POST_GLOBAL_ALIGNMENT.tsv
- Outputs
- a
.tsv
file with all the blast info post global alignmentblast_outputs/all_blasted_to_new_assembly_POST_GLOBAL_ALIGNMENT.tsv
- a
- Inputs
- new assembly without chloroplast and mitochondria in
.fasta
formatassemblies/new_nuclear_assembly.fasta
- old assembly without chloroplast and mitochondria in
.fasta
formatassemblies/old_nuclear_assembly.fasta
- new assembly without chloroplast and mitochondria in
- Script
minimap2
alignment of old to new alignment-x asm5
flag specified to optimize aligner for full genome/assembly alignment
minimap2 -x asm5 assemblies/new_nuclear_assembly.fasta assemblies/old_nuclear_assembly.fasta > minimap_outputs/oldToNewMap.paf
- Outputs
- old assembly to new assembly alignment in
.paf
formatminimap_outputs/oldToNewMap.paf
- old assembly to new assembly alignment in
-
Remove all the small and low scoring alignments.
-
A cut off of 450bp was chosen as the minimum alignment length. This is because the smallest scaffold in the old assembly with annotated data is 450bp.
-
A cutoff score of 20 was used for my alignments.
.paf
scores use Phred-scaled quality scores. Generally speaking, a Phred score of 20 or above is acceptable, because this means that whatever it qualifies is >99% accurate, with a <1% chance of error. -
Inputs
- old assembly to new assembly alignment in in
.paf
formatminimap_outputs/oldToNewMap.paf
- old assembly to new assembly alignment in in
-
Script
utils/pre-filter/remove_by_size_and_score.sh
- Take in 2 positional arguments
-
- The minimum size of mapping you want to allow
-
- The minimum quality score you want to allow
-
- Take in 2 positional arguments
./utils/pre-filter/remove_by_size_and_score.sh minimap_outputs/oldToNewMap.paf 450 20 > filtered_alignments/pre-filtered/oldToNewMap-450size-20score.paf
- Outputs
- old assembly to new assembly alignment mappings >450bp and >20 Q-score in
.paf
formatfiltered_alignments/pre-filtered/oldToNewMap-450size-20score.paf
- old assembly to new assembly alignment mappings >450bp and >20 Q-score in
Needed for the following filtering script
- Inputs
- new assembly without chloroplast and mitochondria in
.fasta
formatassemblies/new_nuclear_assembly.fasta
- new assembly without chloroplast and mitochondria in
- Script
utils/chr_sizes/get_sizes.sh
- Make sure you have
bioawk
in your environment before running this script
- Make sure you have
source ./utils/chr_sizes/get_sizes.sh assemblies/new_nuclear_assembly.fasta > ./chr_sizes/new_nuclear_assembly_chr_sizes.txt
- Outputs
- Sequences sizes of all the chromosomes in the new assembly (and
chromosome_
stripped from chromosome name); in a key, value format.- Col 1 = chromosome number
- Col 2 = chromosome size
chr_sizes/new_nuclear_assembly_chr_sizes.txt
- Sequences sizes of all the chromosomes in the new assembly (and
- Inputs
- old assembly to new assembly alignment mappings >450bp and >20 Q-score in
.paf
formatfiltered_alignments/pre-filtered/oldToNewMap-450size-20score.paf
- Sequences sizes of all the chromosomes in the new assembly (and
chromosome_
stripped from chromosome name); in a key, value format.- Col 1 = chromosome number
- Col 2 = chromosome size
chr_sizes/new_nuclear_assembly_chr_sizes.txt
- Your output file name in
.paf
format - The percentage of the size of the mapping that you're willing to allow overlap with another plasmid
a number ranging from 0-1
--noSave
if you don't want to save intermediate tables--noLoad
if you don't want to load intermediate tables
- old assembly to new assembly alignment mappings >450bp and >20 Q-score in
- Script
utils/greedy_lenient_nogaps/filterPaf_lenient_nogaps.py
python3 utils/greedy_lenient_nogaps/filterPaf_lenient_nogaps.py -f1 filtered_alignments/pre-filtered/oldToNewMap-450size-20score.paf -gs chr_sizes/new_nuclear_assembly_chr_sizes.txt -of filtered_alignments/greedy-lenient-nogaps/oldToNewMap-450size-20score_20pct-overlap_nogaps.paf --sensitivity 0.20 --noLoad
- Outputs
- old assembly to new assembly alignment mappings >450bp and >20 Q-score and updated mapping coordiantes to ensure there are no gaps in
.paf
formatfiltered_alignments/greedy-lenient-nogaps/oldToNewMap-450size-20score_20pct-overlap_nogaps.paf
- old assembly to new assembly alignment mappings >450bp and >20 Q-score and updated mapping coordiantes to ensure there are no gaps in
For each entry in the old GFF, check if its coordinates fall within a region that got remapped to the filtered old to new assembly mapping. Append which mapping it got mapped to and where it should fall on the new assembly
- Inputs
- old assembly to new assembly alignment mappings >450bp and >20 Q-score and updated mapping coordiantes to ensure there are no gaps in
.paf
formatfiltered_alignments/greedy-lenient-nogaps/oldToNewMap-450size-20score_20pct-overlap_nogaps.paf
- genome data file in
.gff
format with the length of the feature calculated and appended tocolumn 13
genome_data/old/cleaned_up_metadata_file/old_annotation_data_fully_cleaned_and_w_unique_lines_and_feature_sizes_in_last_col.gff
- old assembly to new assembly alignment mappings >450bp and >20 Q-score and updated mapping coordiantes to ensure there are no gaps in
- Script
utils/extract_gffs_from_mappings/extract_gffs_from_mappings.py
python3 utils/extract_gffs_from_mappings/extract_gffs_from_mappings.py --pafFile filtered_alignments/greedy-lenient-nogaps/oldToNewMap-450size-20score_20pct-overlap_nogaps.paf --gffFile genome_data/old/cleaned_up_metadata_file/old_annotation_data_fully_cleaned_and_w_unique_lines_and_feature_sizes_in_last_col.gff --outputFile genome_data/filtered_by_mapping/old_annotation_data_fully_cleaned_and_w_unique_lines_and_mapping_info_added-450size-20score_20pct-overlap_nogaps.gff
- Outputs
- genome data file in
.gff
format with the length of the feature calculated and appended tocolumn 13
- columns list is
["seqid", "source", "type", "start", "end", "score", "strand", "phase", "attributes", "featureLength", "line_id_on_mapping_file", "target_chrom", "mapping_start_on_old", "mapping_end_on_old"]
genome_data/filtered_by_mapping/old_annotation_data_fully_cleaned_and_w_unique_lines_and_mapping_info_added-450size-20score_20pct-overlap_nogaps.gff
- genome data file in
-
Inputs
- genome data file in
.gff
format with the length of the feature calculated and appended tocolumn 13
- columns list is
["seqid", "source", "type", "start", "end", "score", "strand", "phase", "attributes", "featureLength", "line_id_on_mapping_file", "target_chrom", "mapping_start_on_old", "mapping_end_on_old"]
genome_data/filtered_by_mapping/old_annotation_data_fully_cleaned_and_w_unique_lines_and_mapping_info_added-450size-20score_20pct-overlap_nogaps.gff
- genome data file in
-
Script
awk '{print $12}' genome_data/filtered_by_mapping/old_annotation_data_fully_cleaned_and_w_unique_lines_and_mapping_info_added-450size-20score_20pct-overlap_nogaps.gff | sort | uniq -c | awk 'BEGIN {OFS="\t"} {if ($2 ~ /^chromosome_*/) print $2, $1}' | sort -n -t "_" -k 2 | awk 'BEGIN {OFS="\t"} {total += $2; print $0} END {print "total", total}' > genome_data/new/expected_nuclear_remaps_summary.txt
- Outputs
- Expected number of features aligned to each new chromosome simply based off where that feature falls within the old assembly to new assembly alignment mappings; in a key, value format.
- Col 1 = New chromosome
- Col 2 = Expected number of features that remapped
genome_data/new/expected_nuclear_remaps_summary.txt
- Expected number of features aligned to each new chromosome simply based off where that feature falls within the old assembly to new assembly alignment mappings; in a key, value format.
Want to create a file with all the BLAST info plus the global alignment info plus the original info from the gff plus the info from the assembly to assembly mapping in order to begin performing reciprical best hit and figuring out which BLAST hits are accurate remaps of old features
- Inputs
- a
.tsv
file with all the blast info post global alignmentblast_outputs/all_blasted_to_new_assembly_POST_GLOBAL_ALIGNMENT.tsv
- a
# columns list is ["qseqid", #1, "sseqid", #2, "%_identity", #3, "alignment_length", #4, "mismatch", #5, "gapopen", #6, "qstart", #7, "qend", #8, "sstart", #9, "send", #10, "evalue", #11, "bitscore", #12, "subject_strand", #13, "line_in_og_BLAST", #14, "Needle_score", #15, "First_10_bp_matching", #16, "Last_10_bp_matching", #17, "First_9_bp_matching", #18, "Last_9_bp_matching", #19, "First_8_bp_matching", #20, "Last_8_bp_matching", #21, "First_7_bp_matching", #22, "Last_7_bp_matching", #23, "First_6_bp_matching", #24, "Last_6_bp_matching", #25, "First_5_bp_matching", #26, "Last_5_bp_matching", #27, "First_4_bp_matching", #28, "Last_4_bp_matching", #29, "First_3_bp_matching", #30, "Last_3_bp_matching", #31, "First_2_bp_matching", #32, "Last_2_bp_matching", #33, "First_1_bp_matching", #34, "Last_1_bp_matching", #35]
- .
- genome data file in
.gff
format with the length of the feature calculated and appended tocolumn 13
genome_data/filtered_by_mapping/old_annotation_data_fully_cleaned_and_w_unique_lines_and_mapping_info_added-450size-20score_20pct-overlap_nogaps.gff
- genome data file in
# columns list is ["seqid" #1, "source" #2, "type" #3, "start" #4, "end" #5, "score" #6, "strand" #7, "phase" #8, "attributes" #9, "featureLength" #10, "line_id_on_mapping_file" #11, "target_chrom" #12, "mapping_start_on_old" #13, "mapping_end_on_old" #14, "if_query_and_target_on_same_strand" #15; ‘+’ if query/target on the same strand; ‘-’ if opposite, "mapping_start_on_new" #16, "mapping_end_on_new" #17, "mapping_q_score" #18]
- Script
utils/add_gff_info_to_blast/pull_info_from_gff_into_blast.py
--noLoad
if you don't want to load intermediate tables
nohup python3 -u pull_info_from_gff_into_blast.py --blastFile blast_outputs/all_blasted_to_new_assembly_POST_GLOBAL_ALIGNMENT.tsv --gffFile old_annotation_data_fully_cleaned_and_w_unique_lines_and_mapping_info_added-450size-20score_20pct-overlap_nogaps.gff --outBlastFile blast_outputs/all_blasted_to_new_assembly_WITH_ALL_INFO_FINAL.tsv --noLoad > all.log 2> all.errlog &
- Outputs
- A
.tsv
file with all the BLAST info plus the global alignment info plus the original info from the gff plus the info from the assembly to assembly mappingblast_outputs/all_blasted_to_new_assembly_WITH_ALL_INFO_FINAL.tsv
- A
# columns list is ["qseqid", #1, "sseqid", #2, "%_identity", #3, "alignment_length", #4, "mismatch", #5, "gapopen", #6, "qstart", #7, "qend", #8, "sstart", #9, "send", #10, "evalue", #11, "bitscore", #12, "subject_strand", #13, "line_in_og_BLAST", #14, "Needle_score", #15, "First_10_bp_matching", #16, "Last_10_bp_matching", #17, "First_9_bp_matching", #18, "Last_9_bp_matching", #19, "First_8_bp_matching", #20, "Last_8_bp_matching", #21, "First_7_bp_matching", #22, "Last_7_bp_matching", #23, "First_6_bp_matching", #24, "Last_6_bp_matching", #25, "First_5_bp_matching", #26, "Last_5_bp_matching", #27, "First_4_bp_matching", #28, "Last_4_bp_matching", #29, "First_3_bp_matching", #30, "Last_3_bp_matching", #31, "First_2_bp_matching", #32, "Last_2_bp_matching", #33, "First_1_bp_matching", #34, "Last_1_bp_matching", #35, "query_feature_length", #36, "query_feature_chromosome", #37, "query_feature_start", #38, "query_feature_end", #39, "query_feature_strand", #40, "query_feature_source", #41, "query_feature_type", #42, "query_feature_score", #43, "query_feature_phase", #44, "query_feature_attributes_cleaned", #45, "mapping_row_in_non-sorted_paf_file_0_based", #46, "expected_chromosome_to_map_to", #47, "expected_start_position_to_map_to", #48, "expected_end_position_to_map_to", #49, "subject_length" #50]
Step 9: Filter the BLAST file to perform reciprical best hit and figuring out which BLAST hits are accurate remaps of old features
- Inputs
- A
.tsv
file with all the BLAST info plus the global alignment info plus the original info from the gff plus the info from the assembly to assembly mappingblast_outputs/all_blasted_to_new_assembly_WITH_ALL_INFO_FINAL.tsv
- A
# columns list is ["qseqid", #1, "sseqid", #2, "%_identity", #3, "alignment_length", #4, "mismatch", #5, "gapopen", #6, "qstart", #7, "qend", #8, "sstart", #9, "send", #10, "evalue", #11, "bitscore", #12, "subject_strand", #13, "line_in_og_BLAST", #14, "Needle_score", #15, "First_10_bp_matching", #16, "Last_10_bp_matching", #17, "First_9_bp_matching", #18, "Last_9_bp_matching", #19, "First_8_bp_matching", #20, "Last_8_bp_matching", #21, "First_7_bp_matching", #22, "Last_7_bp_matching", #23, "First_6_bp_matching", #24, "Last_6_bp_matching", #25, "First_5_bp_matching", #26, "Last_5_bp_matching", #27, "First_4_bp_matching", #28, "Last_4_bp_matching", #29, "First_3_bp_matching", #30, "Last_3_bp_matching", #31, "First_2_bp_matching", #32, "Last_2_bp_matching", #33, "First_1_bp_matching", #34, "Last_1_bp_matching", #35, "query_feature_length", #36, "query_feature_chromosome", #37, "query_feature_start", #38, "query_feature_end", #39, "query_feature_strand", #40, "query_feature_source", #41, "query_feature_type", #42, "query_feature_score", #43, "query_feature_phase", #44, "query_feature_attributes_cleaned", #45, "mapping_row_in_sorted_paf_file_0_based", #46, "expected_chromosome_to_map_to", #47, "expected_start_position_to_map_to", #48, "expected_end_position_to_map_to", #49, "subject_length" #50]
- Script
utils/filter_blast_remaps/pre_filter_blast_remaps.sh
./utils/filter_blast_remaps/pre_filter_blast_remaps.sh blast_outputs/all_blasted_to_new_assembly_WITH_ALL_INFO_FINAL.tsv
- Outputs
- A
.tsv
file with all the BLAST info plus the global alignment info plus the original info from the gff plus the info from the assembly to assembly mapping, filtered according to the following pre-filters- Pre-Filter 1: Remove all entries where the first base pair of the subject sequence is not the same as the last base pair of the query sequence
- Pre-Filter 2: Remove all entries where the last base pair of the subject sequence is not the same as the last base pair of the query sequence
- Pre-Filter 3: Remove all entries where the first 3 base pairs of the subject sequence is not the same as the first base pair of the query sequence if the sequence is a CDS or exon
- Pre-Filter 4: Remove all entries where the last 3 base pairs of the subject sequence is not the same as the last base pair of the query sequence if the sequence is a CDS or exon
- Pre-Filter 5: Remove all entries where the first 8 base pairs of the subject sequence are not the same as the first 8 base pairs of the query sequence
- Pre-Filter 6: Remove all entries where the last 8/10 base pairs of the subject sequence are not the same as the last 8 base pairs of the query sequence
- Pre-Filter 7: Remove all entries where the length of the subject sequence is 50% smaller than the length of the query sequence
- Pre-Filter 8: Remove all entries where the length of the subject sequence is 50% larger than the length of the query sequence
- Pre-Filter 9: Remove all entries where the length of the subject sequence is 10% smaller than the length of the query sequence if the query is a CDS or exon
- Pre-Filter 10: Remove all entries where the length of the subject sequence is 10% larger than the length of the query sequence if the query is a CDS or exon
blast_outputs/all_blasted_to_new_assembly_WITH_ALL_INFO_FINAL_and_PRE_FILTERED.tsv
- A
2. Filter the BLAST file to make sure the BLAST mappings fall within a region deemed acceptable by the old assembly to new assembly alignment
-
Inputs
- A prefiltered
.tsv
file with all the BLAST info plus the global alignment info plus the original info from the gff plus the info from the assembly to assembly mappingblast_outputs/all_blasted_to_new_assembly_WITH_ALL_INFO_FINAL_and_PRE_FILTERED.tsv
- old assembly to new assembly alignment mappings >450bp and >20 Q-score and updated mapping coordiantes to ensure there are no gaps in
.paf
formatfiltered_alignments/greedy-lenient-nogaps/oldToNewMap-450size-20score_20pct-overlap_nogaps.paf
--noLoad
if you don't want to load intermediate tables
- A prefiltered
-
Script
utils/filter_blast_remaps/filter_blast_remaps.py
python3 utils/filter_blast_remaps/filter_blast_remaps.py --blastFile blast_outputs/all_blasted_to_new_assembly_WITH_ALL_INFO_FINAL_and_PRE_FILTERED.tsv --pafFile filtered_alignments/greedy-lenient-nogaps/oldToNewMap-450size-20score_20pct-overlap_nogaps.paf --outFile ./blast_outputs/all_blasted_to_new_assembly_WITH_ALL_INFO_FINAL_and_FILTERED_POST_SYNTENY_CHECK.tsv --noLoad
- Outputs
- A filtered
.tsv
file with all the BLAST info plus the global alignment info plus the original info from the gff plus the info from the assembly to assembly mapping, filtered to remove blast hits that don't align with old assembly to new assembly alignment synteny./blast_outputs/all_blasted_to_new_assembly_WITH_ALL_INFO_FINAL_and_FILTERED_POST_SYNTENY_CHECK.tsv
- A filtered
-
Inputs
- A filtered
.tsv
file with all the BLAST info plus the global alignment info plus the original info from the gff plus the info from the assembly to assembly mapping, filtered to remove blast hits that don't align with old assembly to new assembly alignment synteny./blast_outputs/all_blasted_to_new_assembly_WITH_ALL_INFO_FINAL_and_FILTERED_POST_SYNTENY_CHECK.tsv
- .
- ``
- A filtered
-
Script
utils/filter_blast_remaps/filter_blast_remaps.Rmd
- Run in R Studio
-
Outputs
- A
.gff
file of all the features that remapped (without their coordinates updated)genome_data/new/remapped_features_COORDS_NOT_UPDATED.gff
- A
.gff
file of all the features that were found in the blast outputs but that did not successfully remap (without their coordinates updated)genome_data/new/genome_data/new/non_remapped_features_FROM_BLAST_OUTPUT_ONLY.gff
- A
.gff
file of all the features that remapped, merged with all the data at from the blast table (merged by the id column)genome_data/new/merged_remapped_table_with_BLAST_info.gff
- A
# columns list is ["seqid", #1, "source", #2, "type", #3, "start", #4, "end", #5, "score", #6, "strand", #7, "phase", #8, "q_metadata", #9, "qseqid", #10, "sseqid", #11, "pident", #12, "length", #13, "mismatch", #14, "gapopen", #15, "qstart", #16, "qend", #17, "sstart", #18, "send", #19, "evalue", #20, "bitscore", #21, "sstrand", #22, "order", #23, "needle_score", #24, "first_10", #25, "last_10", #26, "first_9", #27, "last_9", #28, "first_8", #29, "last_8", #30, "first_7", #31, "last_7", #32, "first_6", #33, "last_6", #34, "first_5", #35, "last_5", #36, "first_4", #37, "last_4", #38, "first_3", #39, "last_3", #40, "first_2", #41, "last_2", #42, "first_1", #43, "last_1", #44, "q_length", #45, "q_chr", #46, "q_old_start", #47, "q_old_end", #48, "q_old_strand", #49, "q_old_source", #50, "q_old_type", #51, "q_old_score", #52, "q_old_phase", #53, "mapping_row_in_paf_0i", #54, "expected_chromosome", #55, "mapping_start_on_old_chromosome", #56, "mapping_end_on_old_chromosome", #57, "if_query_and_target_on_same_strand", #58, "mapping_start_on_new_chromosome", #59, "mapping_end_on_new_chromosome", #60, "mapping_q_score", #61, "s_length", #62, "in_acceptable_region", #63, "orientation_check" #64]
Update and clean up the remapped GFF file so that each remapped feature has its new chromosome and new start and end coordinates corresponding with the new assembly. Also order the file first by the chromosome number, and then by the start coordinates of the features (ascending order). Then append all the chloroplast and mitochondrial features as well and then finally replace the cleaned up metadata with the original ones.
- Inputs
- A
.gff
file of all the features that remapped, merged with all the data at from the blast table (merged by the id column)genome_data/new/merged_remapped_table_with_BLAST_info.gff
- A
- Script
utils/update_remapping_coordinates/update_remapping_deails.sh
./utils/update_remapping_coordinates/update_remapping_deails.sh genome_data/new/merged_remapped_table_with_BLAST_info.gff
- Outputs
- A
.gff
file with the successfully remapped features updated with their new chromosomes and new start and end coordinates corresponding with the new assembly.genome_data/new/final_remapped_data_updated_coordinates_minus_cp_and_mt_minus_original_metadata.gff
- Also outputs an intermediate
.gff
file with all the blast info still appendedgenome_data/new/final_remapped_data_updated_coordinates_plus_all_blast_data_minus_cp_and_mt_minus_original_metadata.gff
- A
2. Sort the file first by the chromosome number, and then by the start coordinates of the features (ascending order)
- Inputs
- A
.gff
file with the successfully remapped features updated with their new chromosomes and new start and end coordinates corresponding with the new assembly.genome_data/new/final_remapped_data_updated_coordinates_minus_cp_and_mt_minus_original_metadata.gff
- A
- Script
utils/update_remapping_coordinates/sort_gff.R
./utils/update_remapping_coordinates/sort_gff.R genome_data/new/final_remapped_data_updated_coordinates_minus_cp_and_mt_minus_original_metadata.gff
- Outputs
- A
.gff
file with the successfully remapped features updated with their new chromosomes and new start and end coordinates corresponding with the new assembly, sorted by the chromosome number, and then by the start coordinates of the features in ascending ordergenome_data/new/final_remapped_data_updated_coordinates_sorted_minus_cp_and_mt_minus_original_metadata.gff
- A
-
Inputs
- A
.gff
file with the successfully remapped features updated with their new chromosomes and new start and end coordinates corresponding with the new assembly, sorted by the chromosome number, and then by the start coordinates of the features in ascending ordergenome_data/new/final_remapped_data_updated_coordinates_sorted_minus_cp_and_mt_minus_original_metadata.gff
- A
.gff
file containing all the features from the mitochondrial genome (needs to have cleaned up metadata)genome_data/old/cleaned_up_metadata_file/mt_annotation_data_fully_cleaned_and_w_unique_lines.gff
- A
.gff
file containing all the features from the chloroplast genome (needs to have cleaned up metadata)genome_data/old/cleaned_up_metadata_file/cp_annotation_data_fully_cleaned_and_w_unique_lines.gff
- A
-
Script
utils/update_remapping_coordinates/add_mt_and_cp_details.sh
- Takes in 3 positional arguments
-
- The
.gff
file to be updated
- The
-
- The
.gff
file with the mitochondria annotation data
- The
-
- The
.gff
file with the chloroplast annotation data
- The
-
- Takes in 3 positional arguments
./utils/update_remapping_coordinates/add_mt_and_cp_details.sh genome_data/new/final_remapped_data_updated_coordinates_sorted_minus_cp_and_mt_minus_original_metadata.gff genome_data/old/cleaned_up_metadata_file/mt_annotation_data_fully_cleaned_and_w_unique_lines.gff genome_data/old/cleaned_up_metadata_file/cp_annotation_data_fully_cleaned_and_w_unique_lines.gff
- Outputs
- A
.gff
file with the successfully remapped features updated with their new chromosomes and new start and end coordinates corresponding with the new assembly, sorted by the chromosome number, and then by the start coordinates of the features in ascending order, updated with the mitochondria and chloroplast annotation datagenome_data/new/final_remapped_data_updated_coordinates_sorted_plus_cp_and_mt_minus_original_metadata.gff
- A
4.a. Create a key value pair file for old metadata (unique IDs to what they should be replaced with)
# Remove headers from the old features file
awk 'BEGIN {FS = "\t"} {if (NF == 9) print $0}' genome_data/old/uncleaned_metadata_file/phatr3_gene_models_with_href.gff3 > genome_data/old/uncleaned_metadata_file/phatr3_gene_models_with_href_HEADERS_REMOVED.gff3
# Create key value pairs of cleaned up IDS to what they should be replaced with at the end
paste cleaned_up_metadata_file/old_annotation_data_fully_cleaned_and_w_unique_lines.gff genome_data/old/uncleaned_metadata_file/phatr3_gene_models_with_href_HEADERS_REMOVED.gff3 | awk 'BEGIN {FS = "\t"; OFS = "\t"} {print $9, $19}' > txts/new_id_to_old_id_kv_pairs.txt
- Inputs
- A
.gff
file with the successfully remapped features updated with their new chromosomes and new start and end coordinates corresponding with the new assembly, sorted by the chromosome number, and then by the start coordinates of the features in ascending order, updated with the mitochondria and chloroplast annotation datagenome_data/new/final_remapped_data_updated_coordinates_sorted_plus_cp_and_mt_minus_original_metadata.gff
- A key value pair file for old metadata
- Keys: unique IDs
- Values: original metadata that the unique ids should be replaced with
txts/new_id_to_old_id_kv_pairs.txt
- A
- Script
utils/update_remapping_coordinates/revert_to_original_metadata.py
python3 utils/update_remapping_coordinates/revert_to_original_metadata.py --gffFile genome_data/new/final_remapped_data_updated_coordinates_sorted_plus_cp_and_mt_minus_original_metadata.gff --metaTable txts/new_id_to_old_id_kv_pairs.txt --outputGffFile genome_data/new/final_remapped_data_updated_coordinates_sorted_plus_cp_and_mt_plus_original_metadata.gff
- Outputs
- A
.gff
file with the successfully remapped features updated with their new chromosomes and new start and end coordinates corresponding with the new assembly, sorted by the chromosome number, and then by the start coordinates of the features in ascending order, updated with the mitochondria and chloroplast annotation data, and with original metadatagenome_data/new/final_remapped_data_updated_coordinates_sorted_plus_cp_and_mt_plus_original_metadata.gff
- A
Manually add whole chromosome data to the correct lines of the remapped .gff
(before the features start for the given chromsome). The start coordinate is 1 and the end coordinate is the length of the chromosome, and the strand is "."
-
Inputs
- A
.gff
file with the successfully remapped features updated with their new chromosomes and new start and end coordinates corresponding with the new assembly, sorted by the chromosome number, and then by the start coordinates of the features in ascending order, updated with the mitochondria and chloroplast annotation data, and with original metadatagenome_data/new/final_remapped_data_updated_coordinates_sorted_plus_cp_and_mt_plus_original_metadata.gff
- A
-
Script
utils/update_remapping_coordinates/revert_to_original_metadata.py
# For gff with original metadata
python3 utils/update_remapping_coordinates/add_whole_chromosome_data.py --gffFile genome_data/new/final_remapped_data_updated_coordinates_sorted_plus_cp_and_mt_plus_original_metadata.gff --outputFile genome_data/new/final_remapped_data.gff
# For gff with cleaned up metadata
python3 utils/update_remapping_coordinates/add_whole_chromosome_data.py --gffFile genome_data/new/final_remapped_data_updated_coordinates_sorted_plus_cp_and_mt_minus_original_metadata.gff --outputFile genome_data/new/final_remapped_data_minus_original_metadata.gff
- Outputs
- A
.gff
file with the successfully remapped features updated with their new chromosomes and new start and end coordinates corresponding with the new assembly, sorted by the chromosome number, and then by the start coordinates of the features in ascending order, updated with the mitochondria and chloroplast annotation data, with original metadata, and with chromosome informationgenome_data/new/final_remapped_data.gff
- If you choose to keep the cleaned up metadata, can be found here
genome_data/new/final_remapped_data_minus_original_metadata.gff
- A