-
Notifications
You must be signed in to change notification settings - Fork 14
Create a toy dataset
The goal is to create a small working dataset we can use to test pandora’s full workflow.
The genes I will use are selected from the consensus sequence for the Cardio sample /nfs/leia/research/iqbal/mbhall/Projects/Pandora_variation/Analysis/Pandora/local_assembly_test/Cardio/H152180645/compare_consensus_to_ref/H152180645_pandora.consensus.fa
GC00002914
GC00000085_5
GC00000434_2
Pull the PRG sequences for these genes out of the Cardio PRG
prg=/nfs/leia/research/iqbal/rmcolq/projects/pangenome_prg/ecoli/290818_genes_only/ecoli_pangenome_PRG_290818.fa
toy_prg=/nfs/leia/research/iqbal/mbhall/Projects/Pandora_variation/Analysis/Pandora/toy_prg.fa
cat "$prg" | grep -A 1 --no-group-separator 'GC00002914\|GC00000085_5\|GC00000434_2' > "$toy_prg"
Now I will make a "reference" file composed of the pandora consensus sequences for each of the genes and map a Cardio sample to it so I can create a toy fastq with some reads that do and dont map to these genes.
fa=/nfs/leia/research/iqbal/mbhall/Projects/Pandora_variation/Analysis/Pandora/local_assembly_test/Cardio/H152180645/compare_consensus_to_ref/H152180645_pandora.consensus.fa
ref=/nfs/leia/research/iqbal/mbhall/Projects/Pandora_variation/Analysis/Pandora/toy_ref.fa
cat "$fa" | grep -A 1 --no-group-separator 'GC00002914\|GC00000085_5\|GC00000434_2' > "$ref"
Mapping a fastq of nanopore reads to the toy reference
fq=/hps/nobackup/iqbal/mbhall/Pandora_variation/Data/Cardio/ecoli20170616/data/porechopped/BC01.fastq.gz
minimap2 -ax map-ont "$ref" "$fq" | samtools view -b toy.bam
samtools sort -o toy.sorted.bam toy.bam
samtools index toy.sorted.bam
Wrote a script to randomly select a given number of reads that map (100 default) and a given number that dont map (100 default) to this toy reference /nfs/leia/research/iqbal/mbhall/Projects/Pandora_variation/Analysis/Pandora/subset_bam.py
.
Ran this script to generate the toy fastq file /nfs/leia/research/iqbal/mbhall/Projects/Pandora_variation/Analysis/Pandora/toy.fastq
script=/nfs/leia/research/iqbal/mbhall/Projects/Pandora_variation/Analysis/Pandora/subset_bam.py
out=/nfs/leia/research/iqbal/mbhall/Projects/Pandora_variation/Analysis/Pandora/toy.fastq
python3 "$script" --input toy.sorted.bam --output "$out"