Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR ~ Error executing process > 'pipeline:reference_assembly:map_reads (1)' #121

Open
physnano opened this issue Sep 30, 2024 · 5 comments
Labels
question Further information is requested

Comments

@physnano
Copy link

physnano commented Sep 30, 2024

My workflow keeps failing at the reference_assembly:map_reads step:

ERROR ~ Error executing process > 'pipeline:reference_assembly:map_reads (1)'

Caused by:
  Process `pipeline:reference_assembly:map_reads (1)` terminated with an error exit status (140)

Command executed:

  minimap2 -t 1 -ax splice -uf genome_index.mmi seqs.fastq.gz        | samtools view -q 40 -F 2304 -Sb -        | seqkit bam -j 1 -x -T 'AlnContext: { Ref: "GRCh38.primary_assembly.genome.fa", LeftShift: -24,
      RightShift: 24, RegexEnd: "[Aa]{8,}",
      Stranded: True,Invert: True, Tsv: "internal_priming_fail.tsv"} ' -        | samtools sort --write-index -@ 1 -o "E3_rep2_reads_aln_sorted.bam##idx##E3_rep2_reads_aln_sorted.bam.bai" - ;
  ((cat "E3_rep2_reads_aln_sorted.bam" | seqkit bam -s -j 1 - 2>&1)  | tee E3_rep2_read_aln_stats.tsv ) || true
  
  # Add sample id header and column
  sed "s/$/E3_rep2/" "E3_rep2_read_aln_stats.tsv"         | sed "1 s/E3_rep2/sample_id/" > tmp
  mv tmp "E3_rep2_read_aln_stats.tsv"
  
  if [[ -s "internal_priming_fail.tsv" ]];
      then
          tail -n +2 "internal_priming_fail.tsv" | awk '{print ">" $1 "\n" $4 }' - > "context_internal_priming_fail_start.fasta"
          tail -n +2 "internal_priming_fail.tsv" | awk '{print ">" $1 "\n" $6 }' - > "context_internal_priming_fail_end.fasta"
  fi

Command exit status:
  140

Command output:
  (empty)

Error code 140 suggests Memory/CPU constraint, however adding the following to the config file has not resolved the issue:

process {
    withName: 'makeReport' {
    queue = 'himem'
    memory = '512.GB'
    }

    withName: 'reference_assembly:map_reads' {
    memory = '32.GB'
    } 
}

--->

WARN: There's no process matching config selector: reference_assembly:map_reads
@physnano physnano added the question Further information is requested label Sep 30, 2024
@nrhorner
Copy link
Contributor

Hi @physnano

Just the process name should be included in the process selector like so:

    withName: 'map_reads' {
    memory = '32.GB'
    } 

@physnano
Copy link
Author

physnano commented Oct 3, 2024

Thanks @nrhorner, that along with clusterOptions = '--qos=long' seemed to help. Although now I am seeing the following:

ERROR ~ Error executing process > 'pipeline:split_bam (2)'

Caused by:
  Process `pipeline:split_bam (2)` terminated with an error exit status (137)

Command executed:

  n=`samtools view -c isob11_rep2_reads_aln_sorted.bam`
  if [[ n -lt 1 ]]
  then
      echo 'There are no reads mapping for isob11_rep2. Exiting!'
      exit 1
  fi
  
  re='^[0-9]+$'
  
  if [[ 50000 =~ $re ]]
  then
      echo "Bundling up the bams"
      seqkit bam -j 4 -N 50000 isob11_rep2_reads_aln_sorted.bam -o  bam_bundles/
      let i=1
      for b in bam_bundles/*.bam; do
          echo $b
          newname="isob11_rep2_batch_${i}.bam"
          mv $b $newname
         ((i++))
      done
  else
      echo 'no bundling'
      ln -s isob11_rep2_reads_aln_sorted.bam isob11_rep2_batch_1.bam
  fi

Command exit status:
  137

It seems that many of the steps of this workflow do not have sufficient default memory allocated to the (sub)processes...

@nrhorner
Copy link
Contributor

Hi @physnano

Ok, thanks for the update. We will review memory allocations for this workflow. Would you be able to share a bit of information about your data? How many samples and total number of reads are you using? ALso which version of the workflow and the command you used?

Thanks,

Neil

@physnano
Copy link
Author

Hi @nrhorner , In my case 3 replicates for 2 samples (6 total) were split across two PromethION flow cells, so ~40-50M raw reads per individual barcode. The makeReport step spikes to ~200GB according to my monitoring. I am using the latest version v1.4.0 --> Command used:

nextflow run ${wfPath}wf-transcriptomes \
    --fastq ${fqPath} \
    --de_analysis \
    --ref_genome ${refPath}GRCh38.primary_assembly.genome.fa \
    --ref_annotation ${refPath}gencode.v46.primary_assembly.annotation.gtf \
    --ref_transcriptome ${refPath}gencode.v46.transcripts.fa \
    --sample_sheet ${wfPath}sample_sheet.csv \
    --cdna_kit SQK-PCB114 \
    --out_dir ${wfPath}outdir-de \
    -profile singularity \
    -c ${wfPath}wf-transcriptomes/nextflow.config \
    --threads 4 \
    -resume

@nrhorner
Copy link
Contributor

nrhorner commented Nov 6, 2024

Hi @physnano

It's not good that the report generation step is using so much memory. I will investigate this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Development

No branches or pull requests

2 participants