Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bootstrapping requires a lot of resources? #81

Open
goeckeritz opened this issue Oct 19, 2023 · 1 comment
Open

Bootstrapping requires a lot of resources? #81

goeckeritz opened this issue Oct 19, 2023 · 1 comment

Comments

@goeckeritz
Copy link

Hi Lee,

Thanks so much for making this handy tool. I'm trying to use it to calculate genetic differences for some very heterozygous plant species. I am trying to run it in accurate mode (--min-depth 0) with 100 bootstrap replicates, but it seems to be taking quite a long time. So I wanted to essentially do a sanity check, and ask you if I'm fundamentally misunderstanding how the bootstrapping is done.

According to the log file, mashtree successfully identifies the valleys of very rare kmers of several sizes and is able to calculate the distances within about five hours (at least 48 threads, 4G - 8G per thread). But when it gets to bootstrapping... it takes more than a day or two. In fact I haven't seen it finish yet. I guess I was surprised by this because I thought the sketches were created first and random subsampling then occurs with the already-made sketches to create the bootstrapped tree; I wasn't expecting this process to be especially CPU time intensive.

For some context, I'm using the conda installed version for 1.4.5 -- https://anaconda.org/bioconda/mashtree
And when I initially install it, the mashtree_bootstrap.pl script is angry about not having List::MoreUtils installed, which I then install in the same environment with this conda recipe -- https://anaconda.org/bioconda/perl-list-moreutils

I mention this because it seems bootstrapping does its multi-threading using perl, and I wonder if there is an issue in my installation. I know you don't oversee the conda or docker installations but I think if I could at least understand the fundamentals of how it is bootstrapping that may give me an idea on how to fix/handle this.

I'm identifying kmers in about 250Gb of compressed short read (Illumina) fastq.gz data. Here's the script I'm running, in case that's helpful:

#!/bin/sh --login
#SBATCH -J case0
#SBATCH --nodes=1
#SBATCH --ntasks=52
#SBATCH --mem-per-cpu=8g
#SBATCH --time=48:00:00
#SBATCH -o /mnt/scratch/goeckeri/mashtree/mashtree_original_cov_per_hap%j

module purge
conda activate mashtree

cd /mnt/scratch/goeckeri/mashtree/

mashtree_bootstrap.pl --outmatrix case0_dist --reps 100 --numcpus 52 --file-of-files case0_files.txt -- --sort-order random
--genomesize 750000000 --mindepth 0 --kmerlength 25 --sketch-size 10000 > case0_bs_tree.dnd

Thanks so much for your time -- I appreciate any help/coaching you might be able to give!

Kindly,
Charity

@lskatz
Copy link
Owner

lskatz commented Nov 3, 2023

Hi, I am sorry for the frustration. I don't immediately see anything wrong with how you are running it. Your command looks good and it looks like it's in the right framework. That said, it is possible that you are hitting some disk I/O bandwidth issues if you are running 52 CPUs. Even though you are on the scratch drive and even though the way it runs is embarrassingly parallel, you could be maxing out how much your disk can handle. I would recommend seeing what happens if you run it with 8 CPUs and/or if you can set --tempdir /dev/shm (if you have enough RAM).

Let me know how it goes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants