Even if the genome is small, shouldn't too many samples be added to build the graph #316

Wwwwwwwyc · 2023-06-28T07:22:54Z

Hello! Sorry I'm bothering you again.

The genome size of this species is around 12M and I have partitioned it chromosomeally in advance. The following command ran fast in 34 samples.
for i in 1 2 3 4 5 6 7 ;do pggb -i chr${i}_pan.fa -o chr${i}.pan -n 34 -t 128 -p 90 -s 5000 -S -m -V '$ref:#' ;done

However, when I built the graph with 1589 samples, the command stopped at seqwish indexing.
for i in 1 2 3 4 5 6 7 ;do pggb -i chr${i}_pan.fa -o chr${i}.pan -n 1589 -t 128 -p 90 -s 5000 -S -m -V '$ref:#' ;done

I used the top to check the running status of seqwish, which showed as D (Uninterruptible Sleep).

I'm confused because free -g shows that there is a lot more memory available

                     total        used        free      shared  buff/cache   available
Mem:           1984          51         298           0             1634        1927
Swap:                 3           0             3

The text was updated successfully, but these errors were encountered:

subwaystation · 2023-06-28T07:45:05Z

seqwish uses disk-backed data structures, maybe you ran out of disk space?

Wwwwwwwyc · 2023-06-28T08:00:16Z

There is 15T of disk space under the working path, of which seqwish produces about 1T of process files. Maybe I didn't formulate temp-dir, which causes the alignment results in the work path while the index results are in other spaces? I modified -D to the current working path in an attempt, thanks for your help

ekg · 2023-06-28T11:12:53Z

To be used efficiently by seqwish, the disk needs to be local and ideally SSD. It should support efficient random access.

If you do not have such a disk, one option is to create a ramdisk and use that as the scratch directory.

Take care not to run seqwish on networked storage with high latencies. This will cause the kind of problem you're seeing.

Exactly where did the seqwish job slow down? Would you share some of the log?

Wwwwwwwyc · 2023-06-30T03:21:39Z

Thanks for your help, we got back into running the pipeline and just came to the slow down part.

We are running on a local HDD disk.

Seqwish is slow down in the indexing process and the log file is as follows:

[seqwish::seqidx] 0.000 indexing sequences
[seqwish::seqidx] 40.197 index built
[seqwish::alignments] 40.197 processing alignments
[seqwish::alignments] 20878.751 indexing

top

NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
0  0.836t 0.831t 0.830t D  16.8 42.9  45082:29 seqwish

The output file looks like this:

92K	4I3HBy.sqi
6.3G	RmtjUs.sqq
845G	vXXiRd.sqa

Could this be because I aligned softmask genome?

seqwish --version is v0.7.9-0-gd9e7ab5

ekg · 2023-06-30T04:00:18Z

I thought we had fixed the issue with softmasking.

…

On Fri, Jun 30, 2023, 12:21 Wwwwwwwyc ***@***.***> wrote: Thanks for your help, we got back into running the pipeline and just came to the slow down part. We are running on a local HDD disk. Seqwish is slow down in the indexing process and the log file is as follows: [seqwish::seqidx] 0.000 indexing sequences [seqwish::seqidx] 40.197 index built [seqwish::alignments] 40.197 processing alignments [seqwish::alignments] 20878.751 indexing top NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 0 0.836t 0.831t 0.830t D 16.8 42.9 45082:29 seqwish The output file looks like this: 92K 4I3HBy.sqi 6.3G RmtjUs.sqq 845G vXXiRd.sqa Could this be because I aligned softmask genome? — Reply to this email directly, view it on GitHub <#316 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABDQEK6GWGWG3NPJ2IPWDLXNZA47ANCNFSM6AAAAAAZWTPESQ> . You are receiving this because you commented.Message ID: ***@***.***>

Wwwwwwwyc · 2023-07-28T07:39:44Z

Hello! Sorry I'm bothering you again. This time we waited for the end of the procedure. We ran:

pggb -i chr1_pan_filter_shorter_than_50k.fa -o chr1.pan -n 1589 -t 128 -p 90 -s 5000 -S -m -V ref:#

but got the following error:

[wfmash::skch::Map::mapQuery] mapped 100.00% @ 2.59e+06 bp/s elapsed: 00:00:43:05 remain: 00:00:00:00
[wfmash::skch::Map::mapQuery] count of mapped reads = 2225, reads qualified for mapping = 2228, total input reads = 2228, total input bp = 6704481258
[wfmash::map] time spent mapping the query: 2.59e+03 sec
[wfmash::map] mapping results saved in: /dev/stdout
wfmash -s 5000 -l 25000 -p 90 -n 1588 -k 19 -H 0.001 -X -t 128 --tmp-base chr1.pan chr1_pan_50k.fa --approx-map
322888.33s user 1654.82s system 12388% cpu 2619.72s total 19614516Kb max memory
[wfmash::align] Reference = [chr1_pan_50k.fa]
[wfmash::align] Query = [chr1_pan_50k.fa]
[wfmash::align] Mapping file = chr1.pan/wfmash-1331Yx
[wfmash::align] Alignment identity cutoff = 0.72%
[wfmash::align] Alignment output file = /dev/stdout
[wfmash::align] time spent loading the reference index: 0.164741 sec
[wfmash::align::computeAlignments] aligned 100.00% @ 4.48e+07 bp/s elapsed: 01:07:41:59 remain: 00:00:00:00
[wfmash::align::computeAlignments] count of mapped reads = 2228, total aligned bp = 5109032990714
[wfmash::align] time spent computing the alignment: 1.14e+05 sec
[wfmash::align] alignment results saved in: /dev/stdout
wfmash -s 5000 -l 25000 -p 90 -n 1588 -k 19 -H 0.001 -X -t 128 --tmp-base chr1.pan chr1_pan_50k.fa -i chr1.pan/chr1_pan_50k.fa.ee441bc.mappings.wfmash.paf --invert-filtering
14407243.04s user 128933.60s system 12733% cpu 114153.25s total 14749796Kb max memory
[seqwish::seqidx] 0.000 indexing sequences
[seqwish::seqidx] 38.848 index built
[seqwish::alignments] 38.848 processing alignments
[seqwish::alignments] 23203.418 indexing
[seqwish::alignments] 1714820.808 index built
[seqwish::transclosure] 1714820.852 computing transitive closures
[seqwish::transclosure] 1714823.517 0.00% 0-10000000 overlap_collect
Command terminated by signal 9
seqwish -s chr1_pan_50k.fa -p chr1.pan/chr1_pan_50k.fa.ee441bc.alignments.wfmash.paf -k 19 -f 0 -g chr1.pan/chr1_pan_50k.fa.ee441bc.417fcdf.seqwish.gfa -B 10000000 -t 128 --temp-dir chr1.pan -P
4914907.12s user 80460.79s system 285% cpu 1750681.50s total 1269699096Kb max memory

Does this seem to be caused by insufficient memory?

We were confused because we had done partition and only had a sequence of about 5M per sample. Perhaps it is currently difficult to build graphs on samples of 1000 orders of magnitude?

ekg · 2023-07-28T08:45:28Z

This does look like an out of memory situation. We are working on the sequence space partitioning. Hope to have a solution in the coming month or two. In the meantime, you might use a reference guided approach to collect smaller amounts of sequence.

…

On Fri, Jul 28, 2023, 09:39 Wwwwwwwyc ***@***.***> wrote: Hello! Sorry I'm bothering you again. This time we waited for the end of the procedure. We ran: pggb -i chr1_pan_filter_shorter_than_50k.fa -o chr1.pan -n 1589 -t 128 -p 90 -s 5000 -S -m -V ref:# but got the following error: [wfmash::skch::Map::mapQuery] mapped 100.00% @ 2.59e+06 bp/s elapsed: 00:00:43:05 remain: 00:00:00:00 [wfmash::skch::Map::mapQuery] count of mapped reads = 2225, reads qualified for mapping = 2228, total input reads = 2228, total input bp = 6704481258 [wfmash::map] time spent mapping the query: 2.59e+03 sec [wfmash::map] mapping results saved in: /dev/stdout wfmash -s 5000 -l 25000 -p 90 -n 1588 -k 19 -H 0.001 -X -t 128 --tmp-base chr1.pan chr1_pan_50k.fa --approx-map 322888.33s user 1654.82s system 12388% cpu 2619.72s total 19614516Kb max memory [wfmash::align] Reference = [chr1_pan_50k.fa] [wfmash::align] Query = [chr1_pan_50k.fa] [wfmash::align] Mapping file = chr1.pan/wfmash-1331Yx [wfmash::align] Alignment identity cutoff = 0.72% [wfmash::align] Alignment output file = /dev/stdout [wfmash::align] time spent loading the reference index: 0.164741 sec [wfmash::align::computeAlignments] aligned 100.00% @ 4.48e+07 bp/s elapsed: 01:07:41:59 remain: 00:00:00:00 [wfmash::align::computeAlignments] count of mapped reads = 2228, total aligned bp = 5109032990714 [wfmash::align] time spent computing the alignment: 1.14e+05 sec [wfmash::align] alignment results saved in: /dev/stdout wfmash -s 5000 -l 25000 -p 90 -n 1588 -k 19 -H 0.001 -X -t 128 --tmp-base chr1.pan chr1_pan_50k.fa -i chr1.pan/chr1_pan_50k.fa.ee441bc.mappings.wfmash.paf --invert-filtering 14407243.04s user 128933.60s system 12733% cpu 114153.25s total 14749796Kb max memory [seqwish::seqidx] 0.000 indexing sequences [seqwish::seqidx] 38.848 index built [seqwish::alignments] 38.848 processing alignments [seqwish::alignments] 23203.418 indexing [seqwish::alignments] 1714820.808 index built [seqwish::transclosure] 1714820.852 computing transitive closures [seqwish::transclosure] 1714823.517 0.00% 0-10000000 overlap_collect Command terminated by signal 9 seqwish -s chr1_pan_50k.fa -p chr1.pan/chr1_pan_50k.fa.ee441bc.alignments.wfmash.paf -k 19 -f 0 -g chr1.pan/chr1_pan_50k.fa.ee441bc.417fcdf.seqwish.gfa -B 10000000 -t 128 --temp-dir chr1.pan -P 4914907.12s user 80460.79s system 285% cpu 1750681.50s total 1269699096Kb max memory Does this seem to be caused by insufficient memory? We were confused because we had done partition and only had a sequence of about 5M per sample. Perhaps it is currently difficult to build graphs on samples of 1000 orders of magnitude? — Reply to this email directly, view it on GitHub <#316 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABDQEL7TLO3TWOSGR6HLQLXSNUEXANCNFSM6AAAAAAZWTPESQ> . You are receiving this because you commented.Message ID: ***@***.***>

Wwwwwwwyc · 2023-07-28T08:50:56Z

Thanks for your help. We have used a reference guided approach like [https://github.com/pangenome/HPRCyear1v2genbank]. But we'll try smaller partitions.

Wwwwwwwyc changed the title ~~Even if the genome is small, shouldn't too many samples be added for genome-wide alignment?~~ Even if the genome is small, shouldn't too many samples be added to build the graph Jun 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Even if the genome is small, shouldn't too many samples be added to build the graph #316

Even if the genome is small, shouldn't too many samples be added to build the graph #316

Wwwwwwwyc commented Jun 28, 2023 •

edited

Loading

subwaystation commented Jun 28, 2023

Wwwwwwwyc commented Jun 28, 2023

ekg commented Jun 28, 2023

Wwwwwwwyc commented Jun 30, 2023 •

edited

Loading

ekg commented Jun 30, 2023 via email

Wwwwwwwyc commented Jul 28, 2023

ekg commented Jul 28, 2023 via email

Wwwwwwwyc commented Jul 28, 2023 •

edited

Loading

Even if the genome is small, shouldn't too many samples be added to build the graph #316

Even if the genome is small, shouldn't too many samples be added to build the graph #316

Comments

Wwwwwwwyc commented Jun 28, 2023 • edited Loading

subwaystation commented Jun 28, 2023

Wwwwwwwyc commented Jun 28, 2023

ekg commented Jun 28, 2023

Wwwwwwwyc commented Jun 30, 2023 • edited Loading

ekg commented Jun 30, 2023 via email

Wwwwwwwyc commented Jul 28, 2023

ekg commented Jul 28, 2023 via email

Wwwwwwwyc commented Jul 28, 2023 • edited Loading

Wwwwwwwyc commented Jun 28, 2023 •

edited

Loading

Wwwwwwwyc commented Jun 30, 2023 •

edited

Loading

Wwwwwwwyc commented Jul 28, 2023 •

edited

Loading