Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible community detection bug #277

Open
subwaystation opened this issue Jan 16, 2023 · 5 comments
Open

Possible community detection bug #277

subwaystation opened this issue Jan 16, 2023 · 5 comments

Comments

@subwaystation
Copy link
Member

Hi @AndreaGuarracino,

when I am running the default pggb with DRB1-3123 I do get one graph component aka one community:

image

However, running the partition-before-pggb script, I get 2 communites:

docker run -it -v ${PWD}/data/:/data ghcr.io/pangenome/pggb:latest /bin/bash -c "partition-before-pggb -i /data/HLA/DRB1-3123.fa.gz -p 70 -o /data/HLA/outputtt -n 12"

image

@subwaystation
Copy link
Member Author

[wfmash::map] Reference = [/data/HLA/DRB1-3123.fa.gz]
[wfmash::map] Query = [/data/HLA/DRB1-3123.fa.gz]
[wfmash::map] Kmer size = 19
[wfmash::map] Window size = 30
[wfmash::map] Segment length = 5000 (read split allowed)
[wfmash::map] Block length min = 25000
[wfmash::map] Chaining gap max = 100000
[wfmash::map] Percentage identity threshold = 70%
[wfmash::map] Skip self mappings
[wfmash::map] Mapping output file = /dev/stdout
[wfmash::map] Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none)
[wfmash::map] Execution threads  = 16
[wfmash::skch::Sketch::build] minimizers picked from reference = 10970
[wfmash::skch::Sketch::index] unique minimizers = 2797
[wfmash::skch::Sketch::computeFreqHist] Frequency histogram of minimizers = (1, 19) ... (22, 1)
[wfmash::skch::Sketch::computeFreqHist] With threshold 0.001%, consider all minimizers during lookup.
[wfmash::map] time spent computing the reference index: 0.00974391 sec
[wfmash::skch::Map::mapQuery] mapped  0.00% @ 0.00e+00 bp/s elapsed: 00:00:00:00[wfmash::skch::Map::mapQuery] mapped 100.00% @ 3.26e+05 bp/s elapsed: 00:00:00:00 remain: 00:00:00:00
[wfmash::skch::Map::mapQuery] count of mapped reads = 11, reads qualified for mapping = 12, total input reads = 12, total input bp = 163416
[wfmash::map] time spent mapping the query: 5.02e-01 sec
[wfmash::map] mapping results saved in: /dev/stdout
wfmash -s 5000 -l 25000 -p 70 -n 11 -k 19 -H 0.001 -X -t 16 --tmp-base /data/HLA/outputtt /data/HLA/DRB1-3123.fa.gz --approx-map
0.09s user 0.02s system 24% cpu 0.51s total 63704Kb max memory
python3 /usr/local/bin/scripts/paf2net.py -p /data/HLA/outputtt/DRB1-3123.fa.gz.6deb21e.mappings.wfmash.paf
0.03s user 0.00s system 97% cpu 0.04s total 10636Kb max memory
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-x0sxurwi because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Detected 2 communities.
python3 /usr/local/bin/scripts/net2communities.py -e /data/HLA/outputtt/DRB1-3123.fa.gz.6deb21e.mappings.wfmash.paf.edges.list.txt -w /data/HLA/outputtt/DRB1-3123.fa.gz.6deb21e.mappings.wfmash.paf.edges.weights.txt -n /data/HLA/outputtt/DRB1-3123.fa.gz.6deb21e.mappings.wfmash.paf.vertices.id2name.txt --accurate-detection --output-prefix /data/HLA/outputtt/DRB1-3123.fa.gz.6deb21e
0.83s user 2.05s system 519% cpu 0.55s total 72840Kb max memory
samtools faidx /data/HLA/DRB1-3123.fa.gz gi|568815592:32578768-32589835 gi|568815551:3814534-3830133 gi|568815561:3988942-4004531 gi|568815567:3779003-3792415 gi|568815569:3979127-3993865 gi|28212469:126036-137103 gi|28212470:131613-146345 gi|157702218:147985-163915 gi|528476637:32549024-32560088
0.00s user 0.00s system 100% cpu 0.00s total 3368Kb max memory
samtools faidx /data/HLA/outputtt/DRB1-3123.fa.gz.6deb21e.community.0.fa
0.00s user 0.00s system 20% cpu 0.01s total 3192Kb max memory
samtools faidx /data/HLA/DRB1-3123.fa.gz gi|568815529:3998044-4011446 gi|345525392:5000-18402 gi|29124352:124254-137656
0.00s user 0.00s system 50% cpu 0.00s total 3484Kb max memory
samtools faidx /data/HLA/outputtt/DRB1-3123.fa.gz.6deb21e.community.1.fa
0.00s user 0.00s system 28% cpu 0.00s total 3128Kb max memory

pggb -i /data/HLA/outputtt/DRB1-3123.fa.gz.6deb21e.community.0.fa \
     -o /data/HLA/outputtt/DRB1-3123.fa.gz.6deb21e.community.0.fa.out \
     -p 5000 -l 25000 -p 70 -n 12 -K 19 -F 0.001 \
     -k 19 -f 0 -B 10000000 \
     -H 12 -j 0 -e 0 -G 700,900,1100 -P 1,19,39,3,81,1 -O 0.001 -d 100 -Q Consensus_ \
     --threads 16 --poa-threads 16
pggb -i /data/HLA/outputtt/DRB1-3123.fa.gz.6deb21e.community.1.fa \
     -o /data/HLA/outputtt/DRB1-3123.fa.gz.6deb21e.community.1.fa.out \
     -p 5000 -l 25000 -p 70 -n 12 -K 19 -F 0.001 \
     -k 19 -f 0 -B 10000000 \
     -H 12 -j 0 -e 0 -G 700,900,1100 -P 1,19,39,3,81,1 -O 0.001 -d 100 -Q Consensus_ \
     --threads 16 --poa-threads 16

@subwaystation
Copy link
Member Author

Is this intended? I would expect 1 community.

@subwaystation
Copy link
Member Author

With -p 98 several communities. But with -p 70 it should be one from my understanding.

@AndreaGuarracino
Copy link
Member

That is a quite small (and short) dataset for applying community detection.

However, it doesn't seem like a bug. Those sequences are more similar to each other than all the others, so they form a community of their own. I suppose they tend to stay together also with different values of -p.

@subwaystation
Copy link
Member Author

I see. The longer the better. But how long? LPA? E. coli? Yeast?
Should one already know, there are several chromosomes in the dataset?
Can you somehow quantify when to apply the community detection?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants