NEXT LINCLUST #873

ChunShow · 2024-08-14T04:29:45Z

Summary

This pull request introduces an algorithm that reduces the total number of clusters while maintaining linearity. The algorithm captures meaningful information from previously unadopted data in the 'assignGroup' function, allowing for more effective clustering.

Details

1 Extended Search Process :

For the same k-mer group, the process of combining the representative sequence with each sequence has been extended. The algorithm now calculates sequence dissimilarity using adjacent sequence information. The most dissimilar sequence is selected as the next representative sequence, and this exploration process is repeated. If there are multiple sequences with the same level of dissimilarity, the most recently explored sequence is chosen as the representative sequence. Additionally, the selection of the most dissimilar sequence is limited to sequences that follow the current representative sequence in the search order.

2. Data Structure Challenges

The implementation of this method introduced challenges in maintaining the original in-place data structure. To overcome these challenges, a new data structure has been introduced with an additional buffer (slack space) at the end. The new data structure includes a default buffer size of 5%.

3. Dynamic Memory Allocation

If memory becomes insufficient during operation, the structure dynamically resizes by splitting and reallocating memory based on the progress of the previous process. This approach ensures efficient memory usage and prevents potential memory shortages.

Benchmark Results

I conducted benchmarking on datasets randomly selected from the UniParc dataset, with sizes of 1.3M, 2.7M, 5.3M, 11M, 21M, 42M, and 85M. The results confirmed that the new algorithm effectively reduces the number of clusters while maintaining linearity, showing no significant time difference compared to the existing Linclust method. Despite these improvements, the algorithm still lags behind the easy-cluster method and does not fully reach the ideal results obtained by performing a quadratic search that captures all possible combinations. Thus, there remains room for further improvement.

src/linclust/kmermatcher.cpp

leejoey0921 · 2024-08-27T08:03:21Z

@martin-steinegger @milot-mirdita
I have questions regarding the integration of this PR with the master branch.

1) Our new algorithm uses 6 additional bytes to store the adjacent information per each sequence, so the memory needed per sequence increases from 16 to 22 bytes.

So if we were to pack our new linclust together with the old one, and use a parameter to choose between the two at runtime, quite a lot of memory would be wasted for users of old-linclust.

Should I consider a prettier way of integrating the two, like dividing the structs and functions used for each version under the hood?
Or can we just assume that our users would be happier with the new version regardless of the increased memory usage?

2) If we were to replace linclust with our new version, should we provide the option to use the original version, or should we just totally replace it, or stage it for deprecation? I'm not sure how much this change would affect our users, or how many would want to use the old version instead of the new.

milot-mirdita · 2024-08-27T16:28:10Z

I am a bit hesitant to actually implement this, as this will likely require some ugly C++ magic. Something like the following could work: https://godbolt.org/z/9v6q1r41z

But martin is also right, a 30% increase in RAM is not that much.

leejoey0921 · 2024-08-28T02:42:13Z

@milot-mirdita
Some template black magic seems pretty convenient (albeit a little dangerous). I'll try looking into it and check if it breaks anything.

But martin is also right, a 30% increase in RAM is not that much.

Then if integrating the new linclust into our old version with dynamic memory allocation gets too ugly, I'll consider giving up on the 6 bytes of memory, or even removing the old version entirely.

Thank you!

martin-steinegger · 2024-08-28T04:27:19Z

We do have templates already implemented. I guess it wouldn’t be too hard to avoid the extra 6 byte.

Make adjacent sequence matching configurable

leejoey0921

@milot-mirdita @martin-steinegger
I added some changes to improve the integration with our existing codebase.
My changes were tested with MMseqs-Regression to ensure that behavior is identical to the previous code, for both cases when --match-adjacent-seq is true and false.

1) Features added in this PR are made configurable by toggling the --match-adjacent-seq option

initially set the default to false(disabled) in order to pass the checks, will change to true after updating regression tests

2) Data will not be allocated to store adjacentSeq[6] when adjacent sequence matching is not enabled

templates are used for polymorphic behavior throughout kmermatcher
all methods referencing instances ofKmerPosition were annotated with templates, with the exclusion of size_t assignGroup(...), which needed overloading to provide different behavior for each case

leejoey0921 · 2024-08-30T04:00:03Z

src/linclust/kmermatcher.h

+        return _adjacentSeq[index];
+    }
+
+    unsigned char _adjacentSeq[6];


I wanted to keep this property well encapsulated, to ensure memory safety when Include==false.

However, I kept this public instead of declaring it private, because we were using memset and memcpy to directly access the data from outside the struct, which would conflict with any strict encapsulation.

leejoey0921 · 2024-08-30T04:02:49Z

src/linclust/kmermatcher.cpp

+        matchWithAdjacentSeq(par, argc, argv, command);
+    } else {
+        // overwrite value (no need for buffer)
+        par.hashSeqBuffer = 1.0;


Needed this line to disable unneeded memory buffering, but this behavior would seem a bit obscure from the outside.
Would it be more recommended to define this behavior somewhere in Parameters.cpp?

leejoey0921 · 2024-08-30T04:08:54Z

src/linclust/kmermatcher.h

+    }
+};
+
+template <typename T, bool IncludeAdjacentSeq = false>


Defined the default as false to minimize the impact of the changes made.

leejoey0921 · 2024-08-30T04:11:55Z

src/linclust/kmermatcher.h

                                DBReader<unsigned int> & seqDbr, Parameters & par, BaseMatrix  * subMat,
                                size_t KMER_SIZE, size_t chooseTopKmer, float chooseTopKmerScale = 0.0);
-template <typename T>
-KmerPosition<T> *initKmerPositionMemory(size_t size);
+template <typename T, bool IncludeAdjacentSeq = false>


defined default value for methods that are used elsewhere than in kmermatcher (i.e. kmersearch, kmerindexdb)

ChunShow added 2 commits August 14, 2024 13:29

new linclust

1f4cd03

fix compile error

f78b6bb

milot-mirdita reviewed Aug 14, 2024

View reviewed changes

src/linclust/kmermatcher.cpp Outdated Show resolved Hide resolved

milot-mirdita reviewed Aug 14, 2024

View reviewed changes

src/linclust/kmermatcher.cpp Show resolved Hide resolved

resolve problems

e9da997

ChunShow marked this pull request as ready for review August 14, 2024 05:43

ChunShow changed the title ~~new linclust~~ NEXT LINCLUST Aug 14, 2024

resolve problems

e12665b

leejoey0921 mentioned this pull request Aug 26, 2024

Dummy PR (to trigger tests) #880

Closed

leejoey0921 added 9 commits August 28, 2024 16:54

feat: add param to enable adj_seq

2ef1e2e

fix: apply changes to header

1533a91

feat: extract adjSeq to be configurable

36bca10

feat: integrate with old linclust

b110185

feat: default IncludeAdjacentSeq false

e199a69

fix: evade unused parameter check

4c88efc

fix: compiler error

0cc8af8

fix: disable feature by default

ecea89f

Merge pull request #1 from leejoey0921/linclust_integration

5b68a8e

Make adjacent sequence matching configurable

leejoey0921 reviewed Aug 30, 2024

View reviewed changes

leejoey0921 mentioned this pull request Aug 30, 2024

Make adjacent sequence matching configurable ChunShow/MMseqs2#1

Merged

fix: make param visible

62a2ad4

This was referenced Sep 2, 2024

Update Tests for New Linclust soedinglab/MMseqs2-Regression#2

Draft

feat: default enable adjacent seq ChunShow/MMseqs2#2

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NEXT LINCLUST #873

NEXT LINCLUST #873

ChunShow commented Aug 14, 2024 •

edited

Loading

leejoey0921 commented Aug 27, 2024 •

edited

Loading

milot-mirdita commented Aug 27, 2024 •

edited

Loading

leejoey0921 commented Aug 28, 2024 •

edited

Loading

martin-steinegger commented Aug 28, 2024

leejoey0921 left a comment •

edited

Loading

leejoey0921 Aug 30, 2024 •

edited

Loading

leejoey0921 Aug 30, 2024 •

edited

Loading

leejoey0921 Aug 30, 2024

leejoey0921 Aug 30, 2024

NEXT LINCLUST #873

Are you sure you want to change the base?

NEXT LINCLUST #873

Conversation

ChunShow commented Aug 14, 2024 • edited Loading

Summary

Details

Benchmark Results

leejoey0921 commented Aug 27, 2024 • edited Loading

milot-mirdita commented Aug 27, 2024 • edited Loading

leejoey0921 commented Aug 28, 2024 • edited Loading

martin-steinegger commented Aug 28, 2024

leejoey0921 left a comment • edited Loading

Choose a reason for hiding this comment

leejoey0921 Aug 30, 2024 • edited Loading

Choose a reason for hiding this comment

leejoey0921 Aug 30, 2024 • edited Loading

Choose a reason for hiding this comment

leejoey0921 Aug 30, 2024

Choose a reason for hiding this comment

leejoey0921 Aug 30, 2024

Choose a reason for hiding this comment

ChunShow commented Aug 14, 2024 •

edited

Loading

leejoey0921 commented Aug 27, 2024 •

edited

Loading

milot-mirdita commented Aug 27, 2024 •

edited

Loading

leejoey0921 commented Aug 28, 2024 •

edited

Loading

leejoey0921 left a comment •

edited

Loading

leejoey0921 Aug 30, 2024 •

edited

Loading

leejoey0921 Aug 30, 2024 •

edited

Loading