Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEXT LINCLUST #873

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
Open

NEXT LINCLUST #873

wants to merge 14 commits into from

Conversation

ChunShow
Copy link
Contributor

@ChunShow ChunShow commented Aug 14, 2024

Summary

This pull request introduces an algorithm that reduces the total number of clusters while maintaining linearity. The algorithm captures meaningful information from previously unadopted data in the 'assignGroup' function, allowing for more effective clustering.

Details

1 Extended Search Process :

For the same k-mer group, the process of combining the representative sequence with each sequence has been extended. The algorithm now calculates sequence dissimilarity using adjacent sequence information. The most dissimilar sequence is selected as the next representative sequence, and this exploration process is repeated. If there are multiple sequences with the same level of dissimilarity, the most recently explored sequence is chosen as the representative sequence. Additionally, the selection of the most dissimilar sequence is limited to sequences that follow the current representative sequence in the search order.

2. Data Structure Challenges

The implementation of this method introduced challenges in maintaining the original in-place data structure. To overcome these challenges, a new data structure has been introduced with an additional buffer (slack space) at the end. The new data structure includes a default buffer size of 5%.

3. Dynamic Memory Allocation

If memory becomes insufficient during operation, the structure dynamically resizes by splitting and reallocating memory based on the progress of the previous process. This approach ensures efficient memory usage and prevents potential memory shortages.

Benchmark Results

I conducted benchmarking on datasets randomly selected from the UniParc dataset, with sizes of 1.3M, 2.7M, 5.3M, 11M, 21M, 42M, and 85M. The results confirmed that the new algorithm effectively reduces the number of clusters while maintaining linearity, showing no significant time difference compared to the existing Linclust method. Despite these improvements, the algorithm still lags behind the easy-cluster method and does not fully reach the ideal results obtained by performing a quadratic search that captures all possible combinations. Thus, there remains room for further improvement.

@ChunShow ChunShow marked this pull request as ready for review August 14, 2024 05:43
@ChunShow ChunShow changed the title new linclust NEXT LINCLUST Aug 14, 2024
@leejoey0921
Copy link
Contributor

leejoey0921 commented Aug 27, 2024

@martin-steinegger @milot-mirdita
I have questions regarding the integration of this PR with the master branch.

1) Our new algorithm uses 6 additional bytes to store the adjacent information per each sequence, so the memory needed per sequence increases from 16 to 22 bytes.

So if we were to pack our new linclust together with the old one, and use a parameter to choose between the two at runtime, quite a lot of memory would be wasted for users of old-linclust.

Should I consider a prettier way of integrating the two, like dividing the structs and functions used for each version under the hood?
Or can we just assume that our users would be happier with the new version regardless of the increased memory usage?

2) If we were to replace linclust with our new version, should we provide the option to use the original version, or should we just totally replace it, or stage it for deprecation? I'm not sure how much this change would affect our users, or how many would want to use the old version instead of the new.

@milot-mirdita
Copy link
Member

milot-mirdita commented Aug 27, 2024

I am a bit hesitant to actually implement this, as this will likely require some ugly C++ magic. Something like the following could work: https://godbolt.org/z/9v6q1r41z

But martin is also right, a 30% increase in RAM is not that much.

@leejoey0921
Copy link
Contributor

leejoey0921 commented Aug 28, 2024

@milot-mirdita
Some template black magic seems pretty convenient (albeit a little dangerous). I'll try looking into it and check if it breaks anything.

But martin is also right, a 30% increase in RAM is not that much.

Then if integrating the new linclust into our old version with dynamic memory allocation gets too ugly, I'll consider giving up on the 6 bytes of memory, or even removing the old version entirely.

Thank you!

@martin-steinegger
Copy link
Member

We do have templates already implemented. I guess it wouldn’t be too hard to avoid the extra 6 byte.

Copy link
Contributor

@leejoey0921 leejoey0921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@milot-mirdita @martin-steinegger
I added some changes to improve the integration with our existing codebase.
My changes were tested with MMseqs-Regression to ensure that behavior is identical to the previous code, for both cases when --match-adjacent-seq is true and false.

1) Features added in this PR are made configurable by toggling the --match-adjacent-seq option

  • initially set the default to false(disabled) in order to pass the checks, will change to true after updating regression tests

2) Data will not be allocated to store adjacentSeq[6] when adjacent sequence matching is not enabled

  • templates are used for polymorphic behavior throughout kmermatcher
  • all methods referencing instances ofKmerPosition were annotated with templates, with the exclusion of size_t assignGroup(...), which needed overloading to provide different behavior for each case

return _adjacentSeq[index];
}

unsigned char _adjacentSeq[6];
Copy link
Contributor

@leejoey0921 leejoey0921 Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to keep this property well encapsulated, to ensure memory safety when Include==false.

However, I kept this public instead of declaring it private, because we were using memset and memcpy to directly access the data from outside the struct, which would conflict with any strict encapsulation.

matchWithAdjacentSeq(par, argc, argv, command);
} else {
// overwrite value (no need for buffer)
par.hashSeqBuffer = 1.0;
Copy link
Contributor

@leejoey0921 leejoey0921 Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed this line to disable unneeded memory buffering, but this behavior would seem a bit obscure from the outside.
Would it be more recommended to define this behavior somewhere in Parameters.cpp?

}
};

template <typename T, bool IncludeAdjacentSeq = false>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defined the default as false to minimize the impact of the changes made.

DBReader<unsigned int> & seqDbr, Parameters & par, BaseMatrix * subMat,
size_t KMER_SIZE, size_t chooseTopKmer, float chooseTopKmerScale = 0.0);
template <typename T>
KmerPosition<T> *initKmerPositionMemory(size_t size);
template <typename T, bool IncludeAdjacentSeq = false>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defined default value for methods that are used elsewhere than in kmermatcher (i.e. kmersearch, kmerindexdb)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants