-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NEXT LINCLUST #873
base: master
Are you sure you want to change the base?
NEXT LINCLUST #873
Conversation
@martin-steinegger @milot-mirdita 1) Our new algorithm uses 6 additional bytes to store the adjacent information per each sequence, so the memory needed per sequence increases from 16 to 22 bytes. So if we were to pack our new linclust together with the old one, and use a parameter to choose between the two at runtime, quite a lot of memory would be wasted for users of old-linclust. Should I consider a prettier way of integrating the two, like dividing the structs and functions used for each version under the hood? 2) If we were to replace linclust with our new version, should we provide the option to use the original version, or should we just totally replace it, or stage it for deprecation? I'm not sure how much this change would affect our users, or how many would want to use the old version instead of the new. |
I am a bit hesitant to actually implement this, as this will likely require some ugly C++ magic. Something like the following could work: https://godbolt.org/z/9v6q1r41z But martin is also right, a 30% increase in RAM is not that much. |
@milot-mirdita
Then if integrating the new linclust into our old version with dynamic memory allocation gets too ugly, I'll consider giving up on the 6 bytes of memory, or even removing the old version entirely. Thank you! |
We do have templates already implemented. I guess it wouldn’t be too hard to avoid the extra 6 byte. |
Make adjacent sequence matching configurable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@milot-mirdita @martin-steinegger
I added some changes to improve the integration with our existing codebase.
My changes were tested with MMseqs-Regression to ensure that behavior is identical to the previous code, for both cases when --match-adjacent-seq
is true and false.
1) Features added in this PR are made configurable by toggling the --match-adjacent-seq
option
- initially set the default to false(disabled) in order to pass the checks, will change to true after updating regression tests
2) Data will not be allocated to store adjacentSeq[6]
when adjacent sequence matching is not enabled
- templates are used for polymorphic behavior throughout
kmermatcher
- all methods referencing instances of
KmerPosition
were annotated with templates, with the exclusion ofsize_t assignGroup(...)
, which needed overloading to provide different behavior for each case
return _adjacentSeq[index]; | ||
} | ||
|
||
unsigned char _adjacentSeq[6]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to keep this property well encapsulated, to ensure memory safety when Include==false
.
However, I kept this public instead of declaring it private, because we were using memset
and memcpy
to directly access the data from outside the struct, which would conflict with any strict encapsulation.
matchWithAdjacentSeq(par, argc, argv, command); | ||
} else { | ||
// overwrite value (no need for buffer) | ||
par.hashSeqBuffer = 1.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed this line to disable unneeded memory buffering, but this behavior would seem a bit obscure from the outside.
Would it be more recommended to define this behavior somewhere in Parameters.cpp
?
} | ||
}; | ||
|
||
template <typename T, bool IncludeAdjacentSeq = false> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defined the default as false to minimize the impact of the changes made.
DBReader<unsigned int> & seqDbr, Parameters & par, BaseMatrix * subMat, | ||
size_t KMER_SIZE, size_t chooseTopKmer, float chooseTopKmerScale = 0.0); | ||
template <typename T> | ||
KmerPosition<T> *initKmerPositionMemory(size_t size); | ||
template <typename T, bool IncludeAdjacentSeq = false> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defined default value for methods that are used elsewhere than in kmermatcher
(i.e. kmersearch
, kmerindexdb
)
Summary
Details
1 Extended Search Process :
2. Data Structure Challenges
3. Dynamic Memory Allocation
Benchmark Results