Releases: ekzhu/datasketch
Releases · ekzhu/datasketch
v1.5.5
What's Changed
- Adding minhash_many to WeightedMinHashGenerator. by @jroose-jv in #165
- Add query buffer by @hguhlich in #167
New Contributors
- @jroose-jv made their first contribution in #165
- @hguhlich made their first contribution in #167
Full Changelog: v1.5.4...v1.5.5
v1.5.4
What's Changed
- Fixes #146; MinhashLSH creates mongo index key. by @oisincar in #148
- Add
redis_buffer
configuration. by @QthCN in #152 - minhash: Get rid of deprecation warning by @xkubov in #156
New Contributors
- @oisincar made their first contribution in #148
- @QthCN made their first contribution in #152
- @xkubov made their first contribution in #156
Full Changelog: 1.5.2...v1.5.4
Improved performance for MinHash and MinHashLSH
- Performance improvement for MinHash's update method.
- Make MinHash updates 4.5X faster by using
update_batch
method for bulk update on MinHash. [See API doc].(http://ekzhu.com/datasketch/documentation.html#datasketch.MinHash.update_batch) - Further performance gain by using bulk generation of MinHash using
MinHash.bulk
orMinHash.generator
. See API doc and pull request. - Optional compression for MinHash LSH index by hashing the bucket key produced by
MinHashLSH._H
. See pull request. This leads to saving of memory/storage space used by the index.
Thank you @Sinusoidal36!
Add Cassandra storage layer.
- Minor bug fixes
- Cassandra storage layer, thank @ostefano! Now you can specify the Cassandra config just like the Redis one.
from datasketch import MinHashLSH
lsh = MinHashLSH(
threashold=0.5, num_perm=128, storage_config={
'type': 'cassandra',
'cassandra': {
'seeds': ['127.0.0.1'],
'keyspace': 'lsh_test',
'replication': {
'class': 'SimpleStrategy',
'replication_factor': '1',
},
'drop_keyspace': False,
'drop_tables': False,
}
}
)
hashfunc to replace hashobj
Now support hashfunc
parameter for MinHash and HyperLogLog. The old parameter hashobj
is removed.
# Let's use MurmurHash3.
import mmh3
# We need to define a new hash function that outputs an integer that
# can be encoded in 32 bits.
def _hash_func(d):
return mmh3.hash32(d)
# Use this function in MinHash constructor.
m = MinHash(hashfunc=_hash_func)
Better LSH Ensemble
Use dynamic programming to create optimal partition, allow LSH Ensemble index to adapt to any set size distribution.
Batch removal of keys from Async MinHashLSH index
- Adding batch removal functionality for Async MinHashLSH
- Because Redis does not support async operation, removed Redis support from Async MinHashLSH
For details see Pull #70
Thanks @aastafiev for the contribution.
MongoDB replicas
Add support for MongoDB replica set
Fix bug #68
v1.2.8 (Asynchronous MinHashLSH) fixes critical bug when removing key from L…
Asynchronous MinHash LSH module and storage base name
- Added Asynchronous MinHash LSH module. Thanks @aastafiev!
- Added ability to set the base name in storage config. Base name is used as the
prefix for generating keys in the underlying storage (e.g., Redis).
This change allows client to "reconnect" to an existing LSH index in the storage through its base name.