Improve hashCode() implementation for Tuple2 #2803
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The current implementation of the hashCode() method for
Tuple2
has a significant flaw that can easily lead to hash collisions, particularly in data structures likeLinkedHashMap
.In this case, m1 and m2 should ideally produce different hash codes, but they don't. This is because the current implementation of
Tuple#hashCode()
simply sums the hash codes of all elements, which can result in identical hash codes for different tuples.A potential solution is to apply the XOR (^) operator to the hash codes of all elements. XOR is order-sensitive, so it would resolve the issue in the example above, where the order of elements differs between tuples.
However, XOR has its limitations and is only a suitable solution for tuples with up to two elements. This is because XOR is a commutative and associative operation, meaning that the XOR of multiple elements can still result in rather bad collisions:
The new implementation is not perfect as well:
But it is imperfect in the same way as standard library:
Related: #2733