Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functional incorrectness: criterion used for item identity is hash value instead Eq-identity #24

Closed
sanmai-NL opened this issue Oct 10, 2020 · 18 comments
Labels

Comments

@sanmai-NL
Copy link

A very surprising defect. PriorityQueue ignores the Eq-implementation of the item type. Whether or not an item is already in the queue should not be determined by an implementation detail leaked from the underlying data structure, namely its hash value. Item identity should be determined by its Eq-identity.

This defect also decreases performance in cases where comparing identities costs less than comparing hash values, which I presume is always.

@garro95
Copy link
Owner

garro95 commented Oct 10, 2020

I agree this is surprising.

Can you provide an example showing this behavior? The hash table implementation used in PriorityQueue is provided by the IndexMap crate, so probably a fix must be implemented at that level

@sanmai-NL
Copy link
Author

sanmai-NL commented Oct 10, 2020

I think the issue is in your get_priority() implementation actually. The IndexMap is supposed to distinguish keys by their hash values, but a priority queue data structure shouldn't be affected by this.

@garro95
Copy link
Owner

garro95 commented Oct 10, 2020

The IndexMap is supposed to distinguish keys by their hash values, but a priority queue data structure shouldn't be affected by this.

I don't think I agree with this.
The IndexMap is supposed to do the following:

  1. Compute the hash value of the provided key
  2. Lookup for that hash in the table and, if something is found
  3. Check that the found key is actually equal to the provided one.

If 3 returns true, then it found what we were looking for, otherwise, the collision resolution strategy has to be used.

@garro95
Copy link
Owner

garro95 commented Oct 10, 2020

The tests in that repo seem to pass

@garro95
Copy link
Owner

garro95 commented Oct 10, 2020

I did a quick inspection in the IndexMap code and it seems to confirm what I said above. Also the implementation looks correct.

@sanmai-NL
Copy link
Author

sanmai-NL commented Oct 10, 2020

The purpose of a key in a hash table is to have a unique hash value, not to be equal in some other sense. Also, note that IndexMap has its own Equivalent trait, and tries allows to circumvent Eq.

The tests pass, because I manually implemented Hash for the item type (the part of the source code I hyperlinked). However, I want PriorityQueue to be independent of the Hash-implementation of the item type, and instead use its Eq-implementation.

IndexMap is correct, but PriorityQueue isn't (as I wrote).

@garro95
Copy link
Owner

garro95 commented Oct 10, 2020

The Equivalent trait in the IndexMap has a default implementation for Eq items, that is *self == *key.borrow() (https://docs.rs/indexmap/1.6.0/src/indexmap/equivalent.rs.html#18-27)

@garro95
Copy link
Owner

garro95 commented Oct 10, 2020

Oh, Now maybe I understand what you want to say. Would you like to compute eq only on nederlander with your implementation but derive Hash?

@sanmai-NL
Copy link
Author

I want the hash value of a type to be independent of its identity.

@garro95
Copy link
Owner

garro95 commented Oct 10, 2020

That is not possible. As stated in the Standard library documentation, if you implement these yourself, it is important that the following property holds:

k1 == k2 -> hash(k1) == hash(k2)

In other words, if two keys are equal, their hashes must be equal.
https://doc.rust-lang.org/std/collections/struct.HashMap.html

@sanmai-NL
Copy link
Author

From that property (which is an arbitrary restriction in the Rust standard library), it does not follow that if the hash values of a pair of keys are equal, the keys themselves must be equal, however (i.e., vice versa). This means you cannot rely on item types being Hash to make them suitable for a priority queue.

Furthermore, it is a design choice in priority_queue to use a backing storage that depends on hashing. This means priority_queue is not yet a general priority queue. This priority queue now only works correctly for items that are equal if and only if their hash values are equal.

@garro95
Copy link
Owner

garro95 commented Oct 10, 2020

From that property it does not follow that if the hash values of a pair of keys are equal,

This assumption is never made in PriorityQueue. The check that k1 == k2 is performed internally by IndexMap if hash(k1) == hash(k2).

which is an arbitrary restriction in the Rust standard library

The property is not an arbitrary restriction. It depends on how hash tables work (in any language they are implemented).

Furthermore, it is a design choice in priority_queue to use a backing storage that depends on hashing

The choice to use a HashMap as the underlying storage is driven by the fact that it is the only way (I am aware of) to implement the priority change function with a complexity O(log(N)) in time. I would be happy to learn another way though.

This priority queue now only works correctly for items that are equal if and only if their hash values are equal.

Again, this is not true. IndexMap (HashBrown's RawTable actually, the hash table implementation IndexMap and the standard HashMap rely on) ensures that the returned element is exactly the one with the same key you was looking for. If you can provide an example where Hash and Eq are implemented properly (that is, where k1 == k2 -> hash(k1) == hash(k2)) and the priority queue does not work correctly, I would be happy to correct the implementation (the correction should be applied in the Hash Table implementation though)

@sanmai-NL
Copy link
Author

sanmai-NL commented Oct 11, 2020

An example will come, but it would only prove something about priority_queue, not any hash table crate. What will you be doing after my example comes? I sense you're quite defensive about this.

@garro95
Copy link
Owner

garro95 commented Oct 11, 2020

I am not defensive. I already published a new version of the Priority Queue to update the documentation and clarify the property that Hash and Eq implementations must respect (https://docs.rs/priority-queue/1.0.1/priority_queue/) and I also created the issue #25 to allow two different implementations of Eq and Equivalent thank to your issue, so I am grateful to you

@sanmai-NL
Copy link
Author

@garro95 You're right about PriorityQueue being correct, my apologies ... I've fixed the referenced example project.

The criterion used for item identity in PriorityQueue isn't hash value instead of Eq-identity. Working on my example and debugging showed this. You use some sort of bucket in which it is possible to check the item identity (key) independent from its hash value.

However, suppose the hash value is equal for all items. Will PriorityQueue still attain O(log(N)) time in the worst case for the pop() operation?

Due to a design decision at the Rust standard library level, it isn't actually correct to implement Eq on fewer fields than Hash. The same restriction applies to Java, but that's a whole different language. Furthermore, due to PriorityQueue requiring an item type that is Hash, I am forced to hash at least the fields that matter for Eq.

@sanmai-NL
Copy link
Author

sanmai-NL commented Oct 12, 2020

Hmm, in fact if I were to hash more fields than used in Eq, possibly I can run into hash collisions with PriorityQueue despite using a good hash function.

@garro95
Copy link
Owner

garro95 commented Oct 12, 2020

You're right about PriorityQueue being correct, my apologies ...

Yeah, I know (joking LOL, but I was actually quite confident)

You use some sort of bucket in which it is possible to check the item identity (key) independent from its hash value.

Yeah, that is what is called a HashMap. In PriorityQueue, the HashMap key is the PriorityQueue's item, while the value is its priority. Just to refresh you how HashMaps work, they do the following operations on lookup:

  1. Compute the hash value of the provided key
  2. Lookup for that hash in the table (that is a direct access operation, so it's really fast) and, if something is found
  3. Check that the found key is actually equal to the provided one.

If this last check returns false there must have been a hash collision. HashMaps implement collision resolution strategies. According to the strategy, the HashMap will check if the other elements with a colliding hash match with the provided key, until an empty position is reached, which means the key is not present. If you read this carefully, it will be evident why the property mentioned above must hold. The point is, in other words, there is no explicit comparison between the hashes, but the hash is used as an index in the map. The comparison is only performed on the key using the Eq trait implementation (or Equivalent in the case of IndexMap).

Now, if you had to deal with HashMaps that did not implement collision resolution strategies and only used to compare hashes, I am sorry, but that were wrong implementations of a HashMap.

However, suppose the hash value is equal for all items. Will PriorityQueue still attain O(log(N)) time in the worst case for the pop() operation?

That would be a very bad Hash trait implementation. In that case, the time to lookup the HashMap would prevail on the time complexity of the heapify operation (on which depends many operations of the PriorityQueue), making them O(n).
For what concerns PriorityQueue though, only the operations that involve the lookup by key in the map would be affected (e.g. push() and its flavors or change_priority() and flavors), while operations like pop() that only relies on direct access trough index would not change their worst case time complexity.

Due to a design decision at the Rust standard library level, it isn't actually correct to implement Eq on fewer fields than Hash. The same restriction applies to Java, but that's a whole different language.

Again, I don't want to defend anything or anyone, but that's the case with every HashMap implementation, regardless of the language. As I tried to show you above, it's a matter of math.

Furthermore, due to PriorityQueue requiring an item type that is Hash, I am forced to hash at least the fields that matter for Eq.

No, it's the other way around: you have to check for equivalence at least the fields on which you compute the hash value, so that the property k1 == k2 -> hash(k1) == hash(k2) holds. Usually you want to use the same fields, in order to obtain the most effective hash with a correct implementation.

Hmm, in fact if I were to hash more fields than used in Eq, possibly I can run into hash collisions with PriorityQueue despite using a good hash function

Something worse can happen in that case: you could not retrieve an item despite having something Eq in the queue, because of its hash being different then the previously inserted item.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants