-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix dataset hashes not taking labels into account #493
Conversation
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You know, I have never seen the approach of hashing each entry, then sorting those before combining them. Very interesting! I think it is analogous to the method I am used to, where you sort all the input values before hashing. If something hits me, I'll let you know, but I really don't see why this doesn't work well (and the implementation is cleaner).
Maybe add a comment though just to clarify that your hash is not depending on the ordering of the walk function calls? A new reader might see the lack of sorted(...) and immediately get concerned (which is what I did until I saw what you were doing instead :P).
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
This small PR adds the required changes for generating dataset hashes from both the data and labels. Previously, only the data path was considered, which meant that two datasets with the same input data but different labels would register as the same dataset
closes #507