Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dataset hashes not taking labels into account #493

Merged
merged 60 commits into from
Dec 1, 2023

Conversation

aristizabal95
Copy link
Contributor

@aristizabal95 aristizabal95 commented Oct 10, 2023

This small PR adds the required changes for generating dataset hashes from both the data and labels. Previously, only the data path was considered, which meant that two datasets with the same input data but different labels would register as the same dataset
closes #507

@aristizabal95 aristizabal95 requested a review from a team as a code owner October 10, 2023 15:09
@github-actions
Copy link
Contributor

github-actions bot commented Oct 10, 2023

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@aristizabal95 aristizabal95 temporarily deployed to testing-external-code October 10, 2023 15:31 — with GitHub Actions Inactive
msheller
msheller previously approved these changes Oct 13, 2023
Copy link

@msheller msheller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know, I have never seen the approach of hashing each entry, then sorting those before combining them. Very interesting! I think it is analogous to the method I am used to, where you sort all the input values before hashing. If something hits me, I'll let you know, but I really don't see why this doesn't work well (and the implementation is cleaner).

Maybe add a comment though just to clarify that your hash is not depending on the ordering of the walk function calls? A new reader might see the lack of sorted(...) and immediately get concerned (which is what I did until I saw what you were doing instead :P).

Thanks!

Copy link
Contributor

@hasan7n hasan7n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@hasan7n hasan7n had a problem deploying to testing-external-code December 1, 2023 10:57 — with GitHub Actions Failure
@hasan7n hasan7n merged commit 5a6a1e3 into mlcommons:main Dec 1, 2023
6 of 7 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Dec 1, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dataset hashes not taking labels into account
3 participants