Fix dataset hashes not taking labels into account #493

aristizabal95 · 2023-10-10T15:09:37Z

This small PR adds the required changes for generating dataset hashes from both the data and labels. Previously, only the data path was considered, which meant that two datasets with the same input data but different labels would register as the same dataset
closes #507

github-actions · 2023-10-10T15:09:55Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

msheller

You know, I have never seen the approach of hashing each entry, then sorting those before combining them. Very interesting! I think it is analogous to the method I am used to, where you sort all the input values before hashing. If something hits me, I'll let you know, but I really don't see why this doesn't work well (and the implementation is cleaner).

Maybe add a comment though just to clarify that your hash is not depending on the ordering of the walk function calls? A new reader might see the lack of sorted(...) and immediately get concerned (which is what I did until I saw what you were doing instead :P).

Thanks!

hasan7n

Thanks!

aristizabal95 added 30 commits March 3, 2023 16:21

Add Data Preparator cookiecutter template

28f7e7f

Rename cookiecutter folder

6f9e19e

Temporarily remove possibly offending files

df6e6a2

Remove cookicutter conditionals

a7db0cf

Inclube back missing pieces of template

a7a6d15

remove cookiecutter typo

e2f7108

Use project_name attribute

581b5bb

Change cookiecutter fields order

fd77804

Create empty directories on hook

6eebd59

Fix empty folders paths

5ef86a2

Create evaluator mlcube cookiecutter template

d04baf8

Fix JSON Syntax Error

02cec01

Update template default values

b3d7a1d

Remove reference to undefined template variable

7338236

Implement model mlcube cookiecutter template

d1cec5e

Update cookiecutter variable default values

7338755

Create medperf CLI command for creating MLCubes

3ae9226

Provide additional options for mlcube create

e07cde2

Start working on tests

68e136a

Add tests for cube create

b8e03ac

Ignore invalid syntax on cookiecutter conditionals

7896b25

Ignore more flake8 errors

4f78981

Remove unused import

f5dab5e

Empty commit for cloudbuild

a03d7f6

Fix inconsistency with labels paths

6bb60d0

Update mlcube.yaml so it can be commented on docs

43b6cab

Don't render noqa comments on template

55b5d22

Remove flake8 specific ignores

135c598

Exclude templates from lint checks

e9e2c32

Remove specific flake8 ignores

e95dab8

aristizabal95 added 11 commits May 16, 2023 15:02

Reformat errors dictionary for printing

3951af8

Merge branch 'main' of https://github.com/mlcommons/medperf

09b8d69

Merge branch 'main' of https://github.com/mlcommons/medperf

e7a8ae4

Merge branch 'main' of https://github.com/mlcommons/medperf

091b69c

Merge branch 'main' of https://github.com/mlcommons/medperf

c69bf4c

Merge branch 'main' of https://github.com/mlcommons/medperf

d0c2b77

Merge branch 'main' of https://github.com/mlcommons/medperf

c70bb9b

Merge branch 'main' of https://github.com/mlcommons/medperf

90b5cb3

Allow passing multiple folders for hashing

40b8296

Generate dataset hashes with data and labels

27a836f

Fix linter issue

d3b5714

aristizabal95 requested a review from a team as a code owner October 10, 2023 15:09

aristizabal95 had a problem deploying to testing-external-code October 10, 2023 15:09 — with GitHub Actions Failure

Fix utils using outdated folder hash function

3fe058f

aristizabal95 had a problem deploying to testing-external-code October 10, 2023 15:23 — with GitHub Actions Failure

Fix tests

375baa7

aristizabal95 temporarily deployed to testing-external-code October 10, 2023 15:31 — with GitHub Actions Inactive

msheller previously approved these changes Oct 13, 2023

View reviewed changes

Merge branch 'main' into fix-data-hash

7725266

aristizabal95 had a problem deploying to testing-external-code October 16, 2023 17:02 — with GitHub Actions Failure

Add note about hash ordering

002f971

aristizabal95 dismissed msheller’s stale review via 002f971 November 20, 2023 15:19

aristizabal95 had a problem deploying to testing-external-code November 20, 2023 15:19 — with GitHub Actions Failure

hasan7n self-requested a review November 27, 2023 17:30

hasan7n approved these changes Nov 27, 2023

View reviewed changes

Merge branch 'main' into fix-data-hash

0771cf8

hasan7n had a problem deploying to testing-external-code December 1, 2023 10:57 — with GitHub Actions Failure

hasan7n merged commit 5a6a1e3 into mlcommons:main Dec 1, 2023
6 of 7 checks passed

github-actions bot locked and limited conversation to collaborators Dec 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix dataset hashes not taking labels into account #493

Fix dataset hashes not taking labels into account #493

aristizabal95 commented Oct 10, 2023 •

edited by hasan7n

Loading

github-actions bot commented Oct 10, 2023 •

edited

Loading

msheller left a comment

hasan7n left a comment

Fix dataset hashes not taking labels into account #493

Fix dataset hashes not taking labels into account #493

Conversation

aristizabal95 commented Oct 10, 2023 • edited by hasan7n Loading

github-actions bot commented Oct 10, 2023 • edited Loading

msheller left a comment

Choose a reason for hiding this comment

hasan7n left a comment

Choose a reason for hiding this comment

aristizabal95 commented Oct 10, 2023 •

edited by hasan7n

Loading

github-actions bot commented Oct 10, 2023 •

edited

Loading