-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deeply sort hashes to ensure consistent fingerprints #55
Conversation
2bb56fc
to
9b50a97
Compare
Fingerprinting is done by either fetching the .to_hash representation or by fetching fields with .get(field) The content of these often contain hashes, which we don't guarantee order during insertion, so this commit recursively sorts hashes. This happens in all modes: * concatenate_all_sources * concatenate_fields * normal single field fingerprint
9b50a97
to
f5f952d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thinking back to the concerns raised in #41 (review), there are a few items to keep in mind before merging this. The existing implementation already sorts the top level keys of a hash, either during concatenate_all_fields or if multiple sources are listed. I ran many different scenarios and I can't generate fingerprints that are consistently different from the ones produced by this PR. Here is a link to a spreadsheet where I compared the two implementations against a different set of data and fingerprint strategies (all using SHA1): https://docs.google.com/spreadsheets/d/15tEkBJXk6_f7j4g6IHofsEAWG9ATxGI8ZFZaqMcgujA/edit?usp=sharing So in summary, I can't find a reason to declare this a breaking change. With all of this in mind I'd like a second review from @colinsurprenant and @andsel. Please let me know if I missed any corner case that can lead to breaking change for the user. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I agree with you when you says that's not a breaking change, since previously the key's order wasn't consistent, and every run with same data could give different results, so it was breaking itself. As the spreadsheet show we don't break more condition than wasn't already broken before.
LGTM2 |
Fingerprinting is done by either fetching the .to_hash representation
or by fetching fields with .get(field)
The content of these often contain hashes, which we don't guarantee
order during insertion, so this commit recursively sorts hashes.
This happens in all modes:
NOTE: I propose we don't consider this a breaking change (therefore not requiring a major version bump). The current ordering is unpredictable and therefore is much more of a bug than a feature. The only two paths forward would have been:
concatenate_all_sources
)Option 2 was much more likely to produce breaking changes, so we went here with 1.
Fixes #39
Replaces #41, #52