deeply sort hashes to ensure consistent fingerprints #55

jsvd · 2020-08-20T15:31:42Z

Fingerprinting is done by either fetching the .to_hash representation
or by fetching fields with .get(field)

The content of these often contain hashes, which we don't guarantee
order during insertion, so this commit recursively sorts hashes.

This happens in all modes:

concatenate_all_sources
concatenate_fields
normal single field fingerprint

NOTE: I propose we don't consider this a breaking change (therefore not requiring a major version bump). The current ordering is unpredictable and therefore is much more of a bug than a feature. The only two paths forward would have been:

sorting hashes like we do here
only allowing fingerprinting on scalar fields (and therefore removing concatenate_all_sources)

Option 2 was much more likely to produce breaking changes, so we went here with 1.

Fixes #39
Replaces #41, #52

Fingerprinting is done by either fetching the .to_hash representation or by fetching fields with .get(field) The content of these often contain hashes, which we don't guarantee order during insertion, so this commit recursively sorts hashes. This happens in all modes: * concatenate_all_sources * concatenate_fields * normal single field fingerprint

andsel

LGTM

colinsurprenant

LGTM

jsvd · 2020-08-25T12:13:25Z

Thinking back to the concerns raised in #41 (review), there are a few items to keep in mind before merging this.

The existing implementation already sorts the top level keys of a hash, either during concatenate_all_fields or if multiple sources are listed.
From my testing, the differences in results will come when fingerprinting nested objects. In this scenario the current implementation is unpredictable, as hashmaps may have any order independent of the insertion order (since the underlying data structurs of a Logstash Event aren't ruby hashes).

I ran many different scenarios and I can't generate fingerprints that are consistently different from the ones produced by this PR.
For example, when fingerprinting a nested object, the current implementation will, at random, either produce a fingerprint equal to the one this PR produces, or a different one.

Here is a link to a spreadsheet where I compared the two implementations against a different set of data and fingerprint strategies (all using SHA1): https://docs.google.com/spreadsheets/d/15tEkBJXk6_f7j4g6IHofsEAWG9ATxGI8ZFZaqMcgujA/edit?usp=sharing

So in summary, I can't find a reason to declare this a breaking change.

With all of this in mind I'd like a second review from @colinsurprenant and @andsel. Please let me know if I missed any corner case that can lead to breaking change for the user.

andsel

LGTM

I agree with you when you says that's not a breaking change, since previously the key's order wasn't consistent, and every run with same data could give different results, so it was breaking itself. As the spreadsheet show we don't break more condition than wasn't already broken before.

colinsurprenant · 2020-08-27T17:26:42Z

LGTM²
I also agree, the case of nested hashes was already unpredictable and thus was a bug. From my own experience I believe most of the usage are with flat single/multi field with or without concatenation and this was not affected and still works the same. +1 for non-breaking change on my side.

jsvd force-pushed the consistent_fingerprint branch from 2bb56fc to 9b50a97 Compare August 24, 2020 09:09

jsvd force-pushed the consistent_fingerprint branch from 9b50a97 to f5f952d Compare August 24, 2020 09:26

[skip ci] bump to 3.2.2

0952912

jsvd changed the title ~~deeply sort hashes when concatenate_all_fields is enabled~~ deeply sort hashes to ensure consistent fingerprints Aug 24, 2020

jsvd marked this pull request as ready for review August 24, 2020 09:40

elasticsearch-bot self-assigned this Aug 24, 2020

andsel approved these changes Aug 24, 2020

View reviewed changes

andsel assigned jsvd and unassigned elasticsearch-bot Aug 24, 2020

colinsurprenant approved these changes Aug 24, 2020

View reviewed changes

jsvd requested review from colinsurprenant and andsel August 25, 2020 12:12

andsel approved these changes Aug 25, 2020

View reviewed changes

colinsurprenant approved these changes Aug 27, 2020

View reviewed changes

jsvd merged commit 7292935 into logstash-plugins:master Aug 27, 2020

jsvd deleted the consistent_fingerprint branch August 27, 2020 18:55

This was referenced Aug 28, 2020

Fix non-deterministic fingerprint for nested hashes by sorting #41

Closed

Deep sort hashes #52

Closed

andsel mentioned this pull request Oct 16, 2020

Update patch plugin versions in gemfile lock for 6.8.13 elastic/logstash#12352

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deeply sort hashes to ensure consistent fingerprints #55

deeply sort hashes to ensure consistent fingerprints #55

jsvd commented Aug 20, 2020 •

edited

Loading

andsel left a comment

colinsurprenant left a comment

jsvd commented Aug 25, 2020

andsel left a comment

colinsurprenant commented Aug 27, 2020

deeply sort hashes to ensure consistent fingerprints #55

deeply sort hashes to ensure consistent fingerprints #55

Conversation

jsvd commented Aug 20, 2020 • edited Loading

andsel left a comment

Choose a reason for hiding this comment

colinsurprenant left a comment

Choose a reason for hiding this comment

jsvd commented Aug 25, 2020

andsel left a comment

Choose a reason for hiding this comment

colinsurprenant commented Aug 27, 2020

jsvd commented Aug 20, 2020 •

edited

Loading