Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deeply sort hashes to ensure consistent fingerprints #55

Merged
merged 2 commits into from
Aug 27, 2020

Conversation

jsvd
Copy link
Member

@jsvd jsvd commented Aug 20, 2020

Fingerprinting is done by either fetching the .to_hash representation
or by fetching fields with .get(field)

The content of these often contain hashes, which we don't guarantee
order during insertion, so this commit recursively sorts hashes.

This happens in all modes:

  • concatenate_all_sources
  • concatenate_fields
  • normal single field fingerprint

NOTE: I propose we don't consider this a breaking change (therefore not requiring a major version bump). The current ordering is unpredictable and therefore is much more of a bug than a feature. The only two paths forward would have been:

  1. sorting hashes like we do here
  2. only allowing fingerprinting on scalar fields (and therefore removing concatenate_all_sources)

Option 2 was much more likely to produce breaking changes, so we went here with 1.

Fixes #39
Replaces #41, #52

@jsvd jsvd force-pushed the consistent_fingerprint branch from 2bb56fc to 9b50a97 Compare August 24, 2020 09:09
Fingerprinting is done by either fetching the .to_hash representation
or by fetching fields with .get(field)

The content of these often contain hashes, which we don't guarantee
order during insertion, so this commit recursively sorts hashes.

This happens in all modes:
* concatenate_all_sources
* concatenate_fields
* normal single field fingerprint
@jsvd jsvd force-pushed the consistent_fingerprint branch from 9b50a97 to f5f952d Compare August 24, 2020 09:26
@jsvd jsvd changed the title deeply sort hashes when concatenate_all_fields is enabled deeply sort hashes to ensure consistent fingerprints Aug 24, 2020
@jsvd jsvd marked this pull request as ready for review August 24, 2020 09:40
@elasticsearch-bot elasticsearch-bot self-assigned this Aug 24, 2020
Copy link
Contributor

@andsel andsel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@andsel andsel assigned jsvd and unassigned elasticsearch-bot Aug 24, 2020
Copy link
Contributor

@colinsurprenant colinsurprenant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jsvd jsvd requested review from colinsurprenant and andsel August 25, 2020 12:12
@jsvd
Copy link
Member Author

jsvd commented Aug 25, 2020

Thinking back to the concerns raised in #41 (review), there are a few items to keep in mind before merging this.

The existing implementation already sorts the top level keys of a hash, either during concatenate_all_fields or if multiple sources are listed.
From my testing, the differences in results will come when fingerprinting nested objects. In this scenario the current implementation is unpredictable, as hashmaps may have any order independent of the insertion order (since the underlying data structurs of a Logstash Event aren't ruby hashes).

I ran many different scenarios and I can't generate fingerprints that are consistently different from the ones produced by this PR.
For example, when fingerprinting a nested object, the current implementation will, at random, either produce a fingerprint equal to the one this PR produces, or a different one.

Here is a link to a spreadsheet where I compared the two implementations against a different set of data and fingerprint strategies (all using SHA1): https://docs.google.com/spreadsheets/d/15tEkBJXk6_f7j4g6IHofsEAWG9ATxGI8ZFZaqMcgujA/edit?usp=sharing

So in summary, I can't find a reason to declare this a breaking change.

With all of this in mind I'd like a second review from @colinsurprenant and @andsel. Please let me know if I missed any corner case that can lead to breaking change for the user.

Copy link
Contributor

@andsel andsel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I agree with you when you says that's not a breaking change, since previously the key's order wasn't consistent, and every run with same data could give different results, so it was breaking itself. As the spreadsheet show we don't break more condition than wasn't already broken before.

@colinsurprenant
Copy link
Contributor

LGTM2
I also agree, the case of nested hashes was already unpredictable and thus was a bug. From my own experience I believe most of the usage are with flat single/multi field with or without concatenation and this was not affected and still works the same. +1 for non-breaking change on my side.

@jsvd jsvd merged commit 7292935 into logstash-plugins:master Aug 27, 2020
@jsvd jsvd deleted the consistent_fingerprint branch August 27, 2020 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The fingerprint is non deterministic for events with a nested map
4 participants