-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix non-deterministic fingerprint for nested hashes by sorting #41
Fix non-deterministic fingerprint for nested hashes by sorting #41
Conversation
56b8e0b
to
f7c5d56
Compare
Consider this example: ```JSON { "@timestamp": "2023-05-23T23:23:23.555Z", "@Version": "1", "beat": { "hostname": "gnu.example.com", "name": "gnu.example.com", "version": "5.2.2" }, "host": "gnu.example.com" } ``` Using this filter: ```Logstash fingerprint { concatenate_all_fields => true target => "[@metadata][_id]" method => "SHA512" key => "XXX" base64encode => true } ``` Here, the order of the `.beat` hash is non-deterministic and the plugin did not do a deep sort as part of the serialization. This resulted in different fingerprints for the same event because the order of the three keys (hostname, name, version) changed randomly in the serialization. This has been fixed by recursively checking for hashes and serializing them in sorted order. Note that this changes the serialization format and thus breaks backwards compatibility. The old format could be emulated in order to not break backwards compatibility. Backwards compatibility in this case means to generate the same fingerprint for the same input. Closes: logstash-plugins#39
f7c5d56
to
357a36e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ypid-geberit thank you for contributing!
Unfortunately, I believe this to be a breaking change (shown off by the broken spec and otherwise noted in-line), and will need to be either an opt-in feature or accompany a major version change (and not be bundled with Logstash until 7.0).
Because this plugin is widely used to generate consistent ids, there is a large risk that any breaking change here will have significant and surprising repercussions to many users. Even if we were to ship with a major version, there would need to be a backward-compatibility fallback that has unmodified behaviour. I would advocate instead to make this an opt-in-feature within the current major.
To do so, we would need a flag (naming is hard, but maybe force_sort
?) that defaults to false, but when set causes the new sorting behaviour to be invoked.
Since Logstash does maintain iteration order of Hashes once they have been created, a near-term workaround would be to explicitly order the hash before fingerprinting it:
filter {
ruby {
# sort beat for consistent fingerprinting
code => "beat = event.get('beat'); beat && event.set('beat', Hash[beat.sort])"
}
fingerprint {
# ...
}
}
lib/logstash/filters/fingerprint.rb
Outdated
event.to_hash.sort.map do |k,v| | ||
to_string << "|#{k}|#{v}" | ||
end | ||
to_string << serialize(event) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a breaking change for any pipeline configuration that makes uses of concatenate_all_fields
, because the string changes from pipe-delimited kv to json-ish.
if event.respond_to?(:to_hash) | ||
to_string << "{" | ||
event.to_hash.sort.map do |k,v| | ||
to_string << "#{k}:#{serialize(v)}," |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would be a breaking change in two ways:
- current
Hash#to_s
outputs key/value pairs joined by=>
, not by:
- ordering
Ideally, `deep_sort` would be the default some day.
Thanks for your review @yaauie 👍 Backwards compatibility is definitely a discussion point here and I find it reasonable to introduce a parameter like
This would be interesting but I did not understand who I could rewrite the An open question is also if the |
16ed4f6
to
eb38146
Compare
+1 we need this ! |
Would be really helpful to get this merged. Many thanks in advance. |
+1 any updates on when this is getting merged? |
#52 is probably a cleaner solution for this issue. Please have a look to get this fixed as this issue is hard to debug and probably occurs for all users of the plugin. The question is just if they have noticed it yet or not. |
Fingerprinting is done by either fetching the .to_hash representation or by fetching fields with .get(field) The content of these often contain hashes, which we don't guarantee order during insertion, so this commit recursively sorts hashes. This change affects all modes: * concatenate_all_sources => true * concatenate_fields => true * normal single field fingerprint (that may contain a nested Hash) Fixes #39 Replaces #41, #52
Nice to see this bug finally got fixed by #55. Thanks! Closing this alternative PR. |
Consider this example:
Using this filter:
Here, the order of the
.beat
hash is non-deterministic and the plugin did not do a deep sort as part of the serialization. This resulted in different fingerprints for the same event because the order of the three keys (hostname, name, version) changed randomly in the serialization.This has been fixed by recursively checking for hashes and serializing them in sorted order.
Note that this changes the serialization format and thus breaks backwards compatibility. The old format could be emulated in order to not break backwards compatibility. Backwards compatibility in this case means to generate the same fingerprint for the same input. This is also the reason why the unit test on Travis fails.