Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix non-deterministic fingerprint for nested hashes by sorting #41

Closed

Conversation

ypid-geberit
Copy link

@ypid-geberit ypid-geberit commented Sep 21, 2018

Consider this example:

{
  "@timestamp": "2023-05-23T23:23:23.555Z",
  "@version": "1",
  "beat": {
    "hostname": "gnu.example.com",
    "name": "gnu.example.com",
    "version": "5.2.2"
  },
  "host": "gnu.example.com"
}

Using this filter:

fingerprint {
  concatenate_all_fields => true
  target => "[@metadata][_id]"
  method => "SHA512"
  key => "XXX"
  base64encode => true
}

Here, the order of the .beat hash is non-deterministic and the plugin did not do a deep sort as part of the serialization. This resulted in different fingerprints for the same event because the order of the three keys (hostname, name, version) changed randomly in the serialization.

This has been fixed by recursively checking for hashes and serializing them in sorted order.

Note that this changes the serialization format and thus breaks backwards compatibility. The old format could be emulated in order to not break backwards compatibility. Backwards compatibility in this case means to generate the same fingerprint for the same input. This is also the reason why the unit test on Travis fails.

Consider this example:

```JSON
{
  "@timestamp": "2023-05-23T23:23:23.555Z",
  "@Version": "1",
  "beat": {
    "hostname": "gnu.example.com",
    "name": "gnu.example.com",
    "version": "5.2.2"
  },
  "host": "gnu.example.com"
}
```

Using this filter:

```Logstash
fingerprint {
  concatenate_all_fields => true
  target => "[@metadata][_id]"
  method => "SHA512"
  key => "XXX"
  base64encode => true
}
```

Here, the order of the `.beat` hash is non-deterministic and the plugin
did not do a deep sort as part of the serialization. This resulted in
different fingerprints for the same event because the order of the three
keys (hostname, name, version) changed randomly in the serialization.

This has been fixed by recursively checking for hashes and serializing
them in sorted order.

Note that this changes the serialization format and thus breaks
backwards compatibility. The old format could be emulated in order to
not break backwards compatibility. Backwards compatibility in this case
means to generate the same fingerprint for the same input.

Closes: logstash-plugins#39
Copy link
Contributor

@yaauie yaauie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ypid-geberit thank you for contributing!

Unfortunately, I believe this to be a breaking change (shown off by the broken spec and otherwise noted in-line), and will need to be either an opt-in feature or accompany a major version change (and not be bundled with Logstash until 7.0).

Because this plugin is widely used to generate consistent ids, there is a large risk that any breaking change here will have significant and surprising repercussions to many users. Even if we were to ship with a major version, there would need to be a backward-compatibility fallback that has unmodified behaviour. I would advocate instead to make this an opt-in-feature within the current major.

To do so, we would need a flag (naming is hard, but maybe force_sort?) that defaults to false, but when set causes the new sorting behaviour to be invoked.


Since Logstash does maintain iteration order of Hashes once they have been created, a near-term workaround would be to explicitly order the hash before fingerprinting it:

filter {
  ruby {
    # sort beat for consistent fingerprinting
    code => "beat = event.get('beat'); beat && event.set('beat', Hash[beat.sort])"
  }
  fingerprint {
    # ...
  }
}

event.to_hash.sort.map do |k,v|
to_string << "|#{k}|#{v}"
end
to_string << serialize(event)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a breaking change for any pipeline configuration that makes uses of concatenate_all_fields, because the string changes from pipe-delimited kv to json-ish.

if event.respond_to?(:to_hash)
to_string << "{"
event.to_hash.sort.map do |k,v|
to_string << "#{k}:#{serialize(v)},"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be a breaking change in two ways:

  • current Hash#to_s outputs key/value pairs joined by =>, not by :
  • ordering

@ypid-geberit
Copy link
Author

Thanks for your review @yaauie 👍

Backwards compatibility is definitely a discussion point here and I find it reasonable to introduce a parameter like force_sort (disabled for now). I implemented this now. I would propose the name deep_sort as this is also used in other context/languages (Python, Perl) and because I think deep_sort should ultimately be enabled by default so force would be confusing.

Since Logstash does maintain iteration order of Hashes once they have been created, a near-term workaround would be to explicitly order the hash before fingerprinting it:

This would be interesting but I did not understand who I could rewrite the serialize function.

An open question is also if the serialize function should always be used. I would say yes and did this also in a separate commit.

@BenoitHugonnard
Copy link

+1 we need this !

@dbroeh
Copy link

dbroeh commented Feb 8, 2019

Would be really helpful to get this merged. Many thanks in advance.

@rryenlo
Copy link

rryenlo commented Jul 3, 2019

+1 any updates on when this is getting merged?

@ypid-geberit
Copy link
Author

#52 is probably a cleaner solution for this issue. Please have a look to get this fixed as this issue is hard to debug and probably occurs for all users of the plugin. The question is just if they have noticed it yet or not.

jsvd added a commit that referenced this pull request Aug 27, 2020
Fingerprinting is done by either fetching the .to_hash representation
or by fetching fields with .get(field)

The content of these often contain hashes, which we don't guarantee
order during insertion, so this commit recursively sorts hashes.

This change affects all modes:
* concatenate_all_sources => true
* concatenate_fields => true
* normal single field fingerprint (that may contain a nested Hash)

Fixes #39
Replaces #41, #52
@ypid-geberit
Copy link
Author

Nice to see this bug finally got fixed by #55. Thanks! Closing this alternative PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants