Elastic data types and missing test cases #46

andurin · 2024-01-30T16:22:09Z

andurin
Jan 30, 2024
Maintainer

Hi Community,

this is my very first discussion and I want to ask for your help with a constructive discussion.

#43 (also #25 and #36) revealed some issues between indexed data and their ability to be searched "as intended" using this elasticsearch backend - special for the lucene backend.

There are some unclear points hitting me since I'm not using ES on a daily basis (at the moment). For example: "To quote or not to quote?", "What about asterix (*) searches against wildcard fields?", etc.

I would like to cover those points using Tests within this project - so everyone interested is invited to help.

I'm thinking about test cases for the following:

Test data for different ES field data types. e.g.
- process.command_line (as keyword): <test data>
- process.command_line.text (as text <test data>
- process.command_line.wildcard (as wildcard): <test data>
Test cases for the different sigma modifiers - how should a successful query look like?
- also for contains, endswith, etc.
What else ??

Starting point to help writing tests:

Query a ES Node and expect $NUM of results: https://github.com/SigmaHQ/pySigma-backend-elasticsearch/blob/main/tests/test_backend_elasticsearch_lucene_connect.py#L144
Changes to the mapping: https://github.com/SigmaHQ/pySigma-backend-elasticsearch/blob/main/tests/test_backend_elasticsearch_lucene_connect.py#L58
and test data: https://github.com/SigmaHQ/pySigma-backend-elasticsearch/blob/main/tests/test_backend_elasticsearch_lucene_connect.py#L87

I'm hoping for a lot of PRs for new and valuable test cases.

Regards,
@andurin

Koen1999 · 2024-01-30T17:41:57Z

Koen1999
Jan 30, 2024

I think the 'online' tests are what should be aimed for. Although it makes the tests a bit more troublesome to run, it should add to the trust we have in the overall package.

It is important to note that for the Lucene backend, these are the most important resources to take a look at:

Perhaps a wise first step would be to identify which the common field types are for the fields that we want to search. Ideally, we would have test cases for each field type. Moreover, I agree with @andurin that tests should cover a wide range of Sigma modifiers in order to cover various aspects. Specifically, boolean fields and date fields are currently not covered by the test cases.

Another important consideration, however, is that pySigma-backend-elastic search is kind of meaningless on its own and is only meaningful in combination with, for example, pySigma-pipeline-sysmon and the rules in the main sigma repo. Therefore, it might be a good idea to work on developing a type of integration test.
Something that I would expect these integration tests to uncover is that field mappings might be incorrect. It is important to realize that the correctness of a query generated by the Lucene backend might depend on the field type the query is executed against.

One particular issue I tried to address in #43 but that is currently not covered in tests, is the possibility to have queries containing both wildcards and spaces. We should also include tests using special characters that must be escaped (e.g. quotes) and field types that are currently not tested.

0 replies

Koen1999 · 2024-01-30T18:01:54Z

Koen1999
Jan 30, 2024

On the matter "to quote, or not to quote", I think the answer is quite simple. We should strive to never quote.

To elaborate, according to the Lucene documentation I mentioned above:

A Phrase is a group of words surrounded by double quotes such as "hello dolly".

Lucene supports single and multiple character wildcard searches within single terms (not within phrase queries).

In other words, if the Lucene backend should support wildcard searches, which is an essential part of the Sigma syntax, we should not generate phrases and hence should not quote.

The only exception I can think of at the moment is when you want to search for an empty string field. (note that this is different from asserting that a field exists)

0 replies

Koen1999 · 2024-01-30T19:38:25Z

Koen1999
Jan 30, 2024

Moreover, I have come to realize the cause of all the problems we are having with the Lucene backend and Sigma. Sigma promises to transform Sigma queries (allowing for regular expressions) whereas this is a feature Lucene simply cannot support. Lucene only allows for using certain wildcards and is less expressive than the Sigma syntax suggests.

A Sigma query such as keywordFieldA|re: value .* spaces results in the following Lucene query: textFieldA:/value .* spaces/. Obviously, this query will not work. Ideally, we modify the backend to rewrite regular expressions to something using Wildcards such as textFieldA:*value\ *\ spaces*.

Of course, this becomes more complicated when queries specify character groups, lookahead, exclusion, or the number of matches. Perhaps this is something the team behind Sigma should take a look at, because it looks like a contradiction at Sigma's core to me.

EDIT: This means that we should also look back at previous issues and PRs such as #9. In the commit to address this, a new test was also added: 563c565#diff-8e673d84136778434f31a4b9af2fc02d9afcb5c2bbac2698f57017989f65943aR142-R156
To my understanding the test_lucene_regex_query_escaped_input test is wrong. First and foremostly because the : character should be escaped. I also do not understand the meaning of the \ at the begin and end of the query on the field. And of course for the reasons highlighted above. I think the closest approximation Lucene allows for is fieldA:127\.0\.0\.1\:???

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elastic data types and missing test cases #46

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Elastic data types and missing test cases #46

andurin Jan 30, 2024 Maintainer

Replies: 3 comments

Koen1999 Jan 30, 2024

Koen1999 Jan 30, 2024

Koen1999 Jan 30, 2024

andurin
Jan 30, 2024
Maintainer

Koen1999
Jan 30, 2024

Koen1999
Jan 30, 2024

Koen1999
Jan 30, 2024