From 51548ae4f5331394293062521c1d973fdbef4d43 Mon Sep 17 00:00:00 2001 From: trinity-1686a Date: Thu, 1 Aug 2024 17:28:54 +0200 Subject: [PATCH 1/4] start adding documentation on updating mapping --- docs/reference/updating-mapper.md | 42 +++++++++++++++++++++++++++++++ 1 file changed, 42 insertions(+) create mode 100644 docs/reference/updating-mapper.md diff --git a/docs/reference/updating-mapper.md b/docs/reference/updating-mapper.md new file mode 100644 index 00000000000..19c32c0fedb --- /dev/null +++ b/docs/reference/updating-mapper.md @@ -0,0 +1,42 @@ +# Updating the doc mapping of an index + +Quickwit allows updating the mapping it uses, to add more fields to an existing index, or change how they are indexed. In doing so, +it does not reindex existing data, still lets you search through older documents where possible. + +## Indexing + +When receiving a new doc mapping for an index, Quickwit will restart indexing pipelines to take into account the changes. This operation +is asynchronous. You can't assume documents send immediately after the update will necessarily use the new mapping. On the other hand, +this usually happens very fast, possibly while some documents sent before the udpate haven't been commited yet. This means you can't +assume documents sent before the update won't be indexed using the new doc mapping. + +## Querying + +When receiving a query, Quickwit always validate that query against the most recent mapping. If a query was valid under a previous +mapping but is not compatible with the newer mapping, that query will be rejected. For instance if a field which was indexed no longer +is, or a field was removed, any query that uses it will become invalid. On the other hand, if a query was not valid for a previous doc +mapping, but is valid under the new doc mapping, Quickwit will process the query. When querying newer splits, it will behave normally, +when querying older splits, it will try ot execute the query as correctly as possible. If you find a situation where older splits +causes a valid request to return an error, please open a bug report. + +For instance, if a field was added or newly marked as indexed, quickwit will behave as if no older document had the right value. This +means `field1:abc field2:def` where field2 is a new field would return no documents for older splits. However `field1:abc OR field2:def` +would return the same as `field1:abc` for those older splits, and `NOT field2:def`would return every documents. + +If a field was previously not marked as fast, and it is now, aggregations using that field will assume the value is missing for older +documents, and will behave normally for newer. + +If you remove stored fields, the documents returned from a query will instantly have that field hidden, like if that field never was +stored. If you change the type of a field, Quickwit will attempt to convert the old document format into the newer one. If it can't, +it will show the field as absent. Some conversions are trivially always correct, for instance converting from integer to text. Some are +dependant on your data, for instance converting text to integer. Finally some will always hide the fields, for instance converting text +to an object. If a field goes from being an array, to being single-valued, only the first value will be returned, as a direct value. +When converting an array field to be a part of a json field, Quickwit will decide per document and per field to show it as a direct +value (if it contained exactly one element for this document), or an array if it contained any other number of values. + +## Reversibility + +Quickwit doesn't store old doc mappings, it can't revert an update all by itself. However it also does not modify existing data when +receiving a new doc mapping. If you realize you updated the mapping in a way that's very wrong, you can re-update using the previous +mapping. Documents indexed while the mapping was wrong will be impacted, but any document that was commited before the change will be +back to its original state. From 26369308f166cd8f442c86594e3706247d8f2845 Mon Sep 17 00:00:00 2001 From: trinity-1686a Date: Fri, 27 Sep 2024 11:35:44 +0200 Subject: [PATCH 2/4] improve update-api documentation --- docs/reference/rest-api.md | 8 +- docs/reference/updating-mapper.md | 152 +++++++++++++++++++++++++----- 2 files changed, 132 insertions(+), 28 deletions(-) diff --git a/docs/reference/rest-api.md b/docs/reference/rest-api.md index b70286a01de..7dee9592f98 100644 --- a/docs/reference/rest-api.md +++ b/docs/reference/rest-api.md @@ -318,12 +318,10 @@ Updates the configurations of an index. This endpoint follows PUT semantics, whi - The retention policy update is automatically picked up by the janitor service on its next state refresh. - The search settings update is automatically picked up by searcher nodes when the next query is executed. -- The indexing settings update is not automatically picked up by the indexer nodes, they need to be manually restarted. -- The doc mapping update is not automatically picked up by the indexer nodes, they have to be manually restarted. +- The indexing settings update is automatically picked up by the indexer nodes once the control plane emit a new indexing plan. +- The doc mapping update is automatically picked up by the indexer nodes once the control plane emit a new indexing plan. -Updating the doc mapping doesn't reindex existing data. Queries and answers are mapped on a best effort basis when querying older splits. -It is also not possible to update the timestamp field, or to modify/remove existing non-default tokenizers (but it is possible to change -which tokenizer is used for a field). +Updating the doc mapping doesn't reindex existing data. Queries and answers are mapped on a best effort basis when querying older splits. For more details, check [the reference](updating-mapper.md) #### PUT payload diff --git a/docs/reference/updating-mapper.md b/docs/reference/updating-mapper.md index 19c32c0fedb..b87ac62e4f0 100644 --- a/docs/reference/updating-mapper.md +++ b/docs/reference/updating-mapper.md @@ -5,34 +5,24 @@ it does not reindex existing data, still lets you search through older documents ## Indexing -When receiving a new doc mapping for an index, Quickwit will restart indexing pipelines to take into account the changes. This operation -is asynchronous. You can't assume documents send immediately after the update will necessarily use the new mapping. On the other hand, -this usually happens very fast, possibly while some documents sent before the udpate haven't been commited yet. This means you can't -assume documents sent before the update won't be indexed using the new doc mapping. +When you update a doc mapping for an index, Quickwit will restart indexing pipelines to take the changes into account. As both this operation and the document ingestion are asynchronous, you can't assume documents sent immediately after the update will necessarily use the new mapping nor that documents sent immediately before the update won't be indexed using the new doc mapping. ## Querying -When receiving a query, Quickwit always validate that query against the most recent mapping. If a query was valid under a previous -mapping but is not compatible with the newer mapping, that query will be rejected. For instance if a field which was indexed no longer -is, or a field was removed, any query that uses it will become invalid. On the other hand, if a query was not valid for a previous doc -mapping, but is valid under the new doc mapping, Quickwit will process the query. When querying newer splits, it will behave normally, -when querying older splits, it will try ot execute the query as correctly as possible. If you find a situation where older splits -causes a valid request to return an error, please open a bug report. +When receiving a query, Quickwit always validate it against the most recent mapping. +If a query was valid under a previous mapping but is not compatible with the newer mapping, that query will be rejected. +For instance if a field which was indexed no longer is, any query that uses it will become invalid. +On the other hand, if a query was not valid for a previous doc mapping, but is valid under the new doc mapping, Quickwit will process the query. +When querying newer splits, it will behave normally, when querying older splits, it will try ot execute the query as correctly as possible. +If you find a situation where older splits causes a valid request to return an error, please open a bug report. +See example 1 and 2 below for clarification. -For instance, if a field was added or newly marked as indexed, quickwit will behave as if no older document had the right value. This -means `field1:abc field2:def` where field2 is a new field would return no documents for older splits. However `field1:abc OR field2:def` -would return the same as `field1:abc` for those older splits, and `NOT field2:def`would return every documents. +Change in tokenizer affect only newer splits, older splits keep using the tokenizers they were created with. -If a field was previously not marked as fast, and it is now, aggregations using that field will assume the value is missing for older -documents, and will behave normally for newer. - -If you remove stored fields, the documents returned from a query will instantly have that field hidden, like if that field never was -stored. If you change the type of a field, Quickwit will attempt to convert the old document format into the newer one. If it can't, -it will show the field as absent. Some conversions are trivially always correct, for instance converting from integer to text. Some are -dependant on your data, for instance converting text to integer. Finally some will always hide the fields, for instance converting text -to an object. If a field goes from being an array, to being single-valued, only the first value will be returned, as a direct value. -When converting an array field to be a part of a json field, Quickwit will decide per document and per field to show it as a direct -value (if it contained exactly one element for this document), or an array if it contained any other number of values. +Document retrieved are mapped from Quickwit internal format to JSON based on the latest doc mapping. This means if fields are deleted, +they will stop appearing (see also Reversibility below). If the type of some field changed, it will be converted on a best effort basis: +integers will get turned into text, text will get turned into string when it is possible, otherwise, the field is omited. +See example 3 for clarification. ## Reversibility @@ -40,3 +30,119 @@ Quickwit doesn't store old doc mappings, it can't revert an update all by itself receiving a new doc mapping. If you realize you updated the mapping in a way that's very wrong, you can re-update using the previous mapping. Documents indexed while the mapping was wrong will be impacted, but any document that was commited before the change will be back to its original state. + +## Examples + +In all exemples, fields which are not relevant are removed for conciseness, you will not be able to use these index config as is. + +### Example 1 + +before: +```yaml +doc_mapping: + field_mappings: + - name: field1 + type: text + tokenizer: raw +``` + +after: +```yaml +doc_mapping: + field_mappings: + - name: field1 + type: text + indexed: false +``` + +A field changed from being indexed to not being indexed. +A query such as `field1:my_value` was valid, but is now rejected. + +### Example 2 + +before: +```yaml +doc_mapping: + field_mappings: + - name: field1 + type: text + indexed: false + - name: field2 + type: text + tokenizer: raw + +``` + +after: +```yaml +doc_mapping: + field_mappings: + - name: field1 + type: text + tokenizer: raw + - name: field2 + type: text + tokenizer: raw +``` + +A field changed from being not indexed to being indexed. +A query such as `field1:my_value` was invalid before, and is now valid. When querying older splits, it won't return a match, but won't return an error either. +A query such as `field1:my_value OR field2:my_value` is now valid too. For old splits, it will return the same results as `field2:my_value` as field1 wasn't indexed before. For newer splits, it will return the expected results. +A query such as `NOT field1:my_value` would return all documents for old splits, and only documents where `field1` is not `my_value` for newer splits. + + +### Example 3 + +# show cast (trivial, valid and invalid) +# show array to single + +before: +```yaml +doc_mapping: + field_mappings: + - name: field1 + type: text + - name: field2 + type: u64 + - name: field3 + type: array +``` +document presents before update: +```json +{ + "field1": "123", + "field2": 456, + "field3": ["abc", "def"] +} +{ + "field1": "message", + "field2": 987, + "field3": ["ghi"] +} +``` + +after: +```yaml +doc_mapping: + field_mappings: + - name: field1 + type: u64 + - name: field2 + type: text + - name: field3 + type: text +``` + +When querying this index, the documents returned would become: +```json +{ + "field1": 123, + "field2": "456", + "field3": "abc" +} +{ + // field1 is missing because "message" can't be converted to int + "field2": "987", + "field3": "ghi" +} +``` From 02f2edb29ab5b64adfd4f37d1de55380c8e7bd00 Mon Sep 17 00:00:00 2001 From: trinity-1686a Date: Fri, 27 Sep 2024 12:31:03 +0200 Subject: [PATCH 3/4] document valid conversions --- docs/reference/updating-mapper.md | 37 ++++++++++++++++++- .../src/default_doc_mapper/mapping_tree.rs | 5 +++ 2 files changed, 41 insertions(+), 1 deletion(-) diff --git a/docs/reference/updating-mapper.md b/docs/reference/updating-mapper.md index b87ac62e4f0..fd2ec40b32a 100644 --- a/docs/reference/updating-mapper.md +++ b/docs/reference/updating-mapper.md @@ -20,7 +20,7 @@ See example 1 and 2 below for clarification. Change in tokenizer affect only newer splits, older splits keep using the tokenizers they were created with. Document retrieved are mapped from Quickwit internal format to JSON based on the latest doc mapping. This means if fields are deleted, -they will stop appearing (see also Reversibility below). If the type of some field changed, it will be converted on a best effort basis: +they will stop appearing (see also Reversibility below) unless mapper mode is Dynamic. If the type of some field changed, it will be converted on a best effort basis: integers will get turned into text, text will get turned into string when it is possible, otherwise, the field is omited. See example 3 for clarification. @@ -31,6 +31,41 @@ receiving a new doc mapping. If you realize you updated the mapping in a way tha mapping. Documents indexed while the mapping was wrong will be impacted, but any document that was commited before the change will be back to its original state. + +## Type update reference + +Conversion from a type to itself is omited. Conversion which never succeed and always omit the field are omited too. + + +| type before | type after | behavior | +|-------------|------------| +| u64/i64/f64 | text | convert to decimal string | +| date | text | convert to rfc3339 textual representation | +| ip | text | convert to IPv6 representation. For IPv4, convert to IPv4-mapped IPv6 address (`::ffff:1.2.3.4`) | +| bool | text | convert to "true" or false" | +| u64/i64/f64 | bool | convert 0/0.0 to false and 1/1.0 to true, otherise omit | +| text | bool | convert if "true" or "false" (lowercase), otherwise omit | +| text | ip | convert if valid IPv4 or IPv6, otherwise omit | +| text | f64 | convert if valid floating point number, otherwise omit | +| u64/i64 | f64 | convert, possibly with loss of precision | +| bool | f64 | convert to 0.0 for false, and 1.0 for true | +| text | u64 | convert is valid integer in range 0..2\*\*64, otherwise omit | +| i64 | u64 | convert if in range 0..2\*\*63, otherwise omit | +| f64 | u64 | convert if in range 0..2\*\*64, possibly with loss of precision, otherwise omit | +| text | i64 | convert is valid integer in range -2\*\*63..2\*\*63, otherwise omit | +| u64 | i64 | convert if in range 0..2\*\*63, otherwise omit | +| f64 | i64 | convert if in range -2\*\*63..2\*\*63, possibly with loss of precision, otherwise omit | +| bool | i64 | convert to 0 for false, and 1 for true | +| text | datetime | parse according to current input\_format, otherwise omit | +| u64 | datetime | parse according to current input\_format, otherwise omit | +| i64 | datetime | parse according to current input\_format, otherwise omit | +| f64 | datetime | parse according to current input\_format, otherwise omit | +| array\ | array\ | convert individual elements, skipping over those which can't be converted | +| T | array\ | convert element, emiting array of a single element, or empty array if it can't be converted | +| array\ | U | convert individual elements, keeping the first which can be converted | +| json | object | try convert individual elements if they exists inside object, omit individual elements which can't be | +| object | json | convert individual elements. Previous lists of one element are converted to a single element not in an array. + ## Examples In all exemples, fields which are not relevant are removed for conciseness, you will not be able to use these index config as is. diff --git a/quickwit/quickwit-doc-mapper/src/default_doc_mapper/mapping_tree.rs b/quickwit/quickwit-doc-mapper/src/default_doc_mapper/mapping_tree.rs index 28851fd37c3..e28dcef6cd8 100644 --- a/quickwit/quickwit-doc-mapper/src/default_doc_mapper/mapping_tree.rs +++ b/quickwit/quickwit-doc-mapper/src/default_doc_mapper/mapping_tree.rs @@ -640,6 +640,11 @@ fn value_to_bool(value: TantivyValue) -> Result { 1 => Some(true), _ => None, }, + TantivyValue::F64(number) => match number { + 0.0 => Some(false), + 1.0 => Some(true), + _ => None, + }, TantivyValue::Bool(b) => Some(*b), _ => None, } From 2a8adbfda2db70061c9e57695b013c12424b72d0 Mon Sep 17 00:00:00 2001 From: trinity-1686a Date: Fri, 27 Sep 2024 12:48:42 +0200 Subject: [PATCH 4/4] improve section on reversibility --- docs/reference/updating-mapper.md | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/docs/reference/updating-mapper.md b/docs/reference/updating-mapper.md index fd2ec40b32a..ce0ede1d06e 100644 --- a/docs/reference/updating-mapper.md +++ b/docs/reference/updating-mapper.md @@ -26,11 +26,7 @@ See example 3 for clarification. ## Reversibility -Quickwit doesn't store old doc mappings, it can't revert an update all by itself. However it also does not modify existing data when -receiving a new doc mapping. If you realize you updated the mapping in a way that's very wrong, you can re-update using the previous -mapping. Documents indexed while the mapping was wrong will be impacted, but any document that was commited before the change will be -back to its original state. - +Quickwit does not modify existing data when receiving a new doc mapping. If you realize that you updated the mapping in a wrong way, you can re-update your index using the previous mapping. Documents indexed while the mapping was wrong will be impacted, but any document that was committed before the change will be queryable as if nothing happened. ## Type update reference