Handle the cases were a field is both indexed and not indexed within a time range #825

cawaring · 2020-05-28T19:36:25Z

Fields can be added or removed from the index over time. We do not always handle cases where a field spans a time range where it's both indexed and not index. We need to handle this case appropriately. This would include using the field as an indexed field for the portion of time it is indexed and avoid using it to prune queries and ranges when it's not indexed.

ivakegg · 2020-07-31T18:02:49Z

Suggest marking the time ranges in the datawave metadata table, and then using that info in the DefaultQueryPlanner appropriately. The list of indexed fields sent to the QueryIterator will need to be adjusted for each range as appropriate.

ivakegg · 2020-07-31T18:03:44Z

The Data Dictionary should also show this information.

ivakegg · 2020-10-09T17:59:48Z

FYI, the PushdownMissingIndexRangeNodesVisitor is sort of an initial implementation of this capability. However instead of pushing those nodes down, perhaps range stream index lookup can return a stream of shards for the holes, and also modify the query for those ranges to push the terms down to the evaluation phase. The comment I put into another pull request may help one understand this: #925 (comment).

ivakegg · 2020-10-09T18:09:56Z

BTW having the ingest automatically fill in the metadata for when a field is indexed would be great as well. This should probably be encoded in the same way we are condensing the F column here: #828. For existing systems we would need to manually update (a tool would be nice) the metadata for data already ingested.

jzgithub1 · 2020-11-03T15:27:09Z

BTW having the ingest automatically fill in the metadata for when a field is indexed would be great as well. This should probably be encoded in the same way we are condensing the F column here: #828. For existing systems we would need to manually update (a tool would be nice) the metadata for data already ingested.

This is done with the new IndexColumnIterator class.

jzgithub1 · 2020-11-03T15:29:57Z

BTW having the ingest automatically fill in the metadata for when a field is indexed would be great as well. This should probably be encoded in the same way we are condensing the F column here: #828. For existing systems we would need to manually update (a tool would be nice) the metadata for data already ingested.

Already implemented in the pull request for this ticket.

jzgithub1 · 2020-11-03T15:37:48Z

FYI, the PushdownMissingIndexRangeNodesVisitor is sort of an initial implementation of this capability. However instead of pushing those nodes down, perhaps range stream index lookup can return a stream of shards for the holes, and also modify the query for those ranges to push the terms down to the evaluation phase. The comment I put into another pull request may help one understand this: #925 (comment).

The code inside of processTree in the DefaultQueryPlanner looks like it should be able to take care to index holes. The problem is the index holes are never set inside of the ShardQueryConfiguration object prior to this line on line 1082:
queryTree = PushdownMissingIndexRangeNodesVisitor.pushdownPredicates(queryTree, config, metadataHelper);

I am going to investigate your idea more thoroughly though since this is your area of expertise. I will engage @lbschanno about this issue.

ivakegg · 2020-11-04T15:04:34Z

As I believe you have discovered, the existing "index hole" mechanism is value based (i.e. values are missing in the index from 'a' to 'b' for a specified date range). This ticket will result in a different kind of index hole which is field based meaning that a field will be missing all entries in the index for a specified date range or date ranges. I suggest we use the terminology of ValueIndexHole and FieldIndexHole. Once we have the FieldIndexHoles encoded in the metadata table and provide a mechanism to get those out, we should put those in the ShardQueryConfiguration along side of the ValueIndexHoles.

Currently we have an IndexHoleMarkerJexlNode that is used to mark fields in the query that need to be delayed until evaluation time. When this is added to the query plan, this denotes that we cannot use that term for index lookup across the entire query date range. This avoids missing any results but avoids being able to use the index for those portions of the date range that might still be available to us. It would be desirable for both ValueIndexHoles and FieldIndexHoles that we create separate plans for portions of the query range to be able to take full advantage of the index where possible. I have two possibilities that come to mind on how to achieve this. We should evaluate these and any other ideas before continuing with the implementation.

We could create a query planner that uses a separate DefaultQueryPlanner instance per portion of the query date range which is consistent in terms of index holes. This would result is multiple range streams which could be funneled into one range stream. This was the existing DefaultQueryPlanner would not really change except for adding a visitor (or extending the existing index hole visitor) to mark fields that holes for the specified portion of the query date range.
We could modify the global index lookups to return exhaustive sets of shard ranges for those portions of the index that are not available for the field being queried. So the index lookup would return the UIDs (document ranges) and shards (fi ranges) when scanning the index and when it hits the date range that is deemed unavailable/missing, it would simply generate artificial shards (fi ranges). The same thing would need to be done for the field index scans, including the ivarators (DatawaveFieldIndexCachingIterator…).
The existing DefaultQueryPlanner would generate multiple plans concurrently for each consistent portion of the date range relative to index holes. This is basically like option 1, but the management of the plans would be pushed into the existing DefaultQueryPlanner.

Modify the generation of 'i' (indexed rows) and 'ri' (reverse indexed rows) in the metadata table such that the column qualifier contains the event date. This is required as a first step to support efforts for issue #825 so that we can identify dates when an event was ingested and included in a frequency count for an associated 'f' row, but was not indexed.

lbschanno · 2023-10-26T10:55:38Z

Created PR datawave-metadata-utils/pull/29 to add functionality for retrieving field index holes from the metadata table.

* Enrich 'i' and 'ri' rows in metadata table with event date Modify the generation of 'i' (indexed rows) and 'ri' (reverse indexed rows) in the metadata table such that the column qualifier contains the event date. This is required as a first step to support efforts for issue #825 so that we can identify dates when an event was ingested and included in a frequency count for an associated 'f' row, but was not indexed. * Add counts to 'i' and 'ri' rows

* Enrich 'i' and 'ri' rows in metadata table with event date Modify the generation of 'i' (indexed rows) and 'ri' (reverse indexed rows) in the metadata table such that the column qualifier contains the event date. This is required as a first step to support efforts for issue #825 so that we can identify dates when an event was ingested and included in a frequency count for an associated 'f' row, but was not indexed. * Add counts to 'i' and 'ri' rows * Initial federated query planner implementation * code formatting * Fixed issues with FederatedQueryIterable * Fix test failures * Fix failing tests * Additional test fixes * pr feedback * Use new MetadataHelper function version * Extract fields to filter index holes * Correct logic for determining sub date ranges * Remove unnecessary check * code formatting * Add check for null query model * Limit config arg to function scope * Update metadata-utils submodule commit * code formatting * Fix failing tests * Additional test fixes * Ensure all original tests pass * Add federated planner tests and chained schedulers * pr feedback * metadata-utils 3.0.3 tag * Fixed the index hole data ingest to set appropriate time stamps on the keys Removed some of the code which I believe was trying to diagnose the test issues * Updated applyModel to use the passed in script * Remove unneeded changes * Make FederatedQueryPlanner the default * Restore original log4j.properties * code formatting * Fix QueryPlanTest * Updated to test with teardown * Test debugging edits * Updated formatting * Concatenate sub-plans * Make FederatedQueryPlanner implement Cloneable * code formatting * * Updated with metadata-utils 4.0.5 (index markers and avoid non-indexed fields for holes) * Fixed test cases with correct responses and periodic failing test cases * Updated AncestorQueryLogic to handle federate query planner * * Allow subclasses of ShardQueryConfiguration * Updated to throw a NoResultsException for am empty query. * import reorg * Updated to avoid expanding unfielded if disabled, and to assume no index holes if no query fields. * Add tests for default query planner with ne and not-eq * Revert changes to test data format * Revert changes to log4j.properties * Ensure query plan updated after any exception type * Revert all changes to test data format --------- Co-authored-by: Ivan Bella <[email protected]> Co-authored-by: hgklohr <[email protected]>

cawaring added the enhancement New feature or request label May 28, 2020

ivakegg mentioned this issue Oct 9, 2020

WIP Validate lineage for all JexlNode rebuilding visitors #925

Closed

jzgithub1 mentioned this issue Oct 22, 2020

WIP - Datawave #825 - Generate the IndexHole objects for the DefaultQueryPlanner to use to process indexed and unindexed ranges. #971

Closed

ivakegg assigned jzgithub1 Dec 2, 2020

lbschanno unassigned jzgithub1 May 31, 2022

lbschanno self-assigned this Aug 22, 2023

lbschanno mentioned this issue Sep 19, 2023

Enrich 'i' and 'ri' rows in metadata table with event date #2094

Merged

lbschanno linked a pull request Sep 19, 2023 that will close this issue

Enrich 'i' and 'ri' rows in metadata table with event date #2094

Merged

lbschanno mentioned this issue Jan 12, 2024

Add FederatedQueryPlanner #2216

Merged

ivakegg closed this as completed in #2094 Feb 13, 2024

lbschanno reopened this Feb 13, 2024

lbschanno linked a pull request Jun 11, 2024 that will close this issue

Add aggregator for frequency metadata rows NationalSecurityAgency/datawave-metadata-utils#32

Open

hgklohr closed this as completed in #2216 Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle the cases were a field is both indexed and not indexed within a time range #825

Handle the cases were a field is both indexed and not indexed within a time range #825

cawaring commented May 28, 2020

ivakegg commented Jul 31, 2020

ivakegg commented Jul 31, 2020

ivakegg commented Oct 9, 2020

ivakegg commented Oct 9, 2020

jzgithub1 commented Nov 3, 2020

jzgithub1 commented Nov 3, 2020 •

edited

Loading

jzgithub1 commented Nov 3, 2020

ivakegg commented Nov 4, 2020

lbschanno commented Oct 26, 2023

Handle the cases were a field is both indexed and not indexed within a time range #825

Handle the cases were a field is both indexed and not indexed within a time range #825

Comments

cawaring commented May 28, 2020

ivakegg commented Jul 31, 2020

ivakegg commented Jul 31, 2020

ivakegg commented Oct 9, 2020

ivakegg commented Oct 9, 2020

jzgithub1 commented Nov 3, 2020

jzgithub1 commented Nov 3, 2020 • edited Loading

jzgithub1 commented Nov 3, 2020

ivakegg commented Nov 4, 2020

lbschanno commented Oct 26, 2023

jzgithub1 commented Nov 3, 2020 •

edited

Loading