-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle the cases were a field is both indexed and not indexed within a time range #825
Handle the cases were a field is both indexed and not indexed within a time range #825
Comments
Suggest marking the time ranges in the datawave metadata table, and then using that info in the DefaultQueryPlanner appropriately. The list of indexed fields sent to the QueryIterator will need to be adjusted for each range as appropriate. |
The Data Dictionary should also show this information. |
FYI, the PushdownMissingIndexRangeNodesVisitor is sort of an initial implementation of this capability. However instead of pushing those nodes down, perhaps range stream index lookup can return a stream of shards for the holes, and also modify the query for those ranges to push the terms down to the evaluation phase. The comment I put into another pull request may help one understand this: #925 (comment). |
BTW having the ingest automatically fill in the metadata for when a field is indexed would be great as well. This should probably be encoded in the same way we are condensing the F column here: #828. For existing systems we would need to manually update (a tool would be nice) the metadata for data already ingested. |
This is done with the new IndexColumnIterator class. |
Already implemented in the pull request for this ticket. |
The code inside of processTree in the DefaultQueryPlanner looks like it should be able to take care to index holes. The problem is the index holes are never set inside of the ShardQueryConfiguration object prior to this line on line 1082: I am going to investigate your idea more thoroughly though since this is your area of expertise. I will engage @lbschanno about this issue. |
As I believe you have discovered, the existing "index hole" mechanism is value based (i.e. values are missing in the index from 'a' to 'b' for a specified date range). This ticket will result in a different kind of index hole which is field based meaning that a field will be missing all entries in the index for a specified date range or date ranges. I suggest we use the terminology of ValueIndexHole and FieldIndexHole. Once we have the FieldIndexHoles encoded in the metadata table and provide a mechanism to get those out, we should put those in the ShardQueryConfiguration along side of the ValueIndexHoles. Currently we have an IndexHoleMarkerJexlNode that is used to mark fields in the query that need to be delayed until evaluation time. When this is added to the query plan, this denotes that we cannot use that term for index lookup across the entire query date range. This avoids missing any results but avoids being able to use the index for those portions of the date range that might still be available to us. It would be desirable for both ValueIndexHoles and FieldIndexHoles that we create separate plans for portions of the query range to be able to take full advantage of the index where possible. I have two possibilities that come to mind on how to achieve this. We should evaluate these and any other ideas before continuing with the implementation.
|
Modify the generation of 'i' (indexed rows) and 'ri' (reverse indexed rows) in the metadata table such that the column qualifier contains the event date. This is required as a first step to support efforts for issue #825 so that we can identify dates when an event was ingested and included in a frequency count for an associated 'f' row, but was not indexed.
Created PR datawave-metadata-utils/pull/29 to add functionality for retrieving field index holes from the metadata table. |
* Enrich 'i' and 'ri' rows in metadata table with event date Modify the generation of 'i' (indexed rows) and 'ri' (reverse indexed rows) in the metadata table such that the column qualifier contains the event date. This is required as a first step to support efforts for issue #825 so that we can identify dates when an event was ingested and included in a frequency count for an associated 'f' row, but was not indexed. * Add counts to 'i' and 'ri' rows
* Enrich 'i' and 'ri' rows in metadata table with event date Modify the generation of 'i' (indexed rows) and 'ri' (reverse indexed rows) in the metadata table such that the column qualifier contains the event date. This is required as a first step to support efforts for issue #825 so that we can identify dates when an event was ingested and included in a frequency count for an associated 'f' row, but was not indexed. * Add counts to 'i' and 'ri' rows
* Enrich 'i' and 'ri' rows in metadata table with event date Modify the generation of 'i' (indexed rows) and 'ri' (reverse indexed rows) in the metadata table such that the column qualifier contains the event date. This is required as a first step to support efforts for issue #825 so that we can identify dates when an event was ingested and included in a frequency count for an associated 'f' row, but was not indexed. * Add counts to 'i' and 'ri' rows * Initial federated query planner implementation * code formatting * Fixed issues with FederatedQueryIterable * Fix test failures * Fix failing tests * Additional test fixes * pr feedback * Use new MetadataHelper function version * Extract fields to filter index holes * Correct logic for determining sub date ranges * Remove unnecessary check * code formatting * Add check for null query model * Limit config arg to function scope * Update metadata-utils submodule commit * code formatting * Fix failing tests * Additional test fixes * Ensure all original tests pass * Add federated planner tests and chained schedulers * pr feedback * metadata-utils 3.0.3 tag * Fixed the index hole data ingest to set appropriate time stamps on the keys Removed some of the code which I believe was trying to diagnose the test issues * Updated applyModel to use the passed in script * Remove unneeded changes * Make FederatedQueryPlanner the default * Restore original log4j.properties * code formatting * Fix QueryPlanTest * Updated to test with teardown * Test debugging edits * Updated formatting * Concatenate sub-plans * Make FederatedQueryPlanner implement Cloneable * code formatting * * Updated with metadata-utils 4.0.5 (index markers and avoid non-indexed fields for holes) * Fixed test cases with correct responses and periodic failing test cases * Updated AncestorQueryLogic to handle federate query planner * * Allow subclasses of ShardQueryConfiguration * Updated to throw a NoResultsException for am empty query. * import reorg * Updated to avoid expanding unfielded if disabled, and to assume no index holes if no query fields. * Add tests for default query planner with ne and not-eq * Revert changes to test data format * Revert changes to log4j.properties * Ensure query plan updated after any exception type * Revert all changes to test data format --------- Co-authored-by: Ivan Bella <[email protected]> Co-authored-by: hgklohr <[email protected]>
Fields can be added or removed from the index over time. We do not always handle cases where a field spans a time range where it's both indexed and not index. We need to handle this case appropriately. This would include using the field as an indexed field for the portion of time it is indexed and avoid using it to prune queries and ranges when it's not indexed.
The text was updated successfully, but these errors were encountered: