Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle the cases were a field is both indexed and not indexed within a time range #825

Closed
cawaring opened this issue May 28, 2020 · 9 comments · Fixed by #2094 or #2216 · May be fixed by NationalSecurityAgency/datawave-metadata-utils#32
Assignees
Labels
enhancement New feature or request

Comments

@cawaring
Copy link
Collaborator

Fields can be added or removed from the index over time. We do not always handle cases where a field spans a time range where it's both indexed and not index. We need to handle this case appropriately. This would include using the field as an indexed field for the portion of time it is indexed and avoid using it to prune queries and ranges when it's not indexed.

@cawaring cawaring added the enhancement New feature or request label May 28, 2020
@ivakegg
Copy link
Collaborator

ivakegg commented Jul 31, 2020

Suggest marking the time ranges in the datawave metadata table, and then using that info in the DefaultQueryPlanner appropriately. The list of indexed fields sent to the QueryIterator will need to be adjusted for each range as appropriate.

@ivakegg
Copy link
Collaborator

ivakegg commented Jul 31, 2020

The Data Dictionary should also show this information.

@ivakegg
Copy link
Collaborator

ivakegg commented Oct 9, 2020

FYI, the PushdownMissingIndexRangeNodesVisitor is sort of an initial implementation of this capability. However instead of pushing those nodes down, perhaps range stream index lookup can return a stream of shards for the holes, and also modify the query for those ranges to push the terms down to the evaluation phase. The comment I put into another pull request may help one understand this: #925 (comment).

@ivakegg
Copy link
Collaborator

ivakegg commented Oct 9, 2020

BTW having the ingest automatically fill in the metadata for when a field is indexed would be great as well. This should probably be encoded in the same way we are condensing the F column here: #828. For existing systems we would need to manually update (a tool would be nice) the metadata for data already ingested.

@jzgithub1
Copy link
Contributor

BTW having the ingest automatically fill in the metadata for when a field is indexed would be great as well. This should probably be encoded in the same way we are condensing the F column here: #828. For existing systems we would need to manually update (a tool would be nice) the metadata for data already ingested.

This is done with the new IndexColumnIterator class.

@jzgithub1
Copy link
Contributor

jzgithub1 commented Nov 3, 2020

BTW having the ingest automatically fill in the metadata for when a field is indexed would be great as well. This should probably be encoded in the same way we are condensing the F column here: #828. For existing systems we would need to manually update (a tool would be nice) the metadata for data already ingested.

Already implemented in the pull request for this ticket.

@jzgithub1
Copy link
Contributor

FYI, the PushdownMissingIndexRangeNodesVisitor is sort of an initial implementation of this capability. However instead of pushing those nodes down, perhaps range stream index lookup can return a stream of shards for the holes, and also modify the query for those ranges to push the terms down to the evaluation phase. The comment I put into another pull request may help one understand this: #925 (comment).

The code inside of processTree in the DefaultQueryPlanner looks like it should be able to take care to index holes. The problem is the index holes are never set inside of the ShardQueryConfiguration object prior to this line on line 1082:
queryTree = PushdownMissingIndexRangeNodesVisitor.pushdownPredicates(queryTree, config, metadataHelper);

I am going to investigate your idea more thoroughly though since this is your area of expertise. I will engage @lbschanno about this issue.

@ivakegg
Copy link
Collaborator

ivakegg commented Nov 4, 2020

As I believe you have discovered, the existing "index hole" mechanism is value based (i.e. values are missing in the index from 'a' to 'b' for a specified date range). This ticket will result in a different kind of index hole which is field based meaning that a field will be missing all entries in the index for a specified date range or date ranges. I suggest we use the terminology of ValueIndexHole and FieldIndexHole. Once we have the FieldIndexHoles encoded in the metadata table and provide a mechanism to get those out, we should put those in the ShardQueryConfiguration along side of the ValueIndexHoles.

Currently we have an IndexHoleMarkerJexlNode that is used to mark fields in the query that need to be delayed until evaluation time. When this is added to the query plan, this denotes that we cannot use that term for index lookup across the entire query date range. This avoids missing any results but avoids being able to use the index for those portions of the date range that might still be available to us. It would be desirable for both ValueIndexHoles and FieldIndexHoles that we create separate plans for portions of the query range to be able to take full advantage of the index where possible. I have two possibilities that come to mind on how to achieve this. We should evaluate these and any other ideas before continuing with the implementation.

  1. We could create a query planner that uses a separate DefaultQueryPlanner instance per portion of the query date range which is consistent in terms of index holes. This would result is multiple range streams which could be funneled into one range stream. This was the existing DefaultQueryPlanner would not really change except for adding a visitor (or extending the existing index hole visitor) to mark fields that holes for the specified portion of the query date range.
  2. We could modify the global index lookups to return exhaustive sets of shard ranges for those portions of the index that are not available for the field being queried. So the index lookup would return the UIDs (document ranges) and shards (fi ranges) when scanning the index and when it hits the date range that is deemed unavailable/missing, it would simply generate artificial shards (fi ranges). The same thing would need to be done for the field index scans, including the ivarators (DatawaveFieldIndexCachingIterator…).
  3. The existing DefaultQueryPlanner would generate multiple plans concurrently for each consistent portion of the date range relative to index holes. This is basically like option 1, but the management of the plans would be pushed into the existing DefaultQueryPlanner.

@lbschanno lbschanno self-assigned this Aug 22, 2023
lbschanno added a commit that referenced this issue Sep 19, 2023
Modify the generation of 'i' (indexed rows) and 'ri' (reverse indexed
rows) in the metadata table such that the column qualifier contains the
event date. This is required as a first step to support efforts for
issue #825 so that we can identify dates when an event was ingested and
included in a frequency count for an associated 'f' row, but was not
indexed.
@lbschanno lbschanno linked a pull request Sep 19, 2023 that will close this issue
@lbschanno
Copy link
Collaborator

Created PR datawave-metadata-utils/pull/29 to add functionality for retrieving field index holes from the metadata table.

ivakegg pushed a commit that referenced this issue Feb 13, 2024
* Enrich 'i' and 'ri' rows in metadata table with event date

Modify the generation of 'i' (indexed rows) and 'ri' (reverse indexed
rows) in the metadata table such that the column qualifier contains the
event date. This is required as a first step to support efforts for
issue #825 so that we can identify dates when an event was ingested and
included in a frequency count for an associated 'f' row, but was not
indexed.

* Add counts to 'i' and 'ri' rows
@lbschanno lbschanno reopened this Feb 13, 2024
rdhayes68 pushed a commit that referenced this issue Feb 21, 2024
* Enrich 'i' and 'ri' rows in metadata table with event date

Modify the generation of 'i' (indexed rows) and 'ri' (reverse indexed
rows) in the metadata table such that the column qualifier contains the
event date. This is required as a first step to support efforts for
issue #825 so that we can identify dates when an event was ingested and
included in a frequency count for an associated 'f' row, but was not
indexed.

* Add counts to 'i' and 'ri' rows
hgklohr added a commit that referenced this issue Sep 9, 2024
* Enrich 'i' and 'ri' rows in metadata table with event date

Modify the generation of 'i' (indexed rows) and 'ri' (reverse indexed
rows) in the metadata table such that the column qualifier contains the
event date. This is required as a first step to support efforts for
issue #825 so that we can identify dates when an event was ingested and
included in a frequency count for an associated 'f' row, but was not
indexed.

* Add counts to 'i' and 'ri' rows

* Initial federated query planner implementation

* code formatting

* Fixed issues with FederatedQueryIterable

* Fix test failures

* Fix failing tests

* Additional test fixes

* pr feedback

* Use new MetadataHelper function version

* Extract fields to filter index holes

* Correct logic for determining sub date ranges

* Remove unnecessary check

* code formatting

* Add check for null query model

* Limit config arg to function scope

* Update metadata-utils submodule commit

* code formatting

* Fix failing tests

* Additional test fixes

* Ensure all original tests pass

* Add federated planner tests and chained schedulers

* pr feedback

* metadata-utils 3.0.3 tag

* Fixed the index hole data ingest to set appropriate time stamps on the keys
Removed some of the code which I believe was trying to diagnose the test issues

* Updated applyModel to use the passed in script

* Remove unneeded changes

* Make FederatedQueryPlanner the default

* Restore original log4j.properties

* code formatting

* Fix QueryPlanTest

* Updated to test with teardown

* Test debugging edits

* Updated formatting

* Concatenate sub-plans

* Make FederatedQueryPlanner implement Cloneable

* code formatting

* * Updated with metadata-utils 4.0.5 (index markers and avoid non-indexed fields for holes)
* Fixed test cases with correct responses and periodic failing test cases
* Updated AncestorQueryLogic to handle federate query planner

* * Allow subclasses of ShardQueryConfiguration

* Updated to throw a NoResultsException for am empty query.

* import reorg

* Updated to avoid expanding unfielded if disabled, and to assume no index holes if no query fields.

* Add tests for default query planner with ne and not-eq

* Revert changes to test data format

* Revert changes to log4j.properties

* Ensure query plan updated after any exception type

* Revert all changes to test data format

---------

Co-authored-by: Ivan Bella <[email protected]>
Co-authored-by: hgklohr <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment