Skip to content

Secondary Indexing

Eric Robertson edited this page Jun 25, 2015 · 24 revisions
Problem Description

Find the optimal set of one ore more indices to answer a query. A query requests a data store to find features matching the query’s search criteria. A feature is a single instance of data (e.g record, entity, etc.).

Query Syntax: ECQL http://docs.geoserver.org/stable/en/user/filter/ecql_reference.html Query Example: INTERSECTS(LOCATION, "Polygon-109.0448 37.0004, -102.0424 36.9949, -102.0534 41.0006, -109.0489 40.9996, -109.0448 37.0004") AND RECORD_DATE DURING 2006-11-30T00:30:00Z/2006-12-31T00:30:00Z AND PERSONS BETWEEN 1000000 AND 3000000 AND INCOME_CATEGORY in ('10-15', '15-22')

Possible Indices

  1. Primary Geospatial Index on LOCATION

  2. Primary Geospatial-Temporal Index on LOCATION and RECORD_DATE

  3. Secondary Time or Numeric Index on RECORD_DATE

  4. Secondary Text Index on INCOME_CATEGORY

  5. Secondary Numeric Index on PERSONS

Type of Secondary Indices
  1. Text

    1. Optimized for Regular Expression

    2. Optimized for equality

  2. Numeric

    1. Range or Single Value

    2. Numeric, Floating Point, Big Integer/Decimal?

  3. Date/Times

    1. Treat as Numeric (GMT milliseconds from epoch) or Alternate Representation (e.g. YYYYMMDDHHMISSMMMMM)

      1. Range or Single Value

  4. Geometry (Alternate from primary)

.Role of the Primary Index

The primary index maintains a copy of each feature. A primary index may have more than copy of feature. Each copy is given its own unique index identifier (ROW ID). A feature may be stored in multiple primary indices.

.Compound Indices

The generalization of an index should include the ability to create compound keys. Geo-temporal is an example. There may be some limitations. Numeric indices can be managed using techniques for primary indices (ie. Space Filling Curve). Text may be limited to equality. The design should not impede further future development.

Numeric ranges derived from two attributes (e.g. start and end) are considered a type of compound key for this discussion. For the moment, this document will not derive a distinction from a single attribute representative of a range, such as start+end or start+duration.

Persistence Design

The design in this section assumes Accumulo.

A secondary index is a key derived from an attribute (or attributes in a compound scenario) associated with a primary index identifier.

.Tables

A each primary index is a separate table. A primary index table is named according to name space and index type. For example, geospatial and geo-temporal indices are separate tables.

Does each type of secondary index reside in its own table?

Yes. The alternative is to maintain a single table using the column family with locality configured to isolate each type of secondary index. Concerns on efficiently manage the D secondary index - how we persist them - (simplest case is what we do with feature id now - basically another table where the row id is the feature id, and the value is the row-id in the main geowave table) - how we index has an impact on the ability to accelerate some conditions (LIKE/ILIKE (case sensitive/insensitive prefix matching). We don’t have to necessarily support every possible condition - just need to know what we are and aren’t supporting.

statistics - information about cardinality (number of unique values), distribution (histogram), ranges (min/max), etc. for each indexed attribute. - how statistics are set, managed, modified, etc. - making sure the whole workflow "works" - also, exposed in geoserver? - Eric has already done a good bit of this

query planning - I think what we want is basically a cost based optimizer - Basically this is going to take the ECQL portion of a query and determine the best use of indexes to satisfy the query - http://docs.geoserver.org/latest/en/user/filter/ecql_reference.html - no joins for us to optimize across, so problem set is much simpler - typical decision might be something like "where carmake = "toyota" and carcolor = "white""  —  What’s the cardinality of carmake - 100 out of 200 (total count) features? That means, on average, a particular value migh only comprise 2 rows. (if we have histogram values we can do better here). So say a cost of 2.  — What’s the cardinality of carcolor = 2 out of 200? Typical value might still cost us 100 rows? Might be a threshold where it’s not worth using an index - have to experimentally determine where that’s at.  — So in this case the optimizer might decide to do an index lookup for carmake, but just do a full scan on the results of that for carcolor. - Typical choices here, since we don’t have to do joins, are basically going to be  — Do I use an index or not  — Which stats are the best to use to answer the question above. - I think we start out with some sort of cost factor, but will have to run some benchmarks to determine what the final cost factors are (could even turn this into an auto-tune system for different installations) - Would be nice to have a log output optionally available (at some log level) for the results of the optimizer.

predicate pushdown - for the temporal and spatial predicates - (before, during, after, DWITHIN, TOUCHES, etc.) - implementing logic to accelerate these with indexes already in place

Clone this wiki locally