Skip to content

Secondary Indexing

Eric Robertson edited this page Jul 27, 2015 · 24 revisions

Preliminary

Examples in this document refer to a feature type with the following attributes

ATTRIBUTE NAME

ATTRIBUTE TYPE

LOCATION

Geometry

PERSONS

Numeric (Long)

RECORD_DATE

Date

INCOME_CATEGORY

String (specific set of values)

AFFILIATION

String (mixed text)

ID

String

Problem Description

Find the optimal set of one ore more indices to answer a query. A query requests a data store to find features matching the query’s search criteria. A feature is a single instance of data (e.g record, entity, etc.).

Query Syntax

ECQL http://docs.geoserver.org/stable/en/user/filter/ecql_reference.html

Query Example

INTERSECTS(LOCATION, "Polygon-109.0448 37.0004, -102.0424 36.9949, -102.0534 41.0006, -109.0489 40.9996, -109.0448 37.0004") AND
RECORD_DATE DURING 2006-11-30T00:30:00Z/2006-12-31T00:30:00Z AND
PERSONS BETWEEN 1000000 AND 3000000 AND
INCOME_CATEGORY in ('10-15', '15-22')

Example Feature

(ID='xc456',
LOCATION=POINT-109.0448 37.0004,
PERSONS=39474,
INCOME_CATEGORY='10-15',
RECORD_DATE=2006-12-12T13:04:11Z,
AFFILIATION='AFLCIO')

Example Primary Index Identifier (ROW ID)

'0178_xc456'

Note
The primary index identifiers contain associated feature identifiers, as demonstrated in the example 'xc456'.

Possible Indices

  1. Primary Geospatial Index on LOCATION

  2. Primary Geospatial-Temporal Index on LOCATION and RECORD_DATE

  3. Secondary Time or Numeric Index on RECORD_DATE

  4. Secondary Text Index on INCOME_CATEGORY

  5. Secondary Numeric Index on PERSONS

Type of Secondary Indices

  1. Text

    1. Optimized for Regular Expression

    2. Optimized for equality

  2. Numeric

    1. Range or Single Value

    2. Numeric, Floating Point, Big Integer/Decimal?

  3. Date/Times

    1. Treat as Numeric (GMT milliseconds from epoch) or Alternate Representation (e.g. YYYYMMDDHHMISSMMMMM)

    2. Range or Single Value

  4. Geometry (Alternate from primary)

Role of the Primary Index

The primary index maintains a copy of each feature. A primary index may have more than copy of feature. Each copy is given its own unique index identifier (ROW ID). A feature may be stored in multiple primary indices.

For GeoWave, the primary index is range based index (e.g. bounding rectangle, bounding time).

Compound Indices

The generalization of an index should include the ability to create compound keys. Geo-temporal is an example. There may be some limitations. Numeric indices can be managed using techniques for primary indices (ie. Space Filling Curve). Text may be limited to equality. The design should not impede further future development.

Numeric ranges derived from two attributes (e.g. start and end) are considered a type of compound key for this discussion. This document does not derive a distinction from a single attribute representative of a range, such as start+end or start+duration.

The values associated with each compound key should be preserved with the key in order apply finer grained filtering over the secondary index. Fine grain filtering minimizes the amount false positives answered through a secondary index search.

Persistence Design

The design in this section assumes Accumulo as a data store.

A secondary index is a key derived from an attribute value (or attributes in a compound scenario) associated with a primary index identifier. For example, attribute PERSONS with value 29383 is associated with a list of primary index identifiers of those features that have a PERSONS attribute value matching exactly 29383 for a specific primary index.

Table Design

A each primary index is a separate table. A primary index table is named according to name space and index type. For example, geospatial and geo-temporal indices are separate tables.

Does each type of secondary index reside in its own table?

Yes. The approach is consistent with primary index separation. It can be potentially easier manage multiple tables from the perspective of writes and backups.

The alternative is to maintain a single table using the column family with locality configured to isolate each type of secondary index. This approach reduces the growth of secondary index tables, one per each type of secondary index. It is unclear if there are any performance gains over using locality groups. Locality groups organize column families in separate files.

What is the basic structure of a table?

ROW ID

CF

CQ

VIS

VALUE

Key

Index ID

Primary Index Type ID

Attribute Visibility

Primary Index Identifier(ROW ID)

Key

Index ID

Attribute Name

Attribute Visibility

Attribute Value

The table design reflects a separate row for each Primary Index identifier to attribute value association. This permits rapid ingest. Storing a set of index identifiers in a column value would require encoding and decoding the set for each addition/subtraction.

The Storage of the attribute value is only required if the ROW_ID is not same as the attribute value. For example, an ID derived from a Hilbert curve reflects the any Geometry within a cell.

The example below demonstrates three features indexed on INCOME_CATEGORY in ('10-15') for an index called ICAT

ROW ID

CF

CQ

VIS

VALUE

10-15

ICAT

SPATIAL_VECTOR

-

0492_xc284

10-15

ICAT

SPATIAL_VECTOR

-

0912_ks942

10-15

ICAT

SPATIAL_VECTOR

-

0894_kl038

For simplicity, when the ROW_ID reflects the a specific name and type of attribute, the Index ID may reflect both name and type (e.g. INCOME_CATEGORY_STRING).

The structure requires versioning be disabled since the ROW_ID is not unique between the different features.

How are updates and deletions manager?

In Accumulo, updates and deletions are managed using the versioning iterator. For updates, versioning iterating selects each unique row ID with the latest time stamp. For deletions, a deletion flag is marked on a specific ROW ID and timestamp. All rows with the same ROW ID and older timestamp are deleted and eventually removed in a compaction.

Appending some unique identifier to the ROW ID enables deletion and modification. This assumes either the attribute value has a uniform size or range scans can be polluted with false positives requiring additional filtering.

There are several options here to explore:
Forcing uniformity of size for the secondary index.

Research into text indexing may find this is indeed possible and preferable.

Place primary index identifier in the column qualifier.

Since the column family is redundant in the multi-table approach, move the primary index type ID to column family and the primary index identifier into the column qualifier. The column value is attribute value (values for compound key) used to derive the key. Updates and deletions involve removing the column. Accumulo does support wide-rows permitting such an approach.

ROW ID

CF

CQ

VIS

VALUE

Key

Primary Index Type ID

Primary Index identifier(ROW ID)

Attribute Visibility

Attribute Value(s)

Do not adjust index for updates and deletions.

This results in false positives filtered during query processing. Consider that, at present, with the exception of count, updates are not considered separately from inserts in statistics. Treating updates as inserts in the secondary index yields false positives filtered during query processing.

Off-line processing of updates and deletions.

A separate special 'visibility' can be used to flag attribute values associated with one or more features that have changed due to updates or deletions. Periodically, the associated row ids can be reprocessed in a batch job. Reprocessing involves inserting a delete flag for the ROW ID (attribute value) and then reinserting the valid rows(with newer timestamps), omitting invalid rows.

The example below uses 'xxchangedxx' as the flag.

ROW ID

CF

CQ

VIS

VALUE

Key

Index ID

Primary Index Type ID

'xxchangedxx'

'-'

Which primary index ROW ID is referenced in a secondary index for features indexed in more than one primary index?

Consistent with approaches used for primary index and statistic selection, as communicated by the data adapter, the adapter provides a default option and permits customization. The default option uses the index with the least number if dimensions (e.g. geo-spatial over geo-temporal). In cases where the decision requires a change or there is a conflict (e.g. two primary index with same number of dimensions), the adapter provides additional guidance.

Should secondary indices reference primary index identifiers from all primary indices?

In this solution, only one ID is required to look up a feature. For the purposes of extracting a feature associated with a secondary index, the extra IDs are redundant, requiring more space and logic.

The chosen approach precludes intersecting joins of look-ups between secondary and primary indices. Consider the example data stored in both geospatial and geotemporal indices. Suppose, to efficiently answer the query, the system chooses a geotemporal primary index (for LOCATION and RECORD_DATE) and a secondary index on PERSONS. If the secondary index only maintains primary index identifiers for the geospatial index, then intersecting join between index inquiries cannot be processed, as the identifier does not match between the two indices.

To support primary and secondary index joins, each primary identifier along with the primary index type must be maintained with the secondary index. The answer to the next question provides additional details.

A primary index may have more than one identifier per feature. Does the secondary index reference all index identifiers?

The secondary index references only one selected index identifier. The index references the first identifier (ROW ID), lexicographically. As part of a robust solution for intersecting joins when inquiring multiple secondary indices, the feature ID is extracted from the index identifier.

This approach precludes intersecting joins between primary indices and secondary indices. A Cost Based Optimizer (CBO) fulfills a query criteria choosing between a single secondary index, intersecting joins between a set of secondary index, or a primary index. To reason the benefit/complexity analysis of this decision, consider each action separately.

In a primary index scan, the query is decomposed into a set of ranges. Distributed filtering is applied to each discovered row to eliminate features that do not meet the query’s criteria. Using the example query in a geospatial primary index, the LOCATION attribute is decomposed into a set of ROW ranges, provided to the index table scanner. The distributed filter (iterator) further eliminates rows that fail to meet the query’s criteria (e.g. INCOME_CATEGORY IN ('10-15', '15-22')).

In a secondary index scan, the CBO may choose any or all of the secondary indices. The final result of ROW IDs is the intersection across the set of ROW IDS from each secondary index inquiry: Index(INCOME_CATEGORY) ∩ Index(PERSONS) ∩ Index(RECORD_DATE)

For example, consider the returned primary index identifiers from each index.

  1. Index(INCOME_CATEGORY) = { 0178_xc456, 0199_xt356, 0394_jj593}

  2. Index(RECORD_DATE) = { 0394_jj593, 0142_yh636, 0199_xt356, 0393_uh383}

  3. Index(PERSONS) = { 0142_yh636,0394_jj593, 0178_xc456, 0384_kf0394,0199_xt356}

The intersection set of identifiers is {0199_xt356, 0394_jj593}. To protect against the possibility that each secondary index may choose a different primary identifier (in the case of duplicates) for a feature, the intersection process keys off the ID extracted from the identifier (e.g. jj593 and xt356).

Suppose decomposition of the LOCATION attribute using the primary index strategy provided ranges (prefixes) 0197 - 0200 and 0201 - 0202. Intersecting with the secondary index, the only valid identifier with is 0199_xt356.

Assuming some statistics can be applied to provide some evidence on the actual number of rows that meet a set of ranges provided through primary index decomposition, a CBO can make a clear choice on using either a primary index or a set of one or more secondary indices to resolve features for a given query.

One can intuit that there is not a clear benefit of intersecting joins between secondary and primary indices. A sufficiently constrained primary index query eliminates a small set of non-matching feature using distributed filtering. Similarly, secondary indices, when used appropriately, sufficiently reduces the number of specific features to inspect based on a query’s constraints. The scenario for using a secondary index set join with a primary index result set occurs when statistics indicate that set a select set of secondary of indices and a primary index are, in isolation, insufficiently constraining. However, attribute level statistics cannot determine a measurable outcome of a intersecting join. Consider statistics indicate, for a given query, secondary index results in 5000 features and a primary index results in 6000 features. Without compound indexing with its own statistics, a CBO cannot determine the overlap between the two sets of 5000 and 6000 features. Rather, the CBO determines the size of final result set is less than or equal to 5000. Engaging the secondary indexing, incurring the associated overhead, cannot guarantee consistent results when the secondary indices do not sufficient eliminate the result set.

Secondary Index Intersection and De-duplication

Statistics

The following statistics are used for cost based appraisal of indices for query processing.

Statistic

Type

Application

Histogram

Numeric

Provides an estimate of the number of features that have an attribute with a specific value

Range

Numeric

Provides evidence of the range of values in an index

Cardinality

Non-Numeric

Provides the number of unique values in an index.

Count MinSketch

Non-Numeric

Provides the number of matching features for a specific value to an attribute. Not recommended for attributes that do not have have a constrained set of values (e.g. descriptions, comments, etc.).

Distribution

Any

Generalization of a histogram (applicable to text, geometries, etc.)

Query Planning

Query planning ascertains, probabilistically, selectivity of an index given query constraints. The selectivity can be in terms of the average number of features. For example, if the cardinality of the attribute for AFFILIATION is one thousand and there are ten thousand features, the planner could assume a uniform distribution, estimating ten features per each unique value AFFILIATION. Worst case, 9001 features may be discovered in the secondary index. Query execution could abandon a secondary index scan once a threshold has been met if the results turn out much worse than the estimate.

Note
Since the cardinality of AFFILIATION is low, a count min-sketch statistic can be used give a more accurate indication for a specific value.

Cost factors and thresholds per index will be part of the discovery process to determine the optimal use of indexing. Cost factors include considerations for overhead when processing a secondary index, as the results are individual primary index identifiers, not a range.

Limitations

This section will include limitations of secondary index, such as effectiveness of LIKE and ILIKE filtering on an indexed text attribute.

Features in Consideration

  1. Log of optimizer output to evaluate and tune.

  2. Benchmarks to evaluate optimizer.

  3. Auto-tune capabilities

  4. predicate pushdown

Text Indexing

To reiterate, the secondary index does not need to answer a query with primary index identifiers that exactly match a specific query criteria. The intent of the secondary index is to answer with a small set of index identifiers to significantly narrow the search space. False positives are permitted. False negative are not permitted.

ILIKE and LIKE expression processing

A n-gram expression can be used to process these type of expressions. ILIKE and LIKE expressions are considerably simpler than Regular Expressions. n-grams include a representation of start and end of expressions using a special character ('_' as an example). As an example, here are the set of 3-grams for the text 'cow jumped over': {'_co', 'cow', 'ow ','w j',' ju', 'jum', 'ump', 'mpe', 'ped', 'ed ', 'd o', ' ov', 'ove', 'ver'}. When search for an expression larger than n, the search expression is divided in n-grams. Each n-gram is used as a look-up to find the set of associated primary index identifiers. The final result is the intersection of all the sets.

Here are some sample expressions with their corresponding 3-gram expressions:

Expression

Description

Sample 3-gram

'%exp'

Text ends with exp

'xp_' & 'exp'

'exp%'

Text starts with exp

'_ex' & 'exp'

'%exp%'

Text contains exp

'exp'

'%expen%'

Text contains expen

'exp' & 'xpe' & 'pen'

The choice of N for N-grams is based on the following factors.

  1. Distribution of data over each unique N-gram.

  2. Expected Query Size.

Suppose N=2 and the index contains many text values containing 'ea'. To process a query for 'beach', rather than search for all 2-grams composed from 'beach', the system picks the 2-gram with the smallest cardinality. If, for example, 'be' occurs infrequently, then searching on 'be' alone is sufficient. If the system can determine that both 'be' and 'ac' reduce the possible resulting set of identifiers, the text index can perform an in-memory set intersection of the results of scans for both 2-grams.

Using a larger value of N reduces the search space by reducing the possibility of any one N-gram representing a disproportionate number of text values. The caveat with a larger value of N is that query criteria with text smaller than N can only be answered with range scans over multiple N-grams. As an example, consider N=3 and the following different search criteria (assuming lower case alphabetic):

  1. '%ea%' - The set of N-grams include 'ea','aea','bea','cea'…​'eaa','eab','eac'…​'ea'.

  2. '%a' - The set of N-grams include 'aa_','ba_'…​'za_'.

  3. 'e%' - A single range over N-grams is '_ea' ending in '_ez'.

Of three cases, only the last one is efficient. In the first two cases, one cannot assume the limited character set. If the character set is UTF-16, then the search space is quite large.

Some index systems choose multiple values of N, choosing the most optimal to answer a specific query. Another common approach is to use a hierarchical approach. Thus, the highest tier could 2-gram, the next tier is 3-gram, etc. Each tier reduces the search space.

The proposal at the moment is to index over a configurable number of N-grams where N starts at a configurable number (e.g. 2) and increases up to a configurable upper bound (e.g. 4). Consider using n-grams of 2 and 3. A text item "hallibut" produces n-grams 'ha', 'al', 'll', 'li', 'ib', 'bu', 'ut', 'hal', 'all', 'lli', 'ibu' and 'but'.

Recall secondary indices efficacy is predicated on significant and measurable reduction of the searchable set of primary index identifiers. Ideally, n-grams whose cardinality exceed some predefined limit can be deemed not useful, cleared and flagged.

In this approach, the insertion time for an index text increases as the top value N increases. Although less n-grams are produced for a item text as N increases ("hallibut" has seven 2-grams and five 3-grams), each level of n must be populated with all possible n-grams.

In the proposal, rather than intersecting the results of all n-grams composed from the filter expression, the n-gram with smallest cardinality is used. If the result is still not small enough, then the n-gram with the next smallest cardinality is applied and those results are intersected with the initial result. This can continue until the cardinality of the next n-gram exceeds some threshold or the result set falls below a some other threshold.

Range Processing

In this section, processing a range over text alphanumeric ordering. As an example, consider finding all surnames from Smith to Tran.

An index using the full text surname as the index ID takes of advantage of simple range scans. It is possible to answer a range constraint with an n-gram approach, as discussed in the prior section. First, a range scan occurs from 'xx' ; '' indicates the start of expression and xx is the (n-1) characters from the n-gram. The results need to be furthered filtered. Using the example, the range scan covers '_sm' to "_tr". Surnames with prefix 'sm' and suffix values prior to 'ith' are filtered through a distributed filter. Surnames with prefix 'tr' and suffix following 'an' are filtered through a distributed filter.