-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Metadata Cache #25538
Comments
All of that looks good. Some thoughts on the questions:
|
I think doing a scan would be acceptable for the size that the metacache will grow to, for now; I will try to think of ways to avoid doing a scan while implementing, but I think supporting pagination will be important to this cache. We could potentially bake the offset into the
Ah, shall we allow for
Makes sense. |
If a column gets added to a table after the metadata cache is created, do we have some mechanism to add that new column to the cache? |
@praveen-influx good call out - in the original doc outlining the cache, we did no plan to allow updates to existing caches, so in that case, it would need to be deleted and re-added with the desired columns/their order. |
We could allow for |
The thing I was thinking of would be if they are building some dashboard and are using |
Can we have a default value configured for |
Default to 24h in that case |
|
I came here to say that you probably want some limits, total cardinality, column cardinality, tree depth, number of columns etc but I see Paul has asked for a total cardinality limit already. If I cache all columns and values and don't set an age, I believe the metadata cache here nearly (re)implements the "index" that influxdb tsm uses for tags and fields, which has been a source of slowdown in the db when the cardinality is (very) high. For tsm, this metadata index is consulted for almost every query (hence the major impact when it is high cardinality). It appears the monolith metadata cache won't be consulted for general queries (as datafusion doesn't need it) and it'll be only consulted when specifically queried |
Right, for now we will have the total cardinality limit. Tree depth and number of columns in this case should be the same thing, and are limited at the time the cache is created, since the columns used are specified up front, and I don't think we aim to have newly added columns added to existing caches (they would have to re-create the cache to get a different column set). Having a per-column cardinality is one thing I didn't consider that could also be added at some point.
Right, users need to explicitly query it with |
Will creating new I wonder if it's worth exploring concepts like aliasing (if it's not been done already) such that new cache is created in the background whilst old cache is still accessible via an alias. Then once the new cache is available we swap the alias to point to the new one in the background. It is not something to be addressed as part of this issue by the way. |
Yes. The intention was to not do cache population from object store in OSS, but do so in Pro. This does highlight a bit of a usability issue with the caches. Part of the issue is that users need to write to their tables in order to create them (vs. having the ability to explicitly set up their schema ahead of writing to the database). Since the cache can only be created for an existing table, if the cache is not pre-populating, then it will miss out on the data that was written before the cache was created.
This might require that we assign IDs to caches. |
FWIW, we will want to create an API to create a table where the user can specify schema, last caches, meta caches, and for Pro, which columns get put into the file index. |
I don't fully follow the answer above about how values will be populated in the cache. Maybe it helps to ask about what the user experience would be. When I make a metadata cache, what values will be present?
|
The cache is filled by writes. So when you create it, it's empty. When you reboot, it starts empty. |
I recommend putting that in the description of the feature as I think it is something users will want to know |
For now, I added a Limitations section to the main issue description that mentions this detail. |
See related epic: #25539
Context
It is very common for
influxdb
users to want to quickly return the unique values for a given column or to return the unique values for a given column given the value of some other column. For example, the unique region names, or unique hostnames, or given the region ‘us-west’ the unique hostnames within that region.These kinds of lookups are typically used in selectors at the top of dashboards and are frequently accessed. Performing these lookups in tens of milliseconds, rather than hundreds of milliseconds represents a significant improvement in user experience.
The metadata cache will provide a feature to make such queries fast.
Requirements
query
interface, i.e., via queries through a FlightSQL connection or through the/api/v3/query_sql
API, to a user-defined function, similar to thelast_cache
.Querying the
meta_cache
Queries to the metadata cache will be fulfilled via a user-defined function called
meta_cache
and invoked like so:Here,
host
is a column on thecpu
table. This query will return the distinct/unique values that have been seen for thehost
column in incoming writes to the database.The entries returned should be sorted in ascending order by default.
If a cache is configured for multiple columns, one could select from both:
Queries to the cache support
LIMIT
andWHERE
clauses, e.g.,In the latter example, the cache would need to be configured on both the
region
andhost
columns.Each cache has a unique name, such that if there are multiple caches configured on a given table, they can be queried by their name:
System Table
There will be a system table that can be queried via:
The
table
predicate is required. This will return results with columns for:table
the cache's target table namename
the name of the cachecolumn_ids
as a list ofColumnId
s, that the cache is configured oncolumn_names
as a list of the column's names, that the cache is configured onmax_age
or null if it is not configured for the cacheConfiguration API
Metadata caches can be created via the following API:
with body parameters:
db
: database the cache is configured ontable
: table the cache is configured onname
: (optional) name of the cache, must be unique within the db/table, will be generated if not providedcolumns
: list of ordered columns to construct the cachemax_age
: (optional) values that have not been seen for this amount of time will be evicted from the cacheand deleted via:
with body or URL parameters:
db
: database the cache is configured ontable
: table the cache is configured onname
: the name of the cacheHierarchical Cache Structure
The metadata cache, like the last-n-value cache, should have a hierarchical structure. For example, for a cache configured on two columns,
region
andhost
, in that order, the values will be stored in a hierarchy:Eviction
Entries in the cache will be evicted after they have reached the
max_age
configured on the cache. If the cache has a configuredmax_age
, an eviction process will need to be run to prune entries that are older than themax_age
.Limitations
Cache Population
The cache is populated strictly by incoming writes. So, values that were written to the database before the cache was created - or in the event of a restart, values that were written to the database before the restart - will not be cached.
There are plans to have caches be pre-populated on cache creation and server restart in the Pro version of InfluxDB 3.
Questions
OFFSET
clauses when the contents of the cache could change between queries?max_size
for a cache? and if so, is it size in memory usage, or number of elements.The text was updated successfully, but these errors were encountered: