[Stack Monitoring] Metricbeat writes to .monitoring-*-8-mb indice instead of datastream after 8.0 upgrade #30769

klacabane · 2022-03-10T13:10:55Z

Summary

In 8.x, metricbeat writes stack monitoring data to .monitoring-{product}-8-mb index patterns. These patterns are backed by datastream templates created by elasticsearch at startup and define the corresponding ILM and mappings necessary for SM to function.

We may have a race condition where an unmapped/unusable indice could be created instead of the expected datastream if elasticsearch didn't setup the backing template when metricbeat start writing data to .monitoring-{product}-8-mb.

If scenario is confirmed, we should work on a fix that prevents data from being written to the .monitoring-{product}-8-mb patterns until the templates are successfully setup.

This ticket tracks the investigation and resolution of that potential scenario.

The text was updated successfully, but these errors were encountered:

jbaiera · 2022-03-14T17:54:51Z

Potential scenario steps:

Upgrade Beats and Elasticsearch to 8.0
Beats sends metric data to Elasticsearch targetting .monitoring-es-8-mb
Elasticsearch has not yet automatically installed the data stream template (Installed automatically, but after nodes become available for the first time).
.monitoring-es-8-mb is created as a regular index - field aliases from the template are missing
Querying the data in Stack Monitoring UI breaks because of incorrectly mapped data
3 days later, the local ES Monitoring Plugin CleanerService marks and deletes the .monitoring-es-8-mb index since it is an index that matches .monitoring-* (source)
Beats continues to send metric data, which then correctly creates a data stream using the present template

lucabelluccini · 2022-03-16T19:06:45Z

Since Elasticsearch 7.10 bulk requests have require_alias to avoid creating an index (https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html).

I've not tested if require_alias also works against data streams. If it is the case, Metricbeat could add require_alias to ensure it will not create indices prior the correct setup.

klacabane · 2022-03-21T17:27:42Z

Confirming that I was able to replicate the issue with a large cloud deployment, .monitoring-es-8-mb indice was created while the template and ILM policies are nowhere to be found. So during the rolling upgrade, we upgrade a node at a time which contain metricbeat 8.x while the rest of the cluster is still running on 7.x. When this first node upgrade is done (or at least when metricbeat is ready), metricbeat 8.x may route write requests to a 7.x node in the cluster while the templates are not installed. @mat do you know where I could find resources on cloud upgrades internals ?

Looking at the logs for this cluster, the .monitoring-es-mb index template is created at 16:53:29.677 UTC and the first 8.x document is indexed at 16:46:31.922 UTC. Dropping events until the templates are available is questionable with a 7mn gap unless the node being monitored is not accepting traffic until that setup is complete, do you happen to know if that's the case @jbaiera ?

Since Elasticsearch 7.10 bulk requests have require_alias to avoid creating an index

Unfortunately this does not appear to work with datastreams as I'm getting this error both when the template exists (v8.x) or not (v7.x)

...
(status=404): {\"type\":\"index_not_found_exception\",\"reason\":\"no such index [.monitoring-es-8-mb] and [require_alias] request flag is [true] and [.monitoring-es-8-mb] is not an alias\",\"index_uuid\":\"_na_\",\"index\":\".monitoring-es-8-mb\"}, dropping event!"
...

jbaiera · 2022-03-22T19:29:57Z

Looking at the logs for this cluster, the .monitoring-es-mb index template is created at 16:53:29.677 UTC and the first 8.x document is indexed at 16:46:31.922 UTC. Dropping events until the templates are available is questionable with a 7mn gap unless the node being monitored is not accepting traffic until that setup is complete, do you happen to know if that's the case @jbaiera ?

The template registry is configured to only run on the master node of the cluster. I think during a rolling upgrade all nodes are updated to the newer version with the master node being updated last. This could create a window of time where there are 8.0 nodes in the cluster but no monitoring templates are available.

jbaiera · 2022-03-22T19:41:53Z

I just validated that this is the case, at least on cloud. I'll open a discussion issue in the ES repo to see if there are any steps we can take on the ES side of things so that we don't have to drop events during a rolling upgrade.

jbaiera · 2022-03-22T20:23:58Z

I opened elastic/elasticsearch#85247 to see if we can help address the availability window for the data during a rolling upgrade.

klacabane · 2022-03-23T16:42:12Z

On the beats side we could prevent the SM modules from writing monitoring events until the templates are setup. Downside is that we'll fly blind during the upgrade for the nodes that completed the process, but we could cache those events (ie the last n bytes) as a best effort and eventually backfill them.
I'll look if there is a existing beats mechanism that we can reuse for that - maybe @kvch @ruflin have suggestions ?

ruflin · 2022-03-24T14:04:19Z

I remember we had many discussions around this before data streams existed because of the alias, less the template problem. In general I agree, if template / data streams are not there, we should not write to the data streams as it might cause issues. The part I'm not sure about is, if we should really solve this problem on the ingest side. Beats / Elastic Agent needs a lot of unnecessary permissions to do the checks needed. One the one hand it is a permissions issue but I would argue also partially a performance issue as likely each time the beat reconnects, it needs to do these checks. Instead I would like to see this rather enforced on the Elasticsearch side. If template / data stream is not there, Elasticsearch does not allow to ingest data. I somehow remember even we wanted to build such a param, not sure if that ever landed in Elasticsearch. @kvch you might remember?

klacabane · 2022-03-28T12:49:29Z

Agreed on a flag similar to require_alias that rejects indexing requests targeting a non-existing template. However I'm concerned about the loss of data during the interval where the template is not yet installed (15mn on a 5 node cluster), this is large enough to miss important patterns and alert states. Backfilling would not solve this problem, but could allow detection of the patterns (without actioning of them since the alerts could not run in time). If an existing mechanism exists this could be a cheap improvement. Improving the time it takes to land these templates would make that effort unnecessary so let's see if there are improvements on that front first :)
I'm wondering if enabling self-monitoring before the upgrade starts could solve that issue so that we're still ingesting data during the process ?

ruflin · 2022-03-28T14:58:42Z

15mn on a 5 node cluster

I was hoping we talk about a much shorter time period. One of the key problems here I think is that metricbeat runs inside the docker container in Cloud. Ideally, Beats would only be updated after the full cluster is updated so this issue would not happen in the first place. But this is not something we can fix for 8.0 :-(

Alternative workaround idea: What if the 7.last migration assistant would put the right template already in place?

jasonrhodes · 2022-03-29T15:29:31Z

@ruflin do Beat modules expect you to run mybeat setup on every version upgrade of the beat? I'm curious how other beat modules handle this (when MB is the one responsible for installing mappings for most modules)

ruflin · 2022-03-30T14:09:39Z

Beat by default checks if the templates and ingest pipelines are their on each reconnect. You don't need to run setup for this, Beats just does it. As it is all 1 template, it is just 1 check for all the modules.

jasonrhodes · 2022-04-18T16:23:05Z

3. Elasticsearch has not yet automatically installed the data stream template (Installed automatically, but after nodes become available for the first time).

I'm looking through this again and wondering about this line. @jbaiera it sounds like you're saying that ES waits to create the data stream templates until the nodes are available. @klacabane is saying that we've noticed as much as a 7 minute gap between MB writing data and the data stream template being created.

What I'm curious about is, if the template is created as soon as the nodes are available, what is MB writing to for 7+ minutes?

klacabane · 2022-04-19T14:03:34Z

The templates will be installed by the master node once it is upgraded to 8.x. In a rolling upgrade the master node will be upgraded last (or at least after some data node have been upgraded, need to verify the cloud upgrade orchestration details). When a non-master node upgrades its sidecar metricbeat will write to the monitoring index pattern but since the templates are not there yet it creates a mapping-less indice instead of the expected datastream which will break SM querying. The gap is the time it took for the templates to be installed in a small-sized cluster.

Ideally we want to maintain monitoring visibility during the upgrade so the templates need to be installed as soon as the first node is upgraded:

can we install the templates via the upgrade assistant ?
can we install the templates with a non-master node ?
can metricbeat install the templates ?

Otherwise we could prevent metricbeat from writing to the index pattern if no underlying template have been installed, something like the require_alias query parameter.

jasonrhodes · 2022-04-19T14:59:51Z

I like the options of installing templates for each node as it's upgraded, but I don't know what the implications of that might be. I haven't thought too hard about index templates and their relationship to master v non-master nodes ... each node doesn't have its own copy of the templates?

jbaiera · 2022-04-19T18:08:46Z

... each node doesn't have its own copy of the templates?

Each node has information about the templates because they are stored in the cluster state, but the master node is where all template creation operations are applied to the cluster state.

Most of the automatic template creation logic waits for the master node to be upgraded before installing templates. This is because we added composable index templates as a feature in the 7.x line (7.8 to be exact). Master nodes are usually the last to be upgraded in a rolling upgrade (for safety purposes). If you are upgrading from an early version on the 7.x line that doesn't have support for composable templates to a version that does, then you need to wait for the master to be upgraded so that it knows how to apply those templates to the cluster state.

Now that we're in 8.0 territory the oldest version a node can be during a rolling upgrade is likely to be 7.17.X which is safely always aware of composable index templates. Waiting for the master node to be upgraded for the above reason isn't required at this point, but if we ever add new functionality to the templates that doesn't exist in earlier 8.x versions, then we must again wait for the master to be fully updated before templates can be installed. In that situation, metric collection will likely have to wait again.

jbaiera · 2022-04-19T18:21:01Z

So ultimately what I'm saying is that we could get the templates to install earlier, but I'm worried that we will just kick the can down the road until we have new functionality we want to add to the templates (perhaps TSDB stuff?) and this becomes a problem again. I am going to see if we can discuss the crux of the issue (elastic/elasticsearch#85247) in the next data management area meeting and figure out where we go on to next.

jasonrhodes · 2022-04-19T22:14:06Z

Thanks, @jbaiera -- keep us updated. We could tell users that they lose monitoring visibility during a rolling upgrade but that feels like it calls into suspicion the point of a rolling upgrade, a bit.

jbaiera · 2022-05-02T15:20:48Z

We met and discussed some things about template installation and we determined that even though we could add template installation to be done before the master is available currently, any feature additions to templates in the future would just cause us to be in the same state. We discussed some options around how to avoid gaps in monitoring data when doing rolling upgrades:

Changes to the orchestration logic - Our upgrade documentation says to fully update Elasticsearch before upgrading collectors (like metricbeat, agent). This is also the workflow we have in mind on the ES team when making changes. It would be more correct to have the collecting agents upgraded all at once after the full rolling restart has finished and rely on the older version of metric collection in the gap time since the resources on the cluster should be stable. Probably easier to do this with k8s but worth exploring as an option.
We could make changes to beats so that it supports writing to the older indices/resources until the templates are available for ingestion. This would require some slight overhead in the agents to check the templates and switch to an older resource, potentially changing the shape of the data if that is a concern. Alternatively, mappings could be written to be backwards compatible, though that makes breaking changes much harder to implement (the root of the problem we're seeing right now).

jasonrhodes · 2022-05-03T14:26:53Z

Thanks, @jbaiera! I wonder if we should consider just making MB write/update these mappings since this is a MB-only problem. I think that would also solve the issue, right? cc: @klacabane / @matschaffer

matschaffer · 2022-05-09T05:06:51Z

@jasonrhodes I've seen some challenges with that approach as well. Namely:

Access - in some cases (particularly ECE), we don't want the typical "collection mode" authorization to be able to take cluster-level actions like template updates. It should just be able to set it's own data and that's it.
Template update pressure - if we naively tell beats to push a template on every startup, you can hit start up loop situations (kernel OOM + docker restart behavior) where beats re-puts templates very frequently. These requests involve a cluster state update, so can be costly.

Both problems can be overcome by an extra operator action (or orchestration action) to push templates before upgrading metricbeat.

In general I agree with the sentiment that the publisher of the data should probably own the template rather than embedding templates with elasticsearch.

klacabane · 2022-05-09T11:31:42Z

Metricbeat installing the templates would solve the problem for now but I think we'll hit the same wall if/when we want to use new template functionalities (ie waiting for master to be upgraded). Moving the templates installation to a different registry (mentioned in elastic/elasticsearch#85247 (comment)) would achieve similar behavior if sooner means before metricbeat gets a chance to write to the indice in a cloud deployment, and I'd be in favor of the latter since it should be much lower effort for the same result (not discounting that having the templates closer to the publisher is a good design, but it does not solve any existing problems).

We could make changes to beats so that it supports writing to the older indices/resources until the templates are available for ingestion.

This would mean bringing back the code that handled the legacy document format (completely removed in 8.x) and make the metricbeat code a complex beast. I'm not convinced such effort is justified for now but could be when we eventually face that problem.

Changes to the cloud orchestration to reflect the upgrade order that we recommend for on-prem deployments would be ideal

kunisen · 2022-05-10T05:33:54Z

Hi team, is there a way to fix this without deleting the offending regular indices (as it's destructive)?
i.e. reindex the old data to another index name, delete the offending indices, and then after data stream got created, reindex back the data.

matschaffer · 2022-05-10T06:47:35Z

It'd be good to test out before adding it to docs, but I don't see why that wouldn't be an option. You could probably even skip the final reindex in favor of reindexing to a name covered by typical index patterns (for example, .monitoring-{product}-8-mb-reindex1).

jasonrhodes · 2022-05-23T12:20:30Z

Metricbeat installing the templates would solve the problem for now but I think we'll hit the same wall if/when we want to use new template functionalities (ie waiting for master to be upgraded). Moving the templates installation to a different registry (mentioned in elastic/elasticsearch#85247 (comment)) would achieve similar behavior if sooner means before metricbeat gets a chance to write to the indice in a cloud deployment, and I'd be in favor of the latter since it should be much lower effort for the same result (not discounting that having the templates closer to the publisher is a good design, but it does not solve any existing problems).

Hm ok, I don't 100% follow the upgrade flow but if this is true, then I agree that the separate registry is the easier solution for now (with the same ongoing risks). What I also still don't quite understand is how this architecture could ever support a rolling upgrade for anything that involves changes to index templates? Is there a plan for Elasticsearch to fix this at some point?

We could make changes to beats so that it supports writing to the older indices/resources until the templates are available for ingestion.

This would mean bringing back the code that handled the legacy document format (completely removed in 8.x) and make the metricbeat code a complex beast. I'm not convinced such effort is justified for now but could be when we eventually face that problem.

Very much agree here. I think this would be a pretty dangerous road to go down.

botelastic · 2024-01-24T12:01:51Z

Hi!
We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

botelastic · 2025-01-25T20:45:55Z

Hi!
We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

klacabane added bug Metricbeat Metricbeat v8.2.0 Team:Infra Monitoring UI - DEPRECATED Infrastructure Monitoring UI team - DEPRECATED - Use Team:Monitoring labels Mar 10, 2022

lucabelluccini mentioned this issue Mar 16, 2022

[DOCS] Metricbeat Elasticsearch/Kibana/Logstash modules for 8.0 documentation is wrong #30877

Closed

smith mentioned this issue Apr 14, 2022

Stack Monitoring Tech Debt Plan elastic/kibana#127224

Closed

39 tasks

roshan-elastic added the Feature:Stack Monitoring label Jan 24, 2023

botelastic bot added the Stalled label Jan 24, 2024

smith added Team:Monitoring Stack Monitoring team and removed Team:Infra Monitoring UI - DEPRECATED Infrastructure Monitoring UI team - DEPRECATED - Use Team:Monitoring labels Jan 26, 2024

botelastic bot removed the Stalled label Jan 26, 2024

smith removed the v8.2.0 label Jan 26, 2024

botelastic bot added the Stalled label Jan 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Stack Monitoring] Metricbeat writes to .monitoring-*-8-mb indice instead of datastream after 8.0 upgrade #30769

[Stack Monitoring] Metricbeat writes to .monitoring-*-8-mb indice instead of datastream after 8.0 upgrade #30769

klacabane commented Mar 10, 2022 •

edited

Loading

jbaiera commented Mar 14, 2022

lucabelluccini commented Mar 16, 2022 •

edited

Loading

klacabane commented Mar 21, 2022 •

edited

Loading

jbaiera commented Mar 22, 2022

jbaiera commented Mar 22, 2022

jbaiera commented Mar 22, 2022

klacabane commented Mar 23, 2022

ruflin commented Mar 24, 2022

klacabane commented Mar 28, 2022 •

edited

Loading

ruflin commented Mar 28, 2022

jasonrhodes commented Mar 29, 2022

ruflin commented Mar 30, 2022

jasonrhodes commented Apr 18, 2022

klacabane commented Apr 19, 2022 •

edited

Loading

jasonrhodes commented Apr 19, 2022

jbaiera commented Apr 19, 2022

jbaiera commented Apr 19, 2022

jasonrhodes commented Apr 19, 2022

jbaiera commented May 2, 2022

jasonrhodes commented May 3, 2022

matschaffer commented May 9, 2022

klacabane commented May 9, 2022 •

edited

Loading

kunisen commented May 10, 2022

matschaffer commented May 10, 2022

jasonrhodes commented May 23, 2022

botelastic bot commented Jan 24, 2024

botelastic bot commented Jan 25, 2025

[Stack Monitoring] Metricbeat writes to .monitoring-*-8-mb indice instead of datastream after 8.0 upgrade #30769

[Stack Monitoring] Metricbeat writes to .monitoring-*-8-mb indice instead of datastream after 8.0 upgrade #30769

Comments

klacabane commented Mar 10, 2022 • edited Loading

Summary

jbaiera commented Mar 14, 2022

lucabelluccini commented Mar 16, 2022 • edited Loading

klacabane commented Mar 21, 2022 • edited Loading

jbaiera commented Mar 22, 2022

jbaiera commented Mar 22, 2022

jbaiera commented Mar 22, 2022

klacabane commented Mar 23, 2022

ruflin commented Mar 24, 2022

klacabane commented Mar 28, 2022 • edited Loading

ruflin commented Mar 28, 2022

jasonrhodes commented Mar 29, 2022

ruflin commented Mar 30, 2022

jasonrhodes commented Apr 18, 2022

klacabane commented Apr 19, 2022 • edited Loading

jasonrhodes commented Apr 19, 2022

jbaiera commented Apr 19, 2022

jbaiera commented Apr 19, 2022

jasonrhodes commented Apr 19, 2022

jbaiera commented May 2, 2022

jasonrhodes commented May 3, 2022

matschaffer commented May 9, 2022

klacabane commented May 9, 2022 • edited Loading

kunisen commented May 10, 2022

matschaffer commented May 10, 2022

jasonrhodes commented May 23, 2022

botelastic bot commented Jan 24, 2024

botelastic bot commented Jan 25, 2025

klacabane commented Mar 10, 2022 •

edited

Loading

lucabelluccini commented Mar 16, 2022 •

edited

Loading

klacabane commented Mar 21, 2022 •

edited

Loading

klacabane commented Mar 28, 2022 •

edited

Loading

klacabane commented Apr 19, 2022 •

edited

Loading

klacabane commented May 9, 2022 •

edited

Loading