-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Stack Monitoring] Metricbeat writes to .monitoring-*-8-mb indice instead of datastream after 8.0 upgrade #30769
Comments
Potential scenario steps:
|
Since Elasticsearch 7.10 bulk requests have I've not tested if |
Confirming that I was able to replicate the issue with a large cloud deployment, Looking at the logs for this cluster, the
Unfortunately this does not appear to work with datastreams as I'm getting this error both when the template exists (v8.x) or not (v7.x)
|
The template registry is configured to only run on the master node of the cluster. I think during a rolling upgrade all nodes are updated to the newer version with the master node being updated last. This could create a window of time where there are 8.0 nodes in the cluster but no monitoring templates are available. |
I just validated that this is the case, at least on cloud. I'll open a discussion issue in the ES repo to see if there are any steps we can take on the ES side of things so that we don't have to drop events during a rolling upgrade. |
I opened elastic/elasticsearch#85247 to see if we can help address the availability window for the data during a rolling upgrade. |
On the beats side we could prevent the SM modules from writing monitoring events until the templates are setup. Downside is that we'll fly blind during the upgrade for the nodes that completed the process, but we could cache those events (ie the last n bytes) as a best effort and eventually backfill them. |
I remember we had many discussions around this before data streams existed because of the alias, less the template problem. In general I agree, if template / data streams are not there, we should not write to the data streams as it might cause issues. The part I'm not sure about is, if we should really solve this problem on the ingest side. Beats / Elastic Agent needs a lot of unnecessary permissions to do the checks needed. One the one hand it is a permissions issue but I would argue also partially a performance issue as likely each time the beat reconnects, it needs to do these checks. Instead I would like to see this rather enforced on the Elasticsearch side. If template / data stream is not there, Elasticsearch does not allow to ingest data. I somehow remember even we wanted to build such a param, not sure if that ever landed in Elasticsearch. @kvch you might remember? |
Agreed on a flag similar to |
I was hoping we talk about a much shorter time period. One of the key problems here I think is that metricbeat runs inside the docker container in Cloud. Ideally, Beats would only be updated after the full cluster is updated so this issue would not happen in the first place. But this is not something we can fix for 8.0 :-( Alternative workaround idea: What if the 7.last migration assistant would put the right template already in place? |
@ruflin do Beat modules expect you to run |
Beat by default checks if the templates and ingest pipelines are their on each reconnect. You don't need to run setup for this, Beats just does it. As it is all 1 template, it is just 1 check for all the modules. |
I'm looking through this again and wondering about this line. @jbaiera it sounds like you're saying that ES waits to create the data stream templates until the nodes are available. @klacabane is saying that we've noticed as much as a 7 minute gap between MB writing data and the data stream template being created. What I'm curious about is, if the template is created as soon as the nodes are available, what is MB writing to for 7+ minutes? |
The templates will be installed by the master node once it is upgraded to 8.x. In a rolling upgrade the master node will be upgraded last (or at least after some data node have been upgraded, need to verify the cloud upgrade orchestration details). When a non-master node upgrades its sidecar metricbeat will write to the monitoring index pattern but since the templates are not there yet it creates a mapping-less indice instead of the expected datastream which will break SM querying. The gap is the time it took for the templates to be installed in a small-sized cluster. Ideally we want to maintain monitoring visibility during the upgrade so the templates need to be installed as soon as the first node is upgraded:
Otherwise we could prevent metricbeat from writing to the index pattern if no underlying template have been installed, something like the require_alias query parameter. |
I like the options of installing templates for each node as it's upgraded, but I don't know what the implications of that might be. I haven't thought too hard about index templates and their relationship to master v non-master nodes ... each node doesn't have its own copy of the templates? |
Each node has information about the templates because they are stored in the cluster state, but the master node is where all template creation operations are applied to the cluster state. Most of the automatic template creation logic waits for the master node to be upgraded before installing templates. This is because we added composable index templates as a feature in the 7.x line (7.8 to be exact). Master nodes are usually the last to be upgraded in a rolling upgrade (for safety purposes). If you are upgrading from an early version on the 7.x line that doesn't have support for composable templates to a version that does, then you need to wait for the master to be upgraded so that it knows how to apply those templates to the cluster state. Now that we're in 8.0 territory the oldest version a node can be during a rolling upgrade is likely to be 7.17.X which is safely always aware of composable index templates. Waiting for the master node to be upgraded for the above reason isn't required at this point, but if we ever add new functionality to the templates that doesn't exist in earlier 8.x versions, then we must again wait for the master to be fully updated before templates can be installed. In that situation, metric collection will likely have to wait again. |
So ultimately what I'm saying is that we could get the templates to install earlier, but I'm worried that we will just kick the can down the road until we have new functionality we want to add to the templates (perhaps TSDB stuff?) and this becomes a problem again. I am going to see if we can discuss the crux of the issue (elastic/elasticsearch#85247) in the next data management area meeting and figure out where we go on to next. |
Thanks, @jbaiera -- keep us updated. We could tell users that they lose monitoring visibility during a rolling upgrade but that feels like it calls into suspicion the point of a rolling upgrade, a bit. |
We met and discussed some things about template installation and we determined that even though we could add template installation to be done before the master is available currently, any feature additions to templates in the future would just cause us to be in the same state. We discussed some options around how to avoid gaps in monitoring data when doing rolling upgrades:
|
Thanks, @jbaiera! I wonder if we should consider just making MB write/update these mappings since this is a MB-only problem. I think that would also solve the issue, right? cc: @klacabane / @matschaffer |
@jasonrhodes I've seen some challenges with that approach as well. Namely:
Both problems can be overcome by an extra operator action (or orchestration action) to push templates before upgrading metricbeat. In general I agree with the sentiment that the publisher of the data should probably own the template rather than embedding templates with elasticsearch. |
Metricbeat installing the templates would solve the problem for now but I think we'll hit the same wall if/when we want to use new template functionalities (ie waiting for master to be upgraded). Moving the templates installation to a different registry (mentioned in elastic/elasticsearch#85247 (comment)) would achieve similar behavior if sooner means before metricbeat gets a chance to write to the indice in a cloud deployment, and I'd be in favor of the latter since it should be much lower effort for the same result (not discounting that having the templates closer to the publisher is a good design, but it does not solve any existing problems).
This would mean bringing back the code that handled the legacy document format (completely removed in 8.x) and make the metricbeat code a complex beast. I'm not convinced such effort is justified for now but could be when we eventually face that problem. Changes to the cloud orchestration to reflect the upgrade order that we recommend for on-prem deployments would be ideal |
Hi team, is there a way to fix this without deleting the offending regular indices (as it's destructive)? |
It'd be good to test out before adding it to docs, but I don't see why that wouldn't be an option. You could probably even skip the final reindex in favor of reindexing to a name covered by typical index patterns (for example, |
Hm ok, I don't 100% follow the upgrade flow but if this is true, then I agree that the separate registry is the easier solution for now (with the same ongoing risks). What I also still don't quite understand is how this architecture could ever support a rolling upgrade for anything that involves changes to index templates? Is there a plan for Elasticsearch to fix this at some point?
Very much agree here. I think this would be a pretty dangerous road to go down. |
Hi! We're labeling this issue as |
Hi! We're labeling this issue as |
Summary
In 8.x, metricbeat writes stack monitoring data to
.monitoring-{product}-8-mb
index patterns. These patterns are backed by datastream templates created by elasticsearch at startup and define the corresponding ILM and mappings necessary for SM to function.We may have a race condition where an unmapped/unusable indice could be created instead of the expected datastream if elasticsearch didn't setup the backing template when metricbeat start writing data to
.monitoring-{product}-8-mb
.If scenario is confirmed, we should work on a fix that prevents data from being written to the
.monitoring-{product}-8-mb
patterns until the templates are successfully setup.This ticket tracks the investigation and resolution of that potential scenario.
The text was updated successfully, but these errors were encountered: