-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process config::update/delete
cluster events gracefully
#9980
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could explain
ref/IP/48850
There's one more point to clarify! Before this PR, faulty objects with the icinga2/lib/cli/daemoncommand.cpp Lines 314 to 317 in 01a6c4c
|
23d30d0
to
c485909
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
You can probably construct cases where the object might survive if it references a object that was temporarily removed before the object was added and is added back before the next reload. But given that There's something else where the behavior probably changed: instantiating the object in memory isn't the only operation that can fail, creating the file might fail as well. Please explain what happens in this case. If you want to try it, you probably don't even have to consider the cluster sync, simply observing what happens on the initial node should be enough. Such an error could be induced by removing write permissions for example.
Not possible? I doubt that, it would need a different approach though. |
Then what about creating the AtomicFile where the whole file has been created before, but committing where it's created now? |
Tests with ---
[2024-01-26 13:34:27 +0000] critical/ApiListener: Error: Import references unknown template: 'generic-service'
[2024-01-26 13:34:27 +0000] critical/ApiListener: Could not create object 'example-999!service-999':
[2024-01-26 13:34:27 +0000] critical/ApiListener: Error: Import references unknown template: 'generic-service'
[2024-01-26 13:34:27 +0000] critical/ApiListener: Could not create object 'example-1000!service-1000':
[2024-01-26 13:34:27 +0000] critical/ApiListener: Error: Import references unknown template: 'generic-service'
--- File sync triggered an Icinga 2 daemon reload: ---
[2024-01-26 13:34:29 +0000] information/ApiListener: 'api' stopped.
[2024-01-26 13:34:29 +0000] information/ApiListener: 'api' started.
[2024-01-26 13:34:29 +0000] information/ApiListener: Started new listener on '[10.27.0.198]:5665'
---
[2024-01-26 13:34:29 +0000] information/ApiListener: Applying configuration file update for path '/var/lib/icinga2/api/zones-stage/global-templates' (2168 Bytes).
[2024-01-26 13:34:29 +0000] information/ApiListener: Received configuration updates (1) from endpoint 'master2' are equal to production, skipping validation and reload.
[2024-01-26 13:34:31 +0000] critical/ApiListener: Could not create object 'example-602':
[2024-01-26 13:34:31 +0000] critical/ApiListener: Error: Object 'example-602' of type 'Host' re-defined: in /var/lib/icinga2/api/packages/_api/97469d41-cc06-4486-b8c8-788c6ecb3aab/conf.d/hosts/example-602.conf: 1:0-1:24; previous definition: in /var/lib/icinga2/api/packages/_api/97469d41-cc06-4486-b8c8-788c6ecb3aab/conf.d/hosts/example-602.conf: 1:0-1:24
[2024-01-26 13:34:31 +0000] critical/ApiListener: Could not create object 'example-669':
[2024-01-26 13:34:31 +0000] critical/ApiListener: Error: Object 'example-669' of type 'Host' re-defined: in /var/lib/icinga2/api/packages/_api/97469d41-cc06-4486-b8c8-788c6ecb3aab/conf.d/hosts/example-669.conf: 1:0-1:24; previous definition: in /var/lib/icinga2/api/packages/_api/97469d41-cc06-4486-b8c8-788c6ecb3aab/conf.d/hosts/example-669.conf: 1:0-1:24
--- [root@config-sync-2 ~]# ls /var/lib/icinga2/api/packages/_api/97469d41-cc06-4486-b8c8-788c6ecb3aab/conf.d/hosts | wc -l
1000
[root@config-sync-2 ~]# ls /var/lib/icinga2/api/packages/_api/97469d41-cc06-4486-b8c8-788c6ecb3aab/conf.d/services | wc -l
1000 |
Without reworking the entire message processing? Note that this not only affects the However, I can at least prevent the configuration updates from being processed simultaneously, if you wish! |
Same inconsistency! Icinga 2 fails with: [2024-01-26 14:47:24 +0000] critical/ApiListener: Error: Function call 'mkstemp' for file '/var/lib/icinga2/api/packages/_api/166b6876-61cc-4feb-b10a-0efbe051dc2c/conf.d/hosts/example-2001.conf.tmp.oGHmxh' failed with error code 13, 'Permission denied'
[2024-01-26 14:47:24 +0000] critical/ApiListener: Error: Function call 'mkstemp' for file '/var/lib/icinga2/api/packages/_api/166b6876-61cc-4feb-b10a-0efbe051dc2c/conf.d/hosts/example-2002.conf.tmp.SK6dgw' failed with error code 13, 'Permission denied' However, the objects can still be found via the API! ~/Workspace/icinga2 (Api-connect_timeout-DNS ✗) curl -k -s -S -i -u root:icinga 'https://10.27.0.198:5665/v1/objects/hosts/example-2001?pretty=1'
---
{
"results": [
{
"attrs": {
"__name": "example-2001",
"source_location": {
"first_column": 0,
"first_line": 1,
"last_column": 25,
"last_line": 1,
"path": "/var/lib/icinga2/api/packages/_api/166b6876-61cc-4feb-b10a-0efbe051dc2c/conf.d/hosts/example-2001.conf"
},
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's something else where the behavior probably changed: instantiating the object in memory isn't the only operation that can fail, creating the file might fail as well.
Then what about creating the AtomicFile where the whole file has been created before, but committing where it's created now?
Additionally, I'd also write everything ASAP, just as before. Only the commit has to be where it is right now in this PR.
Which benefit do you get from doing so? It would just waste a write operation if the file isn't going to be committed. |
You'd ensure you have the permission (Hello SELinux!) and enough disk space to write the file. Also, while on it, call boost::iostreams::stream#flush() immediately after <<, just to be sure. |
@yhabteab Did you also try what happend when trying that without this PR? That should have returned an error without leaving behind the object in memory. Also, as your logs show the error twice, I presume you also tested this in the config sync scenario. However, the same difference in behavior should show on the node handling an HTTP request for creating an object. If it couldn't be written to disk before, it wasn't created at all. With this PR, it's created in memory but not persisted on disk (and even though the client probably receives an error in this case, a subsequent GET request would return the object). |
r2.14.2-1): [2024-01-26 15:28:08 +0000] warning/HttpServerConnection: Error while processing HTTP request: Function call 'std::ifstream::open' for file '/var/lib/icinga2/api/packages/_api/166b6876-61cc-4feb-b10a-0efbe051dc2c/conf.d/hosts/example-2001.conf' failed with error code 2, 'No such file or directory'
---
{
"error": 404,
"status": "The requested path 'v1/objects/hosts/example-2002' could not be found or the request method is not valid for this path."
} This PR: Not, so good! {
"results": [
{
"code": 500,
"errors": [
"Error: Function call 'mkstemp' for file '/var/lib/icinga2/api/packages/_api/166b6876-61cc-4feb-b10a-0efbe051dc2c/conf.d/hosts/example-2005.conf.tmp.zdkA6q' failed with error code 13, 'Permission denied'\n"
],
"status": "Object could not be created."
}
]
} |
e5b9bd2
to
087c8b5
Compare
f5f993d
to
81d9909
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Damn, forget to send my review.
Defer unlockAndNotify([&listener, &ptype, &objName]{ | ||
listener->m_ObjectConfigChangeLock.UnLock(ptype, objName); | ||
}); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just screams for a RAII style class like CpuBoundWork.
Defer unlockAndNotify([&listener, &ptype, &objName]{ | ||
listener->m_ObjectConfigChangeLock.UnLock(ptype, objName); | ||
}); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Especially as there's >1 usage...
std::condition_variable m_CV; | ||
std::unordered_map<Type*, std::set<String>> m_LockedObjectNames; | ||
}; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and even a dedicated class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have absolutely no idea what you're talking about! This class literally locks and unlocks Mutex and you want another Mutex on top of that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your class is like std::mutex. And I'd like something like std::unique_lock on top. Btw. depending on how you name your methods, you could actually use std::unique_lock itself! 💡
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just screams for a RAII style class like CpuBoundWork.
I mean that would be nice to have.
Especially as there's >1 usage...
Well, the number is 2, so IMHO that's also fine with Defer
.
Btw. depending on how you name your methods, you could actually use std::unique_lock itself!
I also thought about that before. Names aren't the only issue here. You can't pass extra parameters and std::unique_lock
takes a reference to the underlying mutex, whereas with this class, if a name is not locked, there's no memory behind it so nothing that could be referenced.
81d9909
to
456144c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code is fine for me now, please update the PR description though to reflect the changes to the implementation.
config::update/delete
cluster events gracefully
Warning
This PR was reverted in #10018 because it cause other problems with the config sync as described in #10012. A new fix will be developed in #10013.
Since the internal
config::Update
cluster events are usingConfigObjectUtility::CreateObject()
as well to create received runtime objects, we shouldn't persist the config file first to the disk and then load and validate it withConfigCompiler::CompileFile()
. Otherwise, we are forced to remove the newly created file whenever we can't validate, commit or activate it. This approach would also have the downside that two cluster events for the same object arriving at the same moment from two different endpoints would result in two different threads simultaneously creating and loading the same configuration file - whereby only one of them surpasses the validation process, while the other one is facing an objectre-definition
error and tries to remove that configuration file it mistakenly thinks it has created. As a consequence, an object successfully created by the former is implicitly deleted by the latter thread, causing the objects to mysteriously disappear.This PR fixes this by preserving the config in a temporary file for the entire validation process and is only moved to the actual config when the object was successfully created, otherwise that temp file will be discarded. Doing so, guarantees that two different threads simultaneously creating and loading the same configuration file do not interfere with each other. When one thread fails to pass object validation, it only deletes its temporary file and does not affect the other thread in any way. This PR additionally prevents two cluster events relating to the same object from being processed simultaneously.
Tests
Master 1
Master 2
Satellite
Results
Before
Master 1
Master 2
root@config-sync centos]# ls /var/lib/icinga2/api/packages/_api/97d86dee-d245-468f-b7dc-bd6044376e2a/conf.d/hosts | wc -l 1000
Satellite
To make things even more mysterious, when you query one of the missing hosts via the API, you get the desired result, but it refers to a non-existent path. Nevertheless, the host disappears forever and cannot even be found via the API when the icinga2 daemon is restarted.
After
Satellite:
fixes #9721