Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Child host becomes unreachable in redundant group when parent is unreachable #10014

Open
jjuanino opened this issue Mar 3, 2024 · 6 comments · May be fixed by #10228
Open

Child host becomes unreachable in redundant group when parent is unreachable #10014

jjuanino opened this issue Mar 3, 2024 · 6 comments · May be fixed by #10228
Assignees
Labels
area/runtime Downtimes, comments, dependencies, events bug Something isn't working

Comments

@jjuanino
Copy link

jjuanino commented Mar 3, 2024

Describe the bug

Dear community,
unless I am misunderstanding something, redundant groups does not work as expected.
Consider the following setup:

   poc_grand_parent 
             |
             |
             | (classic dependency)
             |
             | 
       poc_parent_0                                     poc_parent_1
        \                                                /
          \        (redundant dependency)              /   
            \                                        / 
              \                                    /
                 \                               /
                    \                          /
                        \                 /
                             poc_child
  • poc_parent_0 has a classical dependency with poc_grand_parent.
  • poc_child has a redundant dependency with poc_parent_0 and poc_parent_1

The issue is as follows: when poc_grand_parent becomes down, poc_parent_0 becomes unreachable (as usual), but poc_child also, which is unexpected.

To Reproduce

Consider the following setup:

object Host "poc_grand_parent" { check_command = "dummy"; vars.dummy_state = 2; }
object Host "poc_parent_0" { check_command = "dummy"; vars.dummy_state = 0; }
object Host "poc_parent_1" { check_command = "dummy"; vars.dummy_state = 0; }
object Host "poc_child" { check_command = "dummy"; vars.dummy_state = 0;}

object Dependency "parent_0_to_grandparent" {
    child_host_name = "poc_parent_0"
    parent_host_name = "poc_grand_parent"
}

for (i in range(2)) {
    object Dependency "dep-" + i use (i) {
        child_host_name = "poc_child"
        parent_host_name = "poc_parent_" + i
        redundancy_group = "broken_red_deps"
    }
}

Expected behavior

The expected behavior is that poc_child host remains reachable despite of the state of poc_grand_parent.

Screenshots

image

Your Environment

Include as many relevant details about the environment you experienced the problem in

  • Version used (icinga2 --version):
$ /usr/local/icinga2/sbin/icinga2 --version
icinga2 - The Icinga 2 network monitoring daemon (version: r2.14.2-1)

Copyright (c) 2012-2024 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <https://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

System information:
  Platform: Red Hat Enterprise Linux
  Platform version: 8.9 (Ootpa)
  Kernel: Linux
  Kernel version: 4.18.0-513.11.1.el8_9.x86_64
  Architecture: x86_64

Build information:
  Compiler: GNU 8.5.0
  Build host: ol8-template.localdomain
  OpenSSL version: OpenSSL 1.1.1k  FIPS 25 Mar 2021

Application information:

General paths:
  Config directory: /usr/local/icinga2/etc/icinga2
  Data directory: /usr/local/icinga2/var/lib/icinga2
  Log directory: /usr/local/icinga2/var/log/icinga2
  Cache directory: /usr/local/icinga2/var/cache/icinga2
  Spool directory: /usr/local/icinga2/var/spool/icinga2
  Run directory: /usr/local/icinga2/var/run/icinga2

Old paths (deprecated):
  Installation root: /usr/local/icinga2
  Sysconf directory: /usr/local/icinga2/etc
  Run directory (base): /usr/local/icinga2/var/run
  Local state directory: /usr/local/icinga2/var

Internal paths:
  Package data directory: /usr/local/icinga2/share/icinga2
  State path: /usr/local/icinga2/var/lib/icinga2/icinga2.state
  Modified attributes path: /usr/local/icinga2/var/lib/icinga2/modified-attributes.conf
  Objects path: /usr/local/icinga2/var/cache/icinga2/icinga2.debug
  Vars path: /usr/local/icinga2/var/cache/icinga2/icinga2.vars
  PID path: /usr/local/icinga2/var/run/icinga2/icinga2.pid
  • Operating System and version:
$ cat /etc/os-release
NAME="Red Hat Enterprise Linux"
VERSION="8.9 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.9"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.9 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.9
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.9"
  • Enabled features (icinga2 feature list):
# icinga2 feature list
Disabled features: command compatlog debuglog elasticsearch gelf graphite ido-mysql ido-pgsql influxdb2 livestatus opentsdb perfdata statusdata syslog
Enabled features: api checker icingadb influxdb mainlog notification
  • Icinga Web 2 version and modules (System - About):
    image
@Al2Klimov
Copy link
Member

Hello Jose!

Does just Web mis-indicate the reachability or the Icinga 2 API, too?

Best,
A/K

@jjuanino
Copy link
Author

Hi Alexander,

in the icinga2 console I get the following (output snipped):

<1> => get_host("poc_child")
{
	__name = "poc_child"
	check_attempt = 1.000000
	check_command = "dummy"
	check_interval = 300.000000
	last_check_result = {
		active = true
		command = "dummy"
		exit_status = 0.000000
		output = "Check was successful."
		previous_hard_state = 99.000000
		vars_after = {
			attempt = 1.000000
			reachable = false    ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄
			state = 0.000000
			state_type = 1.000000
		}
		vars_before = {
			attempt = 1.000000
			reachable = true
			state = 0.000000
			state_type = 1.000000
		}
	}
	last_reachable = false   ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄
}

Best regards

@Al2Klimov
Copy link
Member

The issue is as follows: when poc_grand_parent becomes down, poc_parent_0 becomes unreachable (as usual), but poc_child also, which is unexpected.

poc_child indeed seems to misbehave, but only after yet another check of itself after poc_grand_parent is down.

@jjuanino
Copy link
Author

Yes, that is right, you have to check the services several times to reproduce the issue. The test presented is a bit contrived to show the behavior, but in the real world you get the issue in a more natural way. Regards.

@Al2Klimov Al2Klimov added the bug Something isn't working label Jun 18, 2024
@nilmerg
Copy link
Member

nilmerg commented Aug 30, 2024

I'm also able to reproduce this with just a single check now on the child.

@nilmerg
Copy link
Member

nilmerg commented Oct 16, 2024

Just checked the code to understand the behavior, but for another reason, and noticed the cause for this issue. Checkable::IsReachable checks whether any parent is unreachable before considering any redundancy groups. Redundancy groups only apply if all parents are reachable.

@yhabteab yhabteab self-assigned this Nov 12, 2024
@yhabteab yhabteab added the area/runtime Downtimes, comments, dependencies, events label Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/runtime Downtimes, comments, dependencies, events bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants