Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mention plugin exit codes outside [0..3] in the plugin output and warning log #10021

Merged
merged 2 commits into from
Apr 23, 2024

Conversation

Al2Klimov
Copy link
Member

to simplify debugging. Now customers seeing a misbehaving plugin don't have to turn on debug logs.

ref/IP/52294

Test

                "last_check_result": {
...
                    "exit_status": 128,
                    "output": "execvpe(/Users/aklimov/NET/WS/prefix/usr/lib/nagios/plugins/check_disk) failed: No such file or directory\n<Terminated with exit code 128.>",
...
                },

👍

@Al2Klimov Al2Klimov added enhancement New feature or request area/checks Check execution and results ref/IP labels Mar 11, 2024
@cla-bot cla-bot bot added the cla/signed label Mar 11, 2024
@stevie-sy
Copy link
Contributor

@Al2Klimov looks fine.

If there is no output from the check plugin, any information helps. So you don't have to start the debug log on an agent. Especially if you have'nt any (direct) access to the server and always need different colleagues for it. So big thumb up and thank you!

Should we test it? If yes, were we can get the Windows-Setup-Files for this patch? But I'm sure, if you patched it, it will work fine ;-)

@Al2Klimov
Copy link
Member Author

Which Icinga version shall I cherry pick this patch on top?

@stevie-sy
Copy link
Contributor

We use 2.14.2

@Al2Klimov
Copy link
Member Author

You can download the artifacts here: https://git.icinga.com/packaging/windows-icinga2/-/jobs/490746

@stevie-sy
Copy link
Contributor

We installed it on an affected windows server. Output looks good for us:
image
instead of this
image
Thank you!

With this information we can find hints in the web why Windows/PowerShell/... etc. gets crazy. And we don't have to turn on the debug log and wait if this happens again!

@julianbrost
Copy link
Contributor

image

There seems to be an overflow though. The logs you shared in the ticket show a positive value instead:

notice/Process: [...] terminated with exit code 3221225477.

And indeed, this value is exactly $-1073741819 + 2^{32}$, i.e. an uint32 was interpreted as an int32.

I think it would also make sense to (maybe additionally) show the hexadecimal value on Windows, as that's how those are typically presented, for example 3221225477 is listed as 0xC0000005 STATUS_ACCESS_VIOLATION there.

Side note: you don't have to resort to the logs, the exit code is also part of the check result. I still think it makes sense to have this in the output just like it's also done with signals on Linux. The exit_status attribute seems to be affected by the same overflow though.

@stevie-sy
Copy link
Contributor

@julianbrost you're right, Any optimized output is appreciated. But now it's better than any empty output like before. ;-)

At the end every - let's say - "normal" user (not Admin like us) - should see what's going on in the UI in the plugin output.

@Al2Klimov
Copy link
Member Author

@julianbrost execvpe(/Users/aklimov/NET/WS/prefix/usr/lib/nagios/plugins/check_ping) failed: No such file or directory<Terminated with exit code 128 (0x80).> 👍

@stevie-sy https://git.icinga.com/packaging/windows-icinga2/-/jobs/495257

@stevie-sy
Copy link
Contributor

My colleague reinstalled the agent with the new package on an affected server.

Output looks better:
image.

And the benefit now:
Copy & Paste the hex exit code into a search engine (e.g. google) delivers on the top 3 search results the right answers for the root cause. And this without accessing the server (maybe without having the permission) , without turning on a debug log (and still wait for the next appearance) and without analyse the event log from Windows for a first glance. Safes much time!

Thank you. Great work!

@julianbrost julianbrost added this to the 2.15.0 milestone Apr 5, 2024
@julianbrost julianbrost added the consider backporting Should be considered for inclusion in a bugfix release label Apr 5, 2024
@@ -25,7 +26,7 @@ struct ProcessResult
pid_t PID;
double ExecutionStart;
double ExecutionEnd;
long ExitStatus;
int_fast64_t ExitStatus;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly don't understand why you are changing this, why not keep using long? What benefits does this fancy type have over the original type in this specific context and its intended usage? I would simply change the type of the CheckResult#exit_status attribute from int to long instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm addressing a 32-bit overflow here, so I need 64+ bits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm addressing a 32-bit overflow here, so I need 64+ bits.

First of all, it is not ProcessResult#ExitStatus that's overflowing int32, but CheckResult#exit_status, which has been of type int so far. Secondly, int_fast64_t is just an alias of some other 64 bit types, e.g. on my Mac it is just an alias of int64_t and the compiler might do some optimisations based on this type when doing some sophisticated arithmetic, but it is not intended to overcome 64+ overflows.

I just don't think it's needed here and long should be way more than enough for signals/exit codes IMHO, given that there are no 64+ systems at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. CheckResult#exit_status gets set from ProcessResult#ExitStatus, so both need a large type
  2. Just grep for int_fast in the code - if we need X bits, we declare it explicitly
  3. long has 32 bits, even on x64 Windows https://learn.microsoft.com/en-us/cpp/build/common-visual-cpp-64-bit-migration-issues?view=msvc-170&redirectedfrom=MSDN

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3. long has 32 bits, even on x64 Windows https://learn.microsoft.com/en-us/cpp/build/common-visual-cpp-64-bit-migration-issues?view=msvc-170&redirectedfrom=MSDN

An int and a long are 32-bit values on 64-bit Windows operating systems.

🤦‍♂️! I wasn't aware of this sh*tty operating system.

std::stringstream crOutput;

crOutput << "<Terminated with exit code " << pr.ExitStatus
<< " (0x" << std::noshowbase << std::hex << std::uppercase << pr.ExitStatus << ").>";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice also that no base prefix is implicitly prepended to the number unless the showbase format flag is set.

std::hex doesn't prepend the base unless you explicitly request it to.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice, but obvious code is even nicer.

Copy link
Member

@yhabteab yhabteab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested only on my MacBook! LFTM!

Before:*

...
"execution_end": 1712328047.745857,
"execution_start": 1712328047.74055,
"exit_status": -2147483648,
"output": "Overflowing exit code (2^31)\\n",

After:

...
"execution_end": 1712328344.578195,
"execution_start": 1712328344.571937,
"exit_status": 2147483648,
"output": "Overflowing exit code (2^31)\\n<Terminated with exit code 2147483648 (0x80000000).>",

so that they can hold Windows exit codes like 3221225477 (>2147483647).
…ide 0..3

in the plugin output as well, in addition to the warning log.
@Al2Klimov Al2Klimov merged commit 62512bb into master Apr 23, 2024
26 checks passed
@Al2Klimov Al2Klimov deleted the output-exit-code-52294 branch April 23, 2024 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/checks Check execution and results cla/signed consider backporting Should be considered for inclusion in a bugfix release enhancement New feature or request ref/IP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Plugin-Output "Output unavailable" on Windows, if antiVirus-software kills process (e.g. PowerShell)
4 participants