Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HANA_CALL - handle all timeout return codes, not only 124 #248

Open
PeterPitterling opened this issue Mar 29, 2024 · 18 comments
Open

HANA_CALL - handle all timeout return codes, not only 124 #248

PeterPitterling opened this issue Mar 29, 2024 · 18 comments

Comments

@PeterPitterling
Copy link
Contributor

https://man7.org/linux/man-pages/man1/timeout.1.html

Exit status:
124 if COMMAND times out, and --preserve-status is not specified
125 if the timeout command itself fails
126 if COMMAND is found but cannot be invoked
127 if COMMAND cannot be found
137 if COMMAND (or timeout itself) is sent the KILL (9) signal (128+9)
- the exit status of COMMAND otherwise

SAPHanaSR/ra/SAPHana

Lines 692 to 697 in 8db3d75

#
# on timeout ...
#
if [ $rc -eq 124 ]; then
super_ocf_log warn "RA: HANA_CALL timed out after $timeOut seconds running command '$cmd'"
fi

@fmherschel
Copy link
Member

Why?

@fmherschel
Copy link
Member

What was the CONCRETE error condition?

@fdanapfel
Copy link
Contributor

@fmherschel in the SAPHanaTopology RA there are already checks for other return values, for example

if [ "$hanalrc" -ge 124 ]; then

if [ "$hanalrc" -ge 124 ]; then

if [ "$hanalrc" -ge 124 ]; then

if [ "$hanalrc" -ge 124 ]; then

if [ "$hanalrc" -ge 124 ]; then

So the question for me would be why the other error codes for the timeout command are takein ito account in the SAPHanaTopology RA but not in the SAPHana RA?

@fmherschel
Copy link
Member

@fdanapfel RC 124 happens regularly. I have never seen 125, 126, 127. RC 124 is by plan.
So again did we have had a concrete situation where we did catch other RCs?

@ja9fuchs
Copy link

@fmherschel Yes we do. :)
Although that particular problem only affects scale-out due to a minor difference in the Topology code. But the new merged code seems to have that gap again.

We have a case of RC 134 in a scale-out environment which leads to a failover, due to the role being set to some rubbish from that error.
Although the RC means an ABRT of the python command, it's similar to a timeout and should not cause an immediate HANA resource failure.

In the scale-up SAPHanaTopology RA this happens to be automatically covered by the -ge 124 conditionals, while the same cases in the scale-out SAPHanaTopology RA are strictly -eq 124 in the monitor function.
It also looks like it is again restricted to exact matches of only 124 in the new combined NG code.

Do you have specific concerns about generally allowing this bit of tolerance of any RC >= 124?
Keeping this in some places would reduce the impact of at least a bad role set, caused by issues that are not related to the HANA landscape.
Very similar to a command timeout, this could just be a temporary problem on the system.

@fmherschel
Copy link
Member

@fdanapfel , @ja9fuchs Ah, so you just like to change from "-eq 124" to "-ge 124", right? Then I did misunderstand the initial issue description. I had the impression the initial request was to differ the rc codes. This would be every test intensive. To change from "-eq 124" to "-ge 124" should be fine. However I would like to test that also with my semi-automated tester. Sorry for the misunderstanding. Good catch.

@PeterPitterling
Copy link
Contributor Author

I'm fine with that change, but the RC should be part of the warn message

@ja9fuchs
Copy link

ja9fuchs commented Jun 4, 2024

@fdanapfel , @ja9fuchs Ah, so you just like to change from "-eq 124" to "-ge 124", right? Then I did misunderstand the initial issue description. I had the impression the initial request was to differ the rc codes. This would be every test intensive. To change from "-eq 124" to "-ge 124" should be fine. However I would like to test that also with my semi-automated tester. Sorry for the misunderstanding. Good catch.

@fmherschel Correct, this is my suggestion. Although it is not exactly the same as @PeterPitterling originally requested. But I think it would be a good solution to cover these RCs in a consistent way in the different RAs. Maybe with a slight change of wording in the log output, to make it more generic and not refer to the result as a timeout only. For example call it "command error" instead of "timeout". By including the RC in the output this should be enough info for individual troubleshooting.

@fmherschel
Copy link
Member

@ja9fuchs Yes I also want to change that. Also the angi code will have some changes on that part. It will take some time as this also needs some testing. I was afraid to differ e.g. rc==125 in the RAs reaction. Because that is really not easy. But we could of course treat a (e.g.) rc==125 in the same way as a rc==124 as a "failed answer". And following the comment of @PeterPitterling we will add the RC to the log messages. Please give me some time to do that.

@ja9fuchs
Copy link

ja9fuchs commented Jun 4, 2024

@fmherschel If I understand it correctly, you will work on adjustments of the angi code. In the meanwhile, if that is ok, I would create a small PR in the separate scale-out repository to make the minimum of adjustment there to cover the agreed change. This way the work is distributed and we can clarify details in the PR to get it right.

@fmherschel
Copy link
Member

@ja9fuchs That would be great!

@fdanapfel
Copy link
Contributor

@fmherschel My main question actually was why the resource agents use "-eq 124" in some cases and "-ge 124" in others.

I agree with @ja9fuchs that it would be good to make it more consistent that only one type of check is used, preferably "-ge 124" to also cover other values that might be returned. But there is no need in my opinion to differentiate between the RCs.

And I also like the suggestion from @PeterPitterling to include the RC in the messages that get logged.

@fmherschel
Copy link
Member

@fdanapfel Why does code has different handlings? It might be that we have learned during the "aging" of SAPHanaSR and did miss the other code sections. I currently do not expect that there is a good reason why we should differ between -eq and -ge. But I will also ask @angelabriel, if I miss something which explains the difference.
If I look at the timeout-callling/using code I see that we have had the need to change the call from default (which uses SIGTERM) to SIGKILL (9). This unfortunately changes the timeout return code from 124 to 137 (following the man pages and my local tests). So in these case the -eq 124 seams to be wrong.
Sorry for this error.

@ja9fuchs
Copy link

ja9fuchs commented Jun 6, 2024

@fmherschel Thank you for accepting and merging the PR in scale-out! 👍

@PeterPitterling Would you please check if that covers your request well enough, for classic scale-out? If yes, it might help aligning the angi code in a consistent way.

@fmherschel
Copy link
Member

@ja9fuchs Do you also check the scale-up-classic code in the branch maintenence-classic or should I do this?
Angi code will be adapted be me. Then we have all 3 projects (including smells) updated.

@ja9fuchs
Copy link

ja9fuchs commented Jun 6, 2024

@ja9fuchs Do you also check the scale-up-classic code in the branch maintenence-classic or should I do this? Angi code will be adapted be me. Then we have all 3 projects (including smells) updated.

@fmherschel I actually compared to the classic scale-up code side-by-side while doing the scale-out changes. The scale-up sht_monitor_clone function uses -ge 124 only already.

Note:
The HANA_CALL function still contains conditionals with -eq 124, but this seems really about timeout handling and it only adds log messages. That is similar in both Topology agents and I did not touch it. This function does not take a decision, but provides the RC to the parent function that called it, like the monitor.This is where actual decisions are made based on the conditionals, which only needed the adjustment in the classic scale-out Topology code.

@ja9fuchs
Copy link

ja9fuchs commented Jun 6, 2024

I checked all conditionals around 124 in each of the classic RAs and those that are still -eq 124 are not taking decisions or influencing the health. They might be enhanced for cosmetical/logging reasons, but I considered those not part of issues that needed fixing.

@PeterPitterling
Copy link
Contributor Author

Note: The HANA_CALL function still contains conditionals with -eq 124, but this seems really about timeout handling and it only adds log messages. That is similar in both Topology agents and I did not touch it. This function does not take a decision, but provides the RC to the parent function that called it, like the monitor.This is where actual decisions are made based on the conditionals, which only needed the adjustment in the classic scale-out Topology code.

That was actually my initial request. Also this section should handle ge 124 and provide the RC in the error log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants