HANA_CALL - handle all timeout return codes, not only 124 #248

PeterPitterling · 2024-03-29T17:56:19Z

https://man7.org/linux/man-pages/man1/timeout.1.html

Exit status:
124 if COMMAND times out, and --preserve-status is not specified
125 if the timeout command itself fails
126 if COMMAND is found but cannot be invoked
127 if COMMAND cannot be found
137 if COMMAND (or timeout itself) is sent the KILL (9) signal (128+9)
- the exit status of COMMAND otherwise

SAPHanaSR/ra/SAPHana

Lines 692 to 697 in 8db3d75

    
           # 
        
           # on timeout ... 
        
           # 
        
           if [ $rc -eq 124 ]; then 
        
               super_ocf_log warn "RA: HANA_CALL timed out after $timeOut seconds running command '$cmd'" 
        
           fi

fmherschel · 2024-04-02T09:49:54Z

Why?

fmherschel · 2024-04-02T09:51:00Z

What was the CONCRETE error condition?

fdanapfel · 2024-05-28T08:57:37Z

@fmherschel in the SAPHanaTopology RA there are already checks for other return values, for example

SAPHanaSR/ra/SAPHanaTopology

Line 981 in 8db3d75

if [ "$hanalrc" -ge 124 ]; then

SAPHanaSR/ra/SAPHanaTopology

Line 1091 in 8db3d75

if [ "$hanalrc" -ge 124 ]; then

SAPHanaSR/ra/SAPHanaTopology

Line 1095 in 8db3d75

if [ "$hanalrc" -ge 124 ]; then

SAPHanaSR/ra/SAPHanaTopology

Line 1114 in 8db3d75

if [ "$hanalrc" -ge 124 ]; then

SAPHanaSR/ra/SAPHanaTopology

Line 1118 in 8db3d75

if [ "$hanalrc" -ge 124 ]; then

So the question for me would be why the other error codes for the timeout command are takein ito account in the SAPHanaTopology RA but not in the SAPHana RA?

fmherschel · 2024-05-28T10:25:57Z

@fdanapfel RC 124 happens regularly. I have never seen 125, 126, 127. RC 124 is by plan.
So again did we have had a concrete situation where we did catch other RCs?

ja9fuchs · 2024-05-29T18:06:38Z

@fmherschel Yes we do. :)
Although that particular problem only affects scale-out due to a minor difference in the Topology code. But the new merged code seems to have that gap again.

We have a case of RC 134 in a scale-out environment which leads to a failover, due to the role being set to some rubbish from that error.
Although the RC means an ABRT of the python command, it's similar to a timeout and should not cause an immediate HANA resource failure.

In the scale-up SAPHanaTopology RA this happens to be automatically covered by the -ge 124 conditionals, while the same cases in the scale-out SAPHanaTopology RA are strictly -eq 124 in the monitor function.
It also looks like it is again restricted to exact matches of only 124 in the new combined NG code.

Do you have specific concerns about generally allowing this bit of tolerance of any RC >= 124?
Keeping this in some places would reduce the impact of at least a bad role set, caused by issues that are not related to the HANA landscape.
Very similar to a command timeout, this could just be a temporary problem on the system.

fmherschel · 2024-06-04T06:31:24Z

@fdanapfel , @ja9fuchs Ah, so you just like to change from "-eq 124" to "-ge 124", right? Then I did misunderstand the initial issue description. I had the impression the initial request was to differ the rc codes. This would be every test intensive. To change from "-eq 124" to "-ge 124" should be fine. However I would like to test that also with my semi-automated tester. Sorry for the misunderstanding. Good catch.

PeterPitterling · 2024-06-04T06:58:33Z

I'm fine with that change, but the RC should be part of the warn message

ja9fuchs · 2024-06-04T08:13:51Z

@fdanapfel , @ja9fuchs Ah, so you just like to change from "-eq 124" to "-ge 124", right? Then I did misunderstand the initial issue description. I had the impression the initial request was to differ the rc codes. This would be every test intensive. To change from "-eq 124" to "-ge 124" should be fine. However I would like to test that also with my semi-automated tester. Sorry for the misunderstanding. Good catch.

@fmherschel Correct, this is my suggestion. Although it is not exactly the same as @PeterPitterling originally requested. But I think it would be a good solution to cover these RCs in a consistent way in the different RAs. Maybe with a slight change of wording in the log output, to make it more generic and not refer to the result as a timeout only. For example call it "command error" instead of "timeout". By including the RC in the output this should be enough info for individual troubleshooting.

fmherschel · 2024-06-04T08:22:06Z

@ja9fuchs Yes I also want to change that. Also the angi code will have some changes on that part. It will take some time as this also needs some testing. I was afraid to differ e.g. rc==125 in the RAs reaction. Because that is really not easy. But we could of course treat a (e.g.) rc==125 in the same way as a rc==124 as a "failed answer". And following the comment of @PeterPitterling we will add the RC to the log messages. Please give me some time to do that.

ja9fuchs · 2024-06-04T08:29:00Z

@fmherschel If I understand it correctly, you will work on adjustments of the angi code. In the meanwhile, if that is ok, I would create a small PR in the separate scale-out repository to make the minimum of adjustment there to cover the agreed change. This way the work is distributed and we can clarify details in the PR to get it right.

fmherschel · 2024-06-04T08:33:21Z

@ja9fuchs That would be great!

fdanapfel · 2024-06-04T12:15:35Z

@fmherschel My main question actually was why the resource agents use "-eq 124" in some cases and "-ge 124" in others.

I agree with @ja9fuchs that it would be good to make it more consistent that only one type of check is used, preferably "-ge 124" to also cover other values that might be returned. But there is no need in my opinion to differentiate between the RCs.

And I also like the suggestion from @PeterPitterling to include the RC in the messages that get logged.

fmherschel · 2024-06-04T12:51:02Z

@fdanapfel Why does code has different handlings? It might be that we have learned during the "aging" of SAPHanaSR and did miss the other code sections. I currently do not expect that there is a good reason why we should differ between -eq and -ge. But I will also ask @angelabriel, if I miss something which explains the difference.
If I look at the timeout-callling/using code I see that we have had the need to change the call from default (which uses SIGTERM) to SIGKILL (9). This unfortunately changes the timeout return code from 124 to 137 (following the man pages and my local tests). So in these case the -eq 124 seams to be wrong.
Sorry for this error.

ja9fuchs · 2024-06-06T07:27:30Z

@fmherschel Thank you for accepting and merging the PR in scale-out! 👍

@PeterPitterling Would you please check if that covers your request well enough, for classic scale-out? If yes, it might help aligning the angi code in a consistent way.

fmherschel · 2024-06-06T07:37:56Z

@ja9fuchs Do you also check the scale-up-classic code in the branch maintenence-classic or should I do this?
Angi code will be adapted be me. Then we have all 3 projects (including smells) updated.

ja9fuchs · 2024-06-06T08:24:24Z

@ja9fuchs Do you also check the scale-up-classic code in the branch maintenence-classic or should I do this? Angi code will be adapted be me. Then we have all 3 projects (including smells) updated.

@fmherschel I actually compared to the classic scale-up code side-by-side while doing the scale-out changes. The scale-up sht_monitor_clone function uses -ge 124 only already.

Note:
The HANA_CALL function still contains conditionals with -eq 124, but this seems really about timeout handling and it only adds log messages. That is similar in both Topology agents and I did not touch it. This function does not take a decision, but provides the RC to the parent function that called it, like the monitor.This is where actual decisions are made based on the conditionals, which only needed the adjustment in the classic scale-out Topology code.

ja9fuchs · 2024-06-06T08:36:22Z

I checked all conditionals around 124 in each of the classic RAs and those that are still -eq 124 are not taking decisions or influencing the health. They might be enhanced for cosmetical/logging reasons, but I considered those not part of issues that needed fixing.

PeterPitterling · 2024-06-07T11:04:46Z

Note: The HANA_CALL function still contains conditionals with -eq 124, but this seems really about timeout handling and it only adds log messages. That is similar in both Topology agents and I did not touch it. This function does not take a decision, but provides the RC to the parent function that called it, like the monitor.This is where actual decisions are made based on the conditionals, which only needed the adjustment in the classic scale-out Topology code.

That was actually my initial request. Also this section should handle ge 124 and provide the RC in the error log

ja9fuchs mentioned this issue Jun 5, 2024

cover command err rc >= 124 like in scale-up SUSE/SAPHanaSR-ScaleOut#124

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HANA_CALL - handle all timeout return codes, not only 124 #248

HANA_CALL - handle all timeout return codes, not only 124 #248

PeterPitterling commented Mar 29, 2024

fmherschel commented Apr 2, 2024

fmherschel commented Apr 2, 2024

fdanapfel commented May 28, 2024

fmherschel commented May 28, 2024

ja9fuchs commented May 29, 2024

fmherschel commented Jun 4, 2024

PeterPitterling commented Jun 4, 2024

ja9fuchs commented Jun 4, 2024

fmherschel commented Jun 4, 2024

ja9fuchs commented Jun 4, 2024 •

edited

Loading

fmherschel commented Jun 4, 2024

fdanapfel commented Jun 4, 2024

fmherschel commented Jun 4, 2024

ja9fuchs commented Jun 6, 2024

fmherschel commented Jun 6, 2024

ja9fuchs commented Jun 6, 2024

ja9fuchs commented Jun 6, 2024

PeterPitterling commented Jun 7, 2024

HANA_CALL - handle all timeout return codes, not only 124 #248

HANA_CALL - handle all timeout return codes, not only 124 #248

Comments

PeterPitterling commented Mar 29, 2024

fmherschel commented Apr 2, 2024

fmherschel commented Apr 2, 2024

fdanapfel commented May 28, 2024

fmherschel commented May 28, 2024

ja9fuchs commented May 29, 2024

fmherschel commented Jun 4, 2024

PeterPitterling commented Jun 4, 2024

ja9fuchs commented Jun 4, 2024

fmherschel commented Jun 4, 2024

ja9fuchs commented Jun 4, 2024 • edited Loading

fmherschel commented Jun 4, 2024

fdanapfel commented Jun 4, 2024

fmherschel commented Jun 4, 2024

ja9fuchs commented Jun 6, 2024

fmherschel commented Jun 6, 2024

ja9fuchs commented Jun 6, 2024

ja9fuchs commented Jun 6, 2024

PeterPitterling commented Jun 7, 2024

ja9fuchs commented Jun 4, 2024 •

edited

Loading