Periodically run execution node health checks with a system job #13339

AlanCoding · 2022-12-14T21:23:37Z

SUMMARY

We have this gap in the heartbeat logic, which @kdelee has complained about.

Basically, if we don't have any reason to believe that an execution node is broken, we won't ever discover this automatically. If an execution node is in an error state, then we periodically re-check it. However, if it fits these two conditions we don't check it

its last known health check status was successful
its receptor daemon is connected to the mesh, and we see it from the receptor health check in the receptor status command

This somewhat intentionally will not capture the case where the remote node's receptor is working, but has its ansible-runner install messed up somehow. Just imagine that someone re-installs it incorrectly. We will never mark that node as offline unless someone runs a health check against it manually.

The problem is, it's poor form for the system to have full control over this kind of check. Users will have their own particular reasons that an execution node might get disrupted, and they should be in control of how often this (completely periodic) fallback check should happen, or if it should happen. The best tool for that is a system job.

ISSUE TYPE

New or Enhanced Feature

COMPONENT NAME

API

AlanCoding · 2022-12-15T14:40:36Z

There would be a strong argument for making this run health checks just on execution nodes. So, if anyone wants to vote for that, I think it would tip the scales.

github-actions bot added the component:api label Dec 14, 2022

jbradberry self-requested a review January 4, 2023 20:08

AlanCoding force-pushed the health_check_mgnt branch from 46803c0 to 0ea76fe Compare February 1, 2023 21:39

AlanCoding changed the title ~~Periodically run cluster health checks with a system job~~ Periodically run execution node health checks with a system job Feb 2, 2023

AlanCoding added 3 commits July 27, 2023 10:23

Periodically run cluster health checks with a system job

58fe292

Limit scope of management command to execution nodes

78c68fb

bump migration

562bef2

AlanCoding force-pushed the health_check_mgnt branch from 40b9cbf to 562bef2 Compare July 27, 2023 14:23

AlanCoding added 2 commits July 27, 2023 10:25

Bump migration

e86ddbb

Run black on migration files

3147376

jbradberry removed their request for review June 7, 2024 15:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Periodically run execution node health checks with a system job #13339

Periodically run execution node health checks with a system job #13339

AlanCoding commented Dec 14, 2022

AlanCoding commented Dec 15, 2022

Periodically run execution node health checks with a system job #13339

Are you sure you want to change the base?

Periodically run execution node health checks with a system job #13339

Conversation

AlanCoding commented Dec 14, 2022

SUMMARY

ISSUE TYPE

COMPONENT NAME

AlanCoding commented Dec 15, 2022