Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodically run execution node health checks with a system job #13339

Draft
wants to merge 5 commits into
base: devel
Choose a base branch
from

Conversation

AlanCoding
Copy link
Member

SUMMARY

We have this gap in the heartbeat logic, which @kdelee has complained about.

Basically, if we don't have any reason to believe that an execution node is broken, we won't ever discover this automatically. If an execution node is in an error state, then we periodically re-check it. However, if it fits these two conditions we don't check it

  • its last known health check status was successful
  • its receptor daemon is connected to the mesh, and we see it from the receptor health check in the receptor status command

This somewhat intentionally will not capture the case where the remote node's receptor is working, but has its ansible-runner install messed up somehow. Just imagine that someone re-installs it incorrectly. We will never mark that node as offline unless someone runs a health check against it manually.

The problem is, it's poor form for the system to have full control over this kind of check. Users will have their own particular reasons that an execution node might get disrupted, and they should be in control of how often this (completely periodic) fallback check should happen, or if it should happen. The best tool for that is a system job.

ISSUE TYPE
  • New or Enhanced Feature
COMPONENT NAME
  • API

@AlanCoding
Copy link
Member Author

There would be a strong argument for making this run health checks just on execution nodes. So, if anyone wants to vote for that, I think it would tip the scales.

@jbradberry jbradberry self-requested a review January 4, 2023 20:08
@AlanCoding AlanCoding changed the title Periodically run cluster health checks with a system job Periodically run execution node health checks with a system job Feb 2, 2023
@jbradberry jbradberry removed their request for review June 7, 2024 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant