Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure coral2-dws's k8s connections to tolerate more lost connections #159

Open
jameshcorbett opened this issue May 10, 2024 · 1 comment

Comments

@jameshcorbett
Copy link
Member

Problem: the coral2-dws service on elcap sometimes loses connection to the k8s server, logging

WARNING - Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))': /apis/dataworkflowservices.github.io/v1alpha2/namespaces/default/workflows?resourceVersion=0&timeoutSeconds=1&watch=True

Sometimes this can cause workflows to become stuck.

Somehow the service should become more resilient, and keep retrying with a backoff.

@roehrich-hpe
Copy link
Collaborator

roehrich-hpe commented May 23, 2024

We have a fix for the dangling finalizer, which prevents the workflow from being deleted, in #165

Also, note that the warning above says that it did 2 retries, after which I'm guessing finally succeeded, because your notes don't say it was followed by an error indicating that it failed. We've seen this same warning when we restart the haproxy on the control plane node that has the VIP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants