Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: store the informer events even if the agent is not connected #271

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

chetan-rns
Copy link
Collaborator

@chetan-rns chetan-rns commented Jan 7, 2025

What does this PR do / why we need it:

See #225 (comment) for more details.

Currently, the principal removes the queue when the agent disconnects and drops all the incoming events from the informer. When the agent connects back it may miss the updates and be out of sync from the principal. For example:

  1. The principal and the agent are in sync
  2. Agent disconnects
  3. The user updates the application. But the principal discards them since there is no agent
  4. The agent connects back and is now out of sync with the principal.

There are two solutions to this problem:

  1. The principal shouldn't remove the queue and reuse it when the agent connects again. The principal will just send the events from the queue when the agent reconnects.
  2. The principal could list all the latest resources from the cluster and send them to the agent.

This PR explores solution 1. However, it doesn't handle orphaned resources. For example, if a resource is deleted when the principal is down it cannot transmit this event to the agent and the resource on the agent will be orphaned.

Changes introduced in this PR:

  1. Keep the queue open even if the agent is not connected and store the informer events
  2. Since the namespace name matches the agent ID, we remove the queue when the namespace associated with the agent is deleted.

Which issue(s) this PR fixes:

Fixes #225 (comment)

How to test changes / Special notes to the reviewer:

Checklist

  • Documentation update is required by this PR (and has been updated) OR no documentation update is required.

@jannfis
Copy link
Collaborator

jannfis commented Jan 7, 2025

I have not yet looked at the code, but can this lead to potential unlimited queue growth when there is no consumer for the particular queue?

@chetan-rns
Copy link
Collaborator Author

I have not yet looked at the code, but can this lead to potential unlimited queue growth when there is no consumer for the particular queue?

That's a valid concern. The logic in this PR assumes that the queue is required as long as an agent namespace (same as the agent ID) is present on the control plane. We remove the queue if the user has deleted the agent namespace thereby deleting apps on the principal. AFAIU the unlimited queue growth may only happen when the agent is permanently gone and the user being unaware is continuously updating the apps on the control plane. I'm assuming this happens rarely because if there is no agent the user might also delete the agent namespace or wouldn't update the apps or the agent is down for a short period.

}
}

func (s *Server) removeQueueIfUnused(ctx context.Context, agentName string) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the better solution would be to remove the queue when the agent (in case of Kubernetes backend: namespace) is deleted., instead of executing on an interval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants