Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[admin-tool][server] Add two admin tool commands for dumping consumer ingestion states and heartbeat states; Add logging for stale heartbeat replicas #1260

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sixpluszero
Copy link
Contributor

[admin-tool][server] Add two admin tool commands for dumping consumer ingestion states and heartbeat states; Add logging for stale heartbeat replicas

We already have a admin-tool command to dump ingestion context for a specific topic partition's ingestion context on a specific server host. However, we found it is not very easy to detect which partition replica is actually lagging, when there are multiple partitions assigned to the host. We can iterate over all of them via Helix UI, but it is going to be time consuming.

This PR added 3 things to improve the usability.

  1. Add a heartbeat scan thread to periodically run and log lagging resources (every minute by default, this should be good enough not to spam logging). This can be further collected by other logging collecting system and we can easily detect on which host, what replica is lagging by how much.
  2. Add a command to dump heartbeat status from a host. It has 3 optional filter: topic filter, partition filter and lag filter. You can choose to see only specific topic / topic-partition or you can choose to only see resources that are lagging. This serves as the manual helper when (1) might be missing stuff.
  3. Add a command to dump all consumer service. Filters are not added yet, if you think this is important, I can also add it. I think this will be very useful to see overall distribution and consumption rate for each: region / consumer services / consumer levels.

My long term hope is these commands can be further extended to run periodically when you choose to "attach" to a specific host, and it can collect data for certain time and generate local visualization to help us understand ingestion performance and fairness (because we don't emit too many low level metrics so as to avoid metric explosion), but we need these endpoints to get started.

How was this PR tested?

Added an integration test to tests two admin tool commands.

Does this PR introduce any user-facing changes?

  • No. You can skip the rest of this section.
  • Yes. Make sure to explain your proposed changes and call out the behavior change.

… ingestion states and heartbeat states; Add logging for stale heartbeat replicas
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant