Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any chance to add new metric: nats_stream_total_messages_(per_subject) ? #305

Open
tltsaia opened this issue Sep 3, 2024 · 4 comments
Open
Labels
proposal Enhancement idea or proposal

Comments

@tltsaia
Copy link

tltsaia commented Sep 3, 2024

What motivated this proposal?

This would be very helpful for detecting the health and performance of ephemeral、pull type consumers in the system.

What is the proposed change?

This project already has metrics: nats_stream_total_messages and nats_stream_total_bytes, but not enough for ephemeral、pull type consumer situation.

Who benefits from this change?

No response

What alternatives have you evaluated?

No response

@tltsaia tltsaia added the proposal Enhancement idea or proposal label Sep 3, 2024
@Jarema
Copy link
Member

Jarema commented Sep 3, 2024

Hey.

Thanks for the report.
I'm bit confused on what metric exactly you would like to see. Can you please elaborate?

@tltsaia
Copy link
Author

tltsaia commented Sep 4, 2024

For example, in JetStream: S (RetentionPolicy=workqueue) handling subjects: domain.>,
APP-1 creates an ephemeral consumer to consume subject: domain.app1
APP-2 creates an ephemeral consumer to consume subject: domain.app2

If we only monitor the total count (e.g., total messages/bytes), it will be impossible to distinguish which consumer (APP-x) is abnormal, whether it's due to slow processing speed or crashed.
Since APP-x is not durable consumer, when APP-x terminates abnormally, we cannot detect the anomaly from the metric: jetstream_consumer_num_pending.

If there could be a metric: nats_stream_total_messages_per_subject
reporting the current number of messages in the stream retained per subject, it would reduce the time needed to locate anomalies and prevent JetStream from being overwhelmed by unprocessable messages.

@Jarema
Copy link
Member

Jarema commented Sep 4, 2024

Ok, I got it! Thanks for the explanation.

There are few issues with this approach:

  1. It is an expensive API call to make (to get those exact numbers), and could impact performance of the cluster
  2. It would be useful in your case only if you are using workqueue or limits stream. Otherwise - the consumer would not affect number of messages on the stream.

However, NATS server does not know why the client stopped using the ephemeral consumer. Nor it knows what are the intentions of the apps if they did abandon a client, and it's hard to derive it from server metrics.
It sounds like it's not server metrics that should help you here?

Also keep in mind, that having unprocessed messages does not mean that the JetStream is overwhelmed. It only means that you are publishing faster than you are consuming. And again - only in workqueue streams. Would not say anything in Limits based streams.

Of course I'm aware that above are true for your use case, and I'm only pointing out they can't be generalized.

Maybe let's try to look at the problem differently: Why use ephemeral consumers instead of durable ones?

@tltsaia
Copy link
Author

tltsaia commented Sep 5, 2024

I agree that this doesn't sound like a server metric.

From my limited understanding, NATS can have the following use cases (modes):

  1. MQTT mode: use NATS core (only retains the latest msg)
  2. Streaming mode: use JetStream with RetentionPolicy=limit
  3. MQ mode: use JetStream with RetentionPolicy=workqueue

For a NATS cluster, the health metrics required may vary depending on the different modes. Just as existing metrics are categorized from various perspectives such as consumer, server, and stream, similarly, could we establish necessary metrics for different usage modes?

Clearly, my use case is in MQ mode. Since messages in workqueue will only be consumed once, there is no need to rely on durable features to remember which record was last consumed. Additionally, there might be 1000+ ephemeral consumers (and 100+ durable ones) in the entire system; choosing ephemeral reduces load.

In MQ mode, a high message count/percentage under a specific stream subject is usually considered abnormal regardless of whether it's due to slow or dead consumers.

The suggestion for "msg per subject" comes from two bases:

  1. The NATS API provides an attribute setting for JetStream: MaxMsgsPerSubject.
  2. In NATS CLI you can quickly and easily get it using the command: nats stream subjects <stream-name>.

Thank you again for your reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal Enhancement idea or proposal
Projects
None yet
Development

No branches or pull requests

2 participants