Any chance to add new metric: nats_stream_total_messages_(per_subject) ? #305

tltsaia · 2024-09-03T10:32:48Z

What motivated this proposal?

This would be very helpful for detecting the health and performance of ephemeral、pull type consumers in the system.

What is the proposed change?

This project already has metrics: nats_stream_total_messages and nats_stream_total_bytes, but not enough for ephemeral、pull type consumer situation.

Who benefits from this change?

No response

What alternatives have you evaluated?

No response

Jarema · 2024-09-03T19:15:07Z

Hey.

Thanks for the report.
I'm bit confused on what metric exactly you would like to see. Can you please elaborate?

tltsaia · 2024-09-04T03:01:04Z

For example, in JetStream: S (RetentionPolicy=workqueue) handling subjects: domain.>,
APP-1 creates an ephemeral consumer to consume subject: domain.app1
APP-2 creates an ephemeral consumer to consume subject: domain.app2

If we only monitor the total count (e.g., total messages/bytes), it will be impossible to distinguish which consumer (APP-x) is abnormal, whether it's due to slow processing speed or crashed.
Since APP-x is not durable consumer, when APP-x terminates abnormally, we cannot detect the anomaly from the metric: jetstream_consumer_num_pending.

If there could be a metric: nats_stream_total_messages_per_subject
reporting the current number of messages in the stream retained per subject, it would reduce the time needed to locate anomalies and prevent JetStream from being overwhelmed by unprocessable messages.

Jarema · 2024-09-04T11:41:37Z

Ok, I got it! Thanks for the explanation.

There are few issues with this approach:

It is an expensive API call to make (to get those exact numbers), and could impact performance of the cluster
It would be useful in your case only if you are using workqueue or limits stream. Otherwise - the consumer would not affect number of messages on the stream.

However, NATS server does not know why the client stopped using the ephemeral consumer. Nor it knows what are the intentions of the apps if they did abandon a client, and it's hard to derive it from server metrics.
It sounds like it's not server metrics that should help you here?

Also keep in mind, that having unprocessed messages does not mean that the JetStream is overwhelmed. It only means that you are publishing faster than you are consuming. And again - only in workqueue streams. Would not say anything in Limits based streams.

Of course I'm aware that above are true for your use case, and I'm only pointing out they can't be generalized.

Maybe let's try to look at the problem differently: Why use ephemeral consumers instead of durable ones?

tltsaia · 2024-09-05T01:52:32Z

I agree that this doesn't sound like a server metric.

From my limited understanding, NATS can have the following use cases (modes):

MQTT mode: use NATS core (only retains the latest msg)
Streaming mode: use JetStream with RetentionPolicy=limit
MQ mode: use JetStream with RetentionPolicy=workqueue

For a NATS cluster, the health metrics required may vary depending on the different modes. Just as existing metrics are categorized from various perspectives such as consumer, server, and stream, similarly, could we establish necessary metrics for different usage modes?

Clearly, my use case is in MQ mode. Since messages in workqueue will only be consumed once, there is no need to rely on durable features to remember which record was last consumed. Additionally, there might be 1000+ ephemeral consumers (and 100+ durable ones) in the entire system; choosing ephemeral reduces load.

In MQ mode, a high message count/percentage under a specific stream subject is usually considered abnormal regardless of whether it's due to slow or dead consumers.

The suggestion for "msg per subject" comes from two bases:

The NATS API provides an attribute setting for JetStream: MaxMsgsPerSubject.
In NATS CLI you can quickly and easily get it using the command: nats stream subjects <stream-name>.

Thank you again for your reply.

tltsaia added the proposal Enhancement idea or proposal label Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any chance to add new metric: nats_stream_total_messages_(per_subject) ? #305

Any chance to add new metric: nats_stream_total_messages_(per_subject) ? #305

tltsaia commented Sep 3, 2024

Jarema commented Sep 3, 2024

tltsaia commented Sep 4, 2024

Jarema commented Sep 4, 2024

tltsaia commented Sep 5, 2024

Any chance to add new metric: nats_stream_total_messages_(per_subject) ? #305

Any chance to add new metric: nats_stream_total_messages_(per_subject) ? #305

Comments

tltsaia commented Sep 3, 2024

What motivated this proposal?

What is the proposed change?

Who benefits from this change?

What alternatives have you evaluated?

Jarema commented Sep 3, 2024

tltsaia commented Sep 4, 2024

Jarema commented Sep 4, 2024

tltsaia commented Sep 5, 2024