Collect metrics about CHT endpoint performance #60

jkuester · 2023-06-12T17:34:58Z

I came across this express.js prometheus middleware that collects response status and duration information which might be a really clean way to monitor performance of certain endpoints (eg: replication) without doing much bespoke development.

kennsippell · 2023-06-14T06:11:09Z

This does look clean. Is there also a plan to bring log data into Grafana (via Loki et al)? If that work is already being done, the response_time and status are in the medic-api logs and it may be straightforward to bring that into Grafana. Does anybody have experience with these? Does anybody have a preference for the approach taken?

jkuester · 2023-06-14T18:53:18Z

We have had some long-term discussions about the possibility of bringing in log data (via something like Loki)! It is not the top of the list of priorities, but is definitely something that would be valuable and folks have already shown interest in.

garethbowen · 2023-06-14T22:13:20Z

We definitely want to improve logging, particularly for self-hosted installs, which may be Loki+Grafana. The downside to that is it exposes PHI where there wasn't before so it may not be the right approach. I've heard mixed feedback about it, but interestingly MoH Kenya are using it for other services and they like it and might be the ideal candidate for evaluating our approach.

Looking at one of the demo videos for Loki it can pull out response times and failure rates for different endpoints so it would also do what we need. Definitely worth investigating as well as any other options to do the same thing.

kennsippell · 2023-06-27T22:08:39Z

Looking at a few options. Welcome thoughts and feedback on these findings. I'm new to the Grafana stack so quite possible there is more to be considered here.

joao-fontenele/express-prometheus-middleware

https://github.com/joao-fontenele/express-prometheus-middleware

I fear the package may be end of life. I can't find any official announcement, but the repository has been archived and no updates in 2 years. I asked about the project's future here and await a response.
This uses the response-time library for endpoint performance measurement. This records "... the elapsed time from when a request enters this middleware to when the headers are written". I believe this is not the metric we want because the time reported for _changes requests is often the heartbeat time and so this wouldn't meaningfully monitor the performance of offline user replication requests.
Monitors endpoints by path (unlike prometheus-api-metrics which uses the express route name). Has a useful interface extraMasks interface for custom grouping and differentiation of endpoints. Need this to differentiate similar routes (like offline _changes requests vs longpoll) and group requests (like /medic/resources/*, or /medic/{uuid}).
The importable dashboard is definitely the best available - even just its presentation of the prom-client standard metrics.

matteodisabatino/express-prometheus-middleware

https://github.com/matteodisabatino/express-prometheus-middleware

Official option on the Grafana site. Very basic with poor extensibility. No segmentation of metrics by endpoint at all. I don't think we should consider this.

prometheus-api-metrics

https://github.com/PayU/prometheus-api-metrics

Stewarded by PayU.com a global payment provider.
This measure endpoint performance via the total_time from request to response finished so this accurately reports offline user _changes requests. I believe this is what we want.
Routes are grouped by the express route (eg. /+medic/+:docId/{0,}) and not the path (like express-prometheus-middleware). I like it, but some issues with CHT requests (eg. resources are not properly grouped). No support for differentiating routes or custom grouping. I personally feel it is really important to monitor longpoll _changes requests differently from offline user _changes requests during replication.
The premade dashboards are decent for GC stuff but seem way worse than joao-fontenele for request duration. I can try to combine them?
Good support for custom metrics through prom-client, so we have custom metrics like yield or replication yield as first-class citizens in CHT monitoring.

prom-client

https://github.com/siimon/prom-client

Under the hood all these projects use prom-client.
A lot of the process/GC metrics by default.

Recommendation

Short-List of Tooling Options

Use prometheus-api-metrics and contribute a feature to improve route differentiation/grouping
Custom implementation based on prom-client
Implementation based on logs (not described)

Recommendations

Look into contributing the needed feature to prometheus-api-metrics.
Make CHT-specific dashboards forked from express-prometheus-middleware dashboards (regardless of tooling option chosen).
Monitor endpoint performance based on total_time (last byte) not response_time (first byte)
Avoid joao-fontenele due to end-of-life + endpoint performance measurement is only by time-to-first-byte.

garethbowen · 2023-06-27T22:48:45Z

Great write up, thanks!

One thing to consider is assuming this requires a cht-core api code change, then it'll presumably be released in 4.3.0 and not backported to previous versions. If so, then the endpoints to monitor will change completely with the new replication algorithm. This means the _changes endpoint likely no longer needs to be monitored, which in turn means the issues you've cited with longpoll, heartbeat, etc should not be a problem.

kennsippell · 2023-06-27T23:28:49Z

True. My thinking is so 4.2.

So there is no heartbeat or anything similarly affecting time-to-first-byte in 4.3 replication requests?
How do you feel about adding a dependency on express-prometheus-middleware - an archived/likely unmaintained project?
Is there a preference for monitoring time to first-byte vs last-byte?

dianabarsan · 2023-06-28T19:16:40Z

there is no heartbeat or anything similarly affecting time-to-first-byte in 4.3 replication requests

This is correct, _changes endpoint will no longer be used as of 4.3. We still need endpoint filtering, even if we're changing the protocol, we will still have at least one resource intensive endpoint - but these endpoints will less likely be shared between use-cases (there won't be a "longpoll" variant to be ignored for example).

I trust PayU, they are very big in Europe.

kennsippell · 2023-06-29T08:24:57Z

I like PayU's focus on Apdex. Pretty cool, but may take some thinking and data to calibrate for us.

With the PayU lib, we get a bunch of noisy reporting because of our default '*' route. I'm seeing unique routes in Grafana for each unique path handled by this route. The routes _local/{uuid} and resources are the noisiest. It would be better for these to be grouped somehow.

I've made a PR which would group them all under a single route '*'. Here are the paths that I know of that would be grouped. Is there anything here we'd definitely want to look at individually (not all together in this single group)?

/medic/
/medic/resources/*
/medic/branding/*
/*-users-meta/
/-users-meta/_local
/*-users-meta/_rev_diff
/-users-meta/_design/
/*-users-meta/_bulk_docs
…. probably more I don't know about

I'm also investigating a second issue which is causing a different pattern of noise in the prometheus routes. This seems less critical, but would be nice to fix.

Last issue I've found is that requests for static content (service-worker.js, runtime.js, etc.) are grouped together under a route called "N/A" because of our use of app.use(express.static());. I haven't filed a bug for this because I don't know how I'd go about fixing it exactly. I can't find anything else which triggers this N/A case - though it isn't guaranteed to be all static content.

kennsippell self-assigned this Jun 27, 2023

kennsippell mentioned this issue Jun 30, 2023

feat(#8426): API express internals + endpoint monitoring via Prometheus medic/cht-core#8354

Merged

5 tasks

This was referenced Jul 18, 2023

feat(#60): add prometheus job to scrape express-metrics #74

Merged

feat(#60): dashboards with CHT api express metrics #75

Merged

garethbowen mentioned this issue Aug 1, 2023

Expose an API with endpoint performance metrics medic/cht-core#8426

Closed

kennsippell added a commit that referenced this issue Sep 7, 2023

feat(#60): add prometheus job to scrape express-metrics (#74)

5ea380c

kennsippell added a commit that referenced this issue Sep 21, 2023

feat(#60): CHT API Server and CHT Replication dashboards (#75)

e57487d

kennsippell closed this as completed Sep 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collect metrics about CHT endpoint performance #60

Collect metrics about CHT endpoint performance #60

jkuester commented Jun 12, 2023

kennsippell commented Jun 14, 2023 •

edited

Loading

jkuester commented Jun 14, 2023

garethbowen commented Jun 14, 2023

kennsippell commented Jun 27, 2023 •

edited

Loading

garethbowen commented Jun 27, 2023

kennsippell commented Jun 27, 2023 •

edited

Loading

dianabarsan commented Jun 28, 2023

kennsippell commented Jun 29, 2023 •

edited

Loading

Collect metrics about CHT endpoint performance #60

Collect metrics about CHT endpoint performance #60

Comments

jkuester commented Jun 12, 2023

kennsippell commented Jun 14, 2023 • edited Loading

jkuester commented Jun 14, 2023

garethbowen commented Jun 14, 2023

kennsippell commented Jun 27, 2023 • edited Loading

joao-fontenele/express-prometheus-middleware

matteodisabatino/express-prometheus-middleware

prometheus-api-metrics

prom-client

Recommendation

garethbowen commented Jun 27, 2023

kennsippell commented Jun 27, 2023 • edited Loading

dianabarsan commented Jun 28, 2023

kennsippell commented Jun 29, 2023 • edited Loading

kennsippell commented Jun 14, 2023 •

edited

Loading

kennsippell commented Jun 27, 2023 •

edited

Loading

kennsippell commented Jun 27, 2023 •

edited

Loading

kennsippell commented Jun 29, 2023 •

edited

Loading