-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collect metrics about CHT endpoint performance #60
Comments
This does look clean. Is there also a plan to bring log data into Grafana (via Loki et al)? If that work is already being done, the response_time and status are in the medic-api logs and it may be straightforward to bring that into Grafana. Does anybody have experience with these? Does anybody have a preference for the approach taken? |
We have had some long-term discussions about the possibility of bringing in log data (via something like Loki)! It is not the top of the list of priorities, but is definitely something that would be valuable and folks have already shown interest in. |
We definitely want to improve logging, particularly for self-hosted installs, which may be Loki+Grafana. The downside to that is it exposes PHI where there wasn't before so it may not be the right approach. I've heard mixed feedback about it, but interestingly MoH Kenya are using it for other services and they like it and might be the ideal candidate for evaluating our approach. Looking at one of the demo videos for Loki it can pull out response times and failure rates for different endpoints so it would also do what we need. Definitely worth investigating as well as any other options to do the same thing. |
Looking at a few options. Welcome thoughts and feedback on these findings. I'm new to the Grafana stack so quite possible there is more to be considered here. joao-fontenele/express-prometheus-middlewarehttps://github.com/joao-fontenele/express-prometheus-middleware
matteodisabatino/express-prometheus-middlewarehttps://github.com/matteodisabatino/express-prometheus-middleware Official option on the Grafana site. Very basic with poor extensibility. No segmentation of metrics by endpoint at all. I don't think we should consider this. prometheus-api-metricshttps://github.com/PayU/prometheus-api-metrics
prom-clienthttps://github.com/siimon/prom-client
RecommendationShort-List of Tooling Options
Recommendations
|
Great write up, thanks! One thing to consider is assuming this requires a cht-core api code change, then it'll presumably be released in 4.3.0 and not backported to previous versions. If so, then the endpoints to monitor will change completely with the new replication algorithm. This means the _changes endpoint likely no longer needs to be monitored, which in turn means the issues you've cited with longpoll, heartbeat, etc should not be a problem. |
True. My thinking is so 4.2. So there is no heartbeat or anything similarly affecting time-to-first-byte in 4.3 replication requests? |
This is correct, _changes endpoint will no longer be used as of 4.3. We still need endpoint filtering, even if we're changing the protocol, we will still have at least one resource intensive endpoint - but these endpoints will less likely be shared between use-cases (there won't be a "longpoll" variant to be ignored for example). I trust PayU, they are very big in Europe. |
I like PayU's focus on Apdex. Pretty cool, but may take some thinking and data to calibrate for us. With the PayU lib, we get a bunch of noisy reporting because of our default '*' route. I'm seeing unique routes in Grafana for each unique path handled by this route. The routes _local/{uuid} and resources are the noisiest. It would be better for these to be grouped somehow. I've made a PR which would group them all under a single route '*'. Here are the paths that I know of that would be grouped. Is there anything here we'd definitely want to look at individually (not all together in this single group)?
I'm also investigating a second issue which is causing a different pattern of noise in the prometheus routes. This seems less critical, but would be nice to fix. Last issue I've found is that requests for static content (service-worker.js, runtime.js, etc.) are grouped together under a route called "N/A" because of our use of |
From @garethbowen's Slack message:
The text was updated successfully, but these errors were encountered: