Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/prometheus exporter #172

Closed
wants to merge 3 commits into from
Closed

Conversation

gu
Copy link
Contributor

@gu gu commented Apr 17, 2022

Introduces a very basic set of functionality to expose a new API endpoint that returns application metrics for consumption by Prometheus.

The existing issue regarding performance logging has a few notes about Elastic APM usage, however I've personally had a better experience using the Prometheus/Grafana stack than ELK on a separate project. I'm more than happy to list a few pros/cons that I've noticed through using them both! Also I'm open to ideas on how to make this more modular so we aren't necessarily forcing/prescribing Prometheus usage.

By default will return Prometheus metrics for Node including garbage collection metrics, heap metrics, eventloop metrics, and more

❯ curl http://localhost:3001/api/metrics                                                                                                                                                                                       (develop $)
# HELP process_cpu_user_seconds_total Total user CPU time spent in seconds.
# TYPE process_cpu_user_seconds_total counter
process_cpu_user_seconds_total 0.390205

# HELP process_cpu_system_seconds_total Total system CPU time spent in seconds.
# TYPE process_cpu_system_seconds_total counter
process_cpu_system_seconds_total 0.07347

# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.463675

# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1650217730

# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 98426880

# HELP nodejs_eventloop_lag_seconds Lag of event loop in seconds.
# TYPE nodejs_eventloop_lag_seconds gauge
nodejs_eventloop_lag_seconds 0.005162209

# HELP nodejs_eventloop_lag_min_seconds The minimum recorded event loop delay.
# TYPE nodejs_eventloop_lag_min_seconds gauge
nodejs_eventloop_lag_min_seconds 0.009125888

# HELP nodejs_eventloop_lag_max_seconds The maximum recorded event loop delay.
# TYPE nodejs_eventloop_lag_max_seconds gauge
nodejs_eventloop_lag_max_seconds 0.016310271

# HELP nodejs_eventloop_lag_mean_seconds The mean of the recorded event loop delays.
# TYPE nodejs_eventloop_lag_mean_seconds gauge
nodejs_eventloop_lag_mean_seconds 0.010989794412060301

# HELP nodejs_eventloop_lag_stddev_seconds The standard deviation of the recorded event loop delays.
# TYPE nodejs_eventloop_lag_stddev_seconds gauge
nodejs_eventloop_lag_stddev_seconds 0.00037691298714976

# HELP nodejs_eventloop_lag_p50_seconds The 50th percentile of the recorded event loop delays.
# TYPE nodejs_eventloop_lag_p50_seconds gauge
nodejs_eventloop_lag_p50_seconds 0.011051007

# HELP nodejs_eventloop_lag_p90_seconds The 90th percentile of the recorded event loop delays.
# TYPE nodejs_eventloop_lag_p90_seconds gauge
nodejs_eventloop_lag_p90_seconds 0.011116543

# HELP nodejs_eventloop_lag_p99_seconds The 99th percentile of the recorded event loop delays.
# TYPE nodejs_eventloop_lag_p99_seconds gauge
nodejs_eventloop_lag_p99_seconds 0.011583487

# HELP nodejs_active_handles Number of active libuv handles grouped by handle type. Every handle type is C++ class name.
# TYPE nodejs_active_handles gauge
nodejs_active_handles{type="Socket"} 15
nodejs_active_handles{type="WriteStream"} 1
nodejs_active_handles{type="Server"} 1

# HELP nodejs_active_handles_total Total number of active handles.
# TYPE nodejs_active_handles_total gauge
nodejs_active_handles_total 17

# HELP nodejs_active_requests Number of active libuv requests grouped by request type. Every request type is C++ class name.
# TYPE nodejs_active_requests gauge

# HELP nodejs_active_requests_total Total number of active requests.
# TYPE nodejs_active_requests_total gauge
nodejs_active_requests_total 0

# HELP nodejs_heap_size_total_bytes Process heap size from Node.js in bytes.
# TYPE nodejs_heap_size_total_bytes gauge
nodejs_heap_size_total_bytes 40181760

# HELP nodejs_heap_size_used_bytes Process heap size used from Node.js in bytes.
# TYPE nodejs_heap_size_used_bytes gauge
nodejs_heap_size_used_bytes 38077696

# HELP nodejs_external_memory_bytes Node.js external memory size in bytes.
# TYPE nodejs_external_memory_bytes gauge
nodejs_external_memory_bytes 19946357

# HELP nodejs_heap_space_size_total_bytes Process heap space size total from Node.js in bytes.
# TYPE nodejs_heap_space_size_total_bytes gauge
nodejs_heap_space_size_total_bytes{space="read_only"} 151552
nodejs_heap_space_size_total_bytes{space="new"} 1048576
nodejs_heap_space_size_total_bytes{space="old"} 26636288
nodejs_heap_space_size_total_bytes{space="code"} 622592
nodejs_heap_space_size_total_bytes{space="map"} 1839104
nodejs_heap_space_size_total_bytes{space="large_object"} 9834496
nodejs_heap_space_size_total_bytes{space="code_large_object"} 49152
nodejs_heap_space_size_total_bytes{space="new_large_object"} 0

# HELP nodejs_heap_space_size_used_bytes Process heap space size used from Node.js in bytes.
# TYPE nodejs_heap_space_size_used_bytes gauge
nodejs_heap_space_size_used_bytes{space="read_only"} 150392
nodejs_heap_space_size_used_bytes{space="new"} 595104
nodejs_heap_space_size_used_bytes{space="old"} 25928792
nodejs_heap_space_size_used_bytes{space="code"} 359648
nodejs_heap_space_size_used_bytes{space="map"} 1280448
nodejs_heap_space_size_used_bytes{space="large_object"} 9764864
nodejs_heap_space_size_used_bytes{space="code_large_object"} 2976
nodejs_heap_space_size_used_bytes{space="new_large_object"} 0

# HELP nodejs_heap_space_size_available_bytes Process heap space size available from Node.js in bytes.
# TYPE nodejs_heap_space_size_available_bytes gauge
nodejs_heap_space_size_available_bytes{space="read_only"} 0
nodejs_heap_space_size_available_bytes{space="new"} 452320
nodejs_heap_space_size_available_bytes{space="old"} 608584
nodejs_heap_space_size_available_bytes{space="code"} 1120
nodejs_heap_space_size_available_bytes{space="map"} 555960
nodejs_heap_space_size_available_bytes{space="large_object"} 0
nodejs_heap_space_size_available_bytes{space="code_large_object"} 0
nodejs_heap_space_size_available_bytes{space="new_large_object"} 1047424

# HELP nodejs_version_info Node.js version info.
# TYPE nodejs_version_info gauge
nodejs_version_info{version="v14.18.3",major="14",minor="18",patch="3"} 1

# HELP nodejs_gc_duration_seconds Garbage collection duration by kind, one of major, minor, incremental or weakcb.
# TYPE nodejs_gc_duration_seconds histogram
nodejs_gc_duration_seconds_bucket{le="0.001",kind="minor"} 1
nodejs_gc_duration_seconds_bucket{le="0.01",kind="minor"} 2
nodejs_gc_duration_seconds_bucket{le="0.1",kind="minor"} 2
nodejs_gc_duration_seconds_bucket{le="1",kind="minor"} 2
nodejs_gc_duration_seconds_bucket{le="2",kind="minor"} 2
nodejs_gc_duration_seconds_bucket{le="5",kind="minor"} 2
nodejs_gc_duration_seconds_bucket{le="+Inf",kind="minor"} 2
nodejs_gc_duration_seconds_sum{kind="minor"} 0.002840583
nodejs_gc_duration_seconds_count{kind="minor"} 2
nodejs_gc_duration_seconds_bucket{le="0.001",kind="incremental"} 6
nodejs_gc_duration_seconds_bucket{le="0.01",kind="incremental"} 6
nodejs_gc_duration_seconds_bucket{le="0.1",kind="incremental"} 6
nodejs_gc_duration_seconds_bucket{le="1",kind="incremental"} 6
nodejs_gc_duration_seconds_bucket{le="2",kind="incremental"} 6
nodejs_gc_duration_seconds_bucket{le="5",kind="incremental"} 6
nodejs_gc_duration_seconds_bucket{le="+Inf",kind="incremental"} 6
nodejs_gc_duration_seconds_sum{kind="incremental"} 0.001397041
nodejs_gc_duration_seconds_count{kind="incremental"} 6
nodejs_gc_duration_seconds_bucket{le="0.001",kind="major"} 0
nodejs_gc_duration_seconds_bucket{le="0.01",kind="major"} 3
nodejs_gc_duration_seconds_bucket{le="0.1",kind="major"} 3
nodejs_gc_duration_seconds_bucket{le="1",kind="major"} 3
nodejs_gc_duration_seconds_bucket{le="2",kind="major"} 3
nodejs_gc_duration_seconds_bucket{le="5",kind="major"} 3
nodejs_gc_duration_seconds_bucket{le="+Inf",kind="major"} 3
nodejs_gc_duration_seconds_sum{kind="major"} 0.009879624
nodejs_gc_duration_seconds_count{kind="major"} 3

When adding express middleware, we can also expose individual route stats. Currently this is limited to request count and a request duration histogram

# HELP starter_http_requests_total Counter for total requests received
# TYPE starter_http_requests_total counter
starter_http_requests_total{route="/config",method="GET",status="200"} 4

# HELP starter_http_request_duration_seconds Duration of HTTP requests in seconds
# TYPE starter_http_request_duration_seconds histogram
starter_http_request_duration_seconds_bucket{le="0.005",route="/config",method="GET",status="200"} 4
starter_http_request_duration_seconds_bucket{le="0.01",route="/config",method="GET",status="200"} 4
starter_http_request_duration_seconds_bucket{le="0.025",route="/config",method="GET",status="200"} 4
starter_http_request_duration_seconds_bucket{le="0.05",route="/config",method="GET",status="200"} 4
starter_http_request_duration_seconds_bucket{le="0.1",route="/config",method="GET",status="200"} 4
starter_http_request_duration_seconds_bucket{le="0.25",route="/config",method="GET",status="200"} 4
starter_http_request_duration_seconds_bucket{le="0.5",route="/config",method="GET",status="200"} 4
starter_http_request_duration_seconds_bucket{le="1",route="/config",method="GET",status="200"} 4
starter_http_request_duration_seconds_bucket{le="2.5",route="/config",method="GET",status="200"} 4
starter_http_request_duration_seconds_bucket{le="5",route="/config",method="GET",status="200"} 4
starter_http_request_duration_seconds_bucket{le="10",route="/config",method="GET",status="200"} 4
starter_http_request_duration_seconds_bucket{le="+Inf",route="/config",method="GET",status="200"} 4
starter_http_request_duration_seconds_sum{route="/config",method="GET",status="200"} 0.019
starter_http_request_duration_seconds_count{route="/config",method="GET",status="200"} 4

Freddy Gu added 3 commits April 10, 2022 21:15
…PI and API request metrics via a metrics endpoint. Also added an express middleware function to track count and duration of requests
metrics: {
enabled: false,
prefix: 'starter',
collectDefaultMetrics: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gu - are there any performance / memory considerations for collecting metrics by default? What if a fork doesn't use them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brianghig I haven't done any direct performance comparisons with default metrics disabled. If they are disabled, then the output of the metrics route would essentially be an empty object, which prometheus will be able to handle.

Hoping that the implementation for our team will give us a better idea on performance impact

*/
const getMetrics = () => {
if (!metricsEnabled) {
return Promise.reject('Metrics not configured');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gu - should we throw an error here instead of rejecting the promise? It seems like we're moving toward async/await that would favor that style of error handling.

Also, should this read "Metrics not enabled" instead of not configured?

return Promise.reject('Metrics not configured');
}

return register.metrics();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gu - how should we handle situations where the start() method is not called prior to a getMetrics() request? Understanding that this is now in the middleware initialization, we may want to provide a more explicit error message.

// Track start of request duration
const start = Date.now();

res.on('finish', () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gu - does the finish event handle failures? For example, if the request throws an error or times out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it does but I'll verify that

@tzfx
Copy link
Contributor

tzfx commented May 17, 2022

It's always good to have more options available. I will say that the APM route was amazing in that the only required configuration for it was what I outlined in #32. The default configuration without any additional setup captured any request going into express in a very granular way, and metrics delivery was seamless.

@gu
Copy link
Contributor Author

gu commented Jun 7, 2022

@tzfx Yeah I agree that APM was very convenient on another project that I used it on. I think over time I started to lean more towards the exporter/prometheus/grafana side since I think I personally prefer that pattern a bit more. I'm happy to update this to make this change less 'use-this-or-else' if you have any suggestions to make it more modular!

I'm also fine with nixing this PR since I haven't been able to keep an eye on it :(

@gu gu closed this Jun 24, 2022
@gu
Copy link
Contributor Author

gu commented Jun 24, 2022

Closing since it’s stale and I haven’t been able to keep it up to date

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants