Feat/prometheus exporter #172

gu · 2022-04-17T18:02:58Z

Introduces a very basic set of functionality to expose a new API endpoint that returns application metrics for consumption by Prometheus.

The existing issue regarding performance logging has a few notes about Elastic APM usage, however I've personally had a better experience using the Prometheus/Grafana stack than ELK on a separate project. I'm more than happy to list a few pros/cons that I've noticed through using them both! Also I'm open to ideas on how to make this more modular so we aren't necessarily forcing/prescribing Prometheus usage.

By default will return Prometheus metrics for Node including garbage collection metrics, heap metrics, eventloop metrics, and more

❯ curl http://localhost:3001/api/metrics                                                                                                                                                                                       (develop $)
# HELP process_cpu_user_seconds_total Total user CPU time spent in seconds.
# TYPE process_cpu_user_seconds_total counter
process_cpu_user_seconds_total 0.390205

# HELP process_cpu_system_seconds_total Total system CPU time spent in seconds.
# TYPE process_cpu_system_seconds_total counter
process_cpu_system_seconds_total 0.07347

# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.463675

# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1650217730

# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 98426880

# HELP nodejs_eventloop_lag_seconds Lag of event loop in seconds.
# TYPE nodejs_eventloop_lag_seconds gauge
nodejs_eventloop_lag_seconds 0.005162209

# HELP nodejs_eventloop_lag_min_seconds The minimum recorded event loop delay.
# TYPE nodejs_eventloop_lag_min_seconds gauge
nodejs_eventloop_lag_min_seconds 0.009125888

# HELP nodejs_eventloop_lag_max_seconds The maximum recorded event loop delay.
# TYPE nodejs_eventloop_lag_max_seconds gauge
nodejs_eventloop_lag_max_seconds 0.016310271

# HELP nodejs_eventloop_lag_mean_seconds The mean of the recorded event loop delays.
# TYPE nodejs_eventloop_lag_mean_seconds gauge
nodejs_eventloop_lag_mean_seconds 0.010989794412060301

# HELP nodejs_eventloop_lag_stddev_seconds The standard deviation of the recorded event loop delays.
# TYPE nodejs_eventloop_lag_stddev_seconds gauge
nodejs_eventloop_lag_stddev_seconds 0.00037691298714976

# HELP nodejs_eventloop_lag_p50_seconds The 50th percentile of the recorded event loop delays.
# TYPE nodejs_eventloop_lag_p50_seconds gauge
nodejs_eventloop_lag_p50_seconds 0.011051007

# HELP nodejs_eventloop_lag_p90_seconds The 90th percentile of the recorded event loop delays.
# TYPE nodejs_eventloop_lag_p90_seconds gauge
nodejs_eventloop_lag_p90_seconds 0.011116543

# HELP nodejs_eventloop_lag_p99_seconds The 99th percentile of the recorded event loop delays.
# TYPE nodejs_eventloop_lag_p99_seconds gauge
nodejs_eventloop_lag_p99_seconds 0.011583487

# HELP nodejs_active_handles Number of active libuv handles grouped by handle type. Every handle type is C++ class name.
# TYPE nodejs_active_handles gauge
nodejs_active_handles{type="Socket"} 15
nodejs_active_handles{type="WriteStream"} 1
nodejs_active_handles{type="Server"} 1

# HELP nodejs_active_handles_total Total number of active handles.
# TYPE nodejs_active_handles_total gauge
nodejs_active_handles_total 17

# HELP nodejs_active_requests Number of active libuv requests grouped by request type. Every request type is C++ class name.
# TYPE nodejs_active_requests gauge

# HELP nodejs_active_requests_total Total number of active requests.
# TYPE nodejs_active_requests_total gauge
nodejs_active_requests_total 0

# HELP nodejs_heap_size_total_bytes Process heap size from Node.js in bytes.
# TYPE nodejs_heap_size_total_bytes gauge
nodejs_heap_size_total_bytes 40181760

# HELP nodejs_heap_size_used_bytes Process heap size used from Node.js in bytes.
# TYPE nodejs_heap_size_used_bytes gauge
nodejs_heap_size_used_bytes 38077696

# HELP nodejs_external_memory_bytes Node.js external memory size in bytes.
# TYPE nodejs_external_memory_bytes gauge
nodejs_external_memory_bytes 19946357

# HELP nodejs_heap_space_size_total_bytes Process heap space size total from Node.js in bytes.
# TYPE nodejs_heap_space_size_total_bytes gauge
nodejs_heap_space_size_total_bytes{space="read_only"} 151552
nodejs_heap_space_size_total_bytes{space="new"} 1048576
nodejs_heap_space_size_total_bytes{space="old"} 26636288
nodejs_heap_space_size_total_bytes{space="code"} 622592
nodejs_heap_space_size_total_bytes{space="map"} 1839104
nodejs_heap_space_size_total_bytes{space="large_object"} 9834496
nodejs_heap_space_size_total_bytes{space="code_large_object"} 49152
nodejs_heap_space_size_total_bytes{space="new_large_object"} 0

# HELP nodejs_heap_space_size_used_bytes Process heap space size used from Node.js in bytes.
# TYPE nodejs_heap_space_size_used_bytes gauge
nodejs_heap_space_size_used_bytes{space="read_only"} 150392
nodejs_heap_space_size_used_bytes{space="new"} 595104
nodejs_heap_space_size_used_bytes{space="old"} 25928792
nodejs_heap_space_size_used_bytes{space="code"} 359648
nodejs_heap_space_size_used_bytes{space="map"} 1280448
nodejs_heap_space_size_used_bytes{space="large_object"} 9764864
nodejs_heap_space_size_used_bytes{space="code_large_object"} 2976
nodejs_heap_space_size_used_bytes{space="new_large_object"} 0

# HELP nodejs_heap_space_size_available_bytes Process heap space size available from Node.js in bytes.
# TYPE nodejs_heap_space_size_available_bytes gauge
nodejs_heap_space_size_available_bytes{space="read_only"} 0
nodejs_heap_space_size_available_bytes{space="new"} 452320
nodejs_heap_space_size_available_bytes{space="old"} 608584
nodejs_heap_space_size_available_bytes{space="code"} 1120
nodejs_heap_space_size_available_bytes{space="map"} 555960
nodejs_heap_space_size_available_bytes{space="large_object"} 0
nodejs_heap_space_size_available_bytes{space="code_large_object"} 0
nodejs_heap_space_size_available_bytes{space="new_large_object"} 1047424

# HELP nodejs_version_info Node.js version info.
# TYPE nodejs_version_info gauge
nodejs_version_info{version="v14.18.3",major="14",minor="18",patch="3"} 1

# HELP nodejs_gc_duration_seconds Garbage collection duration by kind, one of major, minor, incremental or weakcb.
# TYPE nodejs_gc_duration_seconds histogram
nodejs_gc_duration_seconds_bucket{le="0.001",kind="minor"} 1
nodejs_gc_duration_seconds_bucket{le="0.01",kind="minor"} 2
nodejs_gc_duration_seconds_bucket{le="0.1",kind="minor"} 2
nodejs_gc_duration_seconds_bucket{le="1",kind="minor"} 2
nodejs_gc_duration_seconds_bucket{le="2",kind="minor"} 2
nodejs_gc_duration_seconds_bucket{le="5",kind="minor"} 2
nodejs_gc_duration_seconds_bucket{le="+Inf",kind="minor"} 2
nodejs_gc_duration_seconds_sum{kind="minor"} 0.002840583
nodejs_gc_duration_seconds_count{kind="minor"} 2
nodejs_gc_duration_seconds_bucket{le="0.001",kind="incremental"} 6
nodejs_gc_duration_seconds_bucket{le="0.01",kind="incremental"} 6
nodejs_gc_duration_seconds_bucket{le="0.1",kind="incremental"} 6
nodejs_gc_duration_seconds_bucket{le="1",kind="incremental"} 6
nodejs_gc_duration_seconds_bucket{le="2",kind="incremental"} 6
nodejs_gc_duration_seconds_bucket{le="5",kind="incremental"} 6
nodejs_gc_duration_seconds_bucket{le="+Inf",kind="incremental"} 6
nodejs_gc_duration_seconds_sum{kind="incremental"} 0.001397041
nodejs_gc_duration_seconds_count{kind="incremental"} 6
nodejs_gc_duration_seconds_bucket{le="0.001",kind="major"} 0
nodejs_gc_duration_seconds_bucket{le="0.01",kind="major"} 3
nodejs_gc_duration_seconds_bucket{le="0.1",kind="major"} 3
nodejs_gc_duration_seconds_bucket{le="1",kind="major"} 3
nodejs_gc_duration_seconds_bucket{le="2",kind="major"} 3
nodejs_gc_duration_seconds_bucket{le="5",kind="major"} 3
nodejs_gc_duration_seconds_bucket{le="+Inf",kind="major"} 3
nodejs_gc_duration_seconds_sum{kind="major"} 0.009879624
nodejs_gc_duration_seconds_count{kind="major"} 3

When adding express middleware, we can also expose individual route stats. Currently this is limited to request count and a request duration histogram

# HELP starter_http_requests_total Counter for total requests received
# TYPE starter_http_requests_total counter
starter_http_requests_total{route="/config",method="GET",status="200"} 4

# HELP starter_http_request_duration_seconds Duration of HTTP requests in seconds
# TYPE starter_http_request_duration_seconds histogram
starter_http_request_duration_seconds_bucket{le="0.005",route="/config",method="GET",status="200"} 4
starter_http_request_duration_seconds_bucket{le="0.01",route="/config",method="GET",status="200"} 4
starter_http_request_duration_seconds_bucket{le="0.025",route="/config",method="GET",status="200"} 4
starter_http_request_duration_seconds_bucket{le="0.05",route="/config",method="GET",status="200"} 4
starter_http_request_duration_seconds_bucket{le="0.1",route="/config",method="GET",status="200"} 4
starter_http_request_duration_seconds_bucket{le="0.25",route="/config",method="GET",status="200"} 4
starter_http_request_duration_seconds_bucket{le="0.5",route="/config",method="GET",status="200"} 4
starter_http_request_duration_seconds_bucket{le="1",route="/config",method="GET",status="200"} 4
starter_http_request_duration_seconds_bucket{le="2.5",route="/config",method="GET",status="200"} 4
starter_http_request_duration_seconds_bucket{le="5",route="/config",method="GET",status="200"} 4
starter_http_request_duration_seconds_bucket{le="10",route="/config",method="GET",status="200"} 4
starter_http_request_duration_seconds_bucket{le="+Inf",route="/config",method="GET",status="200"} 4
starter_http_request_duration_seconds_sum{route="/config",method="GET",status="200"} 0.019
starter_http_request_duration_seconds_count{route="/config",method="GET",status="200"} 4

…PI and API request metrics via a metrics endpoint. Also added an express middleware function to track count and duration of requests

brianghig · 2022-04-27T14:51:22Z

config/env/default.js

+	metrics: {
+		enabled: false,
+		prefix: 'starter',
+		collectDefaultMetrics: true


@gu - are there any performance / memory considerations for collecting metrics by default? What if a fork doesn't use them?

@brianghig I haven't done any direct performance comparisons with default metrics disabled. If they are disabled, then the output of the metrics route would essentially be an empty object, which prometheus will be able to handle.

Hoping that the implementation for our team will give us a better idea on performance impact

brianghig · 2022-04-27T14:56:05Z

src/lib/metrics.js

+ */
+const getMetrics = () => {
+	if (!metricsEnabled) {
+		return Promise.reject('Metrics not configured');


@gu - should we throw an error here instead of rejecting the promise? It seems like we're moving toward async/await that would favor that style of error handling.

Also, should this read "Metrics not enabled" instead of not configured?

brianghig · 2022-04-27T14:57:49Z

src/lib/metrics.js

+		return Promise.reject('Metrics not configured');
+	}
+
+	return register.metrics();


@gu - how should we handle situations where the start() method is not called prior to a getMetrics() request? Understanding that this is now in the middleware initialization, we may want to provide a more explicit error message.

brianghig · 2022-04-27T14:59:02Z

src/lib/metrics.js

+		// Track start of request duration
+		const start = Date.now();
+
+		res.on('finish', () => {


@gu - does the finish event handle failures? For example, if the request throws an error or times out?

I believe it does but I'll verify that

tzfx · 2022-05-17T20:22:51Z

It's always good to have more options available. I will say that the APM route was amazing in that the only required configuration for it was what I outlined in #32. The default configuration without any additional setup captured any request going into express in a very granular way, and metrics delivery was seamless.

gu · 2022-06-07T14:37:27Z

@tzfx Yeah I agree that APM was very convenient on another project that I used it on. I think over time I started to lean more towards the exporter/prometheus/grafana side since I think I personally prefer that pattern a bit more. I'm happy to update this to make this change less 'use-this-or-else' if you have any suggestions to make it more modular!

I'm also fine with nixing this PR since I haven't been able to keep an eye on it :(

gu · 2022-06-24T23:25:46Z

Closing since it’s stale and I haven’t been able to keep it up to date

Freddy Gu added 3 commits April 10, 2022 21:15

Added optional usage of a prometheus client to enable the export of A…

73a1273

…PI and API request metrics via a metrics endpoint. Also added an express middleware function to track count and duration of requests

Merge branch 'develop' into feat/prometheus-exporter

962a11b

Updated readme with instructions on metrics middleware usage

ffcfd7f

brianghig reviewed Apr 27, 2022

View reviewed changes

gu closed this Jun 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/prometheus exporter #172

Feat/prometheus exporter #172

gu commented Apr 17, 2022

brianghig Apr 27, 2022

gu May 16, 2022

brianghig Apr 27, 2022

brianghig Apr 27, 2022

brianghig Apr 27, 2022

gu May 16, 2022

tzfx commented May 17, 2022

gu commented Jun 7, 2022

gu commented Jun 24, 2022

Feat/prometheus exporter #172

Feat/prometheus exporter #172

Conversation

gu commented Apr 17, 2022

brianghig Apr 27, 2022

Choose a reason for hiding this comment

gu May 16, 2022

Choose a reason for hiding this comment

brianghig Apr 27, 2022

Choose a reason for hiding this comment

brianghig Apr 27, 2022

Choose a reason for hiding this comment

brianghig Apr 27, 2022

Choose a reason for hiding this comment

gu May 16, 2022

Choose a reason for hiding this comment

tzfx commented May 17, 2022

gu commented Jun 7, 2022

gu commented Jun 24, 2022