Ported over from Sonja's most excellent OSS repo
This demonstrates how to configure Tyk Gateway, Tyk Pump, Prometheus and Grafana OSS to set-up a dashboard with SLIs and SLOs for your APIs managed by Tyk.
You can use it to explore the Prometheus metrics exposed by Tyk Pump and use them in a Grafana dashboard.
- Run the
up.sh
script with theslo-prometheus-grafana
parameter:
./up.sh slo-prometheus-grafana
- Generate traffic
K6 is used to generate traffic to the API endpoints. The load script load.js will run for 15 minutes.
./docker-compose-command.sh run k6 run /scripts/load.js
You will see K6 output in your terminal:
- Check out the dashboard in Grafana
Go to Grafana in your browser (initial user/pwd: admin/admin) and open the dashboard called SLOs for APIs managed by Tyk.
You should see the data coming in:
You can also filter the data per API:
- Tyk API Gateway is configured to expose two API endpoint:
- httpbin (see .json config)
- httpstatus (see .json config)
- K6 will use the load script load.js to generate demo traffic to the API endpoints
- Tyk Pump is configured to expose a metric endpoint for Prometheus (see config) with two custom metrics called
tyk_http_requests_total
andtyk_http_latency
. Tyk Pump version >= 1.6. is needed for custom metrics. - Prometheus
- prometheus.yml is configured to automatically scrape Tyk Pump's metric endpoint
- slos.rules.yml is used to calculate additional metrics needed for the remaining error budget
- Grafana
- prometheus_ds.yml is configured to connect Grafana automatically to Prometheus
- SLOs-for-APIs-managed-by-Tyk.json is the dashboard definition
Definition and example inspired from https://sre.google/workbook/slo-document/, https://landing.google.com/sre/workbook/chapters/alerting-on-slos/ and https://github.com/google/prometheus-slo-burn-example/blob/master/prometheus/slos.rules.yml.
You will see different indicators displayed on the Grafana dashboard.
To calculate the SLO and the displayed error budget remaining, we use the following SLI/SLO:
- SLI: the proportion of successful HTTP requests, as measured from Tyk API Gateway
- Any HTTP status other than 500–599 is considered successful.
- count of http_requests which do not have a 5XX status code divided by count of all http_requests
- SLO: 95% successful requests
In slos.rules.yml we calculate the rate of error per requests for the last 10 minute in job:slo_errors_per_request:ratio_rate10m
. With job:error_budget:remaining
we calculate the error budget remaining in percent. This is what we display in the Grafana dashboard. We use a threshold of 95% in the dashboard (every value below 95% is red).