Skip to content

Latest commit

 

History

History
78 lines (47 loc) · 4.15 KB

File metadata and controls

78 lines (47 loc) · 4.15 KB

SLIs and SLOs with Prometheus and Grafana for your APIs managed by Tyk

Ported over from Sonja's most excellent OSS repo

About

This demonstrates how to configure Tyk Gateway, Tyk Pump, Prometheus and Grafana OSS to set-up a dashboard with SLIs and SLOs for your APIs managed by Tyk.

You can use it to explore the Prometheus metrics exposed by Tyk Pump and use them in a Grafana dashboard.

SLOs-for-APIs-managed-by-Tyk-Dashboards-Grafana

Setup

  1. Run the up.sh script with the slo-prometheus-grafana parameter:
./up.sh slo-prometheus-grafana
  1. Generate traffic

K6 is used to generate traffic to the API endpoints. The load script load.js will run for 15 minutes.

./docker-compose-command.sh run k6 run /scripts/load.js

You will see K6 output in your terminal:

K6

  1. Check out the dashboard in Grafana

Go to Grafana in your browser (initial user/pwd: admin/admin) and open the dashboard called SLOs for APIs managed by Tyk.

You should see the data coming in: tyk_grafana_initial

You can also filter the data per API:

tyk_grafana_select_api

How this works

slo_grafana

Configuration

  • Tyk API Gateway is configured to expose two API endpoint:
  • K6 will use the load script load.js to generate demo traffic to the API endpoints
  • Tyk Pump is configured to expose a metric endpoint for Prometheus (see config) with two custom metrics called tyk_http_requests_total and tyk_http_latency. Tyk Pump version >= 1.6. is needed for custom metrics.
  • Prometheus
    • prometheus.yml is configured to automatically scrape Tyk Pump's metric endpoint
    • slos.rules.yml is used to calculate additional metrics needed for the remaining error budget
  • Grafana

SLIs and SLOs

Definition and example inspired from https://sre.google/workbook/slo-document/, https://landing.google.com/sre/workbook/chapters/alerting-on-slos/ and https://github.com/google/prometheus-slo-burn-example/blob/master/prometheus/slos.rules.yml.

You will see different indicators displayed on the Grafana dashboard.

To calculate the SLO and the displayed error budget remaining, we use the following SLI/SLO:

  • SLI: the proportion of successful HTTP requests, as measured from Tyk API Gateway
    • Any HTTP status other than 500–599 is considered successful.
    • count of http_requests which do not have a 5XX status code divided by count of all http_requests
  • SLO: 95% successful requests

In slos.rules.yml we calculate the rate of error per requests for the last 10 minute in job:slo_errors_per_request:ratio_rate10m. With job:error_budget:remaining we calculate the error budget remaining in percent. This is what we display in the Grafana dashboard. We use a threshold of 95% in the dashboard (every value below 95% is red).