Ported monitoring stack to k3s #449

wtripp180901 · 2024-10-14T09:10:08Z

Monitoring stack (prometheus/node exporter/grafana/alertmanager) binary installs removed from site and fatimage, now installs kube-prometheus-stack Helm chart into k3s cluster during site run. Containers are pre-pulled by podman and exported into k3s during fatimage build.

As a consequence, the grafana, alertmanager and node exporter groups have been removed and associated roles are now all managed by the prometheus role, which is short for kube_prometheus_stack

Also reduced metrics collected by node exporter down to minimal set described in docs/monitoring-and-logging.README.md, which was previously unimplemented

Note that because of how OOD's proxying interacts with Grafana's server config and kubernetes, OOD being enabled means that Grafana is only accessible through the OOD proxy. In the caas environment, this means that accessing Grafana requires authenticating with OOD's basic auth. Therefore, accessing Grafana through caas no longer logs you in as the admin user, you instead access the dashboards anonymously

Tests as of 63c3094:

Prometheus data from cloudalchemy roles successfully migrated to containerised Prometheus, although will likely be under a different job label than the one KPS is hardcoded to use
Reimage and upgrade from cloudalchemy: done, legacy data needed permission changes
Final caas test: TODO

… more affinity)

Co-authored-by: Steve Brasier <[email protected]>

…slurm-appliance into feature/k3s-monitoring

sjpb · 2024-11-15T13:23:03Z

@wtripp180901 not a high priority but would be nice to know if this PR reduces the size of the data in the image. And/or whether we can reduce the required root disk size at all - which isn't the same thing, b/c e.g. dnf caches which we throw away require additional size during build.

I think you'd need qemu-img info to see the former. And monitoring disk usage during build to see the latter.

wtripp180901 · 2024-11-19T09:13:50Z

https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/11909785160

wtripp180901 · 2024-11-20T11:46:51Z

https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/11932735822

sjpb · 2025-01-15T13:26:45Z

@wtripp180901 when we get back to this we should look at whether https://github.com/stackhpc/ansible-slurm-appliance/blob/main/environments/common/inventory/group_vars/all/systemd.yml needs to be modified.

wtripp180901 · 2025-01-20T11:05:13Z

https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/12866628689

wtripp180901 · 2025-01-20T15:29:51Z

~~hit an issue with prometheus when upgrading existing cluster, don't merge yet~~

wtripp180901 · 2025-01-20T17:25:06Z

now prometheus is getting wrong IP for OOD

wtripp180901 · 2025-01-21T08:30:31Z

^ might be nice to have some CI to catch targets being down from prometheus' POV

sjpb · 2025-01-22T16:57:46Z

@wtripp180901 also re TASK [systemd : Add dropins for unit files]: - its a bit worrying that this is a change when runnign site.yml:

Wednesday 22 January 2025  16:54:02 +0000 (0:00:01.746)       0:00:15.618 ***** 
changed: [RL9-control] => (item={'key': 'opensearch', 'value': {'group': 'opensearch', 'content': '[Unit]\nRequiresMountsFor=/var/lib/state\n'}})
changed: [RL9-control] => (item={'key': 'grafana-server', 'value': {'group': 'grafana', 'content': '[Unit]\nRequiresMountsFor=/var/lib/state\n'}})
changed: [RL9-control] => (item={'key': 'slurmdbd', 'value': {'group': 'openhpc', 'content': '[Unit]\nRequiresMountsFor=/var/lib/state\n'}})
changed: [RL9-control] => (item={'key': 'slurmctld', 'value': {'group': 'openhpc', 'content': '[Unit]\nRequiresMountsFor=/var/lib/state\n'}})
changed: [RL9-control] => (item={'key': 'prometheus', 'value': {'group': 'prometheus', 'content': '[Unit]\nRequiresMountsFor=/var/lib/state\n'}})

if we're using /var/lib/state for the kube-prom-stack, we should ensure relevant dropins are installed in the image, so that when k3s comes up on boot it waits until the vol is mounted before starting 😬

wtripp180901 added 30 commits September 19, 2024 13:28

Added prometheus operator role compatible with state_dir (still needs…

108fa7c

… more affinity)

Merge branch 'feature/k3s-ansible-init' into feature/k3s-monitoring

8f2977c

Added node selectors for non-exporter pods

9836ef8

Added services for monitoring

d790b2b

WIP porting prometheus rolevars

10af75d

Added ingress for monitoring services

7b29a3b

Refactored + re-enabled external labels (not sure if working)

b959e92

replaced monitoring in site.yml and fixed sslip IPs

560eb96

Added slurm exporter service to k3s

0106a95

Added ood exporter to k3s

a4dca77

added grafana metrics

6081d77

fixed alertmanager status

84fd355

Dashboards now installed into k3s (dataources not configured yet)

e2d1c62

Merge branch 'feature/k3s-ansible-init' into feature/k3s-monitoring

cce35a9

Added slurmstats datasource

7afdc1d

enabled ips for monitoring services (except prometheus)

b3020ca

Added grafana to state directory and made port configurable

0dff07f

grafana can now be reverse proxied by ood

f7e555b

Ported grafana rolevars

d142a9f

Added slack integration default

7fa3609

Ported alertmanager rolevars

96edb79

Merge branch 'feature/k3s-ansible-init' into feature/k3s-monitoring

123c573

removed k3s ingress

b13311a

Services now exposed/proxied via nodeports

01718ee

Removed grafana servicemonitor and moved nodeports to helm config

74bd3ba

grafana admin now definable

e724b5d

Now adds additional rules correctly

9c359d9

Merge branch 'feature/k3s-ansible-init' into feature/k3s-monitoring

04a5bf3

Merge branch 'feature/k3s-ansible-init' into feature/k3s-monitoring

7f4862c

Merge branch 'feature/k3s-ansible-init' into feature/k3s-monitoring

2b97d32

wtripp180901 and others added 9 commits November 12, 2024 14:49

moved prometheus install to host group

cd281f3

Review docs suggestions

774f608

Co-authored-by: Steve Brasier <[email protected]>

Merge branch 'feature/k3s-monitoring' of github.com:stackhpc/ansible-…

80a0e21

…slurm-appliance into feature/k3s-monitoring

added readme link

6506e7e

file name and defaults changes

a6d8edc

disambiguated default addresses

5864b56

separated prometheus recording and alerting rules

15b77db

adding alertmanager docs

acf0c0d

merge

b7d9c48

merge

6f16492

bump

8ca0407

Base automatically changed from feature/k3s-ansible-init to main November 19, 2024 09:58

Merge branch 'main' into feature/k3s-monitoring

7f0af9e

bump

f864023

wtripp180901 added 4 commits January 20, 2025 10:46

merged with release train changes

e1e7b34

pinned python kube version

aae2aa4

Merge branch 'main' into feature/k3s-monitoring

f97c6a6

removed old monitoring services from systemd dropins

fe79a33

bump images

32735c0

fixed KPS not having access to legacy data

63c3094

fixed prometheus not resolving OOD

bdd265a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ported monitoring stack to k3s #449

Ported monitoring stack to k3s #449

wtripp180901 commented Oct 14, 2024 •

edited

Loading

sjpb commented Nov 15, 2024 •

edited

Loading

wtripp180901 commented Nov 19, 2024

wtripp180901 commented Nov 20, 2024

sjpb commented Jan 15, 2025

wtripp180901 commented Jan 20, 2025

wtripp180901 commented Jan 20, 2025 •

edited

Loading

wtripp180901 commented Jan 20, 2025

wtripp180901 commented Jan 21, 2025

sjpb commented Jan 22, 2025

Ported monitoring stack to k3s #449

Are you sure you want to change the base?

Ported monitoring stack to k3s #449

Conversation

wtripp180901 commented Oct 14, 2024 • edited Loading

sjpb commented Nov 15, 2024 • edited Loading

wtripp180901 commented Nov 19, 2024

wtripp180901 commented Nov 20, 2024

sjpb commented Jan 15, 2025

wtripp180901 commented Jan 20, 2025

wtripp180901 commented Jan 20, 2025 • edited Loading

wtripp180901 commented Jan 20, 2025

wtripp180901 commented Jan 21, 2025

sjpb commented Jan 22, 2025

wtripp180901 commented Oct 14, 2024 •

edited

Loading

sjpb commented Nov 15, 2024 •

edited

Loading

wtripp180901 commented Jan 20, 2025 •

edited

Loading