Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ported monitoring stack to k3s #449

Open
wants to merge 113 commits into
base: main
Choose a base branch
from
Open

Conversation

wtripp180901
Copy link
Contributor

@wtripp180901 wtripp180901 commented Oct 14, 2024

Monitoring stack (prometheus/node exporter/grafana/alertmanager) binary installs removed from site and fatimage, now installs kube-prometheus-stack Helm chart into k3s cluster during site run. Containers are pre-pulled by podman and exported into k3s during fatimage build.

As a consequence, the grafana, alertmanager and node exporter groups have been removed and associated roles are now all managed by the prometheus role, which is short for kube_prometheus_stack

Also reduced metrics collected by node exporter down to minimal set described in docs/monitoring-and-logging.README.md, which was previously unimplemented

Note that because of how OOD's proxying interacts with Grafana's server config and kubernetes, OOD being enabled means that Grafana is only accessible through the OOD proxy. In the caas environment, this means that accessing Grafana requires authenticating with OOD's basic auth. Therefore, accessing Grafana through caas no longer logs you in as the admin user, you instead access the dashboards anonymously

Tests as of 63c3094:

  • Prometheus data from cloudalchemy roles successfully migrated to containerised Prometheus, although will likely be under a different job label than the one KPS is hardcoded to use
  • Reimage and upgrade from cloudalchemy: done, legacy data needed permission changes
  • Final caas test: TODO

@sjpb
Copy link
Collaborator

sjpb commented Nov 15, 2024

@wtripp180901 not a high priority but would be nice to know if this PR reduces the size of the data in the image. And/or whether we can reduce the required root disk size at all - which isn't the same thing, b/c e.g. dnf caches which we throw away require additional size during build.

I think you'd need qemu-img info to see the former. And monitoring disk usage during build to see the latter.

@wtripp180901
Copy link
Contributor Author

Base automatically changed from feature/k3s-ansible-init to main November 19, 2024 09:58
@wtripp180901
Copy link
Contributor Author

@sjpb
Copy link
Collaborator

sjpb commented Jan 15, 2025

@wtripp180901 when we get back to this we should look at whether https://github.com/stackhpc/ansible-slurm-appliance/blob/main/environments/common/inventory/group_vars/all/systemd.yml needs to be modified.

@wtripp180901
Copy link
Contributor Author

@wtripp180901
Copy link
Contributor Author

wtripp180901 commented Jan 20, 2025

hit an issue with prometheus when upgrading existing cluster, don't merge yet

@wtripp180901
Copy link
Contributor Author

now prometheus is getting wrong IP for OOD

@wtripp180901
Copy link
Contributor Author

^ might be nice to have some CI to catch targets being down from prometheus' POV

@sjpb
Copy link
Collaborator

sjpb commented Jan 22, 2025

@wtripp180901 also re TASK [systemd : Add dropins for unit files]: - its a bit worrying that this is a change when runnign site.yml:

Wednesday 22 January 2025  16:54:02 +0000 (0:00:01.746)       0:00:15.618 ***** 
changed: [RL9-control] => (item={'key': 'opensearch', 'value': {'group': 'opensearch', 'content': '[Unit]\nRequiresMountsFor=/var/lib/state\n'}})
changed: [RL9-control] => (item={'key': 'grafana-server', 'value': {'group': 'grafana', 'content': '[Unit]\nRequiresMountsFor=/var/lib/state\n'}})
changed: [RL9-control] => (item={'key': 'slurmdbd', 'value': {'group': 'openhpc', 'content': '[Unit]\nRequiresMountsFor=/var/lib/state\n'}})
changed: [RL9-control] => (item={'key': 'slurmctld', 'value': {'group': 'openhpc', 'content': '[Unit]\nRequiresMountsFor=/var/lib/state\n'}})
changed: [RL9-control] => (item={'key': 'prometheus', 'value': {'group': 'prometheus', 'content': '[Unit]\nRequiresMountsFor=/var/lib/state\n'}})

if we're using /var/lib/state for the kube-prom-stack, we should ensure relevant dropins are installed in the image, so that when k3s comes up on boot it waits until the vol is mounted before starting 😬

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SELinux not disabled by default, causes Prometheus install to fail
3 participants