-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ported monitoring stack to k3s #449
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: Steve Brasier <[email protected]>
…slurm-appliance into feature/k3s-monitoring
@wtripp180901 not a high priority but would be nice to know if this PR reduces the size of the data in the image. And/or whether we can reduce the required root disk size at all - which isn't the same thing, b/c e.g. dnf caches which we throw away require additional size during build. I think you'd need qemu-img info to see the former. And monitoring disk usage during build to see the latter. |
@wtripp180901 when we get back to this we should look at whether https://github.com/stackhpc/ansible-slurm-appliance/blob/main/environments/common/inventory/group_vars/all/systemd.yml needs to be modified. |
|
now prometheus is getting wrong IP for OOD |
^ might be nice to have some CI to catch targets being down from prometheus' POV |
@wtripp180901 also re TASK [systemd : Add dropins for unit files]: - its a bit worrying that this is a change when runnign site.yml:
if we're using /var/lib/state for the kube-prom-stack, we should ensure relevant dropins are installed in the image, so that when k3s comes up on boot it waits until the vol is mounted before starting 😬 |
Monitoring stack (prometheus/node exporter/grafana/alertmanager) binary installs removed from site and fatimage, now installs kube-prometheus-stack Helm chart into k3s cluster during site run. Containers are pre-pulled by podman and exported into k3s during fatimage build.
As a consequence, the
grafana
,alertmanager
andnode exporter
groups have been removed and associated roles are now all managed by theprometheus
role, which is short for kube_prometheus_stackAlso reduced metrics collected by node exporter down to minimal set described in
docs/monitoring-and-logging.README.md
, which was previously unimplementedNote that because of how OOD's proxying interacts with Grafana's server config and kubernetes, OOD being enabled means that Grafana is only accessible through the OOD proxy. In the caas environment, this means that accessing Grafana requires authenticating with OOD's basic auth. Therefore, accessing Grafana through caas no longer logs you in as the admin user, you instead access the dashboards anonymously
Tests as of 63c3094: