Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Containerise prometheus #308

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open

WIP: Containerise prometheus #308

wants to merge 15 commits into from

Conversation

sjpb
Copy link
Collaborator

@sjpb sjpb commented Sep 8, 2023

  • Bumps prometheus from 2.27.0 to 2.48.1
  • Adds new prometheus ansible role to run prometheus containerised using podman/systemd:
    • Supports previous cloudalchemy.prometheus role config options in use by the appliance.
    • Provides install.yml, runtime.yml playbooks for separate image-build / runtime-configuration use for speed.
  • Extends podman role with service templates from azimuth-images, to start convergence of codebases.

WIP, with problems - see https://wiki.stackhpc.com/doc/containerised-prometheus-oLfxe5Es6K for notes

TODO:

  • Test fat image build
  • Test upgrade on Azimuth

@sjpb
Copy link
Collaborator Author

sjpb commented Jan 4, 2024

Tested an update (via branch switch and running site.yml again) from bfa719f works, i.e. monitoring from before update is visible afterwards.

@sjpb
Copy link
Collaborator Author

sjpb commented Jan 4, 2024

Fat image build: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/7411834185 - openhpc-240104-1602-262d12b5

…an-run-1001/libpod/tmp/pause.pid existing in image, causing podman startup failure
@sjpb
Copy link
Collaborator Author

sjpb commented Jan 5, 2024

@sjpb
Copy link
Collaborator Author

sjpb commented Jan 9, 2024

Checked at 3cff73b that upgrading an azimuth-deployed slurm to this works on azimuth from azimuth-config d2e6ee2 / v0.3.2:

  • Deployed azimuth at azimuth-config v0.3.2
  • Deployed slurm with hpctests
  • Checked prometheus: clicked through from jobs list to prom data for a job
  • Updated caas slurm definition to 3cff73b and corresponding image and re-provision azimuth
  • Patch cluster
  • Checked prometheus: checked could still click through from slurm jobs dashboard and see data for job run before patch.

@sjpb sjpb marked this pull request as ready for review January 9, 2024 12:07
@sjpb sjpb requested a review from a team as a code owner January 9, 2024 12:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant