Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add role to install NVIDIA DOCA on top of an existing "fat" image #492

Merged
merged 23 commits into from
Dec 12, 2024

Conversation

sjpb
Copy link
Collaborator

@sjpb sjpb commented Dec 10, 2024

  • Adds new doca role to install NVIDIA DOCA:

    • This is only run during image builds where inventory_groups include doca.
    • It may be run on an existing "fat image" e.g. as part of an "extra" build.
    • It rebuilds the DOCA kernel packages to match the installed kernel.

    This role should be preferred over the ofed role which may be deprecated at a later date.
    NB: This uses doca packages from upstream repos, not StackHPC's ark.

  • Adds a workflow to test the DOCA build during CI on the current RL8/RL9 fat images.
    Unless run manually, the built image is deleted on completion.
    NB: the resulting DOCA image is not tested by CI.

  • Simplifies the configuration for packer builds:

    • Only a single build is now defined. Control of behaviour is via variables, rather than by selecting a build.
    • All variables are now simple (instead of some being maps keyed by the OS version).
    • The inventory groups to add the build VM is now specified via inventory_groups, taking a comma-separated list (instead groups which took a map).
  • All image build workflows have been adjusted to use the new packer configuration

The following manually-triggered checks related to packer configuration changes have been completed:

  • Check build works with empty groups
  • Check fatimage workflow works
  • Check nightly workflow works

Ticket: PLATFORM-537

@sjpb sjpb mentioned this pull request Dec 10, 2024
@sjpb sjpb force-pushed the feat/doca-v2 branch 2 times, most recently from b92641c to d2c387c Compare December 10, 2024 21:08
@sjpb sjpb changed the title add doca role run by fatimage Add role to install Mellanox DOCA on top of an existing "fat" image Dec 10, 2024
@sjpb sjpb force-pushed the feat/doca-v2 branch 2 times, most recently from a694dab to 066d31f Compare December 10, 2024 21:18
@sjpb sjpb force-pushed the feat/doca-v2 branch 2 times, most recently from 1e54cbc to 5fbed66 Compare December 11, 2024 10:38
@sjpb sjpb force-pushed the feat/doca-v2 branch 4 times, most recently from 73d9f33 to e3af80a Compare December 11, 2024 11:19
@sjpb
Copy link
Collaborator Author

sjpb commented Dec 11, 2024

@sjpb
Copy link
Collaborator Author

sjpb commented Dec 11, 2024

Fat image build: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/12277251098. NB: doca workflows fail here b/c previous fat image has signature_verified property.

@sjpb
Copy link
Collaborator Author

sjpb commented Dec 11, 2024

Note the above didn't actually build DOCA for some reason even though the workflow ran with the right groups

@sjpb
Copy link
Collaborator Author

sjpb commented Dec 11, 2024

Tested nightly build here: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/12280868395
At least started running ansible and checked it had correct groups, cancelled it before it could produce an image.

@sjpb sjpb marked this pull request as ready for review December 11, 2024 16:58
@sjpb sjpb requested a review from a team as a code owner December 11, 2024 16:58
@sjpb sjpb changed the title Add role to install Mellanox DOCA on top of an existing "fat" image Add role to install NVIDIA DOCA on top of an existing "fat" image Dec 11, 2024
bertiethorpe
bertiethorpe previously approved these changes Dec 11, 2024
Copy link
Member

@bertiethorpe bertiethorpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Packer and Workflows LGTM

Copy link
Member

@bertiethorpe bertiethorpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs LGTM

@sjpb sjpb merged commit efd2883 into main Dec 12, 2024
7 checks passed
@sjpb sjpb deleted the feat/doca-v2 branch December 12, 2024 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants