Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T6231: Mellanox OFED #665

Merged
merged 5 commits into from
Jul 25, 2024
Merged

Conversation

sempervictus
Copy link

Change Summary

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes)
  • Migration from an old Vyatta component to vyos-1x, please link to related PR inside obsoleted component
  • Other (please describe):

Related Task(s)

T6231

Component(s) name

Mellanox (Nvidia) OFED kernel and userspace

Proposed changes

OFED stack: drivers, userspace utilities, libraries, and DPDK/VPP interfaces.

How to test

  1. run included script in build stack
  2. deploy the debs with the kernel built from the referenced sources
  3. boot on host with ConnectX card
  4. verify which module is loaded and features available

Checklist:

  • I have read the CONTRIBUTING document
  • I have linked this PR to one or more Phabricator Task(s)
  • My commit headlines contain a valid Task id
  • My change requires a change to the documentation
  • I have updated the documentation accordingly

@sever-sever
Copy link
Member

There are unrelated to this task commits
Why do you add them?

@sempervictus
Copy link
Author

sempervictus commented Jun 21, 2024

Git fun it seems :-/ - I just checked out -b and cherry picked my commit on it then added the Jenkinsfile one... Not sure how GH turned that into this. Rebased

@zdc
Copy link
Contributor

zdc commented Jun 21, 2024

Are there any examples of specific issues with upstream drivers fixed in OFED, or important for VyOS case benefits available in OFED and not available upstream?

@sempervictus
Copy link
Author

,

Are there any examples of specific issues with upstream drivers fixed in OFED, or important for VyOS case benefits available in OFED and not available upstream?

Addresses a few things:
Ever try to get line rate comms above connectx4, hw offload for complex functions, or vendor support when using upstream? First thing nv says in the support process is to use their drivers... We were struggling to route/nat more than 40gbit on upstream, were able to get nearly 200 going through inside lacp bond through outside lacp bond on OFED on cx6 and 7+ are a whole other ballgame. Its not as bad as difference between nouveau and actual GPU drivers from nv but well worth the change despite making kCFI that much more of a challenge (they hardcode GCC, for now) since this is a NOS.

@zdc
Copy link
Contributor

zdc commented Jun 21, 2024

We were struggling to route/nat more than 40gbit on upstream, were able to get nearly 200 going through inside lacp bond through outside lacp bond on OFED on cx6 and 7+

Can you confirm that in the described case a driver replacement was the only difference? The same device, same connections, same performance tester, same configuration, just another driver?

I have no objections against adding OFED drivers, but we need to understand the real profit from this.

@sempervictus
Copy link
Author

Yes, same device rebooted to an image containing ofed

Build OFED drivers and userspace components against the kernel
source tree similar to Intel's NIC drivers.

OFED installers create Debian packages of their own tageting the
kernel version defined in the build invocation if DKMS is omitted.
Script builds with supporting components for VPP to permit handoff
of function to the underlying hardware as appropriate. Updating the
version is fairly trivial along with adding patching as needed to
handle kCFI and hardening measures as they are introduced.

Testing:
  Tested against GCC-built Linux Hardened kernel with the various
additions from PR 132 - sustained line-rate testing against 4x100g
links on a single machine at a hair below 200g for each LACP pair.
@sempervictus sempervictus force-pushed the feature/mellanox_ofed_drivers branch from edd5236 to c0365df Compare June 22, 2024 02:45
Copy link
Member

@c-po c-po left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can both drivers co-exist or does the stock driver needs to be removed from the Kernel?

How does the system determine which to load? Or dies it go the usual way in detecting "there is a newer driver in /lib/modules/6.6.35/updates/dkms?

Issue: -amd64-vyos is missing from the installation path of the new modules, see KERNEL_SUFFIX environment variable also present in packages/linux-kernel/kernel-vers

@sempervictus
Copy link
Author

Can both drivers co-exist or does the stock driver needs to be removed from the Kernel?

both can coexist and you can insmod by path if you want to use the upstream ones.

How does the system determine which to load? Or dies it go the usual way in detecting "there is a newer driver in /lib/modules/6.6.35/updates/dkms?

the updates paths are how DKMS and such do this as well - upstream Linux semantic

Issue: -amd64-vyos is missing from the installation path of the new modules, see KERNEL_SUFFIX environment variable also present in packages/linux-kernel/kernel-vers

i havent found this to be an issue on any distribution over the ages and all of our kernels on all of our own OS forks and upstream ones utilize suffixes while DKMS and many out-of-tree drivers don't. Remedy would require changing the vendors source which is doable, but adds hassle/maintenance burden as their code changes rev to rev.

On that note, the suffix thing is rather constraining as implemented - we use different suffixes to denote specific kernel build types and tag our images with them (LLVM+hardened, GCC+hardened, GCC+grsec/pax, etc) because we cannot build an image with multiple kernels to let the user choose what to boot. IIRC this was was one of the big blockers to #132 - we couldn't get people to test anything to figure out what and how we should implement. I do have a much better "piecemeal it in" strategy for that code moving forward (actually been testing with this PR in our builders - linux-hardened and the xtables stuff drops in nicely now though i think we should just pull from linux-hardened for the source tree and adjust the version tag) but we still have to build dedicated images with those kernels as opposed to making a boot option (in part due to how the build scripts generate initramfs)

c-po added 4 commits July 25, 2024 15:30
All VyOS kernel modules must live in the appropriate module directory,
example: /lib/modules/6.6.41-amd64-vyos/

In addition we do not abbreviate script options to make reading easier,
without call --help all the time.
@c-po
Copy link
Member

c-po commented Jul 25, 2024

Moved drivers to appropriate directory and added some other minor changes to this pr

-rw-r--r-- root/root 5218104 2012-12-31 13:38 ./lib/modules/6.6.41-amd64-vyos/updates/dkms/mlx5_core.ko

Copy link

👍
No issues in PR Title / Commit Title

@c-po c-po enabled auto-merge July 25, 2024 13:37
@c-po c-po merged commit 02379d3 into vyos:current Jul 25, 2024
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

4 participants