-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Support for NVIDIA Network Operator to the Kubernetes Scheduler #804
Comments
Thanks for the proposal. I think this is something worth adding. Happy to take a look at the PR if you're up to contribute!
How coupled are these to host types? One way to achieve this today is by defining your custom named resource (see docs) and adding the resources/devices as part of
Take a look at the aws named resources as an example. The
the kubernetes scheduler in torchx runs with
How coupled are these to host types? (e.g. how often would a user need to configure this?) If the answer is "per-job" then we'd want to add this as part of the scheduler arguments.
And in |
Great, thanks for the feedback and pointers. I'd be open to putting together a PR if this seems the right thing to do.
I'm not sure what you mean by host types in this context, but the use-cases I'm interested in would always involve multiple hosts (K8s nodes) that are networked together for via RDMA. The two network types,
Great, I'll try adding a custom resource in this way.
We've been using one container per node as well, so I think that assumption is OK to continue with.
OK, then I don't think the annotation would be needed for K8s, but has OCP (OpenShift) has ever been tested/supported? That's one platform supported by the GPU Operator/Network Operator that's restricted security-wise.
It wouldn't change from job to job. The value is the name of network you'd like the pod to attach to, e.g., what you see from |
Great, looking forward to the PR! Since you're going to be registering custom named resources (this would be in your project), the only thing that would need to be done in the PR is to add the Follow ups below:
I was referring to a physical machine type. Looks like these are semi-static "resource" configurations (e.g. once you set up a couple of resource definitions, you can reuse them for the jobs you launch onto the cluster until the cluster's resources change - new host types added, deprecated ones removed, etc). So defining a few "named resources" that the user can select when launching the job would work nicely here. Note that while torchx currently requires the named resource static factory methods to be no-argument, if you had a valid use-case for a dynamic resource parameter, you could set this in the custom component (e.g. write your own version of In the context of k8s, the torchx named resource would map to an enumeration of the most commonly used But, as in your case, there are valid use-cases where the user might want to define more than one named resource for a specific machine type in the cluster. And this is what I meant by "host type".
AFAIK no. I'm not too familiar with OCP but skimming over their docs it seems like we'd have to implement a |
OK. If we get there in the future, the Volcano FAQ mentions some modifications needed to run on OCP. |
With the network operator, when we configure a secondary network like this apiVersion: mellanox.com/v1alpha1
kind: MacvlanNetwork
metadata:
name: macvlannetwork
spec:
networkNamespace: "default"
master: "enp141s0f0np0"
mode: "bridge"
mtu: 1500
ipam: |
{
"type": "whereabouts",
"datastore": "kubernetes",
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
},
"range": "192.168.2.225/28",
"log_file" : "/var/log/whereabouts.log",
"log_level" : "info",
"gateway": "192.168.2.1"
} The attached pods are then configured with an additional interface called [root@pod2 /]# ip -4 addr show
. . .
4: eth0@if800 < . . . >
inet 192.168.32.54/32 scope global eth0
5: net1@if26: < . . . >
inet 192.168.2.226/28 brd 192.168.2.239 scope global net1 However, it seems that Volcano defines its service endpoints with addresses of the primary interfaces (ones that are not on the RDMA network). And TorchX then gives the Does that problem description make sense? What I'd like is for the service endpoints to use the IP addresses of the secondary RDMA-based network, not the primary one. I will also ask the Network Operator team about this use-case. |
It looks like TorchX facilitates communication between nodes by spinning up a K8s service and passing the name of the master pod's endpoint in
I believe this means that we'll have to pass in the IP address of the master pod a different way. How can we manually set |
It would be more in line with kuberenetes design principals to make the change in the network operator to support name resolution. You can see torchx/components/dist.py for an example on how the rdzv is set up there. |
Description
Provide a way to use the NVIDIA Network Operator through the CLI and API of the Kubernetes scheduler.
Motivation/Background
The NVIDIA Network Operator enables RDMA devices and other fast networking components to be used in containerized environments. Fast networking is critical for the performance of workloads that span multiple nodes.
The network operator could provide access to RDMA devices by using either a MacvlanNetwork or a HostDeviceNetwork. This example shows how a pod can be attached to a MacVlanNetwork, and this example shows how a pod can be attached to a HostDeviceNetwork. In either case, the critical parts are:
rdma/rdma_shared_device_a: 1
ornvidia.com/hostdev: '1'
)IPC_LOCK
security context capabilityk8s.v1.cni.cncf.io/networks: rdma-net-ipam
)Detailed Proposal
Before detailing a specific proposal, I'd like to hear from the team about how feasible this sounds so far and whether any existing facilities might already help with some of this.
The text was updated successfully, but these errors were encountered: