-
Notifications
You must be signed in to change notification settings - Fork 850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL_SOCKET_IFNAME has no effect during pytorch distributed training with multiple NICs #1580
Comments
I am also getting the same error. Deep Speed and NCCL appears to be bugged on Debian 12. Deep Speed is based upon nccl. Nccl is defaulting to 1 gigabit adapter, even when its configured to only use the 10 gigabit adapter. NCCL_SOCKET_IFNAME does not appear to work properly on Debian systems. |
I'm surprised that NCCL didn't choose the fastest network adapter. If you could share the |
I'm guessing this is because both I'm curious: during NCCL communication, what are the local and remote IP addresses of the socket connections used by NCCL -- could you check with something like Is there a reason why you don't use separate subnets for different NICs? That would be the classic solution to such problems... I'm guessing you can tweak the routing table by flipping the order of entries or increasing the metric of Your |
i collected this log file on the master and worker nodes. nccl_master.log i also used on master node ifstat -i enp37s0f0,eno1 1 > ifstat_master.log on workernode ifstat -i enp37s0f0,eno1 1 > ifstat_worker.log |
Is the linux kernel choosing the network interface to use? Or is NCCL choosing the network interface? If I
Is there a way of stopping NCCL from using a network interface, if I dont want it to use that interface?
|
This system appears to have
In this instance
NCCL_SOCKET_IFNAME=ino1,10gigbit1,^docker ? Can you use ^ after the "=" or have to use ^= |
i run ss-t_worker.log
I modified the ip route table like this, and it works
|
This is connection between the 192.168.5.15 10 gigabit interface and the 192.168.3.12 eno1 interface There is no connection to 192.168.3.12 shown in the logs for either the client or server. Search for "192.168.3." in the worker and master log. Here: https://github.com/user-attachments/files/18467806/nccl_master.log https://github.com/user-attachments/files/18467841/nccl_worker.log There is no logging of
Clearly some logging message is missing.
Could this be out of band data or because out of band interface was not set? And the out of band listening/sending connection is not being logged? Maybe because NCCL_OOB_NET_IFNAME is not set? Is NCCL_OOB_NET_ENABLE, enabled by default now? Is NCCL_OOB_NET_IFNAME missing logging prints for connections? |
Thank you! These logs confirms that NCCL connects to the correct destination IP addresses (192.168.5.12 and 192.168.5.15) but, given the routing tables, the Linux kernel chooses to get there over the
Yes, exactly, that should work -- thank you for confirming. |
I understand your disappointment that the outcome ends up being different than what you requested. But NCCL's ability to follow your request is subject to the assumptions made in the code regarding how a network should be configured, as well as to the limitations of the available programming interfaces exposed by the underlying Linux kernel.
Short answer: the Linux kernel. Longer answer: NCCL is choosing the destination IP address, which is subject to the That assumption breaks if multiple network interfaces are on the same subnet, which is what we are dealing with here. NCCL could possibly be made to work in this scenario by utilizing the bind-before-connect technique to request a specific source IP address (the address of the local interface we want to use), but we currently don't do it. Who knows what unforeseen scenarios that currently happen to work fine would break if we decided to force such a change. Strictly speaking, selecting the source IP address is not equivalent to selecting a particular interface either -- what if two local interfaces have the same IP address? The Linux kernel does in fact have an API that allows to bind a socket to a device by interface name, but it's a privileged operation guarded by What I'm trying to say is that everything has limitations; every option has pros and cons.
In practice -- not, because the Docker interface is on a different subnet.
See above. C does not need to be specified.
No, that syntax doesn't work.
Correct. You won't find these in the log, because the first one is not being done, and as to the second one, the address is not chosen explicitly by NCCL, but rather the Linux kernel chooses it implicitly.
In fact, the connection is FROM "anywhere" (unspecified) TO 192.168.5.*. Because NCCL does not specify the FROM address, the Linux kernel chooses it on its own, based on the TO address and the routing table. That's how FROM ends up being 192.168.3.*.
Both of the above statements are true. NCCL never listens on 192.168.3.* and also never connects to 192.168.3.*. As I said, the source, not the destination, ends up being 192.168.3.*.
No. OOB is used during bootstrap (early initialization) and is separate from the code used later for user data exchange, although it is subject to the same assumptions/limitations as listed above (but in a way it "doesn't matter", since OOB data exchanges are relatively low-volume).
No, it's still opt-in.
That's possible, given that many bootstrap connections are extremely short-lived. We try to limit the logging to what we think adds value. |
I am trying to use pytorch for multi-node distributed parallel training on 2 Debian servers with 3 RTX 3090s installed.
Each server has 2 NICs. One 1GB port (
eno1
) is assigned to the192.168.3.*
network segment, and one 10GB port (enp37s0f0
) is assigned to the192.168.5.*
network segment.I want them to use the 10GB port to communicate during training. However, they use the 1GB port to send data and the 10GB port to receive data.
I tried setting
NCCL_SOCKET_IFNAME=enp37s0f0
as an environment variable, writing it to/etc/nccl.conf
, and adding it to the python file (useos.environ['NCCL_SOCKET_IFNAME'] = 'enp37s0f0'
). None of them worked.Now, I can only temporarily solve this problem by modifying the routing table.
However, I want to add more 10GB NICs later to improve the communication capacity during training, which will require
NCCL_SOCKET_IFNAME
to specify multiple network ports, but the routing table does not have this capability.What is the reason why
NCCL_SOCKET_IFNAME
does not work? How should I solve it?training script
ddp.py
launch on master node
launch on worker node
environment
system info
network info
on master
on worker
The text was updated successfully, but these errors were encountered: