Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/verbs: establishing verbs connection using the GID doesn't work #10472

Open
SnaKyEyeS opened this issue Oct 18, 2024 · 0 comments
Open

prov/verbs: establishing verbs connection using the GID doesn't work #10472

SnaKyEyeS opened this issue Oct 18, 2024 · 0 comments
Labels

Comments

@SnaKyEyeS
Copy link

Describe the bug
Attempting to use NICs with IPoIB (via a call to fi_domain) disabled with the verbs provider using @sydidelot's feature (#5605) with RxM enabled doesn't work, likely due to RxM's assumption of dest_addr = FI_SOCKADDR rather than FI_SOCKADDR_IB

To Reproduce
Steps to reproduce the behavior:

  • Use MPICH on a NIC where IPoIB is disabled (with FI_PROVIDER=verbs,ofi_rxm)
  • Unfortunately I don't really have a simple reproducer, but the specific call failing is attempting to call fi_domain on a NIC using its GID rather than via IPoIB

Expected behavior
Replacing this line with .addr_format = FI_SOCKADDR_IB makes everything working as expected (ie, the call to fi_domain on a NIC with IPoIB disabled (thus using its GID instead) succeeds.

Output
OFI fails with ENODATA

[3] libfabric:1347797:1729070846:ofi_rxm:verbs:fabric:vrb_get_rai_id():301<warn> rdma_resolve_addr: Invalid argument (22)
[3] libfabric:1347797:1729070846:ofi_rxm:verbs:fabric:vrb_get_rai_id():303<info> src addr: fi_sockaddr_ib://[fe80::88e9:a4ff:ff1c:5860]:0xffff:0x13f:0x0
[3] libfabric:1347797:1729070846:ofi_rxm:verbs:fabric:vrb_get_rai_id():305<info> dst addr: (null)
[3] libfabric:1347797:1729070846:ofi_rxm:verbs:fabric:vrb_get_match_infos():1825<info> handling of the socket address fails - -22

Environment:
fi_info's relevant output:

fi_info:
    caps: [ FI_MULTI_RECV, FI_LOCAL_COMM, FI_REMOTE_COMM, FI_HMEM ]
    mode: [ FI_BUFFERED_RECV ]
    addr_format: FI_SOCKADDR_IB
    src_addrlen: 48
    dest_addrlen: 0
    src_addr: fi_sockaddr_ib://[fe80::88e9:a4ff:ff4a:997c]:0xffff:0x13f:0x0
    dest_addr: (null)
    handle: (nil)
[...]
@SnaKyEyeS SnaKyEyeS added the bug label Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant