Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/udp: detect and use interface MTU to support large messages over UDP #10493

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mpokorny
Copy link

This commit supports a maximum UDP message size that depends on the MTU size of the interface. The use case is UDP messaging on an Ethernet network configured to use jumbo frames.

For each interface detected by the udp provider, determine the MTU of the interface, and use that value to set the max_msg_size field of the fi_ep_attr and fi_tx_attr values of the fi_info element. When the MTU cannot be determined, the MTU value assumed by previous code versions (1500) is used.

@mpokorny
Copy link
Author

I have a small command-line test application, but I'm not sure of the best way to modify it into a unit test and integrate it into the code base.

#include <rdma/fabric.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char** argv) {
  int msg_size = 1000;
  if (argc > 1) {
    if (sscanf(argv[1], "%d", &msg_size) != 1 || msg_size <= 0) {
      printf("failed to convert argument '%s' to a positive integer\n", argv[1]);
      return 1;
    }
  }

  struct fi_fabric_attr* fabric_attr = calloc(1, sizeof(struct fi_fabric_attr));
  fabric_attr->prov_name = strdup("udp");

  struct fi_ep_attr* ep_attr = calloc(1, sizeof(struct fi_ep_attr));
  ep_attr->max_msg_size = msg_size;

  struct fi_info* hints = fi_allocinfo();
  hints->caps = FI_SEND | FI_RECV;
  hints->mode = FI_CONTEXT | FI_CONTEXT2;
  hints->addr_format = FI_SOCKADDR_IN;
  hints->ep_attr = ep_attr;
  hints->fabric_attr = fabric_attr;

  struct fi_info* info = NULL;
  struct fi_info* cur;
  fi_getinfo(
    FI_VERSION(FI_MAJOR_VERSION, FI_MINOR_VERSION),
    NULL,
    NULL,
    0,
    hints,
    &info);
  for (cur = info; cur; cur = cur->next)
    printf("domain %s, max_msg_size %d\n",
           cur->domain_attr->name,
           cur->ep_attr ? cur->ep_attr->max_msg_size : 0);
  fi_freeinfo(info);
  fi_freeinfo(hints);
  return 0;
}

@mpokorny mpokorny force-pushed the jumbo-udp branch 2 times, most recently from e2be47f to a1a89ce Compare October 26, 2024 21:09
Comment on lines 46 to 50
ofi_mutex_lock(&init_lock);
udpx_util_prov_init(version, node, service, flags);
ofi_mutex_unlock(&init_lock);
return util_getinfo(&udpx_util_prov, version, node, service, flags,
hints, info);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

udpx_util_prov is initialized once with the "node", "service" and "flags" from the first fi_getinfo() call. That likely won't work properly for later fi_getinfo calls with different parameters.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course. Please check the updated code, which should address this.

src/linux/osd.c Outdated
Comment on lines 271 to 273
size_t filename_len = strlen(mtu_filename_prefix) +
strlen(mtu_filename_prefix) +
IF_NAMESIZE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please align the 2nd and 3rd lines with the 'strlen' of the first line.

src/linux/osd.c Outdated
Comment on lines 286 to 288
if (getline(&line, &len, fd) == -1) {
goto err2;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the braces here.

src/linux/osd.c Outdated

int ofi_ifaddr_get_mtu(const struct ifaddrs *ifa)
{
FILE *fd;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use fp instead. fd is usually used for the integer file descriptor returned from open.

src/linux/osd.c Outdated
goto err2;
}

if (sscanf(line, "%d", &mtu) != 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can use fscanf instead. that eliminates the need of calling getline.

src/linux/osd.c Outdated
Comment on lines 279 to 280
snprintf(mtu_filename, filename_len, "%s%s%s",
mtu_filename_prefix, ifa->ifa_name, mtu_filename_suffix);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use asprintf instead. That way, you don't need to calculate the size of the string, and thus don't need to define the variable for the prefix and suffix.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions. All applied in the update.

ifaddrs.ifa_netmask = NULL;
ifaddrs.ifa_name = info->domain_attr->name;
ifaddrs.ifa_addr = info->src_addr;
return ofi_ifaddr_get_mtu(&ifaddrs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The window version of ofi_ifaddr_get_mtu returns ifaddrs.mtu which is undefined here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the updated version.

@mpokorny mpokorny force-pushed the jumbo-udp branch 2 times, most recently from ed17f9a to f9814bb Compare October 28, 2024 19:16
Copy link
Contributor

@aingerson aingerson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few small things, nothing glaring. Thanks for the patch!
As far as your unit test, I would forgo putting it into our testing as it's not really testing anything but rather just printing out the values as far as I can tell.

man/fi_udp.7.md Show resolved Hide resolved
prov/udp/src/udpx.h Outdated Show resolved Hide resolved
prov/udp/src/udpx_attr.c Outdated Show resolved Hide resolved
src/linux/osd.c Outdated Show resolved Hide resolved
prov/udp/src/udpx_attr.c Outdated Show resolved Hide resolved
prov/udp/src/udpx_attr.c Show resolved Hide resolved
@mpokorny mpokorny force-pushed the jumbo-udp branch 3 times, most recently from d218488 to 2d26352 Compare October 28, 2024 22:14
aingerson
aingerson previously approved these changes Oct 29, 2024
Copy link
Contributor

@aingerson aingerson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks!
You can include the man page updates in this PR btw, just have it in a separate patch. But you can also do it in a separate PR - up to you!
Our CI failed but it was unrelated to your changes. Rerunning for fun but don't consider it a blocker. Not sure what the AWS failure is though.

@mpokorny
Copy link
Author

I pushed a commit to update the man page.

struct fi_info* info;

if (udpx_util_prov.info == NULL) {
udpx_util_prov.info = &udpx_info;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is redundant since it is overwritten by L153.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But udpx_util_prov is used in the call to ofi_ip_getinfo(), which won't work correctly otherwise.

ofi_get_list_of_addr(&udpx_prov, "iface", &addr_list);
for (cur = info; cur; cur = cur->next)
set_mtu_from_addr_list(cur, &addr_list);
*(struct fi_info**)&udpx_util_prov.info = info;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be simplified to udpx_util_prov.info = info;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The info field of struct util_prov is declared const, and the compiler complains without the cast.

}
}

void udpx_util_prov_fin()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to udpx_util_prov_fini to be consistent with other similar cases in the repo.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done with next commit

@@ -2182,7 +2190,7 @@ void ofi_get_list_of_addr(const struct fi_provider *prov, const char *env_name,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addr_entry->mtu is unitialized in this case

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fairly unfamiliar with Windows, which is where this code is used. Given what I see in include/windows/config.h, it's not clear to me that this code will actually be compiled, but I'm probably missing something. In any case, initializing addr_entry->mtu to -1 suffices to make the program correct, although the standard default MTU will then be assumed.

addr_entry = container_of(entry,
struct ofi_addr_list_entry,
entry);
max_msg_size = UDPX_MAX_MSG_SIZE(addr_entry->mtu);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to handle the case that addr_entry->mtu == -1.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That case is handled in line 125. (The value of UDPX_MAX_MSG_SIZE() is strictly less than its argument value.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mtu valid check should be direct, not hidden behind a macro which uses an incorrect value that's then checked. The code is assuming too much about an invalid mtu value and requires too much of the reader to figure out what's happening.

For each interface detected by the udp provider, determine the MTU of
the interface, and use that value to set the max_msg_size field of the
fi_ep_attr and fi_tx_attr values of the fi_info element. When the MTU
cannot be determined, the MTU value assumed by previous code
versions (1500) is used.

Signed-off-by: Martin Pokorny <[email protected]>
Change "limitations" section to describe that the maximum transfer
size depends on interface MTU, and add FI_UDP_IFACE parameter
description.

Signed-off-by: Martin Pokorny <[email protected]>
@@ -497,6 +498,11 @@ size_t ofi_ifaddr_get_speed(struct ifaddrs *ifa)
return ifa->speed;
}

int ofi_ifaddr_get_mtu(const struct ifaddrs *ifa)
{
return ifa->mtu;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation is off.

@@ -76,6 +76,11 @@ static inline size_t ofi_ifaddr_get_speed(struct ifaddrs *ifa)
return 0;
}

static inline int ofi_ifaddr_get_mtu(const struct ifaddrs *ifa)
{
return -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just return 1500 as the default? Setting mtu = -1 as a default looks like a recipe for problems and missed error checking

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your feedback. One reason for using the value -1 is that it removes the need to put default MTU size values in a bunch more files. If there is there a good, single place to put such a definition, I must have missed it. I'm open to suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually disagree with Sean for once. Returning -1 is similar to an enosys indicating that finding the MTU size is not supported/possible. Returning 1500 seems like it's falsely stating that it was able to determine that to be the MTU size. I think it's more appropriate for the default to be in the caller where it defaults to 1500 if were unable to find it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then you need to add checks everywhere that the mtu size is valid prior to using it. There are several places where mtu == -1, but that value is used anyway, with, maybe, checks coming later that the calculation is valid. If the mtu can't be determined, then remove that interface from the list. I can't tell that these changes work at all on windows.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaaand back to agreeing with Sean. Yes, proper error checking is absolutely necessary for that to work.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't removing an interface whose mtu can't be determined potentially break existing client code? Doesn't the code in the main branch implicitly assume mtu=1500 for any interface supporting udp to determine the endpoint max_msg_size and inject_size? Falling back to that behavior seems like it might be safer (from that perspective) than removing the interface from the list. From a correctness point of view, dropping the interface seems preferable. For my use case, either choice is fine, so I could use some advice about how to proceed.

.info = &udpx_info,
.flags = 0,
.info = NULL,
.flags = 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced tab with spaces here - revert change

set_mtu_from_addr_list(cur, &addr_list);
*(struct fi_info**)&udpx_util_prov.info = info;
ofi_free_list_of_addr(&addr_list);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't embed the entire function inside an if statement. Revert the if check and return, then outdent the rest of the code. Please add some spaces between lines.


if (udpx_util_prov.info == NULL) {
udpx_util_prov.info = &udpx_info;
info = fi_allocinfo();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must check for failure

if (max_msg_size > 0) {
info->tx_attr->inject_size = max_msg_size;
info->ep_attr->max_msg_size = max_msg_size;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_msg_size > 0 should be an assert. We shouldn't leave these sizes unset.

((ifa->ifa_addr->sa_family != AF_INET) &&
(ifa->ifa_addr->sa_family != AF_INET6)))
continue;
if (ifa->ifa_flags & IFF_LOOPBACK) {
mtu = ofi_ifaddr_get_mtu(ifa);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to loopback_mtu to make it clear how this is used. On first glance, this value looks like it's ignored.

@@ -2174,6 +2182,7 @@ void ofi_get_list_of_addr(const struct fi_provider *prov, const char *env_name,
addr_entry->ipaddr.sin.sin_family = AF_INET;
addr_entry->ipaddr.sin.sin_addr.s_addr =
iptbl->table[i].dwAddr;
addr_entry->mtu = -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everywhere mtu = -1 should probably use a better default than an unusable value

@@ -655,6 +655,7 @@ struct ofi_addr_list_entry {
char ipstr[INET6_ADDRSTRLEN];
union ofi_sock_ip ipaddr;
size_t speed;
int mtu;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please separate out adding / setting the mtu to ofi_addr_list_entry into its own patch. That will make it clearer which changes are generic, versus tied specifically to udp.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I'll do that with the next update.

udpx_util_prov.info = &udpx_info;
info = fi_allocinfo();
ofi_ip_getinfo(&udpx_util_prov, version, NULL, NULL, 0, NULL,
&info);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the way to handle this is to add a callback to ofi_ip_getinfo(). That function is already iterating over the interfaces. The only part that's missing is the ability of the provider to update the fi_info based on the specific interface. For most callers, the callback will be null. For UDP, the callback will receive the fi_info and ofi_addr_list_entry, so that it can update the attributes based on the mtu. The callback may need to be passed through to subfunctions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do that. It seems like a more extensive change to the code, but if you're looking over my shoulder, I'll give it a try.

@mpokorny
Copy link
Author

I should have some time to pick up work on this PR again this week. I propose doing the following:

  • use the "old" assumed value for the MTU when it cannot be determined at run time
  • add a callback to ofi_ip_getinfo() that allows providers to update the fi_info based on the interface
    Any comments on this outline are appreciated.

@shefty
Copy link
Member

shefty commented Nov 18, 2024

@mpokorny - There are other places where callbacks are used within the getinfo() path. See struct util_prov::alter_defaults. I would see about adding a new callback to struct util_prov, maybe alter_dev_defaults? If that call is not null, call it for each fi_info created for each network interface. The udp provider would just need to define the function and set it.

@mpokorny
Copy link
Author

I have a question about a possible alternative implementation to take. I noticed that fi_link_attr has an mtu field already, and was wondering whether that might be filled in by the udp provider for subsequent use by the util_prov::alter_defaults callback. That callback would then set the attribute values that depend on the MTU value (in fi_tx_attr and fi_ep_attr). Would such an implementation be correct, and, if so, how might it compare with what I've done so far?

@shefty
Copy link
Member

shefty commented Nov 20, 2024

I'm not entirely certain, but reading the code, I don't think the fi_link_attr (via fi_info::nic) is being set for the tcp/udp providers. If that were available, it would be great to make use of it. Unfortunately, the current alter_defaults() callback occurs too early in the getinfo flow to be used.

ofi_ip_getinfo() -> util_getinfo()

This flow checks against the provider's template fi_info list. For each fi_info that matches, the fi_info is dup'ed, then possibly altered by the provider. This alter flow allows custom changes to the template fi_info based on the input hints.

The NIC interface list is checked after this flow (back in ofi_ip_getinfo()). For every fi_info returned from the above flow, a new fi_info is created, one per interface. (The exception is when a src_addr limits us to a single NIC.) The places to add nic information would be adjacent to the calls to:

util_set_netif_names()

Sooo... here's what I would consider:

  • Rename ofi_addr_list_entry to something like ofi_netif. Rename related functions and variables. This is just to clarify that it's capturing more than simple address information.
  • Extend ofi_netif with mtu, plus additional device information which might be available.
  • Update util_set_netif_names() -> util_set_netif(). In addition to changing the addressing, allocate and attach a fid_nic structure, plus sub-attributes, to the fi_info. Note that it's okay if most of the data is left 0/null. A future patch fill out more information later. The fi_link_attr structure should be filled out and returned.
  • The udp provider can iterate over the returned fi_info list from ofi_ip_getinfo() to update the msg_size based on the link mtu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants