-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/cxi: peer infrastructure support #10436
base: main
Are you sure you want to change the base?
Conversation
Hey Amir - Sorry for taking a while to get to this. Can you please rebase this commit on top of main? I have merged a series of CXI provider updates. |
41d814f
to
020893d
Compare
prov/cxi/src/cxip_msg_hpc.c
Outdated
struct cxip_ux_send *ux_send) | ||
{ | ||
union cxip_match_bits ux_mb; | ||
struct fi_peer_rx_entry *entry = calloc(sizeof(*entry), 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this still looks wrong to me. cxi as a peer should never be allocating peer entries. that all should happen in the owner space. I understand that from the cxi perspective, the messages all look like unexpected messages but I don't understand how this fits into that flow. it looks like you're basically bypassing the get_msg api and sending all messages straight into the owner unexpected queue which does not match the peer API flow definition.
the only place I see things getting added to the sw_pending_ux_list
that this is going through is in the else where you're not in the owner srx case. is this even hit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error is in the caller of this function. It should be going through rxc->sw_ux_list. However now that I look at it more closely I have to admit I'm not entirely sure if we need to do something special similar to what cxip_post_ux_onload_sw() and cxip_post_ux_onload_fc() do. @iziemba do you have any advice here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, as I was thinking it over I don't think this is going to be an issue. This assumes hardware offload is enabled and handles the case when we need to onload to the software. However, LINKx currently disables hardware offload, so this path should never get hit. In the latest update I removed all this code, and I think this should resolve this issue.
/* this is used when the owner calls start_msg */ | ||
rx_entry->peer_context = ux; | ||
return -FI_ENOMSG; | ||
} else if (ret) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drop else and just use if
@@ -861,6 +861,9 @@ struct cxip_domain { | |||
ofi_spin_t lock; | |||
ofi_atomic32_t ref; | |||
|
|||
struct fid_ep rx_ep; | |||
struct fid_peer_srx *owner_srx; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting this comment here but it's unrelated - rewrite commit message as the callback is no longer used
Implement the shared completion queues in CXI. This implementation mainly relies on the util CQ common code. The util code set the correct completion queue callbacks based on the FI_PEER flag Signed-off-by: Amir Shehata <[email protected]>
Restructure the code to allow for posting on the owner provider's shared receive queues. Signed-off-by: Amir Shehata <[email protected]>
Add the FI_PEER capability bit to the CXI provider fi_info Signed-off-by: Amir Shehata <[email protected]>
020893d
to
40bb48e
Compare
@@ -1271,6 +1274,9 @@ struct cxip_req { | |||
uint64_t trig_thresh; | |||
struct cxip_cntr *trig_cntr; | |||
|
|||
/* pointer to the shared receive entry */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't care enough about this to make you fix and repush but if you do another update, probably drop this comment. I think it's unnecessary
@@ -138,7 +138,7 @@ | |||
(FI_SOURCE | FI_SOURCE_ERR | FI_LOCAL_COMM | \ | |||
FI_REMOTE_COMM | FI_RMA_EVENT | FI_MULTI_RECV | FI_FENCE | FI_TRIGGER) | |||
#define CXIP_EP_CAPS (CXIP_EP_PRI_CAPS | CXIP_EP_SEC_CAPS) | |||
#define CXIP_DOM_CAPS (FI_LOCAL_COMM | FI_REMOTE_COMM | FI_AV_USER_ID) | |||
#define CXIP_DOM_CAPS (FI_LOCAL_COMM | FI_REMOTE_COMM | FI_AV_USER_ID | FI_PEER) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think to fully support FI_PEER, you need to support counters too. I know link doesn't use them so it's not the biggest of deals but it probably wouldn't be too hard - how hard do you think it would be to add peer counter support?
Looks pretty good to me! But definitely get @iziemba's approval before merging. |
This PR introduces support for the peer infrastructure in the CXI provider.