Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/cxi: peer infrastructure support #10436

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

amirshehataornl
Copy link
Contributor

This PR introduces support for the peer infrastructure in the CXI provider.

@iziemba
Copy link
Contributor

iziemba commented Oct 17, 2024

Hey Amir - Sorry for taking a while to get to this. Can you please rebase this commit on top of main? I have merged a series of CXI provider updates.

@amirshehataornl amirshehataornl force-pushed the pr01_cxip_peer branch 4 times, most recently from 41d814f to 020893d Compare October 23, 2024 21:58
struct cxip_ux_send *ux_send)
{
union cxip_match_bits ux_mb;
struct fi_peer_rx_entry *entry = calloc(sizeof(*entry), 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this still looks wrong to me. cxi as a peer should never be allocating peer entries. that all should happen in the owner space. I understand that from the cxi perspective, the messages all look like unexpected messages but I don't understand how this fits into that flow. it looks like you're basically bypassing the get_msg api and sending all messages straight into the owner unexpected queue which does not match the peer API flow definition.
the only place I see things getting added to the sw_pending_ux_list that this is going through is in the else where you're not in the owner srx case. is this even hit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error is in the caller of this function. It should be going through rxc->sw_ux_list. However now that I look at it more closely I have to admit I'm not entirely sure if we need to do something special similar to what cxip_post_ux_onload_sw() and cxip_post_ux_onload_fc() do. @iziemba do you have any advice here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, as I was thinking it over I don't think this is going to be an issue. This assumes hardware offload is enabled and handles the case when we need to onload to the software. However, LINKx currently disables hardware offload, so this path should never get hit. In the latest update I removed all this code, and I think this should resolve this issue.

/* this is used when the owner calls start_msg */
rx_entry->peer_context = ux;
return -FI_ENOMSG;
} else if (ret) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop else and just use if

@@ -861,6 +861,9 @@ struct cxip_domain {
ofi_spin_t lock;
ofi_atomic32_t ref;

struct fid_ep rx_ep;
struct fid_peer_srx *owner_srx;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting this comment here but it's unrelated - rewrite commit message as the callback is no longer used

Implement the shared completion queues in CXI. This implementation
mainly relies on the util CQ common code.

The util code set the correct completion queue callbacks based on the
FI_PEER flag

Signed-off-by: Amir Shehata <[email protected]>
Restructure the code to allow for posting on the owner provider's shared
receive queues.

Signed-off-by: Amir Shehata <[email protected]>
Add the FI_PEER capability bit to the CXI provider fi_info

Signed-off-by: Amir Shehata <[email protected]>
@@ -1271,6 +1274,9 @@ struct cxip_req {
uint64_t trig_thresh;
struct cxip_cntr *trig_cntr;

/* pointer to the shared receive entry */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't care enough about this to make you fix and repush but if you do another update, probably drop this comment. I think it's unnecessary

@@ -138,7 +138,7 @@
(FI_SOURCE | FI_SOURCE_ERR | FI_LOCAL_COMM | \
FI_REMOTE_COMM | FI_RMA_EVENT | FI_MULTI_RECV | FI_FENCE | FI_TRIGGER)
#define CXIP_EP_CAPS (CXIP_EP_PRI_CAPS | CXIP_EP_SEC_CAPS)
#define CXIP_DOM_CAPS (FI_LOCAL_COMM | FI_REMOTE_COMM | FI_AV_USER_ID)
#define CXIP_DOM_CAPS (FI_LOCAL_COMM | FI_REMOTE_COMM | FI_AV_USER_ID | FI_PEER)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to fully support FI_PEER, you need to support counters too. I know link doesn't use them so it's not the biggest of deals but it probably wouldn't be too hard - how hard do you think it would be to add peer counter support?

@aingerson
Copy link
Contributor

Looks pretty good to me! But definitely get @iziemba's approval before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants