prov/shm: proposal for new shm architecture #10693

aingerson · 2025-01-11T02:55:50Z

This is an early look at what I am planning for shm. I am out on vacation for the next 2 weeks and will not be updating this PR but will check back in for any conversations or questions that come up. I'm dropping this here to get eyes on it while I'm gone and to get some feedback on it. Thanks in advance!

For context:
We've been working on rearchitecting the provider for a while (by we I mean Intel, AWS, and ORNL) because of various limitations and difficulties with the existing protocol. sm2 was an attempt at implementing some new methods at handling rendezvous - type protocols but full performance was never achieved and development was abandoned in favor of some smaller optimizations in the existing provider. However, this did not solve some of the limitations regarding receiver side resources and the polling method for the response queue (for example #9853).
This draft PR is a proposal for redoing shm that solves these issues while preserving/improving the performance.

This is a very rough draft and is NOT intended to be reviewed line by line (please don't, it will be a waste of your time). It is not polished. The commits also do not build separately but I separated them into smaller chunks to make it easier to follow. The big one is "new shm" which implements the new queues.

Here's the general gist though:
The command queue is kept as is since it has shown to be performant and allows us to implement an easy inline protocol. The queue now has a ptr to the command being used and a built in command. This built in command is only used for the inline protocol. For all other protocols, the ptr will be set to a sender size command (located in a stack in the shm region) but translated into peer space.
Inline messages do not have to be returned - all data for the command is saved within the inline command
Inject, iov, and sar messages (mmap is removed) all use a sender size command which needs to be returned by the receiver. To return the command, the receiver translates the command pointer into a sender command and inserts it into another atomic queue dedicated for returned entries. There are two atomic queues to poll but this doesn't result in more overhead and saves on assumptions regarding insertion into the return queue (since it runs in parallel with the command stack, we are always guaranteed space in the queue and do not have to handle retries).
Sar messages do not need a sar list to poll (like before) since now the return and resubmission of the command acts as the trigger for more data.
Sender side commands can be held onto by the receiver and returned out of order of the submission order which means subsequent messages (specifically iovs) do not get blocked before of previous messages not getting matched.
This also opens the door to adding a CMA-IPC fallback protocol (not in this patch set, will come later).

Like I mentioned, this is a draft and not final. I'm opening this up to get feedback and more eyes on the new implementation to see if there are any expected issues to moving to a model like this.

The PR in its current state passes all fabtests, ubertest (all.test and verify.test), and works with OMPI using lnx (shm+tcp;ofi_rxm and shm+verbs;ofi_rxm) so I feel confident enough that it's in a solid enough state to have more folks run tests on it and get some more performance and functional data from it. This shouldn't affect anyone but it is currently not working for DSA but the implementation logic is all there.

@shijin-aws @a-szegel @amirshehataornl please take a look if you can and give me a good sanity check. Any performance data (or simple a thumbs up or down) would be much appreciated!

My least favorite things about the implementation are the DSA status triggering and the SAR overflow list. I'm trying to figure out a smooth way for the SAR progress to get triggered through the command queue without the potential for the insert to fail. My vision would require modification to the atomic queue code so I'm putting it off for now to just get something working.

Performance wise, I'm seeing inline, CMA, and SAR looking stable and competitive and significant improvement with inject (inject spin lock was removed)

Let me know what you think and if this is a direction you want to move towards!

@zachdworkin @alex-mckinley

Signed-off-by: Alexia Ingerson <[email protected]>

…eestack Signed-off-by: Alexia Ingerson <[email protected]>

Signed-off-by: Alexia Ingerson <[email protected]>

Turn response queue into return queue for local commands Inline commands are still receive side All commands have an inline option but a common ptr to the command being used for remote commands. These commands have to be returned to the sender but the receive side can hold onto them as long as needed for the lifetime of the message Signed-off-by: Alexia Ingerson <[email protected]>

shm has self and peer caps for each p2p interface (right now just CMA and xpmem). The support for each of these interfaces is saved in separate fields which causes a lot of wasted memory and is confusing. Merge these into two fields (one for self and one for peer) which holds the information for all p2p interfaces and is accessed by the P2P type enums. CMA also needs a flag to know wether CMA support has been queried yet or not. This also moves some shm fields around for alignment Signed-off-by: Alexia Ingerson <[email protected]>

Simplifies access to the map to remove need for container Signed-off-by: Alexia Ingerson <[email protected]>

There is a 1:1 relationship with the av and map so just reuse the util av lock for access to the map as well. This requires some reorganizing of the locking semantics Signed-off-by: Alexia Ingerson <[email protected]>

There is nothing in smr_fabric, just use the util_fabric directly Signed-off-by: Alexia Ingerson <[email protected]>

Just like on the send side, make the progress functions be an array of function pointers accessible by the command proto. This cleans up the parameters of the progress calls and streamlines the calls This also renames the proto_ops to send_ops to make the two more clear Signed-off-by: Alexia Ingerson <[email protected]>

…operations Signed-off-by: Alexia Ingerson <[email protected]>

proto data isn't really needed since it's only used for the inject offset now and the cmd stack and inject buffers run in parallel. Use a simple inject buf array and access by index where the index is the same as the command's index in its stack Signed-off-by: Alexia Ingerson <[email protected]>

DSA copies happen asynchronously so we need a way to notify the receiver when the copy is done and the data is available. This used to be done with the response queue and sar list. The response queue notification can still be done with the return of the command but the sar list was removed since more data is sent by returning the command on a subsequent loop. DSA still needs this list to track asynchronous copies. This refactors the async ipc list and turns it into a generic async list to track asynchronous copies. If a DSA copy is not ready, the entry is inserted into the async list and polled until it is ready to be copied and then it resumes the regular SAR protocol where the command is returned to the sender. Tracking the status of the sar is done through the existing sar status of the peer but to check for the correct status, the rx id is also needed by the receiver for proper status exchange. This also refactors the ids sto make it clearer when an id is used for the trasmitter (tx) or target (rx) Signed-off-by: Alexia Ingerson <[email protected]>

Will move Signed-off-by: Alexia Ingerson <[email protected]>

aingerson added 13 commits January 10, 2025 17:36

prov/shm: simplify headers so everything is included in smr.h

453852f

Signed-off-by: Alexia Ingerson <[email protected]>

include/ofi_mem: add function to return number of free elements in fr…

cca1bb7

…eestack Signed-off-by: Alexia Ingerson <[email protected]>

include/ofi.h: add compile time static assert

d032d6d

Signed-off-by: Alexia Ingerson <[email protected]>

prov/shm: add a pointer to the map to the EP

38c3f86

Simplifies access to the map to remove need for container Signed-off-by: Alexia Ingerson <[email protected]>

prov/shm: remove map->lock and use util_av->lock instead

4840a19

There is a 1:1 relationship with the av and map so just reuse the util av lock for access to the map as well. This requires some reorganizing of the locking semantics Signed-off-by: Alexia Ingerson <[email protected]>

prov/shm: remove smr_fabric

3141b65

There is nothing in smr_fabric, just use the util_fabric directly Signed-off-by: Alexia Ingerson <[email protected]>

prov/shm: merge tx and pend entries for simple management of pending …

a625db0

…operations Signed-off-by: Alexia Ingerson <[email protected]>

prov/shm: bug fixes and cleanups without a category

25075b8

Will move Signed-off-by: Alexia Ingerson <[email protected]>

aingerson added ⚠️ Do not merge work in progress labels Jan 11, 2025

aingerson marked this pull request as draft January 11, 2025 02:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prov/shm: proposal for new shm architecture #10693

prov/shm: proposal for new shm architecture #10693

aingerson commented Jan 11, 2025

prov/shm: proposal for new shm architecture #10693

Are you sure you want to change the base?

prov/shm: proposal for new shm architecture #10693

Conversation

aingerson commented Jan 11, 2025