#0: Hoist SubDeviceManager/Lock-Step Allocator to MeshDevice #16878

cfjchu · 2025-01-18T01:10:57Z

Ticket

Link to Github Issue

Problem description

As part of the TT-Mesh scope of work, we want to natively virtualize the devices in the mesh. Part of this virtualization process involves deferring to a single set of allocators (global, per-subdevice) at the MeshDevice level, instead of issuing repeated allocations on each of the per-device allocators.

What's changed

Following the helpful cleanup work from @tt-aho in #16625: Refactor tracking of sub-device managers from Device to a new class #16683, this PR now introduces the SubDeviceManagerTracker to the MeshDevice level.
This change adds support for SubDevice management and lock-step allocation of L1/DRAM buffers across devices.
This change also helps to implement IDevice interface APIs responsible for querying allocation state.

Checklist

Post commit CI passes: https://github.com/tenstorrent/tt-metal/actions/runs/12944333124
T3000 Regressions: https://github.com/tenstorrent/tt-metal/actions/runs/12937359046

github-actions

⚠️ Clang-Tidy found issue(s) with the introduced code (1/1)

tests/tt_metal/distributed/test_mesh_allocator.cpp

omilyutin-tt

Some high level questions for now. In general, I think it is worth exploring a better "allocator" abstraction that is independent of details of mesh / single device / sub device.

tt_metal/distributed/mesh_buffer.cpp

tt_metal/distributed/mesh_device.cpp

tt_metal/impl/sub_device/sub_device_manager.cpp

tt_metal/distributed/mesh_device.cpp

tests/tt_metal/distributed/test_mesh_allocator.cpp

tt_metal/distributed/mesh_buffer.cpp

tt_metal/distributed/mesh_device.cpp

tt_metal/impl/sub_device/sub_device_manager.cpp

tt-aho

Sub-Device related changes look okay to me. Cleanup of how we want to expose these apis/interfaces instead of having a bunch of wrapping fns is a different discussion/issue from this pr.

tt_metal/distributed/mesh_device.cpp

tt_metal/api/tt-metalium/sub_device_manager_tracker.hpp

omilyutin-tt · 2025-01-24T06:27:43Z

tt_metal/api/tt-metalium/mesh_device.hpp

-
-    void initialize();
+    std::unique_ptr<SubDeviceManagerTracker> sub_device_manager_tracker_;
+    std::unique_ptr<WorkExecutor> work_executor_;


Hmm, isn't this new?

I mentioned this yesterday... This is being added to provide implementations for IDevice interface for executor methods like push_work. Right now I'm defaulting executor to work in synchronous mode. This was added so that Buffer::create -> calls push_work -> requires executor methods on MeshDevice.

Separate discussion needs to be had about removing this from IDevice interface but my goal is to unify the APIs, as-is and to do so incrementally to prepare for TT-NN integration.

cfjchu requested review from aliuTT, tt-asaigal, omilyutin-tt, abhullar-tt, pgkeller, tt-aho, tt-dma and ubcheema as code owners January 18, 2025 01:10

cfjchu force-pushed the jchu/mesh-allocator branch from 1f34d17 to 92a2dd9 Compare January 18, 2025 02:34

github-actions bot reviewed Jan 18, 2025

View reviewed changes

tests/tt_metal/distributed/test_mesh_allocator.cpp Outdated Show resolved Hide resolved

cfjchu force-pushed the jchu/mesh-allocator branch from 92a2dd9 to d91d8e5 Compare January 18, 2025 02:44

omilyutin-tt reviewed Jan 18, 2025

View reviewed changes

tt-asaigal reviewed Jan 20, 2025

View reviewed changes

cfjchu force-pushed the jchu/mesh-allocator branch 2 times, most recently from 5968a64 to 2f3c743 Compare January 21, 2025 23:59

cfjchu requested a review from davorchap as a code owner January 21, 2025 23:59

cfjchu force-pushed the jchu/mesh-allocator branch 4 times, most recently from 89a2fd9 to a04ce94 Compare January 23, 2025 21:05

tt-aho approved these changes Jan 23, 2025

View reviewed changes

tt-asaigal approved these changes Jan 23, 2025

View reviewed changes

omilyutin-tt reviewed Jan 23, 2025

View reviewed changes

tt_metal/distributed/mesh_device.cpp Outdated Show resolved Hide resolved

omilyutin-tt reviewed Jan 24, 2025

View reviewed changes

tt_metal/api/tt-metalium/sub_device_manager_tracker.hpp Show resolved Hide resolved

cfjchu force-pushed the jchu/mesh-allocator branch from a04ce94 to 6558bf9 Compare January 24, 2025 06:12

#0: Hoist SubDeviceManager/Lock-Step Allocator to MeshDevice

86d6a3b

cfjchu force-pushed the jchu/mesh-allocator branch from 6558bf9 to 86d6a3b Compare January 24, 2025 06:14

omilyutin-tt reviewed Jan 24, 2025

View reviewed changes

omilyutin-tt approved these changes Jan 24, 2025

View reviewed changes

ayerofieiev-tt approved these changes Jan 24, 2025

View reviewed changes

cfjchu merged commit 2c2110c into main Jan 24, 2025
218 of 220 checks passed

cfjchu deleted the jchu/mesh-allocator branch January 24, 2025 09:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#0: Hoist SubDeviceManager/Lock-Step Allocator to MeshDevice #16878

#0: Hoist SubDeviceManager/Lock-Step Allocator to MeshDevice #16878

cfjchu commented Jan 18, 2025 •

edited

Loading

github-actions bot left a comment

omilyutin-tt left a comment

tt-aho left a comment

omilyutin-tt Jan 24, 2025

cfjchu Jan 24, 2025 •

edited

Loading

#0: Hoist SubDeviceManager/Lock-Step Allocator to MeshDevice #16878

#0: Hoist SubDeviceManager/Lock-Step Allocator to MeshDevice #16878

Conversation

cfjchu commented Jan 18, 2025 • edited Loading

Ticket

Problem description

What's changed

Checklist

github-actions bot left a comment

Choose a reason for hiding this comment

omilyutin-tt left a comment

Choose a reason for hiding this comment

tt-aho left a comment

Choose a reason for hiding this comment

omilyutin-tt Jan 24, 2025

Choose a reason for hiding this comment

cfjchu Jan 24, 2025 • edited Loading

Choose a reason for hiding this comment

cfjchu commented Jan 18, 2025 •

edited

Loading

cfjchu Jan 24, 2025 •

edited

Loading