You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a question about the row-wise sharding behavior in TorchRec when running on multiple machines.
My setup:
Embedding table size: 1.6B rows
Number of machines: 2
GPUs per machine: 8 A100 GPUs (16 GPUs in total)
Sharding strategy: Row-wise sharding
I observed that each GPU gets about 200M rows (1.6B rows per machine, shared among 8 GPUs), rather than 100M rows (1.6B rows divided by 16 GPUs).
This suggests that row-wise sharding is done at machine level first (1.6B/2 = 800M rows per machine), then shared among GPUs within each machine, rather than sharding across all available GPUs directly.
Could someone explain:
Is this the expected behavior?
What's the rationale behind this design?
Which part of the source code controls this sharding behavior?
Thanks in advance!
The content you are editing has changed. Please copy your edits and refresh the page.
I have a question about the row-wise sharding behavior in TorchRec when running on multiple machines.
My setup:
I observed that each GPU gets about 200M rows (1.6B rows per machine, shared among 8 GPUs), rather than 100M rows (1.6B rows divided by 16 GPUs).
This suggests that row-wise sharding is done at machine level first (1.6B/2 = 800M rows per machine), then shared among GPUs within each machine, rather than sharding across all available GPUs directly.
Could someone explain:
Thanks in advance!
Tasks
The text was updated successfully, but these errors were encountered: