Question about Row-wise Sharding behavior across multiple machines #2510

whn09 · 2024-10-23T08:57:30Z

I have a question about the row-wise sharding behavior in TorchRec when running on multiple machines.

My setup:

Embedding table size: 1.6B rows
Number of machines: 2
GPUs per machine: 8 A100 GPUs (16 GPUs in total)
Sharding strategy: Row-wise sharding

I observed that each GPU gets about 200M rows (1.6B rows per machine, shared among 8 GPUs), rather than 100M rows (1.6B rows divided by 16 GPUs).

This suggests that row-wise sharding is done at machine level first (1.6B/2 = 800M rows per machine), then shared among GPUs within each machine, rather than sharding across all available GPUs directly.

Could someone explain:

Is this the expected behavior?
What's the rationale behind this design?
Which part of the source code controls this sharding behavior?

Thanks in advance!

Tasks

Give feedback

No tasks being tracked yet.

Options

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about Row-wise Sharding behavior across multiple machines #2510

Question about Row-wise Sharding behavior across multiple machines #2510

whn09 commented Oct 23, 2024 •

edited

Loading

Tasks

Question about Row-wise Sharding behavior across multiple machines #2510

Question about Row-wise Sharding behavior across multiple machines #2510

Comments

whn09 commented Oct 23, 2024 • edited Loading

Tasks

whn09 commented Oct 23, 2024 •

edited

Loading