Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about Row-wise Sharding behavior across multiple machines #2510

Open
whn09 opened this issue Oct 23, 2024 · 0 comments
Open

Question about Row-wise Sharding behavior across multiple machines #2510

whn09 opened this issue Oct 23, 2024 · 0 comments

Comments

@whn09
Copy link

whn09 commented Oct 23, 2024

I have a question about the row-wise sharding behavior in TorchRec when running on multiple machines.

My setup:

  • Embedding table size: 1.6B rows
  • Number of machines: 2
  • GPUs per machine: 8 A100 GPUs (16 GPUs in total)
  • Sharding strategy: Row-wise sharding

I observed that each GPU gets about 200M rows (1.6B rows per machine, shared among 8 GPUs), rather than 100M rows (1.6B rows divided by 16 GPUs).

This suggests that row-wise sharding is done at machine level first (1.6B/2 = 800M rows per machine), then shared among GPUs within each machine, rather than sharding across all available GPUs directly.

Could someone explain:

  1. Is this the expected behavior?
  2. What's the rationale behind this design?
  3. Which part of the source code controls this sharding behavior?

Thanks in advance!

Tasks

No tasks being tracked yet.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant