Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduction of the raft::device_resources_snmg type #2487

Open
wants to merge 3 commits into
base: branch-24.12
Choose a base branch
from

Conversation

viclafargue
Copy link
Contributor

@viclafargue viclafargue commented Nov 8, 2024

Introduces the raft::device_resources_snmg type to hold all resources required for the NCCL clique.

Answers #2459
Removed call to raft::comms::build_comms_nccl_only (#2465)

@viclafargue viclafargue requested a review from a team as a code owner November 8, 2024 17:16
@github-actions github-actions bot added the cpp label Nov 8, 2024
@cjnolet cjnolet added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 8, 2024
@codecov-commenter
Copy link

codecov-commenter commented Nov 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.01%. Comparing base (fccf33e) to head (a269555).
Report is 1 commits behind head on branch-24.12.

❗ There is a different number of reports uploaded between BASE (fccf33e) and HEAD (a269555). Click for more details.

HEAD has 16 uploads less than BASE
Flag BASE (fccf33e) HEAD (a269555)
19 3
Additional details and impacted files
@@               Coverage Diff                @@
##           branch-24.12    #2487      +/-   ##
================================================
- Coverage         87.17%   81.01%   -6.17%     
================================================
  Files                25       17       -8     
  Lines               546      511      -35     
================================================
- Hits                476      414      -62     
- Misses               70       97      +27     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@viclafargue viclafargue changed the title NCCL clique initialization function Introduction of the raft::device_resources_snmg type Nov 15, 2024

~device_resources_snmg()
{
#pragma omp parallel for // necessary to avoid hangs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add proper documentation to each of the methods in this class (even private methods). It's important for users of this API to know that they'll need to have multiple threads available for this, otherwise it'll hang.


namespace raft {

class device_resources_snmg : public resources {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a device_resources instance, I think it'll be better to extend device_resources. That way you can override important methods like get_stream and get_worker_stream to return a main stream. In general, we should expect the device_resources_snmg to behave the same as device_resources when passed into an algorithm that only operates single gpu while being able to operate in snmg mode when passed into an algorithm that supports single-node multi-gpu. I think it should store off and use the id of the GPU that is selected when creating the snmg handle by default when being used in single-gpu mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

3 participants