-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specify the kineto filepath explicitly when running HTA analysis #167
Specify the kineto filepath explicitly when running HTA analysis #167
Conversation
Without specifying the kineto filepath explicitly, HTA may pick arbitrary files from the `trace_dir` and either provide incorrect analysis results, or fail in some weird ways.
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
Thank you for the PR! The code changes look good overall.
I’ll run further tests once I have your traces. |
Yes, it behaves correctly. In my specific use case it also finishes successfully without the patch, but that's rather by accident. Additionally, it also analyzes more files thus wasting some extra time. nanoGPT with DDP, two pairs of traces
Attached, these are coming from nanoGPT with a single training iteration. There are two cases: "normal" run, and a DDP with 2 GPUs.
In case the |
This will likely fix #163 Thanks! |
Thank you so much! |
I have tested this PR in my environment, and it runs well. |
Thank you for checking @JoongunPark, highly appreciated! Is there anything else I can do to make this PR mergeable? |
This PR looks good to me as it is. @AlexDenisov, thank you so much for your efforts! May I kindly ask for your help to merge this PR, @TaekyungHeo and @srinivas212? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acknowledging Joongun's review of the PR and merging it. Thank you Alex.
Summary
Without specifying the kineto filepath explicitly, HTA may pick arbitrary files from the
trace_dir
and either provide incorrect analysis results, or fail in some weird ways.Test Plan
I tested it locally within my setup, and also run the existing test suite, but I guess this is not enough to move forward.
What would be the best way to add a test case capturing such a behavior? Shall I add some more test data?
Would appreciate any suggestions on how to make this PR better 🙌
Additional Notes
As a concrete example, I have the following files in one folder:
When I run the linker as follows:
then the underlying HTA analysis will pick all the files in the
trace_dir
, despite the fact that I explicitly specified which files to link.This is one of the errors I hit within my setup:
Failure
And here is the output with the patched version:
Success