-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support GPU-to-CPU synchronization dependency with HolisticTraceAnalysis #57
Conversation
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
0076451
to
742abd4
Compare
Hi, how is the review going? |
Hi, @JoongunPark. We did not get a chance to review and test because we have an urgent task internally. Thank you for your patience. |
@JoongunPark - we may need 1-2 more weeks since we are setting up integration tests internally as we speak. We will try to expedite this asap. Thank you for your patience. |
742abd4
to
1d2289b
Compare
1d2289b
to
3e54936
Compare
0eae1ab
to
882e10a
Compare
I have tested with Taekyung's lastest enhancement. It works well on my environment (Python 3.10.13, Linux 5.15.0-105-generic)
Also, as he mentioned, now the code builds sync dependency with the closest next CPU operator instead of cuda_runtime op. |
eba5712
to
eff2dc6
Compare
3441a2c
to
c5db738
Compare
c5db738
to
cdfa867
Compare
Update et_feeder for compatibility with Chakra schema v0.0.4
@JoongunPark can you please resolve the merge conflicts? We can merge this PR. |
cdfa867
to
7306c14
Compare
Co-authored-by: Joongun Park <[email protected]>
…onverter Co-authored-by: Joongun Park <[email protected]>
…akra_trace_link Co-authored-by: Joongun Park <[email protected]>
Co-authored-by: Joongun Park <[email protected]>
7306c14
to
6bdf24e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merging based on @TaekyungHeo's feedback and review. Thanks for the PR @JoongunPark and thanks for the review @TaekyungHeo.
My apologies for the delayed recognition of the merge conflicts. Thank you so much for reviewing and managing this PR, @srinivas212 and @TaekyungHeo! |
Summary
This PR introduces dependencies from GPU operators to CPU operators using the critical path analysis in HolisticTraceAnalysis (HTA). In the simulation flow of Chakra, postprocessors like the trace linker and the converter are required. They are responsible for merging Chakra host traces with Chakra device traces and encoding dependencies. Currently, the dependencies encoded by the postprocessors are limited to CPU operators to GPU operators. However, there can be dependencies from GPU operators to CPU operators if a CPU operator has a dependency on a GPU operator. To identify such dependencies, this PR utilizes the critical path analysis of HTA. More specifically, this PR uses the synchronization dependency of HTA. A synchronization dependency occurs when a CPU operator has to wait for a dispatched GPU operator to be completed. Therefore, synchronization dependency is the best for identifying such dependencies.
Please note that:
--rank
for chakra_trace_link.Test Plan
Download and Install HTA.
Next, you need to collect traces by following the instructions here: pytorch/pytorch#105187.
After that, you can load sync dependencies and print them out with the following script:
You can run it with the following command:
cuda-sync.zip
/tmp/out
Two synchronization dependencies are identified with the script. In this test, we focus on the dependency between 'ncclDevKernel_ReduceScatter_Sum_f32_RING_LL(ncclDevComm*, unsigned long, ncclWork*)' and 'cudaDeviceSynchronize'.
Let's confirm our observation with a trace visualizer. You can read Kineto traces with https://perfetto.dev/. By searching for ncclDevKernel_ReduceScatter_Sum_f32_RING_LL, you can find that it is a GPU kernel (category field) with an external ID of 13847. Around the operator but in the CPU row of the visualization, you can find cudaDeviceSynchronize where the external ID is 94792. It is a cuda_runtime operator. As the cuda_runtime operator is not considered a simulatable operator in the toolchains, the closest but later CPU operator, aten::empty, with the external ID of 16392, should rely on the GPU kernel.
Let's see if the synchronization dependency is properly encoded in trace_link. Make sure you install Chakra.
Run chakra_trace_link.
You can review ~/megatron_0.json and find that sync dependencies are encoded.
Run chakra_converter
Here are traces that I used.
cuda-sync.zip
Resnet-50.zip
llama2.zip