Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify the kineto filepath explicitly when running HTA analysis #167

Conversation

AlexDenisov
Copy link
Member

Summary

Without specifying the kineto filepath explicitly, HTA may pick arbitrary files from the trace_dir and either provide incorrect analysis results, or fail in some weird ways.

Test Plan

I tested it locally within my setup, and also run the existing test suite, but I guess this is not enough to move forward.
What would be the best way to add a test case capturing such a behavior? Shall I add some more test data?
Would appreciate any suggestions on how to make this PR better 🙌

Additional Notes

As a concrete example, I have the following files in one folder:

trace_linked.0.json
trace_kineto.0.json
trace_pytorch.0.json

When I run the linker as follows:

chakra_trace_link --rank 0 \
  --chakra-host-trace trace_pytorch.0.json \
  --chakra-device-trace trace_kineto.0.json \
  --output-file trace_linked.0.json

then the underlying HTA analysis will pick all the files in the trace_dir, despite the fact that I explicitly specified which files to link.

This is one of the errors I hit within my setup:

Failure
> chakra_trace_link --rank 0 --chakra-host-trace trace_pytorch.0.json --chakra-device-trace trace_kineto.0.json --output-file trace_linked.0.json
[2024-11-11 12:20:19,449] trace.py:328 [INFO]: ~/traces
[2024-11-11 12:20:19,514] trace_file.py:61 [ERROR]: If the trace file does not have the rank specified in it, then add the following snippet key to the json files to use HTA; "distributedInfo": {"rank": 0}. If there are multiple traces files, then each file should have a unique rank value.
[2024-11-11 12:20:19,521] trace_file.py:61 [ERROR]: If the trace file does not have the rank specified in it, then add the following snippet key to the json files to use HTA; "distributedInfo": {"rank": 0}. If there are multiple traces files, then each file should have a unique rank value.
[2024-11-11 12:20:19,529] trace_file.py:61 [ERROR]: If the trace file does not have the rank specified in it, then add the following snippet key to the json files to use HTA; "distributedInfo": {"rank": 0}. If there are multiple traces files, then each file should have a unique rank value.
[2024-11-11 12:20:19,529] trace_file.py:92 [WARNING]: There is no item in the rank to trace file map.
[2024-11-11 12:20:19,529] trace.py:474 [INFO]: ranks=[]
[2024-11-11 12:20:19,529] trace.py:480 [ERROR]: The list of ranks to be parsed is empty.
[2024-11-11 12:20:19,529] trace.py:483 [WARNING]: leaving parse_traces duration=0.00 seconds

Traceback (most recent call last):
  File "~/chakra/.venv/bin/chakra_trace_link", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "~/chakra/.venv/lib/python3.11/site-packages/chakra/src/trace_link/trace_link.py", line 47, in main
    linker.link(args.rank, args.chakra_host_trace, args.chakra_device_trace, args.output_file)
  File "~/chakra/.venv/lib/python3.11/site-packages/chakra/src/trace_link/trace_linker.py", line 70, in link
    sync_deps = self.load_sync_dependencies(rank, chakra_device_trace)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/chakra/.venv/lib/python3.11/site-packages/chakra/src/trace_link/trace_linker.py", line 120, in load_sync_dependencies
    trace_analysis = TraceAnalysis(trace_dir=trace_dir)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/chakra/.venv/lib/python3.11/site-packages/hta/trace_analysis.py", line 37, in __init__
    self.t.load_traces(include_last_profiler_step)
  File "~/chakra/.venv/lib/python3.11/site-packages/hta/common/trace.py", line 353, in load_traces
    self.align_and_filter_trace(include_last_profiler_step)
  File "~/chakra/.venv/lib/python3.11/site-packages/hta/common/trace.py", line 493, in align_and_filter_trace
    self._align_all_ranks()
  File "~/chakra/.venv/lib/python3.11/site-packages/hta/common/trace.py", line 615, in _align_all_ranks
    self.min_ts = min(trace_df["ts"].min() for trace_df in self.traces.values())
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: min() arg is an empty sequence

And here is the output with the patched version:

Success
> chakra_trace_link --rank 0 --chakra-host-trace trace_pytorch.0.json --chakra-device-trace trace_kineto.0.json --output-file trace_linked.0.json
[2024-11-11 12:24:35,997] trace.py:328 [INFO]: ~/traces
[2024-11-11 12:24:35,997] trace.py:474 [INFO]: ranks=[0]
[2024-11-11 12:24:36,071] trace_parser.py:107 [WARNING]: Parsed ~/traces/trace_kineto.0.json time = 0.07 seconds
[2024-11-11 12:24:36,086] trace_parser.py:317 [WARNING]: Rounding down ns resolution events due to issue with events overlapping. ts dtype = float64, dur dtype = float64.Please see https://github.com/pytorch/pytorch/pull/122425
[2024-11-11 12:24:36,127] trace_parser.py:430 [WARNING]: Parsed ~/traces/trace_kineto.0.json backend=ParserBackend.JSON in 0.13 seconds; current PID:3256
[2024-11-11 12:24:36,148] trace.py:236 [WARNING]: Overall parsing of ~/traces/trace_kineto.0.json in 0.15 seconds; current PID:3256
[2024-11-11 12:24:36,151] trace.py:449 [WARNING]: leaving parse_multiple_ranks duration=0.15 seconds
[2024-11-11 12:24:36,151] trace.py:483 [WARNING]: leaving parse_traces duration=0.15 seconds
[2024-11-11 12:24:36,151] trace.py:654 [WARNING]: There is only one iteration in the trace. The analysis result may not be accurate.
[2024-11-11 12:24:36,152] critical_path_analysis.py:1467 [WARNING]: Trace does not contain CUDA Synchronization events so the results of analysis could be inaccurate.
[2024-11-11 12:24:36,152] critical_path_analysis.py:1471 [WARNING]: Please see this PR to learn how to enable CUDA sync events https://github.com/pytorch/pytorch/pull/105187
[2024-11-11 12:24:36,152] critical_path_analysis.py:1493 [INFO]: Looking up events under [0, 0) instance(s) of 'ProfilerStep' annotation.
[2024-11-11 12:24:36,153] critical_path_analysis.py:1509 [INFO]: Looking up events within the window (1859, 1400577.0)
[2024-11-11 12:24:36,161] critical_path_analysis.py:1533 [INFO]: Clipped dataframe has 7575 events
[2024-11-11 12:24:36,163] critical_path_analysis.py:1540 [INFO]: Preprocessing took 0.01 seconds
[2024-11-11 12:24:36,445] critical_path_analysis.py:1544 [INFO]: CPGraph construction took 0.28 seconds
[2024-11-11 12:24:36,448] critical_path_analysis.py:1171 [WARNING]: Ignoring negative weight of -1 for CPEdge(begin=9022, end=9023, weight=np.float64(-1.0), type=<CPEdgeType.OPERATOR_KERNEL: 'critical_path_operator'>)
[2024-11-11 12:24:36,448] critical_path_analysis.py:1171 [WARNING]: Ignoring negative weight of -1 for CPEdge(begin=9032, end=9033, weight=np.float64(-1.0), type=<CPEdgeType.OPERATOR_KERNEL: 'critical_path_operator'>)
[2024-11-11 12:24:36,448] critical_path_analysis.py:1171 [WARNING]: Ignoring negative weight of -1 for CPEdge(begin=9220, end=9221, weight=np.float64(-1.0), type=<CPEdgeType.OPERATOR_KERNEL: 'critical_path_operator'>)
[2024-11-11 12:24:36,448] critical_path_analysis.py:1171 [WARNING]: Ignoring negative weight of -1 for CPEdge(begin=9286, end=9287, weight=np.float64(-1.0), type=<CPEdgeType.OPERATOR_KERNEL: 'critical_path_operator'>)
[2024-11-11 12:24:36,448] critical_path_analysis.py:1171 [WARNING]: Ignoring negative weight of -1 for CPEdge(begin=9362, end=9363, weight=np.float64(-1.0), type=<CPEdgeType.OPERATOR_KERNEL: 'critical_path_operator'>)
[2024-11-11 12:24:36,468] critical_path_analysis.py:1143 [INFO]: calculating critical path took 0.023477 seconds
[2024-11-11 12:24:36,490] trace_parser.py:107 [WARNING]: Parsed ~/traces/trace_kineto.0.json time = 0.02 seconds
[2024-11-11 12:24:36,492] trace_linker.py:280 [INFO]: Enforcing sync order in Kineto traces.
[2024-11-11 12:24:36,492] trace_linker.py:326 [INFO]: Thread PyTorch Profiler: Identifying synchronization dependency.
[2024-11-11 12:24:36,492] trace_linker.py:326 [INFO]: Thread 0: Identifying synchronization dependency.
[2024-11-11 12:24:36,493] trace_linker.py:326 [INFO]: Thread Trace PyTorch Profiler: Identifying synchronization dependency.
[2024-11-11 12:24:36,493] trace_linker.py:326 [INFO]: Thread 17810: Identifying synchronization dependency.
[2024-11-11 12:24:36,493] trace_linker.py:326 [INFO]: Thread 17809: Identifying synchronization dependency.
[2024-11-11 12:24:36,493] trace_linker.py:326 [INFO]: Thread 17808: Identifying synchronization dependency.
[2024-11-11 12:24:36,493] trace_linker.py:326 [INFO]: Thread 17779: Identifying synchronization dependency.
[2024-11-11 12:24:36,493] trace_linker.py:326 [INFO]: Thread 7: Identifying synchronization dependency.
[2024-11-11 12:24:36,493] trace_linker.py:326 [INFO]: Thread 17841: Identifying synchronization dependency.
[2024-11-11 12:24:36,493] trace_linker.py:326 [INFO]: Thread : Identifying synchronization dependency.
[2024-11-11 12:24:37,150] trace_link.py:49 [INFO]: Linking process successful. Output file is available at trace_linked.0.json.
[2024-11-11 12:24:37,150] trace_link.py:50 [INFO]: Please run the chakra_converter for further postprocessing.

Without specifying the kineto filepath explicitly, HTA may pick
arbitrary files from the `trace_dir` and either provide incorrect analysis
results, or fail in some weird ways.
@AlexDenisov AlexDenisov requested a review from a team as a code owner November 11, 2024 11:31
Copy link

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@JoongunPark
Copy link
Contributor

Thank you for the PR! The code changes look good overall.
I have a few minor questions:

  1. Have you tested your updates with multiple kineto files in the same directory?
  2. Could you share the traces here for testing purposes?
  3. Do you have any insights into the cause of the malfunction?

I’ll run further tests once I have your traces.

@AlexDenisov
Copy link
Member Author

  1. Have you tested your updates with multiple kineto files in the same directory?

Yes, it behaves correctly. In my specific use case it also finishes successfully without the patch, but that's rather by accident. Additionally, it also analyzes more files thus wasting some extra time.
Here is an example output of running against the nanoGPT_DDP traces:

nanoGPT with DDP, two pairs of traces
chakra_trace_link --rank 0 --chakra-host-trace trace_pytorch.0.json --chakra-device-trace trace_kineto.0.json --output-file trace_linked.0.json
[2024-11-12 09:15:18,820] trace.py:328 [INFO]: ~/traces/DDP
[2024-11-12 09:15:18,821] trace_file.py:61 [ERROR]: If the trace file does not have the rank specified in it, then add the following snippet key to the json files to use HTA; "distributedInfo": {"rank": 0}. If there are multiple traces files, then each file should have a unique rank value.
[2024-11-12 09:15:18,822] trace_file.py:61 [ERROR]: If the trace file does not have the rank specified in it, then add the following snippet key to the json files to use HTA; "distributedInfo": {"rank": 0}. If there are multiple traces files, then each file should have a unique rank value.
[2024-11-12 09:15:18,822] trace_file.py:61 [ERROR]: If the trace file does not have the rank specified in it, then add the following snippet key to the json files to use HTA; "distributedInfo": {"rank": 0}. If there are multiple traces files, then each file should have a unique rank value.
[2024-11-12 09:15:18,822] trace_file.py:61 [ERROR]: If the trace file does not have the rank specified in it, then add the following snippet key to the json files to use HTA; "distributedInfo": {"rank": 0}. If there are multiple traces files, then each file should have a unique rank value.
[2024-11-12 09:15:18,822] trace_file.py:94 [INFO]: Rank to trace file map:
{1: '~/traces/DDP/trace_kineto.1.json', 0: '~/traces/DDP/trace_kineto.0.json'}
[2024-11-12 09:15:18,822] trace.py:474 [INFO]: ranks=[0, 1]
[2024-11-12 09:15:19,752] trace_parser.py:107 [WARNING]: Parsed ~/traces/DDP/trace_kineto.0.json time = 0.92 seconds
[2024-11-12 09:15:19,752] trace_parser.py:107 [WARNING]: Parsed ~/traces/DDP/trace_kineto.1.json time = 0.92 seconds
[2024-11-12 09:15:19,869] trace_parser.py:317 [WARNING]: Rounding down ns resolution events due to issue with events overlapping. ts dtype = float64, dur dtype = float64.Please see https://github.com/pytorch/pytorch/pull/122425
[2024-11-12 09:15:19,869] trace_parser.py:317 [WARNING]: Rounding down ns resolution events due to issue with events overlapping. ts dtype = float64, dur dtype = float64.Please see https://github.com/pytorch/pytorch/pull/122425
[2024-11-12 09:15:20,167] trace_parser.py:430 [WARNING]: Parsed ~/traces/DDP/trace_kineto.0.json backend=ParserBackend.JSON in 1.34 seconds; current PID:17857
[2024-11-12 09:15:20,168] trace_parser.py:430 [WARNING]: Parsed ~/traces/DDP/trace_kineto.1.json backend=ParserBackend.JSON in 1.34 seconds; current PID:17858
[2024-11-12 09:15:20,441] trace.py:236 [WARNING]: Overall parsing of ~/traces/DDP/trace_kineto.1.json in 1.61 seconds; current PID:17858
[2024-11-12 09:15:20,441] trace.py:236 [WARNING]: Overall parsing of ~/traces/DDP/trace_kineto.0.json in 1.61 seconds; current PID:17857
[2024-11-12 09:15:20,548] trace.py:449 [WARNING]: leaving parse_multiple_ranks duration=1.73 seconds
[2024-11-12 09:15:20,548] trace.py:483 [WARNING]: leaving parse_traces duration=1.73 seconds
[2024-11-12 09:15:20,582] critical_path_analysis.py:1467 [WARNING]: Trace does not contain CUDA Synchronization events so the results of analysis could be inaccurate.
[2024-11-12 09:15:20,582] critical_path_analysis.py:1471 [WARNING]: Please see this PR to learn how to enable CUDA sync events https://github.com/pytorch/pytorch/pull/105187
[2024-11-12 09:15:20,583] critical_path_analysis.py:1493 [INFO]: Looking up events under [0, 0) instance(s) of 'ProfilerStep' annotation.
[2024-11-12 09:15:20,583] critical_path_analysis.py:1509 [INFO]: Looking up events within the window (2598, 401406.0)
[2024-11-12 09:15:20,595] critical_path_analysis.py:1533 [INFO]: Clipped dataframe has 7753 events
[2024-11-12 09:15:20,598] critical_path_analysis.py:1540 [INFO]: Preprocessing took 0.02 seconds
[2024-11-12 09:15:21,049] critical_path_analysis.py:1544 [INFO]: CPGraph construction took 0.45 seconds
[2024-11-12 09:15:21,052] critical_path_analysis.py:1171 [WARNING]: Ignoring negative weight of -1 for CPEdge(begin=8970, end=8971, weight=np.float64(-1.0), type=<CPEdgeType.OPERATOR_KERNEL: 'critical_path_operator'>)
[2024-11-12 09:15:21,052] critical_path_analysis.py:1171 [WARNING]: Ignoring negative weight of -1 for CPEdge(begin=9036, end=9037, weight=np.float64(-1.0), type=<CPEdgeType.OPERATOR_KERNEL: 'critical_path_operator'>)
[2024-11-12 09:15:21,052] critical_path_analysis.py:1171 [WARNING]: Ignoring negative weight of -1 for CPEdge(begin=9058, end=9059, weight=np.float64(-1.0), type=<CPEdgeType.OPERATOR_KERNEL: 'critical_path_operator'>)
[2024-11-12 09:15:21,052] critical_path_analysis.py:1171 [WARNING]: Ignoring negative weight of -1 for CPEdge(begin=9066, end=9067, weight=np.float64(-1.0), type=<CPEdgeType.OPERATOR_KERNEL: 'critical_path_operator'>)
[2024-11-12 09:15:21,052] critical_path_analysis.py:1171 [WARNING]: Ignoring negative weight of -1 for CPEdge(begin=9110, end=9111, weight=np.float64(-1.0), type=<CPEdgeType.OPERATOR_KERNEL: 'critical_path_operator'>)
[2024-11-12 09:15:21,052] critical_path_analysis.py:1171 [WARNING]: Ignoring negative weight of -1 for CPEdge(begin=9230, end=9231, weight=np.float64(-1.0), type=<CPEdgeType.OPERATOR_KERNEL: 'critical_path_operator'>)
[2024-11-12 09:15:21,052] critical_path_analysis.py:1171 [WARNING]: Ignoring negative weight of -1 for CPEdge(begin=9316, end=9317, weight=np.float64(-1.0), type=<CPEdgeType.OPERATOR_KERNEL: 'critical_path_operator'>)
[2024-11-12 09:15:21,052] critical_path_analysis.py:1171 [WARNING]: Ignoring negative weight of -1 for CPEdge(begin=9518, end=9519, weight=np.float64(-1.0), type=<CPEdgeType.OPERATOR_KERNEL: 'critical_path_operator'>)
[2024-11-12 09:15:21,052] critical_path_analysis.py:1171 [WARNING]: Ignoring negative weight of -1 for CPEdge(begin=9532, end=9533, weight=np.float64(-1.0), type=<CPEdgeType.OPERATOR_KERNEL: 'critical_path_operator'>)
[2024-11-12 09:15:21,052] critical_path_analysis.py:1171 [WARNING]: Ignoring negative weight of -1 for CPEdge(begin=9578, end=9579, weight=np.float64(-1.0), type=<CPEdgeType.OPERATOR_KERNEL: 'critical_path_operator'>)
[2024-11-12 09:15:21,052] critical_path_analysis.py:1171 [WARNING]: Ignoring negative weight of -1 for CPEdge(begin=9644, end=9645, weight=np.float64(-1.0), type=<CPEdgeType.OPERATOR_KERNEL: 'critical_path_operator'>)
[2024-11-12 09:15:21,071] critical_path_analysis.py:1143 [INFO]: calculating critical path took 0.022226 seconds
[2024-11-12 09:15:21,742] trace_parser.py:107 [WARNING]: Parsed ~/traces/DDP/trace_kineto.0.json time = 0.67 seconds
[2024-11-12 09:15:21,759] trace_linker.py:280 [INFO]: Enforcing sync order in Kineto traces.
[2024-11-12 09:15:21,759] trace_linker.py:326 [INFO]: Thread PyTorch Profiler: Identifying synchronization dependency.
[2024-11-12 09:15:21,759] trace_linker.py:326 [INFO]: Thread 0: Identifying synchronization dependency.
[2024-11-12 09:15:21,759] trace_linker.py:326 [INFO]: Thread Trace PyTorch Profiler: Identifying synchronization dependency.
[2024-11-12 09:15:21,759] trace_linker.py:326 [INFO]: Thread 1979522: Identifying synchronization dependency.
[2024-11-12 09:15:21,759] trace_linker.py:326 [INFO]: Thread 1979521: Identifying synchronization dependency.
[2024-11-12 09:15:21,759] trace_linker.py:326 [INFO]: Thread 1979520: Identifying synchronization dependency.
[2024-11-12 09:15:21,759] trace_linker.py:326 [INFO]: Thread 1979519: Identifying synchronization dependency.
[2024-11-12 09:15:21,759] trace_linker.py:326 [INFO]: Thread 1978831: Identifying synchronization dependency.
[2024-11-12 09:15:21,759] trace_linker.py:326 [INFO]: Thread 7: Identifying synchronization dependency.
[2024-11-12 09:15:21,760] trace_linker.py:326 [INFO]: Thread 22: Identifying synchronization dependency.
[2024-11-12 09:15:21,760] trace_linker.py:326 [INFO]: Thread 1980072: Identifying synchronization dependency.
[2024-11-12 09:15:21,760] trace_linker.py:326 [INFO]: Thread 960489024: Identifying synchronization dependency.
[2024-11-12 09:15:21,760] trace_linker.py:326 [INFO]: Thread : Identifying synchronization dependency.
[2024-11-12 09:15:24,793] trace_link.py:49 [INFO]: Linking process successful. Output file is available at trace_linked.0.json.
[2024-11-12 09:15:24,793] trace_link.py:50 [INFO]: Please run the chakra_converter for further postprocessing.
  1. Could you share the traces here for testing purposes?

Attached, these are coming from nanoGPT with a single training iteration. There are two cases: "normal" run, and a DDP with 2 GPUs.

nanoGPT_DDP_2x.zip

nanoGPT_no_DDP.zip

  1. Do you have any insights into the cause of the malfunction?

In case the trace_files dict is empty, hta attempts to construct the dictionary by collecting all the files in the trace_dir
https://github.com/facebookresearch/HolisticTraceAnalysis/blob/b0d4eee4df2c162c2f48ce987bb638aadac9c605/hta/common/trace_file.py#L32-L34

@fh-TurbaAI
Copy link

This will likely fix #163

Thanks!

@JoongunPark
Copy link
Contributor

JoongunPark commented Nov 13, 2024

Thank you so much!
I will also test your PR locally in my machine and share the result here.

@JoongunPark
Copy link
Contributor

JoongunPark commented Nov 14, 2024

I have tested this PR in my environment, and it runs well.
Also, it reduces trace loading time for HTA's since it does not need to load all traces anymore.

@AlexDenisov
Copy link
Member Author

Thank you for checking @JoongunPark, highly appreciated!

Is there anything else I can do to make this PR mergeable?

@JoongunPark
Copy link
Contributor

Thank you for checking @JoongunPark, highly appreciated!

Is there anything else I can do to make this PR mergeable?

This PR looks good to me as it is. @AlexDenisov, thank you so much for your efforts!

May I kindly ask for your help to merge this PR, @TaekyungHeo and @srinivas212?
Thank you! 😊

Copy link
Contributor

@tushar-krishna tushar-krishna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledging Joongun's review of the PR and merging it. Thank you Alex.

@tushar-krishna tushar-krishna merged commit b915ab8 into mlcommons:main Nov 16, 2024
9 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Nov 16, 2024
@AlexDenisov AlexDenisov deleted the alexdenisov/specify-kineto-file-explicitly branch November 18, 2024 08:04
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants