Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify process group init nodes as METADATA nodes #109

Closed
wants to merge 1 commit into from

Conversation

TaekyungHeo
Copy link
Contributor

@TaekyungHeo TaekyungHeo commented Jun 25, 2024

Summary

Identify process group init nodes as METADATA nodes. The "## process_group:init ##" node is now identified as a METADATA node.

{
  "schema": "1.1.0-chakra.0.0.4", "pid": 1967424, "time": "2024-06-25 22:13:31", "start_ts": 7067932814,
  "nodes": [
...
    {
      "id": 3, "name": "## process_group:init ##", "ctrl_deps": 2,
      "inputs": {"values": ["[{\"pg_name\": \"0\", \"pg_desc\": \"default_pg\", \"backend_config\": \"cuda:nccl\", \"ranks\": [], \"group_size\": 8, \"group_count\": 67}, {\"pg_name\": \"3\", \"pg_desc\": \"undefined\", \"backend_config\": \"cuda:nccl\", \"ranks\": [1, 3], \"group_size\": 2, \"group_count\": 67}, {\"pg_name\": \"4\", \"pg_desc\": \"undefined\", \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": [1, 3], \"group_size\": 2, \"group_count\": 67}, {\"pg_name\": \"11\", \"pg_desc\": \"undefined\", \"backend_config\": \"cuda:nccl\", \"ranks\": [1, 3], \"group_size\": 2, \"group_count\": 67}, {\"pg_name\": \"12\", \"pg_desc\": \"undefined\", \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": [1, 3], \"group_size\": 2, \"group_count\": 67}, {\"pg_name\": \"20\", \"pg_desc\": \"undefined\", \"backend_config\": \"cuda:nccl\", \"ranks\": [3], \"group_size\": 1, \"group_count\": 67}, {\"pg_name\": \"26\", \"pg_desc\": \"undefined\", \"backend_config\": \"cuda:nccl\", \"ranks\": [2, 3, 6, 7], \"group_size\": 4, \"group_count\": 67}, {\"pg_name\": \"28\", \"pg_desc\": \"undefined\", \"backend_config\": \"cuda:nccl\", \"ranks\": [2, 3], \"group_size\": 2, \"group_count\": 67}, {\"pg_name\": \"40\", \"pg_desc\": \"undefined\", \"backend_config\": \"cuda:nccl\", \"ranks\": [3, 7], \"group_size\": 2, \"group_count\": 67}, {\"pg_name\": \"41\", \"pg_desc\": \"undefined\", \"backend_config\": \"cuda:nccl\", \"ranks\": [3, 7], \"group_size\": 2, \"group_count\": 67}, {\"pg_name\": \"42\", \"pg_desc\": \"undefined\", \"backend_config\": \"cuda:nccl\", \"ranks\": [3], \"group_size\": 1, \"group_count\": 67}, {\"pg_name\": \"43\", \"pg_desc\": \"undefined\", \"backend_config\": \"cuda:nccl\", \"ranks\": [0, 1, 2, 3], \"group_size\": 4, \"group_count\": 67}, {\"pg_name\": \"45\", \"pg_desc\": \"undefined\", \"backend_config\": \"cuda:nccl\", \"ranks\": [0, 1, 2, 3], \"group_size\": 4, \"group_count\": 67}, {\"pg_name\": \"48\", \"pg_desc\": \"undefined\", \"backend_config\": \"cuda:nccl\", \"ranks\": [2, 3], \"group_size\": 2, \"group_count\": 67}, {\"pg_name\": \"54\", \"pg_desc\": \"undefined\", \"backend_config\": \"cuda:nccl\", \"ranks\": [3], \"group_size\": 1, \"group_count\": 67}, {\"pg_name\": \"61\", \"pg_desc\": \"undefined\", \"backend_config\": \"cuda:nccl\", \"ranks\": [1, 3], \"group_size\": 2, \"group_count\": 67}, {\"pg_name\": \"62\", \"pg_desc\": \"undefined\", \"backend_config\": \"cpu:gloo,cuda:gloo\", \"ranks\": [1, 3], \"group_size\": 2, \"group_count\": 67}]"], "shapes": [[]], "types": ["String"]},
      "outputs": {"values": [], "shapes": [], "types": []},
      "attrs": [{"name": "rf_id", "type": "uint64", "value": 1},{"name": "fw_parent", "type": "uint64", "value": 0},{"name": "seq_id", "type": "int64", "value": -1},{"name": "scope", "type": "uint64", "value": 7},{"name": "tid", "type": "uint64", "value": 1},{"name": "fw_tid", "type": "uint64", "value": 0},{"name": "op_schema", "type": "string", "value": ""},{"name": "kernel_backend", "type": "string", "value": ""},{"name": "kernel_file", "type": "string", "value": ""}]
    },
...
    {
      "id": 801, "name": "record_param_comms", "ctrl_deps": 800,
      "inputs": {"values": [[[782,783,0,6291456,2,"cuda:3"]],9465,["28","undefined"],1,"allreduce",[],[],2,1,2], "shapes": [[[4,2048,768]],[],[[],[]],[],[],[],[],[],[],[]], "types": ["GenericList[Tensor(c10::BFloat16)]","Int","Tuple[String,String]","Int","String","GenericList[]","GenericList[]","Int","Int","Int"]},
      "outputs": {"values": [[[782,783,0,6291456,2,"cuda:3"]]], "shapes": [[[4,2048,768]]], "types": ["GenericList[Tensor(c10::BFloat16)]"]},
      "attrs": [{"name": "rf_id", "type": "uint64", "value": 492},{"name": "fw_parent", "type": "uint64", "value": 0},{"name": "seq_id", "type": "int64", "value": -1},{"name": "scope", "type": "uint64", "value": 0},{"name": "tid", "type": "uint64", "value": 1},{"name": "fw_tid", "type": "uint64", "value": 0},{"name": "op_schema", "type": "string", "value": ""},{"name": "kernel_backend", "type": "string", "value": ""},{"name": "kernel_file", "type": "string", "value": ""}, 
  {"name": "collective_name", "type": "string", "value": "allreduce"}, 
  {"name": "dtype", "type": "string", "value": "BFloat16"}, 
  {"name": "in_msg_nelems", "type": "uint64", "value": 6291456}, 
  {"name": "out_msg_nelems", "type": "uint64", "value": 6291456}, 
  {"name": "in_split_size", "type": "string", "value": "[]"}, 
  {"name": "out_split_size", "type": "string", "value": "[]"}, 
  {"name": "global_rank_start", "type": "uint64", "value": 2}, 
  {"name": "global_rank_stride", "type": "uint64", "value": 1}, 
  {"name": "pg_name", "type": "string", "value": "28"}, 
  {"name": "pg_desc", "type": "string", "value": "undefined"}, 
  {"name": "pg_size", "type": "uint64", "value": 2}]
    },
...

gpt3_126m_1.1.0-chakra.0.0.4.tgz

Test Plan

$ pip install .
Processing /Users/theo/chakra-official
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: protobuf==4.* in /Users/theo/venv/lib/python3.10/site-packages (from chakra==0.0.4) (4.23.4)
Requirement already satisfied: graphviz in /Users/theo/venv/lib/python3.10/site-packages (from chakra==0.0.4) (0.20.1)
Requirement already satisfied: networkx in /Users/theo/venv/lib/python3.10/site-packages (from chakra==0.0.4) (3.2.1)
Requirement already satisfied: pydot in /Users/theo/venv/lib/python3.10/site-packages (from chakra==0.0.4) (2.0.0)
Requirement already satisfied: pyparsing>=3 in /Users/theo/venv/lib/python3.10/site-packages (from pydot->chakra==0.0.4) (3.1.1)
Building wheels for collected packages: chakra
  Building wheel for chakra (pyproject.toml) ... done
  Created wheel for chakra: filename=chakra-0.0.4-py3-none-any.whl size=51949 sha256=829718ccc530f42e01796e6205a2fc4d3a3908415362617aef716b1da476398e
  Stored in directory: /Users/theo/Library/Caches/pip/wheels/a7/4b/e3/99576bcd5b74d73e757364a32af2372bf8fc73262affb0c1e4
Successfully built chakra
Installing collected packages: chakra
  Attempting uninstall: chakra
    Found existing installation: chakra 0.0.4
    Uninstalling chakra-0.0.4:
      Successfully uninstalled chakra-0.0.4
Successfully installed chakra-0.0.4

$ python3 ci_tools/integration_tests.py --tgz_path tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05.tgz --num_ranks 8 --tolerance 0.05 --expected_times_ms 14597 14597 14968 14638 14649 14700 14677 14735                                                                    
Extracting tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05.tgz to tests/data/1.0.2-chakra.0.0.4
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_0.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_0.json --output-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_0.json
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_1.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_1.json --output-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_1.json
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_2.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_2.json --output-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_2.json
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_3.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_3.json --output-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_3.json
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_4.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_4.json --output-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_4.json
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_6.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_6.json --output-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_6.json
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_5.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_5.json --output-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_5.json
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_7.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_7.json --output-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_7.json
Validation successful for /tmp/rank_0.log: 14802300us is within the acceptable range.
Validation successful for /tmp/rank_1.log: 14785782us is within the acceptable range.
Validation successful for /tmp/rank_2.log: 15233261us is within the acceptable range.
Validation successful for /tmp/rank_3.log: 14878058us is within the acceptable range.
Validation successful for /tmp/rank_4.log: 14892945us is within the acceptable range.
Validation successful for /tmp/rank_5.log: 14993779us is within the acceptable range.
Validation successful for /tmp/rank_6.log: 14936348us is within the acceptable range.
Validation successful for /tmp/rank_7.log: 15031147us is within the acceptable range.

@TaekyungHeo TaekyungHeo requested a review from a team as a code owner June 25, 2024 16:24
Copy link

github-actions bot commented Jun 25, 2024

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@TaekyungHeo TaekyungHeo marked this pull request as draft June 25, 2024 17:23
@TaekyungHeo TaekyungHeo added the enhancement New feature or request label Jun 27, 2024
@JoongunPark
Copy link
Contributor

JoongunPark commented Jul 3, 2024

Hi!
I have tested this PR on my environment (Python 3.10, linux 5.15.0-113-generic).

I encountered an error when running chakra_trace_link, which seems to be due to a version difference in PARAM. This PR uses the previous version, while I am using the newer one specified here.

(py3.10) un-gpu@ungpu-Default-string:~/Project/chakra$ chakra_trace_link --pytorch-et-file gpt3_126m_1.1.0-chakra.0.0.4/et_0.json --kineto-file gpt3_126m_1.1.0-chakra.0.0.4/kineto_0.json --output-file gpt3_126m_1.1.0-chakra.0.0.4/rank_0.json
[2024-07-03 18:08:41,302] trace_linker.py:118 [INFO]: Starting to load PyTorch Execution Trace.
Traceback (most recent call last):
  File "/home/un-gpu/py3.10/bin/chakra_trace_link", line 8, in <module>
    sys.exit(main())
  File "/home/un-gpu/py3.10/lib/python3.10/site-packages/chakra/src/trace_link/trace_link.py", line 35, in main
    linker.link(args.pytorch_et_file, args.kineto_file, args.output_file)
  File "/home/un-gpu/py3.10/lib/python3.10/site-packages/chakra/src/trace_link/trace_linker.py", line 57, in link
    pytorch_ops, kineto_data = self.load_traces(pytorch_et_file, kineto_file)
  File "/home/un-gpu/py3.10/lib/python3.10/site-packages/chakra/src/trace_link/trace_linker.py", line 101, in load_traces
    pytorch_ops = self.load_pytorch_et(pytorch_et_file)
  File "/home/un-gpu/py3.10/lib/python3.10/site-packages/chakra/src/trace_link/trace_linker.py", line 119, in load_pytorch_et
    pytorch_et = load_execution_trace_file(pytorch_et_file)
  File "/home/un-gpu/py3.10/lib/python3.10/site-packages/param_bench/train/compute/python/tools/utility.py", line 36, in load_execution_trace_file
    return ExecutionTrace(data)
  File "/home/un-gpu/py3.10/lib/python3.10/site-packages/param_bench/train/compute/python/tools/execution_trace.py", line 313, in __init__
    raise ValueError(
ValueError: No corresponding node creation function found for schema version 1.1.0-chakra.0.0.4

This code should be rebased on this PR (673d190).
I was able to run it properly after that.
Below is the result.

[2024-07-03 18:36:08,235] trace_linker.py:118 [INFO]: Starting to load PyTorch Execution Trace.
[2024-07-03 18:36:09,681] trace_linker.py:123 [INFO]: Original ops in PyTorch ET: 26704
[2024-07-03 18:36:09,681] trace_linker.py:124 [INFO]: PyTorch Execution Trace loaded successfully.
[2024-07-03 18:36:09,686] trace_linker.py:164 [INFO]: Starting to load Kineto Trace.
[2024-07-03 18:36:10,435] trace_linker.py:198 [INFO]: Categorizing Kineto operators and calculating timing boundaries.
[2024-07-03 18:36:10,575] trace_linker.py:278 [INFO]: Calculating exclusive durations for Kineto operators in parallel.
[2024-07-03 18:36:10,576] trace_linker.py:281 [INFO]: Processing 10313 operators in thread.
[2024-07-03 18:36:10,581] trace_linker.py:281 [INFO]: Processing 16105 operators in thread.
[2024-07-03 18:36:12,161] trace_linker.py:325 [INFO]: Exclusive durations for Kineto operators calculated successfully.
[2024-07-03 18:36:12,173] trace_linker.py:177 [INFO]: Processed Kineto trace with 26418 CPU ops, 3910 CPU launcher ops, and 4595 GPU ops.
[2024-07-03 18:36:12,173] trace_linker.py:182 [INFO]: Kineto Trace loaded successfully.
[2024-07-03 18:36:12,213] trace_linker.py:415 [INFO]: Enforcing inter-thread order in Kineto traces.
[2024-07-03 18:36:12,214] trace_linker.py:445 [INFO]: Thread 1967421: Identifying gaps for dependency linking with threshold 1000us.
[2024-07-03 18:36:12,219] trace_linker.py:445 [INFO]: Thread 1972231: Identifying gaps for dependency linking with threshold 1000us.
[2024-07-03 18:36:13,297] trace_linker.py:521 [INFO]: Starting the process of linking PyTorch and Kineto traces.
[2024-07-03 18:36:13,298] trace_linker.py:577 [INFO]: Adding process and thread annotations to Kineto operators.
....
....
....
[2024-07-03 18:36:13,401] trace_linker.py:656 [WARNING]: Number of PyTorch operators (26704) is larger than the number of Kineto operators (26439). Expected PyTorch ops (CPU only) to be fewer than Kineto ops (CPU and GPU). Logging this rare but possible scenario.
[2024-07-03 18:36:13,458] trace_linker.py:679 [INFO]: Completed mapping of PyTorch operators to Kineto operators.
[2024-07-03 18:36:13,459] trace_linker.py:926 [INFO]: Constructing ET+ data.
[2024-07-03 18:36:14,877] trace_linker.py:557 [INFO]: Traces have been successfully linked.
[2024-07-03 18:36:14,878] trace_linker.py:1049 [INFO]: Starting to dump ET+ data to gpt3_126m_1.1.0-chakra.0.0.4/rank_0.json.
[2024-07-03 18:36:18,194] trace_linker.py:1061 [INFO]: ET+ data dumped to gpt3_126m_1.1.0-chakra.0.0.4/rank_0.json.

@github-actions github-actions bot locked and limited conversation to collaborators Jul 26, 2024
@TaekyungHeo TaekyungHeo deleted the comm-group branch July 29, 2024 20:35
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants