Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tutorial] Many nodes have a common parent node, but the node doesn't exist in PyTorch ET. #45

Closed
jeongyoonm opened this issue May 1, 2024 · 7 comments

Comments

@jeongyoonm
Copy link

Describe the Bug

I was following the Chakra trace collection tutorial. I was able to collect both PyTorch ET and Kineto trace, but I couldn't link them using trace_link.py. trace_link.py emitted the following error:

$ python3 trace_link.py --et-file matmul_et.json --kineto-file kineto_trace_matmul.json --exact-match

[2024-04-30 13:46:33,291] execution_trace.py:455 [INFO]: Iteration node ids list = [1]
[2024-04-30 13:46:33,291] trace_link.py:306 [INFO]: Number of original ops in execution trace: 2
[2024-04-30 13:46:33,291] trace_link.py:225 [INFO]: Kineto trace has 0 segments
[2024-04-30 13:46:33,291] trace_link.py:338 [WARNING]: Could not find annotation DataLoader in kineto file using the whole file, processing could be very slow!!
[2024-04-30 13:46:33,291] trace_link.py:343 [INFO]: Number of original cpu ops in kineto trace: 46
[2024-04-30 13:46:33,291] trace_link.py:344 [INFO]: Number of original gpu ops in kineto trace: 6
[2024-04-30 13:46:33,291] trace_link.py:350 [INFO]: Average iteration latency: 4282.0
Traceback (most recent call last):
  File "/home/jmoon/workspace/transport/collect_et/trace_link.py", line 891, in <module>
    main()  # pragma: no cover
    ^^^^^^
  File "/home/jmoon/workspace/transport/collect_et/trace_link.py", line 880, in main
    dump_et_file(
  File "/home/jmoon/workspace/transport/collect_et/trace_link.py", line 818, in dump_et_file
    node["parent"] = assigned_ids[node["parent"]]
                     ~~~~~~~~~~~~^^^^^^^^^^^^^^^^
KeyError: 3

I looked into the collected PyTorch ET to further debug the issue. I found that many nodes have a parent attribute with a value of 3, but there was no node with id 3 (Please refer to the screenshot). I believe this caused the above error. Is my trace collection procedure wrong or is it a known bug? If it's a known bug, is there any way to resolve this error? Any pointers or answers would be appreciated.

Steps to Reproduce

Below is the PyTorch code that I used for the ET and Kineto trace collection:

import torch
import numpy as np
from torch.profiler import ExecutionTraceObserver, profile

def trace_handler(prof):
    prof.export_chrome_trace("kineto_trace_matmul.json")

def gpu_matrix_multiplication(matrix1: np.ndarray, matrix2: np.ndarray) -> torch.Tensor:
    """
    Perform matrix multiplication on the GPU using PyTorch.

    Args:
        matrix1 (np.ndarray): The first input matrix as a NumPy array.
        matrix2 (np.ndarray): The second input matrix as a NumPy array.

    Returns:
        torch.Tensor: The result of the matrix multiplication, as a PyTorch tensor.

    Raises:
        ValueError: If matrices have incompatible shapes for multiplication.
    """
    if matrix1.shape[1] != matrix2.shape[0]:
        raise ValueError("Matrices have incompatible shapes for multiplication.")

    # Convert numpy arrays to PyTorch tensors and set dtype to float
    matrix1_torch = torch.tensor(matrix1, dtype=torch.float)
    matrix2_torch = torch.tensor(matrix2, dtype=torch.float)

    # Transfer tensors to GPU if available
    if torch.cuda.is_available():
        matrix1_torch = matrix1_torch.to('cuda')
        matrix2_torch = matrix2_torch.to('cuda')

    # Perform matrix multiplication using GPU
    result_gpu = torch.matmul(matrix1_torch, matrix2_torch)

    return result_gpu

if __name__ == "__main__":
    
    # for ET
    et = ExecutionTraceObserver()
    et_filename = "matmul_et.json"
    et.register_callback(et_filename)



    # for Kineto traces
    with profile(
        activities=[
            torch.profiler.ProfilerActivity.CPU,
            torch.profiler.ProfilerActivity.CUDA,
        ],
        # skip first 10 iterations
        # record 1 iteration after the first 10.
        schedule=torch.profiler.schedule(wait=0, warmup=10, active=1),
        on_trace_ready=trace_handler,
    ) as prof:
        # Define larger matrices (1024x1024) using NumPy
        matrix_a = np.random.rand(1024, 1024)
        matrix_b = np.random.rand(1024, 1024)
        for epoch in range(20):
            # training function goes here
            result_on_gpu = gpu_matrix_multiplication(matrix_a, matrix_b)
            result2_on_gpu = gpu_matrix_multiplication(matrix_a, result_on_gpu)
            if epoch == 11:
                et.stop()
            if epoch == 10:
                et.start()
            prof.step()

    et.unregister_callback()

trace_link.py is from the PARAM GitHub repository, and I executed it with the command below.
$ python3 trace_link.py --et-file matmul_et.json --kineto-file kineto_trace_matmul.json --exact-match

The PyTorch version is 2.1.2 as the higher version has some issues.(related to #40)

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.1.2+cu121
[pip3] torchaudio==2.1.2+cu121
[pip3] torchvision==0.16.2+cu121
[pip3] triton==2.1.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] torch                     2.1.2+cu121              pypi_0    pypi
[conda] torchaudio                2.1.2+cu121              pypi_0    pypi
[conda] torchvision               0.16.2+cu121             pypi_0    pypi
[conda] triton                    2.1.0                    pypi_0    pypi

Expected Behavior

I expected that PyTorch ET would be collected without missing dependencies so that the link procedure would succeed without an error.

Screenshots

Screen Shot 2024-05-01 at 2 54 48 PM

@TaekyungHeo
Copy link
Contributor

The trace_link and et_converter have been updated by the following PRs. First, please use the updated tools:

Second, ensure you use PyTorch nightly for some time to collect traces, as we are utilizing the latest features from the PyTorch profiler to properly correlate traces.

@jeongyoonm
Copy link
Author

Thank you so much! The updated tools resolve the issue.
FYI, modification on the code snippet that imports param_bench tools was necessary to make it work. This is because there was refactoring on the tools. (commit link)

I have one more question regarding the behavior of trace_link.py. I would appreciate it if you could answer the question.

Q. The linking procedure seems to create edges between tensors. Do you know if this is expected? If so, what do they mean?
Below are the visualized PyTorch execution traces for matrix multiplication. (I have visualized it with param tools.)

Before trace_link.py,

PyTorch_ET_before_trace_link

After trace_link.py,

PyTorch_ET_after_trace_link

@TaekyungHeo
Copy link
Contributor

TaekyungHeo commented May 9, 2024

Let's make it clear - Did you use the latest trace_link.py to plot it?

@jeongyoonm
Copy link
Author

I cloned the Chakra repository this morning and used it to get the above figure, so I didn't.

I just tried out the latest one which was updated an hour ago, but the result is still the same. There are edges between tensor nodes.

@TaekyungHeo
Copy link
Contributor

Let me share some comments. There are many downstream tools in Chakra. When replaying traces on a real system using the actual PyTorch framework, tensors are crucial. However, in simulation, the tools do not care about the tensors. trace_link is one of the simulation tools, and it disregards any side effects in tensors. Perhaps this is why you are observing additional edges.

@jeongyoonm
Copy link
Author

I see. They will be ignored anyway in downstream tools.

Thanks for all your answers. It really helps a lot.

@jeongyoonm
Copy link
Author

FYI, it was the problem of the fix in the PyTorch visualization tool that I made to make it work with previous versions of chakra tools. Actually, there were no edges between tensors in the collected traces :)

rvinaybharadwaj pushed a commit to rvinaybharadwaj/chakra that referenced this issue Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants