Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test out Reactant + Enzyme for the benchmarks? #5

Open
avik-pal opened this issue Nov 11, 2024 · 14 comments
Open

Test out Reactant + Enzyme for the benchmarks? #5

avik-pal opened this issue Nov 11, 2024 · 14 comments

Comments

@avik-pal
Copy link
Contributor

Might be worth trying out https://lux.csail.mit.edu/stable/manual/compiling_lux_models here. Looking at the code it should "just work" and the code should be much faster

@vpuri3
Copy link
Owner

vpuri3 commented Nov 12, 2024

@avik-pal
Copy link
Contributor Author

Nice thanks for trying it out. I will look into the generated HLO to check why it isn't faster than Zygote

@vpuri3
Copy link
Owner

vpuri3 commented Nov 12, 2024

These are the results on GPU.

julia> CUDA.device()
CuDevice(0): NVIDIA GeForce RTX 2080 Ti

Screenshot 2024-11-12 at 12 44 35 PM

@avik-pal
Copy link
Contributor Author

Just to confirm did you synchronize for GPU? You can call Reactant.synchronize(returned_array).

After we tag master, we will support compiling a synchronized function as @compile sync=true ...

@vpuri3
Copy link
Owner

vpuri3 commented Nov 12, 2024

This might be related issue:

cuDNN errors on first attempt if it's being called after Reactant, but loads thereafter.

julia> using Reactant                                          
                                                               
julia> using CUDA                                                                                                                                                                                                                                            
                                                               
julia> using LuxCUDA                                           
ERROR: InitError: could not load library "/home/vedantpu/.julia/artifacts/8e7456794f147517aa9ba5a1147e4ecedffbbfa1/lib/libcudnn_cnn.so"
/home/vedantpu/.julia/artifacts/8e7456794f147517aa9ba5a1147e4ecedffbbfa1/lib/libcudnn_cnn.so: undefined symbol: _ZTVN5cudnn7backend23PagedCacheLoadOperationE, version libcudnn_graph.so.9
Stacktrace:                 
  [1] dlopen(s::String, flags::UInt32; throw_error::Bool)
    @ Base.Libc.Libdl ./libdl.jl:120
  [2] dlopen(s::String, flags::UInt32)
    @ Base.Libc.Libdl ./libdl.jl:119
  [3] macro expansion
    @ ~/.julia/packages/JLLWrappers/jXOYx/src/products/library_generators.jl:63 [inlined]
  [4] __init__()
    @ CUDNN_jll ~/.julia/packages/CUDNN_jll/8XeAL/src/wrappers/x86_64-linux-gnu-cuda+12.0.jl:28
...

julia> using LuxCUDA

julia> cuDNN
cuDNN

@vpuri3
Copy link
Owner

vpuri3 commented Nov 12, 2024

Just to confirm did you synchronize for GPU? You can call Reactant.synchronize(returned_array).

After we tag master, we will support compiling a synchronized function as @compile sync=true ...

Did not sync here. Lemme include that.

@avik-pal
Copy link
Contributor Author

avik-pal commented Nov 12, 2024

This might be related issue:

cuDNN errors on first attempt if it's being called after Reactant, but loads thereafter.

julia> using Reactant                                          
                                                               
julia> using CUDA                                                                                                                                                                                                                                            
                                                               
julia> using LuxCUDA                                           
ERROR: InitError: could not load library "/home/vedantpu/.julia/artifacts/8e7456794f147517aa9ba5a1147e4ecedffbbfa1/lib/libcudnn_cnn.so"
/home/vedantpu/.julia/artifacts/8e7456794f147517aa9ba5a1147e4ecedffbbfa1/lib/libcudnn_cnn.so: undefined symbol: _ZTVN5cudnn7backend23PagedCacheLoadOperationE, version libcudnn_graph.so.9
Stacktrace:                 
  [1] dlopen(s::String, flags::UInt32; throw_error::Bool)
    @ Base.Libc.Libdl ./libdl.jl:120
  [2] dlopen(s::String, flags::UInt32)
    @ Base.Libc.Libdl ./libdl.jl:119
  [3] macro expansion
    @ ~/.julia/packages/JLLWrappers/jXOYx/src/products/library_generators.jl:63 [inlined]
  [4] __init__()
    @ CUDNN_jll ~/.julia/packages/CUDNN_jll/8XeAL/src/wrappers/x86_64-linux-gnu-cuda+12.0.jl:28
...

julia> using LuxCUDA

julia> cuDNN
cuDNN

flip the ordering of imports for now 😓 (using LuxCUDA before using Reactant). Ideally users never load both Reactant and CUDA in the same session.

For context the error is from Reactant shipping its own CuDNN version that is mismatched from the version cuDNN ships

@vpuri3
Copy link
Owner

vpuri3 commented Nov 12, 2024

Thanks. Here's the timings on GPU with syncs added.

println("\n# FWD Vanilla\n")
@btime CUDA.@sync $mlp( $x_zy, $pM_zy , $stM_zy )
@btime CUDA.@sync $kan1($x_zy, $pK1_zy, $stK1_zy)
@btime CUDA.@sync $kan2($x_zy, $pK2_zy, $stK2_zy)
println("\n# FWD Reactant\n")
@btime Reactant.synchronize($mlp_comp( $x_ra, $pM_ra , $stM_ra )[1])
@btime Reactant.synchronize($kan1_comp($x_ra, $pK1_ra, $stK1_ra)[1])
@btime Reactant.synchronize($kan2_comp($x_ra, $pK2_ra, $stK2_ra)[1])
#------------------------#
println("\n# BWD Zygote\n")
@btime CUDA.@sync $grad_zy($mlp , $pM , $stM , $x, $y)
@btime CUDA.@sync $grad_zy($kan1, $pK1, $stK1, $x, $y)
@btime CUDA.@sync $grad_zy($kan2, $pK2, $stK2, $x, $y)
println("\n# BWD Reactant\n")
@btime Reactant.synchronize($grad_ra_comp_M( $mlp , $pM_ra , $stM_ra , $x_ra, $y_ra))
@btime Reactant.synchronize($grad_ra_comp_K1($kan1, $pK1_ra, $stK1_ra, $x_ra, $y_ra))
@btime Reactant.synchronize($grad_ra_comp_K2($kan2, $pK2_ra, $stK2_ra, $x_ra, $y_ra))

Screenshot 2024-11-12 at 1 27 54 PM

@avik-pal
Copy link
Contributor Author

Nice! The numbers do look promising. (cc @wsmoses the speedup is quite nice)

@wsmoses
Copy link

wsmoses commented Nov 12, 2024

Ooh yeah this is fantastic!

@vpuri3 would you be interested in adding this to our reactant benchmark suite? (and potentially to docs listing cool use cases of :) )

@vpuri3
Copy link
Owner

vpuri3 commented Nov 12, 2024

I'd be happy to. Can you link me to it?

@wsmoses
Copy link

wsmoses commented Nov 12, 2024

I think they're in here https://github.com/EnzymeAD/Reactant.jl/tree/main/benchmark but @avik-pal would know best how to set it up

@avik-pal
Copy link
Contributor Author

add a function like https://github.com/EnzymeAD/Reactant.jl/blob/9e8eec051c61c4c122c694ac2fb68b1598968cc0/benchmark/setup.jl#L43-L51 for KANs. The penultimate arg to setup_lux_forward_pass_benchmark is the size of the inputs.

For now add the forward pass, I forgot to set it up for the reverse pass.

@vpuri3
Copy link
Owner

vpuri3 commented Nov 12, 2024

great. lemme register this package and get to it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants