Test out Reactant + Enzyme for the benchmarks? #5

avik-pal · 2024-11-11T21:26:51Z

Might be worth trying out https://lux.csail.mit.edu/stable/manual/compiling_lux_models here. Looking at the code it should "just work" and the code should be much faster

vpuri3 · 2024-11-12T17:24:16Z

It does just work. See https://github.com/vpuri3/KolmogorovArnold.jl/blob/38616fc66b3c5c1550afa7c718a0629608def19b/examples/eg3.jl
on CPU, I get

avik-pal · 2024-11-12T17:41:00Z

Nice thanks for trying it out. I will look into the generated HLO to check why it isn't faster than Zygote

vpuri3 · 2024-11-12T17:45:49Z

These are the results on GPU.

julia> CUDA.device()
CuDevice(0): NVIDIA GeForce RTX 2080 Ti

avik-pal · 2024-11-12T17:51:24Z

Just to confirm did you synchronize for GPU? You can call Reactant.synchronize(returned_array).

After we tag master, we will support compiling a synchronized function as @compile sync=true ...

vpuri3 · 2024-11-12T17:54:49Z

This might be related issue:

cuDNN errors on first attempt if it's being called after Reactant, but loads thereafter.

julia> using Reactant                                          
                                                               
julia> using CUDA                                                                                                                                                                                                                                            
                                                               
julia> using LuxCUDA                                           
ERROR: InitError: could not load library "/home/vedantpu/.julia/artifacts/8e7456794f147517aa9ba5a1147e4ecedffbbfa1/lib/libcudnn_cnn.so"
/home/vedantpu/.julia/artifacts/8e7456794f147517aa9ba5a1147e4ecedffbbfa1/lib/libcudnn_cnn.so: undefined symbol: _ZTVN5cudnn7backend23PagedCacheLoadOperationE, version libcudnn_graph.so.9
Stacktrace:                 
  [1] dlopen(s::String, flags::UInt32; throw_error::Bool)
    @ Base.Libc.Libdl ./libdl.jl:120
  [2] dlopen(s::String, flags::UInt32)
    @ Base.Libc.Libdl ./libdl.jl:119
  [3] macro expansion
    @ ~/.julia/packages/JLLWrappers/jXOYx/src/products/library_generators.jl:63 [inlined]
  [4] __init__()
    @ CUDNN_jll ~/.julia/packages/CUDNN_jll/8XeAL/src/wrappers/x86_64-linux-gnu-cuda+12.0.jl:28
...

julia> using LuxCUDA

julia> cuDNN
cuDNN

vpuri3 · 2024-11-12T17:55:16Z

Just to confirm did you synchronize for GPU? You can call Reactant.synchronize(returned_array).

After we tag master, we will support compiling a synchronized function as @compile sync=true ...

Did not sync here. Lemme include that.

avik-pal · 2024-11-12T18:03:25Z

This might be related issue:

cuDNN errors on first attempt if it's being called after Reactant, but loads thereafter.

julia> using Reactant                                          
                                                               
julia> using CUDA                                                                                                                                                                                                                                            
                                                               
julia> using LuxCUDA                                           
ERROR: InitError: could not load library "/home/vedantpu/.julia/artifacts/8e7456794f147517aa9ba5a1147e4ecedffbbfa1/lib/libcudnn_cnn.so"
/home/vedantpu/.julia/artifacts/8e7456794f147517aa9ba5a1147e4ecedffbbfa1/lib/libcudnn_cnn.so: undefined symbol: _ZTVN5cudnn7backend23PagedCacheLoadOperationE, version libcudnn_graph.so.9
Stacktrace:                 
  [1] dlopen(s::String, flags::UInt32; throw_error::Bool)
    @ Base.Libc.Libdl ./libdl.jl:120
  [2] dlopen(s::String, flags::UInt32)
    @ Base.Libc.Libdl ./libdl.jl:119
  [3] macro expansion
    @ ~/.julia/packages/JLLWrappers/jXOYx/src/products/library_generators.jl:63 [inlined]
  [4] __init__()
    @ CUDNN_jll ~/.julia/packages/CUDNN_jll/8XeAL/src/wrappers/x86_64-linux-gnu-cuda+12.0.jl:28
...

julia> using LuxCUDA

julia> cuDNN
cuDNN

flip the ordering of imports for now 😓 (using LuxCUDA before using Reactant). Ideally users never load both Reactant and CUDA in the same session.

For context the error is from Reactant shipping its own CuDNN version that is mismatched from the version cuDNN ships

vpuri3 · 2024-11-12T18:30:10Z

Thanks. Here's the timings on GPU with syncs added.

KolmogorovArnold.jl/examples/eg3.jl

Lines 122 to 145 in 0fc3498

    
           println("\n# FWD Vanilla\n") 
        
           @btime CUDA.@sync $mlp( $x_zy, $pM_zy , $stM_zy ) 
        
           @btime CUDA.@sync $kan1($x_zy, $pK1_zy, $stK1_zy) 
        
           @btime CUDA.@sync $kan2($x_zy, $pK2_zy, $stK2_zy) 
        
           println("\n# FWD Reactant\n") 
        
           @btime Reactant.synchronize($mlp_comp( $x_ra, $pM_ra , $stM_ra )[1]) 
        
           @btime Reactant.synchronize($kan1_comp($x_ra, $pK1_ra, $stK1_ra)[1]) 
        
           @btime Reactant.synchronize($kan2_comp($x_ra, $pK2_ra, $stK2_ra)[1]) 
        
           #------------------------# 
        
           println("\n# BWD Zygote\n") 
        
           @btime CUDA.@sync $grad_zy($mlp , $pM , $stM , $x, $y) 
        
           @btime CUDA.@sync $grad_zy($kan1, $pK1, $stK1, $x, $y) 
        
           @btime CUDA.@sync $grad_zy($kan2, $pK2, $stK2, $x, $y) 
        
           println("\n# BWD Reactant\n") 
        
           @btime Reactant.synchronize($grad_ra_comp_M( $mlp , $pM_ra , $stM_ra , $x_ra, $y_ra)) 
        
           @btime Reactant.synchronize($grad_ra_comp_K1($kan1, $pK1_ra, $stK1_ra, $x_ra, $y_ra)) 
        
           @btime Reactant.synchronize($grad_ra_comp_K2($kan2, $pK2_ra, $stK2_ra, $x_ra, $y_ra))

avik-pal · 2024-11-12T18:33:54Z

Nice! The numbers do look promising. (cc @wsmoses the speedup is quite nice)

wsmoses · 2024-11-12T18:38:40Z

Ooh yeah this is fantastic!

@vpuri3 would you be interested in adding this to our reactant benchmark suite? (and potentially to docs listing cool use cases of :) )

vpuri3 · 2024-11-12T18:53:11Z

I'd be happy to. Can you link me to it?

wsmoses · 2024-11-12T18:53:59Z

I think they're in here https://github.com/EnzymeAD/Reactant.jl/tree/main/benchmark but @avik-pal would know best how to set it up

avik-pal · 2024-11-12T19:03:04Z

add a function like https://github.com/EnzymeAD/Reactant.jl/blob/9e8eec051c61c4c122c694ac2fb68b1598968cc0/benchmark/setup.jl#L43-L51 for KANs. The penultimate arg to setup_lux_forward_pass_benchmark is the size of the inputs.

For now add the forward pass, I forgot to set it up for the reverse pass.

vpuri3 · 2024-11-12T19:39:34Z

great. lemme register this package and get to it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test out Reactant + Enzyme for the benchmarks? #5

Test out Reactant + Enzyme for the benchmarks? #5

avik-pal commented Nov 11, 2024

vpuri3 commented Nov 12, 2024

avik-pal commented Nov 12, 2024

vpuri3 commented Nov 12, 2024

avik-pal commented Nov 12, 2024

vpuri3 commented Nov 12, 2024

vpuri3 commented Nov 12, 2024

avik-pal commented Nov 12, 2024 •

edited

Loading

vpuri3 commented Nov 12, 2024

avik-pal commented Nov 12, 2024

wsmoses commented Nov 12, 2024

vpuri3 commented Nov 12, 2024

wsmoses commented Nov 12, 2024

avik-pal commented Nov 12, 2024

vpuri3 commented Nov 12, 2024

Test out Reactant + Enzyme for the benchmarks? #5

Test out Reactant + Enzyme for the benchmarks? #5

Comments

avik-pal commented Nov 11, 2024

vpuri3 commented Nov 12, 2024

avik-pal commented Nov 12, 2024

vpuri3 commented Nov 12, 2024

avik-pal commented Nov 12, 2024

vpuri3 commented Nov 12, 2024

vpuri3 commented Nov 12, 2024

avik-pal commented Nov 12, 2024 • edited Loading

vpuri3 commented Nov 12, 2024

avik-pal commented Nov 12, 2024

wsmoses commented Nov 12, 2024

vpuri3 commented Nov 12, 2024

wsmoses commented Nov 12, 2024

avik-pal commented Nov 12, 2024

vpuri3 commented Nov 12, 2024

avik-pal commented Nov 12, 2024 •

edited

Loading