Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot iterate over a Tuple of mixed Type #607

Open
leios opened this issue Jul 18, 2024 · 1 comment
Open

Cannot iterate over a Tuple of mixed Type #607

leios opened this issue Jul 18, 2024 · 1 comment
Labels
bug Something isn't working upstream

Comments

@leios
Copy link

leios commented Jul 18, 2024

I am posting this as a "bug", butI am not sure if it's fixable. For clarity, the actual error with JuliaGPU/CUDA.jl#2450 is due to the fact that we can't iterate through a Tuple of mixed type. I think this is a Julia-specific problem because tbh no one else is crazy enough to send a container of mixed type to the GPU. Julia has to do it though because Functions have their own inherent type, so there is no other way to pass in functions.

Anyway, here's an example of something that fails (in KA, sorry):

@kernel function check(a, t)
    idx = @index(Global, Linear)
    for i = 1:length(t)
        a[idx] = t[i]
    end
end
...
a = Cu/ROCArray(ones(10))

check(get_backend(a))(a, (1, 3.0), ndrange = length(a))

The workarounds for JuliaGPU/CUDA.jl#2450 also work here.

CUDA Errors:

ERROR: InvalidIRError: compiling kernel #gpu_check_kernel(KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(256,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Float64, 1}, Tuple{typeof(f1), typeof(f1)}, Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}}, Tuple{NamedTuple{(:c,), Tuple{Int64}}, NamedTuple{(:d,), Tuple{Int64}}}) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to jl_f_getfield)
Stacktrace:
 [1] getindex
   @ ./tuple.jl:29
 [2] macro expansion
   @ ~/projects/sketches/tuple_test.jl:29
 [3] gpu_check_kernel
   @ ~/.julia/packages/KernelAbstractions/C8flJ/src/macros.jl:81
 [4] gpu_check_kernel
   @ ./none:0
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(gpu_check_kernel), Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(256,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Float64, 1}, Tuple{typeof(f1), typeof(f1)}, Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}}, Tuple{NamedTuple{(:c,), Tuple{Int64}}, NamedTuple{(:d,), Tuple{Int64}}}}}}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/S3TWf/src/validation.jl:141
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/S3TWf/src/driver.jl:418 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/LHjFw/src/TimerOutput.jl:253 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/S3TWf/src/driver.jl:417 [inlined]
  [5] emit_llvm(job::GPUCompiler.CompilerJob, method_instance::Any; libraries::Bool, deferred_codegen::Bool, optimize::Bool, cleanup::Bool, only_entry::Bool, validate::Bool, ctx::LLVM.ThreadSafeContext)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/S3TWf/src/utils.jl:83
  [6] cufunction_compile(job::GPUCompiler.CompilerJob, ctx::LLVM.ThreadSafeContext)
    @ CUDA ~/.julia/packages/CUDA/ZdCxS/src/compiler/execution.jl:360
  [7] JuliaGPU/CUDA.jl#221
    @ ~/.julia/packages/CUDA/ZdCxS/src/compiler/execution.jl:354 [inlined]
  [8] LLVM.ThreadSafeContext(f::CUDA.var"#221#222"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(gpu_check_kernel), Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(256,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Float64, 1}, Tuple{typeof(f1), typeof(f1)}, Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}}, Tuple{NamedTuple{(:c,), Tuple{Int64}}, NamedTuple{(:d,), Tuple{Int64}}}}}}})
    @ LLVM ~/.julia/packages/LLVM/HykgZ/src/executionengine/ts_module.jl:14
  [9] JuliaContext(f::CUDA.var"#221#222"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(gpu_check_kernel), Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(256,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Float64, 1}, Tuple{typeof(f1), typeof(f1)}, Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}}, Tuple{NamedTuple{(:c,), Tuple{Int64}}, NamedTuple{(:d,), Tuple{Int64}}}}}}})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/S3TWf/src/driver.jl:74
 [10] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/ZdCxS/src/compiler/execution.jl:353
 [11] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/S3TWf/src/cache.jl:90
 [12] cufunction(f::typeof(gpu_check_kernel), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(256,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Float64, 1}, Tuple{typeof(f1), typeof(f1)}, Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}}, Tuple{NamedTuple{(:c,), Tuple{Int64}}, NamedTuple{(:d,), Tuple{Int64}}}}}; name::Nothing, always_inline::Bool, kwargs::Base.Pairs{Symbol, Int64, Tuple{Symbol}, NamedTuple{(:maxthreads,), Tuple{Int64}}})
    @ CUDA ~/.julia/packages/CUDA/ZdCxS/src/compiler/execution.jl:306
 [13] macro expansion
    @ ~/.julia/packages/CUDA/ZdCxS/src/compiler/execution.jl:102 [inlined]
 [14] (::KernelAbstractions.Kernel{CUDADevice{false, false}, KernelAbstractions.NDIteration.StaticSize{(256,)}, KernelAbstractions.NDIteration.DynamicSize, typeof(gpu_check_kernel)})(::CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}, ::Vararg{Any}; ndrange::Int64, dependencies::CUDAKernels.CudaEvent, workgroupsize::Nothing, progress::Function)
    @ CUDAKernels ~/.julia/packages/CUDAKernels/3IKLV/src/CUDAKernels.jl:283
 [15] check(a::CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}, t::Tuple{typeof(f1), typeof(f1)}, t_args::Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}}, t_kwargs::Tuple{NamedTuple{(:c,), Tuple{Int64}}, NamedTuple{(:d,), Tuple{Int64}}})
    @ Main ~/projects/sketches/tuple_test.jl:21
 [16] top-level scope
    @ REPL[9]:1
 [17] top-level scope
    @ ~/.julia/packages/CUDA/ZdCxS/src/initialization.jl:155

AMD Error:

julia> check(a, t, t_args, t_kwargs)
ERROR: InvalidIRError: compiling kernel gpu_check_kernel(KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(256,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, AMDGPU.Device.ROCDeviceVector{Float64, 1}, Tuple{typeof(f1), typeof(f1)}, Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}}, Tuple{NamedTuple{(:c,), Tuple{Int64}}, NamedTuple{(:d,), Tuple{Int64}}}) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to ijl_get_nth_field_checked)
Stacktrace:
 [1] getindex
   @ ./tuple.jl:29
 [2] macro expansion
   @ ~/projects/Fae.jl/sketches/KA_test.jl:28
 [3] gpu_check_kernel
   @ ~/.julia/packages/KernelAbstractions/C8flJ/src/macros.jl:81
 [4] gpu_check_kernel
   @ ./none:0
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.GCNCompilerTarget, AMDGPU.Compiler.ROCCompilerParams, GPUCompiler.FunctionSpec{typeof(gpu_check_kernel), Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(256,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, AMDGPU.Device.ROCDeviceVector{Float64, 1}, Tuple{typeof(f1), typeof(f1)}, Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}}, Tuple{NamedTuple{(:c,), Tuple{Int64}}, NamedTuple{(:d,), Tuple{Int64}}}}}}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/S3TWf/src/validation.jl:141
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/S3TWf/src/driver.jl:418 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/LHjFw/src/TimerOutput.jl:253 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/S3TWf/src/driver.jl:417 [inlined]
  [5] emit_llvm(job::GPUCompiler.CompilerJob, method_instance::Any; libraries::Bool, deferred_codegen::Bool, optimize::Bool, cleanup::Bool, only_entry::Bool, validate::Bool, ctx::LLVM.ThreadSafeContext)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/S3TWf/src/utils.jl:83
  [6] emit_llvm
    @ ~/.julia/packages/GPUCompiler/S3TWf/src/utils.jl:77 [inlined]
  [7] (::AMDGPU.Compiler.var"#59#62"{GPUCompiler.CompilerJob{GPUCompiler.GCNCompilerTarget, AMDGPU.Compiler.ROCCompilerParams, GPUCompiler.FunctionSpec{typeof(gpu_check_kernel), Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(256,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, AMDGPU.Device.ROCDeviceVector{Float64, 1}, Tuple{typeof(f1), typeof(f1)}, Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}}, Tuple{NamedTuple{(:c,), Tuple{Int64}}, NamedTuple{(:d,), Tuple{Int64}}}}}}, Core.MethodInstance})(ctx::LLVM.ThreadSafeContext)
    @ AMDGPU.Compiler ~/.julia/packages/AMDGPU/bzHD4/src/compiler/codegen.jl:183
  [8] LLVM.ThreadSafeContext(f::AMDGPU.Compiler.var"#59#62"{GPUCompiler.CompilerJob{GPUCompiler.GCNCompilerTarget, AMDGPU.Compiler.ROCCompilerParams, GPUCompiler.FunctionSpec{typeof(gpu_check_kernel), Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(256,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, AMDGPU.Device.ROCDeviceVector{Float64, 1}, Tuple{typeof(f1), typeof(f1)}, Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}}, Tuple{NamedTuple{(:c,), Tuple{Int64}}, NamedTuple{(:d,), Tuple{Int64}}}}}}, Core.MethodInstance})
    @ LLVM ~/.julia/packages/LLVM/HykgZ/src/executionengine/ts_module.jl:14
  [9] JuliaContext(f::AMDGPU.Compiler.var"#59#62"{GPUCompiler.CompilerJob{GPUCompiler.GCNCompilerTarget, AMDGPU.Compiler.ROCCompilerParams, GPUCompiler.FunctionSpec{typeof(gpu_check_kernel), Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(256,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, AMDGPU.Device.ROCDeviceVector{Float64, 1}, Tuple{typeof(f1), typeof(f1)}, Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}}, Tuple{NamedTuple{(:c,), Tuple{Int64}}, NamedTuple{(:d,), Tuple{Int64}}}}}}, Core.MethodInstance})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/S3TWf/src/driver.jl:74
 [10] rocfunction_compile(job::GPUCompiler.CompilerJob)
    @ AMDGPU.Compiler ~/.julia/packages/AMDGPU/bzHD4/src/compiler/codegen.jl:182
 [11] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(AMDGPU.Compiler.rocfunction_compile), linker::typeof(AMDGPU.Compiler.rocfunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/S3TWf/src/cache.jl:90
 [12] rocfunction(f::typeof(gpu_check_kernel), tt::Type; name::String, device::ROCDevice, global_hooks::NamedTuple{(), Tuple{}})
    @ AMDGPU.Compiler ~/.julia/packages/AMDGPU/bzHD4/src/compiler/codegen.jl:165
 [13] rocfunction
    @ ~/.julia/packages/AMDGPU/bzHD4/src/compiler/codegen.jl:154 [inlined]
 [14] macro expansion
    @ ~/.julia/packages/AMDGPU/bzHD4/src/highlevel.jl:430 [inlined]
 [15] (::KernelAbstractions.Kernel{ROCDevice, KernelAbstractions.NDIteration.StaticSize{(256,)}, KernelAbstractions.NDIteration.DynamicSize, typeof(gpu_check_kernel)})(::ROCVector{Float64}, ::Vararg{Any}; ndrange::Int64, dependencies::Nothing, workgroupsize::Nothing, progress::Nothing)
    @ ROCKernels ~/.julia/packages/ROCKernels/TyQpD/src/ROCKernels.jl:197
 [16] check(a::ROCVector{Float64}, t::Tuple{typeof(f1), typeof(f1)}, t_args::Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}}, t_kwargs::Tuple{NamedTuple{(:c,), Tuple{Int64}}, NamedTuple{(:d,), Tuple{Int64}}})
    @ Main ~/projects/Fae.jl/sketches/KA_test.jl:21
 [17] top-level scope
    @ REPL[9]:1
@leios leios added the bug Something isn't working label Jul 18, 2024
@maleadt maleadt transferred this issue from JuliaGPU/CUDA.jl Jul 19, 2024
@maleadt
Copy link
Member

maleadt commented Jul 19, 2024

This is not easily fixable on the GPUCompiler.jl side, as it's Julia's codegen generating runtime-reliant code here. For example, a Metal-based MWE:

using Metal

function kernel(a, t)
    for j in 1:2
        @inbounds a[1] = t[j]
    end
    return
end

function main()
    a = Metal.ones(1)
    @metal kernel(a, (1, 3.0))
end

Generating CPU code for this requires the runtime:

julia> code_llvm(kernel, Tuple{Vector{Float32}, Tuple{Float32,Float64}})
;  @ /Users/tim/Julia/pkg/Metal/wip.jl:3 within `kernel`
define void @julia_kernel_4986({}* noundef nonnull align 16 dereferenceable(40) %0, { float, double }* nocapture noundef nonnull readonly align 8 dereferenceable(16) %1) #0 {
top:
  %gcframe19 = alloca [3 x {}*], align 16
  %gcframe19.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe19, i64 0, i64 0
  %2 = bitcast [3 x {}*]* %gcframe19 to i8*
  call void @llvm.memset.p0i8.i64(i8* align 16 %2, i8 0, i64 24, i1 true)
  %3 = call {}*** inttoptr (i64 6926966180 to {}*** (i64)*)(i64 261) #4
  %4 = bitcast [3 x {}*]* %gcframe19 to i64*
  store i64 4, i64* %4, align 16
  %5 = load {}**, {}*** %3, align 8
  %6 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe19, i64 0, i64 1
  %7 = bitcast {}** %6 to {}***
  store {}** %5, {}*** %7, align 8
  %8 = bitcast {}*** %3 to {}***
  store {}** %gcframe19.sub, {}*** %8, align 8
  %9 = bitcast { float, double }* %1 to i8*
  %10 = bitcast {}* %0 to float**
;  @ /Users/tim/Julia/pkg/Metal/wip.jl:5 within `kernel`
; ┌ @ tuple.jl:31 within `getindex`
   %ptls_field20 = getelementptr inbounds {}**, {}*** %3, i64 2
   %11 = bitcast {}*** %ptls_field20 to i8**
   %ptls_load2122 = load i8*, i8** %11, align 8
   %box = call noalias nonnull dereferenceable(32) {}* @ijl_gc_pool_alloc(i8* %ptls_load2122, i32 800, i32 32) #10
   %12 = bitcast {}* %box to i64*
   %13 = getelementptr inbounds i64, i64* %12, i64 -1
   store atomic i64 4859175776, i64* %13 unordered, align 8
   %14 = bitcast {}* %box to i8*
   call void @llvm.memcpy.p0i8.p0i8.i64(i8* noundef nonnull align 8 dereferenceable(16) %14, i8* noundef nonnull align 8 dereferenceable(16) %9, i64 16, i1 false)
   %15 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe19, i64 0, i64 2
   store {}* %box, {}** %15, align 16
   %16 = call {}* @ijl_get_nth_field_checked({}* nonnull %box, i64 0)
; └
  %17 = bitcast {}* %16 to i64*
  %18 = getelementptr inbounds i64, i64* %17, i64 -1
  %19 = load atomic i64, i64* %18 unordered, align 8
  %20 = and i64 %19, -16
  switch i64 %20, label %L16 [
    i64 4904455440, label %L7
    i64 4904455376, label %L12
  ]

I don't think we can easily specialize this; one solution would be to add specialized codegen support to essentially do union splitting, but I'm not sure that's worth it.

For a workaround, try unrolling this loop at the Julia level (i.e., before codegen, not using LLVMLoopInfo.jl).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working upstream
Projects
None yet
Development

No branches or pull requests

2 participants