Cannot iterate over a Tuple of mixed Type #607

leios · 2024-07-18T19:05:58Z

I am posting this as a "bug", butI am not sure if it's fixable. For clarity, the actual error with JuliaGPU/CUDA.jl#2450 is due to the fact that we can't iterate through a Tuple of mixed type. I think this is a Julia-specific problem because tbh no one else is crazy enough to send a container of mixed type to the GPU. Julia has to do it though because Functions have their own inherent type, so there is no other way to pass in functions.

Anyway, here's an example of something that fails (in KA, sorry):

@kernel function check(a, t)
    idx = @index(Global, Linear)
    for i = 1:length(t)
        a[idx] = t[i]
    end
end
...
a = Cu/ROCArray(ones(10))

check(get_backend(a))(a, (1, 3.0), ndrange = length(a))

The workarounds for JuliaGPU/CUDA.jl#2450 also work here.

CUDA Errors:

ERROR: InvalidIRError: compiling kernel #gpu_check_kernel(KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(256,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Float64, 1}, Tuple{typeof(f1), typeof(f1)}, Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}}, Tuple{NamedTuple{(:c,), Tuple{Int64}}, NamedTuple{(:d,), Tuple{Int64}}}) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to jl_f_getfield)
Stacktrace:
 [1] getindex
   @ ./tuple.jl:29
 [2] macro expansion
   @ ~/projects/sketches/tuple_test.jl:29
 [3] gpu_check_kernel
   @ ~/.julia/packages/KernelAbstractions/C8flJ/src/macros.jl:81
 [4] gpu_check_kernel
   @ ./none:0
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(gpu_check_kernel), Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(256,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Float64, 1}, Tuple{typeof(f1), typeof(f1)}, Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}}, Tuple{NamedTuple{(:c,), Tuple{Int64}}, NamedTuple{(:d,), Tuple{Int64}}}}}}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/S3TWf/src/validation.jl:141
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/S3TWf/src/driver.jl:418 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/LHjFw/src/TimerOutput.jl:253 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/S3TWf/src/driver.jl:417 [inlined]
  [5] emit_llvm(job::GPUCompiler.CompilerJob, method_instance::Any; libraries::Bool, deferred_codegen::Bool, optimize::Bool, cleanup::Bool, only_entry::Bool, validate::Bool, ctx::LLVM.ThreadSafeContext)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/S3TWf/src/utils.jl:83
  [6] cufunction_compile(job::GPUCompiler.CompilerJob, ctx::LLVM.ThreadSafeContext)
    @ CUDA ~/.julia/packages/CUDA/ZdCxS/src/compiler/execution.jl:360
  [7] JuliaGPU/CUDA.jl#221
    @ ~/.julia/packages/CUDA/ZdCxS/src/compiler/execution.jl:354 [inlined]
  [8] LLVM.ThreadSafeContext(f::CUDA.var"#221#222"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(gpu_check_kernel), Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(256,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Float64, 1}, Tuple{typeof(f1), typeof(f1)}, Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}}, Tuple{NamedTuple{(:c,), Tuple{Int64}}, NamedTuple{(:d,), Tuple{Int64}}}}}}})
    @ LLVM ~/.julia/packages/LLVM/HykgZ/src/executionengine/ts_module.jl:14
  [9] JuliaContext(f::CUDA.var"#221#222"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(gpu_check_kernel), Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(256,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Float64, 1}, Tuple{typeof(f1), typeof(f1)}, Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}}, Tuple{NamedTuple{(:c,), Tuple{Int64}}, NamedTuple{(:d,), Tuple{Int64}}}}}}})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/S3TWf/src/driver.jl:74
 [10] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/ZdCxS/src/compiler/execution.jl:353
 [11] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/S3TWf/src/cache.jl:90
 [12] cufunction(f::typeof(gpu_check_kernel), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(256,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Float64, 1}, Tuple{typeof(f1), typeof(f1)}, Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}}, Tuple{NamedTuple{(:c,), Tuple{Int64}}, NamedTuple{(:d,), Tuple{Int64}}}}}; name::Nothing, always_inline::Bool, kwargs::Base.Pairs{Symbol, Int64, Tuple{Symbol}, NamedTuple{(:maxthreads,), Tuple{Int64}}})
    @ CUDA ~/.julia/packages/CUDA/ZdCxS/src/compiler/execution.jl:306
 [13] macro expansion
    @ ~/.julia/packages/CUDA/ZdCxS/src/compiler/execution.jl:102 [inlined]
 [14] (::KernelAbstractions.Kernel{CUDADevice{false, false}, KernelAbstractions.NDIteration.StaticSize{(256,)}, KernelAbstractions.NDIteration.DynamicSize, typeof(gpu_check_kernel)})(::CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}, ::Vararg{Any}; ndrange::Int64, dependencies::CUDAKernels.CudaEvent, workgroupsize::Nothing, progress::Function)
    @ CUDAKernels ~/.julia/packages/CUDAKernels/3IKLV/src/CUDAKernels.jl:283
 [15] check(a::CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}, t::Tuple{typeof(f1), typeof(f1)}, t_args::Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}}, t_kwargs::Tuple{NamedTuple{(:c,), Tuple{Int64}}, NamedTuple{(:d,), Tuple{Int64}}})
    @ Main ~/projects/sketches/tuple_test.jl:21
 [16] top-level scope
    @ REPL[9]:1
 [17] top-level scope
    @ ~/.julia/packages/CUDA/ZdCxS/src/initialization.jl:155

AMD Error:

julia> check(a, t, t_args, t_kwargs)
ERROR: InvalidIRError: compiling kernel gpu_check_kernel(KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(256,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, AMDGPU.Device.ROCDeviceVector{Float64, 1}, Tuple{typeof(f1), typeof(f1)}, Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}}, Tuple{NamedTuple{(:c,), Tuple{Int64}}, NamedTuple{(:d,), Tuple{Int64}}}) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to ijl_get_nth_field_checked)
Stacktrace:
 [1] getindex
   @ ./tuple.jl:29
 [2] macro expansion
   @ ~/projects/Fae.jl/sketches/KA_test.jl:28
 [3] gpu_check_kernel
   @ ~/.julia/packages/KernelAbstractions/C8flJ/src/macros.jl:81
 [4] gpu_check_kernel
   @ ./none:0
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.GCNCompilerTarget, AMDGPU.Compiler.ROCCompilerParams, GPUCompiler.FunctionSpec{typeof(gpu_check_kernel), Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(256,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, AMDGPU.Device.ROCDeviceVector{Float64, 1}, Tuple{typeof(f1), typeof(f1)}, Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}}, Tuple{NamedTuple{(:c,), Tuple{Int64}}, NamedTuple{(:d,), Tuple{Int64}}}}}}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/S3TWf/src/validation.jl:141
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/S3TWf/src/driver.jl:418 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/LHjFw/src/TimerOutput.jl:253 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/S3TWf/src/driver.jl:417 [inlined]
  [5] emit_llvm(job::GPUCompiler.CompilerJob, method_instance::Any; libraries::Bool, deferred_codegen::Bool, optimize::Bool, cleanup::Bool, only_entry::Bool, validate::Bool, ctx::LLVM.ThreadSafeContext)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/S3TWf/src/utils.jl:83
  [6] emit_llvm
    @ ~/.julia/packages/GPUCompiler/S3TWf/src/utils.jl:77 [inlined]
  [7] (::AMDGPU.Compiler.var"#59#62"{GPUCompiler.CompilerJob{GPUCompiler.GCNCompilerTarget, AMDGPU.Compiler.ROCCompilerParams, GPUCompiler.FunctionSpec{typeof(gpu_check_kernel), Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(256,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, AMDGPU.Device.ROCDeviceVector{Float64, 1}, Tuple{typeof(f1), typeof(f1)}, Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}}, Tuple{NamedTuple{(:c,), Tuple{Int64}}, NamedTuple{(:d,), Tuple{Int64}}}}}}, Core.MethodInstance})(ctx::LLVM.ThreadSafeContext)
    @ AMDGPU.Compiler ~/.julia/packages/AMDGPU/bzHD4/src/compiler/codegen.jl:183
  [8] LLVM.ThreadSafeContext(f::AMDGPU.Compiler.var"#59#62"{GPUCompiler.CompilerJob{GPUCompiler.GCNCompilerTarget, AMDGPU.Compiler.ROCCompilerParams, GPUCompiler.FunctionSpec{typeof(gpu_check_kernel), Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(256,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, AMDGPU.Device.ROCDeviceVector{Float64, 1}, Tuple{typeof(f1), typeof(f1)}, Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}}, Tuple{NamedTuple{(:c,), Tuple{Int64}}, NamedTuple{(:d,), Tuple{Int64}}}}}}, Core.MethodInstance})
    @ LLVM ~/.julia/packages/LLVM/HykgZ/src/executionengine/ts_module.jl:14
  [9] JuliaContext(f::AMDGPU.Compiler.var"#59#62"{GPUCompiler.CompilerJob{GPUCompiler.GCNCompilerTarget, AMDGPU.Compiler.ROCCompilerParams, GPUCompiler.FunctionSpec{typeof(gpu_check_kernel), Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(256,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, AMDGPU.Device.ROCDeviceVector{Float64, 1}, Tuple{typeof(f1), typeof(f1)}, Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}}, Tuple{NamedTuple{(:c,), Tuple{Int64}}, NamedTuple{(:d,), Tuple{Int64}}}}}}, Core.MethodInstance})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/S3TWf/src/driver.jl:74
 [10] rocfunction_compile(job::GPUCompiler.CompilerJob)
    @ AMDGPU.Compiler ~/.julia/packages/AMDGPU/bzHD4/src/compiler/codegen.jl:182
 [11] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(AMDGPU.Compiler.rocfunction_compile), linker::typeof(AMDGPU.Compiler.rocfunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/S3TWf/src/cache.jl:90
 [12] rocfunction(f::typeof(gpu_check_kernel), tt::Type; name::String, device::ROCDevice, global_hooks::NamedTuple{(), Tuple{}})
    @ AMDGPU.Compiler ~/.julia/packages/AMDGPU/bzHD4/src/compiler/codegen.jl:165
 [13] rocfunction
    @ ~/.julia/packages/AMDGPU/bzHD4/src/compiler/codegen.jl:154 [inlined]
 [14] macro expansion
    @ ~/.julia/packages/AMDGPU/bzHD4/src/highlevel.jl:430 [inlined]
 [15] (::KernelAbstractions.Kernel{ROCDevice, KernelAbstractions.NDIteration.StaticSize{(256,)}, KernelAbstractions.NDIteration.DynamicSize, typeof(gpu_check_kernel)})(::ROCVector{Float64}, ::Vararg{Any}; ndrange::Int64, dependencies::Nothing, workgroupsize::Nothing, progress::Nothing)
    @ ROCKernels ~/.julia/packages/ROCKernels/TyQpD/src/ROCKernels.jl:197
 [16] check(a::ROCVector{Float64}, t::Tuple{typeof(f1), typeof(f1)}, t_args::Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}}, t_kwargs::Tuple{NamedTuple{(:c,), Tuple{Int64}}, NamedTuple{(:d,), Tuple{Int64}}})
    @ Main ~/projects/Fae.jl/sketches/KA_test.jl:21
 [17] top-level scope
    @ REPL[9]:1

The text was updated successfully, but these errors were encountered:

maleadt · 2024-07-19T06:21:45Z

This is not easily fixable on the GPUCompiler.jl side, as it's Julia's codegen generating runtime-reliant code here. For example, a Metal-based MWE:

using Metal

function kernel(a, t)
    for j in 1:2
        @inbounds a[1] = t[j]
    end
    return
end

function main()
    a = Metal.ones(1)
    @metal kernel(a, (1, 3.0))
end

Generating CPU code for this requires the runtime:

julia> code_llvm(kernel, Tuple{Vector{Float32}, Tuple{Float32,Float64}})
;  @ /Users/tim/Julia/pkg/Metal/wip.jl:3 within `kernel`
define void @julia_kernel_4986({}* noundef nonnull align 16 dereferenceable(40) %0, { float, double }* nocapture noundef nonnull readonly align 8 dereferenceable(16) %1) #0 {
top:
  %gcframe19 = alloca [3 x {}*], align 16
  %gcframe19.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe19, i64 0, i64 0
  %2 = bitcast [3 x {}*]* %gcframe19 to i8*
  call void @llvm.memset.p0i8.i64(i8* align 16 %2, i8 0, i64 24, i1 true)
  %3 = call {}*** inttoptr (i64 6926966180 to {}*** (i64)*)(i64 261) #4
  %4 = bitcast [3 x {}*]* %gcframe19 to i64*
  store i64 4, i64* %4, align 16
  %5 = load {}**, {}*** %3, align 8
  %6 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe19, i64 0, i64 1
  %7 = bitcast {}** %6 to {}***
  store {}** %5, {}*** %7, align 8
  %8 = bitcast {}*** %3 to {}***
  store {}** %gcframe19.sub, {}*** %8, align 8
  %9 = bitcast { float, double }* %1 to i8*
  %10 = bitcast {}* %0 to float**
;  @ /Users/tim/Julia/pkg/Metal/wip.jl:5 within `kernel`
; ┌ @ tuple.jl:31 within `getindex`
   %ptls_field20 = getelementptr inbounds {}**, {}*** %3, i64 2
   %11 = bitcast {}*** %ptls_field20 to i8**
   %ptls_load2122 = load i8*, i8** %11, align 8
   %box = call noalias nonnull dereferenceable(32) {}* @ijl_gc_pool_alloc(i8* %ptls_load2122, i32 800, i32 32) #10
   %12 = bitcast {}* %box to i64*
   %13 = getelementptr inbounds i64, i64* %12, i64 -1
   store atomic i64 4859175776, i64* %13 unordered, align 8
   %14 = bitcast {}* %box to i8*
   call void @llvm.memcpy.p0i8.p0i8.i64(i8* noundef nonnull align 8 dereferenceable(16) %14, i8* noundef nonnull align 8 dereferenceable(16) %9, i64 16, i1 false)
   %15 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe19, i64 0, i64 2
   store {}* %box, {}** %15, align 16
   %16 = call {}* @ijl_get_nth_field_checked({}* nonnull %box, i64 0)
; └
  %17 = bitcast {}* %16 to i64*
  %18 = getelementptr inbounds i64, i64* %17, i64 -1
  %19 = load atomic i64, i64* %18 unordered, align 8
  %20 = and i64 %19, -16
  switch i64 %20, label %L16 [
    i64 4904455440, label %L7
    i64 4904455376, label %L12
  ]

I don't think we can easily specialize this; one solution would be to add specialized codegen support to essentially do union splitting, but I'm not sure that's worth it.

For a workaround, try unrolling this loop at the Julia level (i.e., before codegen, not using LLVMLoopInfo.jl).

leios added the bug Something isn't working label Jul 18, 2024

maleadt transferred this issue from JuliaGPU/CUDA.jl Jul 19, 2024

maleadt added the upstream label Jul 19, 2024

maleadt mentioned this issue Jul 19, 2024

Device function pointers JuliaGPU/CUDA.jl#2450

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot iterate over a Tuple of mixed Type #607

Cannot iterate over a Tuple of mixed Type #607

leios commented Jul 18, 2024 •

edited

Loading

maleadt commented Jul 19, 2024

Cannot iterate over a Tuple of mixed Type #607

Cannot iterate over a Tuple of mixed Type #607

Comments

leios commented Jul 18, 2024 • edited Loading

maleadt commented Jul 19, 2024

leios commented Jul 18, 2024 •

edited

Loading