You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If I generate a kernel with the following command in the CLI tool: taco "C(i, j, b) = C(i, j, b) + A(l, j, b) * B(i, k, l, b) * w(k, b)" -cuda -d=A:32,32,25866 -d=B:32,32,32,25866 -d=C:32,32,25866 -d=w:32,25866 -t=A:float -t=B:float -t=C:float -t=w:float -print-nocolor
The generated kernel has the parameter C twice in the launcher function. I will add the generated code:
// Generated by the Tensor Algebra Compiler (tensor-compiler.org)
__global__
void computeDeviceKernel0(taco_tensor_t * __restrict__ A, taco_tensor_t * __restrict__ B, taco_tensor_t * __restrict__ C, taco_tensor_t * __restrict__ w){
int A2_dimension = (int)(A->dimensions[1]);
int A3_dimension = (int)(A->dimensions[2]);
float* __restrict__ A_vals = (float*)(A->vals);
int B2_dimension = (int)(B->dimensions[1]);
int B3_dimension = (int)(B->dimensions[2]);
int B4_dimension = (int)(B->dimensions[3]);
float* __restrict__ B_vals = (float*)(B->vals);
int C1_dimension = (int)(C->dimensions[0]);
int C2_dimension = (int)(C->dimensions[1]);
int C3_dimension = (int)(C->dimensions[2]);
float* __restrict__ C_vals = (float*)(C->vals);
int w1_dimension = (int)(w->dimensions[0]);
int w2_dimension = (int)(w->dimensions[1]);
float* __restrict__ w_vals = (float*)(w->vals);
int32_t i161 = blockIdx.x;
int32_t i162 = (threadIdx.x % (256));
if (threadIdx.x >= 256) {
return;
}
int32_t i = i161 * 256 + i162;
if (i >= C1_dimension)
return;
for (int32_t j = 0; j < C2_dimension; j++) {
int32_t jC = i * C2_dimension + j;
for (int32_t b = 0; b < C3_dimension; b++) {
int32_t bC = jC * C3_dimension + b;
float tl_val = 0.0;
for (int32_t l = 0; l < B3_dimension; l++) {
int32_t jA = l * A2_dimension + j;
int32_t bA = jA * A3_dimension + b;
float tk_val = 0.0;
for (int32_t k = 0; k < w1_dimension; k++) {
int32_t kB = i * B2_dimension + k;
int32_t lB = kB * B3_dimension + l;
int32_t bB = lB * B4_dimension + b;
int32_t bw = k * w2_dimension + b;
tk_val = tk_val + (A_vals[bA] * B_vals[bB]) * w_vals[bw];
}
tl_val = tl_val + tk_val;
}
C_vals[bC] = C_vals[bC] + tl_val;
}
}
}
int compute(taco_tensor_t *C, taco_tensor_t *A, taco_tensor_t *B, taco_tensor_t *w, taco_tensor_t *C) {
int C1_dimension = (int)(C->dimensions[0]);
computeDeviceKernel0<<<(C1_dimension + 255) / 256, 256>>>(A, B, C, w);
cudaDeviceSynchronize();
return 0;
}
I have built taco with gcc-11, optimize flags -fPIC -O3, with pybind11 and cuda bundled with nvhpc 23.9 from source. The commit hash (output from git log -n 1, I hope it is the correct command to use here) is:
git log -n 1
commit 2b8ece4
Also, one more question, I wanted to use the index "b" to create batched tensor contractions in this case, and I hoped that the kernel would distribute the workload using the C->dimensions[3] because the last index is definitely the biggest one, I also have provided it in the command line argument with the hope that it would be used, what am I doing there wrong? Should I provide a schedule, or what should I do for that?
I tried:
taco "C(i, j, b) = C(i, j, b) + A(l, j, b) * B(i, k, l, b) * w(k, b)" -cuda -d=A:32,32,25866 -d=B:32,32,32,25866 -d=C:32,32,25866 -d=w:32,25866 -t=A:float -t=B:float -t=C:float -t=w:float -print-nocolor -s="parallelize(b, GPUBlock, NoRaces)"
// Generated by the Tensor Algebra Compiler (tensor-compiler.org)
terminate called after throwing an instance of 'taco::TacoException'
what(): Compiler bug at /home/primrose/Installed/taco/src/codegen/codegen_cuda.cpp:374 in visit
Please report it to developers
Condition failed: blockIDVars.size() == threadIDVars.size()
No matching GPUThread parallelize
The text was updated successfully, but these errors were encountered:
If I generate a kernel with the following command in the CLI tool:
taco "C(i, j, b) = C(i, j, b) + A(l, j, b) * B(i, k, l, b) * w(k, b)" -cuda -d=A:32,32,25866 -d=B:32,32,32,25866 -d=C:32,32,25866 -d=w:32,25866 -t=A:float -t=B:float -t=C:float -t=w:float -print-nocolor
The generated kernel has the parameter C twice in the launcher function. I will add the generated code:
I have built taco with gcc-11, optimize flags -fPIC -O3, with pybind11 and cuda bundled with nvhpc 23.9 from source. The commit hash (output from git log -n 1, I hope it is the correct command to use here) is:
git log -n 1
commit 2b8ece4
Also, one more question, I wanted to use the index "b" to create batched tensor contractions in this case, and I hoped that the kernel would distribute the workload using the C->dimensions[3] because the last index is definitely the biggest one, I also have provided it in the command line argument with the hope that it would be used, what am I doing there wrong? Should I provide a schedule, or what should I do for that?
I tried:
The text was updated successfully, but these errors were encountered: