-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MatrixMultiplication example ERROR : clBuildProgram -> Returned: -11 #397
Comments
Ok, sorry I was part of it since I jump first to the docker-tornado repo. I will take a look and let you know. |
My take here is that it is related to the thread-block being used. Task info: s0.t0
Backend : OPENCL
Device : NVIDIA GeForce RTX 3070 CL_DEVICE_TYPE_GPU (available)
Dims : 2
Global work offset: [0, 0]
Global work size : [512, 512]
Local work size : [32, 32, 1]
Number of workgroups : [16, 16]
[TornadoVM-OCL-JNI] ERROR : [JNI] uk.ac.manchester.tornado.drivers.opencl> notify error:
clEnqueueNDRangeKernel -> Returned: [JNI] uk.ac.manchester.tornado.drivers.opencl> CL_OUT_OF_RESOURCES error executing CL_COMMAND_NDRANGE_KERNEL on NVIDIA GeForce RTX 3070 (Device 0). Tornado does not fail during compilation, but during the kernel dispatch. If I select blocks of 16x16, then I get correct executions: Task info: s0.t0
Backend : OPENCL
Device : NVIDIA GeForce RTX 3070 CL_DEVICE_TYPE_GPU (available)
Dims : 2
Global work offset: [0, 0, 0]
Global work size : [512, 512, 1]
Local work size : [16, 16, 1]
Number of workgroups : [32, 32, 1] The diff: diff --git a/example/pom.xml b/example/pom.xml
index 30c2487..785dc2d 100644
--- a/example/pom.xml
+++ b/example/pom.xml
@@ -29,12 +29,12 @@
<dependency>
<groupId>tornado</groupId>
<artifactId>tornado-api</artifactId>
- <version>1.0.3-dev</version>
+ <version>1.0.4-dev</version>
</dependency>
<dependency>
<groupId>tornado</groupId>
<artifactId>tornado-matrices</artifactId>
- <version>1.0.3-dev</version>
+ <version>1.0.4-dev</version>
</dependency>
<dependency>
<groupId>junit</groupId>
diff --git a/example/src/main/java/example/MatrixMultiplication.java b/example/src/main/java/example/MatrixMultiplication.java
index a37b0d6..d97439e 100644
--- a/example/src/main/java/example/MatrixMultiplication.java
+++ b/example/src/main/java/example/MatrixMultiplication.java
@@ -19,6 +19,7 @@ package example;
import java.util.Random;
import java.util.stream.IntStream;
+import uk.ac.manchester.tornado.api.*;
import uk.ac.manchester.tornado.api.ImmutableTaskGraph;
import uk.ac.manchester.tornado.api.TaskGraph;
import uk.ac.manchester.tornado.api.TornadoExecutionPlan;
@@ -88,7 +89,12 @@ public class MatrixMultiplication {
ImmutableTaskGraph immutableTaskGraph = taskGraph.snapshot();
TornadoExecutionPlan executor = new TornadoExecutionPlan(immutableTaskGraph);
- executor.withWarmUp();
+
+ WorkerGrid workerGrid = new WorkerGrid2D(matrixA.getNumRows(), matrixA.getNumColumns());
+ GridScheduler gridScheduler = new GridScheduler("s0.t0", workerGrid);
+ workerGrid.setLocalWork(16, 16, 1);
+
+ executor.withGridScheduler(gridScheduler).withWarmUp();
// 1. Warm up Tornado
for (int i = 0; i < WARMING_UP_ITERATIONS; i++) {
diff --git a/example/target/example-1.0-SNAPSHOT.jar b/example/target/example-1.0-SNAPSHOT.jar
index df83b80..9fff7be 100644
Binary files a/example/target/example-1.0-SNAPSHOT.jar and b/example/target/example-1.0-SNAPSHOT.jar differ Funny, beacuse on my native system, I can run with blocks of 32x32, so this is a driver issue within the Docker image. |
I pushed the "fix" in the tornado-docker repo: beehive-lab/docker-tornadovm@cb5f48c |
For reference: ./run_nvidia_openjdk.sh tornado -cp example/target/example-1.0-SNAPSHOT.jar example.MatrixMultiplication 1024
WARNING: Using incubator modules: jdk.incubator.vector
Computing MxM of 1024x1024
[TornadoVM] Warning: The loop bounds will be configured by the GridScheduler. Check the grid by using the flag --threadInfo.
Single Threaded CPU Execution: 0.92 GFlops, Total time = 2326 ms
Streams Execution: 10.28 GFlops, Total time = 209 ms
TornadoVM Execution on GPU (Accelerated): 1073.74 GFlops, Total Time = 2 ms
Speedup: 1163.0x
Verification true I am using a new docker image for a new TornadoVM version coming up tomorrow ;-) |
Thanks for looking into it so quickly. I'll try out tomorrow with the new version and report back! |
New docker images: https://hub.docker.com/r/beehivelab/tornadovm-nvidia-openjdk |
Hi @jjfumero, I tried with the new docker image, but I'm still getting the same error.
so I assume it's running the correct version. Is there any way I can get more information out of OpenCL? Or is there any information I can provide that helps you? It's also not completely unrealistic that there is something wrong due to NixOS, but my understanding how things interact with each other isn't deep enough here. (I'm currently trying to manually build TornadoVM from source, but it fails due to cmake not finding the jni/jawt header files, which I also don't understand as they are present) |
The TornadoVM automatic installer should bring all necessary dependencies (Java, cmake, maven, etc). I do not know if there is something specific for NixOS, but to install TonadoVM from source, can you try the following: Assuming you want the OpenCL backend and you have the NVIDIA Driver installed for your NVIDIA GPU: $ ./bin/tornadovm-installer --jdk jdk21 --backend=opencl |
In my system (Ubuntu 23.10), I tested the script that failed for you and it is working: ./run_nvidia_openjdk.sh tornado -cp example/target/example-1.0-SNAPSHOT.jar example.MatrixMultiplication --fullDebug -pk
WARNING: Using incubator modules: jdk.incubator.vector
Computing MxM of 512x512
[INFO] Loading Backend: uk.ac.manchester.tornado.drivers.opencl.OCLTornadoDriverProvider@52aa2946
TornadoGraph dependency matrix...
+----+---------------+
| 5 [data]| <none>
|----+---------------+
| 6 [data]| <none>
|----+---------------+
| 7 [data]| <none>
|----+---------------+
| 8 [data]| <none>
|----+---------------+
| 9 [data]| 10
|----+---------------+
| 10 [task]| 6 7 8
|----+---------------+
| 11 [data]| 10
|----+---------------+
| 12 [data]| 11
|----+---------------+
| 13 [data]| 11
|----+---------------+
| 14 [data]| 11
|----+---------------+
-----------------------------------
Device Table:
[0]: [NVIDIA CUDA] -- NVIDIA RTX A2000 8GB Laptop GPU
Constant Table:
[0]: 512
Object Table:
[0]: 0x16f7c8c1 MatrixFloat <512 x 512>
[1]: 0x573906eb MatrixFloat <512 x 512>
[2]: 0x4ebff610 MatrixFloat <512 x 512>
Task Table:
[0]: task s0.t0 - matrixMultiplication
-----------------------------------
-----------------------------------
TaskGraph:
[0]: constant 0
[1]: object 0
[2]: object 1
[3]: object 2
[4]: context device=0, [ 5 6 7 8 10 11 12 13 14 ]
[5]: persist node
[6]: copy in object 0
[7]: copy in object 1
[8]: copy in object 2
[9]: dependent write on object 2 by task 10
[10]: task=0, args=[ 6 7 8 0 ]
[11]: copy out object 2 after task 10
[12]: deallocate object 1 after 11
[13]: deallocate object 2 after 11
[14]: deallocate object 3 after 11
-----------------------------------
[TornadoVM] Warning: The loop bounds will be configured by the GridScheduler. Check the grid by using the flag --threadInfo.
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#pragma OPENCL EXTENSION cl_khr_fp16 : enable
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable
__kernel void matrixMultiplication(__global long *_kernel_context, __constant uchar *_constant_region, __local uchar *_local_region, __global int *_atomics, __global uchar *A, __global uchar *B, __global uchar *C, __private int size)
{
int i_3, i_34, i_28, i_9, i_41, i_8, i_11, i_10, i_42, i_37, i_4, i_17, i_13, i_12, i_15, i_14, i_27, i_20;
long l_39, l_29, l_30, l_21, l_22, l_38;
float f_24, f_16, f_32, f_33;
ulong ul_19, ul_18, ul_23, ul_25, ul_26, ul_31, ul_1, ul_0, ul_35, ul_2, ul_5, ul_36, ul_7, ul_6, ul_40;
// BLOCK 0
ul_0 = (ulong) A;
ul_1 = (ulong) B;
ul_2 = (ulong) C;
i_3 = get_global_size(0);
i_4 = get_global_size(1);
ul_5 = ul_2 + 32L;
ul_6 = ul_1 + 32L;
ul_7 = ul_0 + 32L;
i_8 = get_global_id(0);
i_9 = get_global_id(1);
i_10 = _kernel_context[0];
// BLOCK 1 MERGES [0 8 ]
i_11 = i_9;
for(;i_11 < i_10;)
{
// BLOCK 2
i_12 = i_11 << 9;
i_13 = i_12 + 6;
// BLOCK 3 MERGES [2 7 ]
i_14 = i_8;
for(;i_14 < i_10;)
{
// BLOCK 4
i_15 = i_14 + 6;
// BLOCK 5 MERGES [4 6 ]
f_16 = 0.0F;
i_17 = 0;
for(;i_17 < i_10;)
{
// BLOCK 6
ul_18 = *((__global ulong *) ul_7);
ul_19 = ul_0 + ul_18;
i_20 = i_13 + i_17;
l_21 = (long) i_20;
l_22 = l_21 << 2;
ul_23 = ul_19 + l_22;
f_24 = *((__global float *) ul_23);
ul_25 = *((__global ulong *) ul_6);
ul_26 = ul_1 + ul_25;
i_27 = i_17 << 9;
i_28 = i_27 + i_15;
l_29 = (long) i_28;
l_30 = l_29 << 2;
ul_31 = ul_26 + l_30;
f_32 = *((__global float *) ul_31);
f_33 = fma(f_24, f_32, f_16);
i_34 = i_17 + 1;
f_16 = f_33;
i_17 = i_34;
} // B6
// BLOCK 7
ul_35 = *((__global ulong *) ul_5);
ul_36 = ul_2 + ul_35;
i_37 = i_14 + i_13;
l_38 = (long) i_37;
l_39 = l_38 << 2;
ul_40 = ul_36 + l_39;
*((__global float *) ul_40) = f_16;
i_41 = i_3 + i_14;
i_14 = i_41;
} // B7
// BLOCK 8
i_42 = i_4 + i_11;
i_11 = i_42;
} // B8
// BLOCK 9
return;
} // kernel
Single Threaded CPU Execution: 1.64 GFlops, Total time = 164 ms
Streams Execution: 14.91 GFlops, Total time = 18 ms
TornadoVM Execution on GPU (Accelerated): 268.44 GFlops, Total Time = 1 ms
Speedup: 164.0x
Verification true
cleanup: programs ..........0.000256351 s
cleanup: context ..........0.000006982 s
cleanup: total ..........0.000263333 s ./run_nvidia_openjdk.sh tornado --version
version=1.0.4
branch=master
commit=585574e
Backends installed:
- opencl I share the same opinion with @jjfumero, it will be easier to understand the problem with your platform if you can install it locally and run the example without docker. Do you have any modified source code? |
Sadly there is, dynamically linked binaries typically cannot be executed without patching. That's why I tried the manual installation, I guess I have to figure out what's up with it not finding JNI headers... I'll let you know when I get it to run. I'm just confused that it doesn't work in the docker container, especially as it finds the device. |
It seems that the problem is in the driver when it compiles the generated kernel from TornadoVM. The I would suggest you to try one of the polyglot images, if possible. Those images are bigger, but I had installed the NVIDIA OpenCL driver in the container image manually. For example the tornadovm-polyglot-graalpy. Also, this container is build on a commit point prior to the latest release, but it may help to see if the driver in the container fails. |
Thanks, I gave that a try and ran |
Going back to the initial problem:
Which is different from the error I encountered. To debug your error, I suggest 2 things:
|
Thanks for the pointers. https://gist.github.com/SirYwell/d9ae4b5393de135ec15429c54d031820 already contains the output from running with The 2. seems to work, I can compile the example kernel and the one from the output I got from the MatrixMultiplication example. The compileKernel program crashes, but from my debugging, that only happens after the kernelBin.bin is already written (after the output of https://github.com/jjfumero/scripts/blob/c8e52c3e83bb7db529ab11f9ae1d61e738792d8d/opencl/compileKernel/compileKernel.cpp#L302 ) |
Describe the bug
A clear and concise description of what the bug is.
I'm getting the following error:
when trying to run the
MatrixMultiplication
example using Docker.My current assumption is that this is caused by my GPU not supporting FP16 (from my understanding, output of
clinfo
here: https://gist.github.com/SirYwell/bdc347db5c4b5f66e2c664666fb0313f), but TornadoVM unconditionally emittingTornadoVM/tornado-drivers/opencl/src/main/java/uk/ac/manchester/tornado/drivers/opencl/graal/asm/OCLAssembler.java
Line 77 in e8faf81
As I haven't yet figured out how I can build and run TornadoVM on NixOS directly, I can't test if changing that already fixes the problem.
How To Reproduce
After cloning the docker-tornado repository, I'm running
./run_nvidia_openjdk.sh tornado -cp example/target/example-1.0-SNAPSHOT.jar example.MatrixMultiplication --fullDebug -pk
The output can be found here: https://gist.github.com/SirYwell/d9ae4b5393de135ec15429c54d031820
Note: I made a few adjustments to make things work:
pom.xml
: I changed the version fromtornado-api
andtornado-matrices
from1.0.3-dev
to1.0.3
.MatrixMultiplication.java
: I changedWARMING_UP_ITERATIONS
from100
to1
to reduce output.run_nvidia_openjdk.sh
: I changed#!/bin/bash
to#!/usr/bin/env bash
to make it work on NixOS. I also added the--gpus all
flag to the docker command.Expected behavior
Expecting the example to run.
Computing system setup (please complete the following information):
Output of
./run_nvidia_openjdk.sh tornado --devices
:Output of
./run_nvidia_openjdk.sh tornado --version
:Additional context
I'm using the nvidia-container-toolkit as described in beehive-lab/docker-tornadovm#8, but I don't think it is related to this issue.
The text was updated successfully, but these errors were encountered: