Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test #47: clblast_test_xconvgemm ...........Subprocess aborted***Exception: 0.45 sec #563

Open
vt-alt opened this issue Nov 20, 2024 · 12 comments

Comments

@vt-alt
Copy link

vt-alt commented Nov 20, 2024

clblast_test_xconvgemm fails on x86-64:

root@x86_64:~/RPM/BUILD/clblast-1.6.3/x86_64-alt-linux# ctest -R clblast_test_xconvgemm --verbose
UpdateCTestConfiguration  from :/usr/src/RPM/BUILD/clblast-1.6.3/x86_64-alt-linux/DartConfiguration.tcl
UpdateCTestConfiguration  from :/usr/src/RPM/BUILD/clblast-1.6.3/x86_64-alt-linux/DartConfiguration.tcl
Test project /usr/src/RPM/BUILD/clblast-1.6.3/x86_64-alt-linux
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 47
    Start 47: clblast_test_xconvgemm

47: Test command: /usr/src/RPM/BUILD/clblast-1.6.3/x86_64-alt-linux/clblast_test_xconvgemm
47: Working Directory: /usr/src/RPM/BUILD/clblast-1.6.3/x86_64-alt-linux
47: Test timeout computed to be: 10000000
47: LLVM ERROR: Instruction Combining did not reach a fixpoint after 1 iterations
1/1 Test #47: clblast_test_xconvgemm ...........Subprocess aborted***Exception:   0.41 sec

0% tests passed, 1 tests failed out of 1

Total Test time (real) =   0.42 sec

The following tests FAILED:
         47 - clblast_test_xconvgemm (Subprocess aborted)
Errors while running CTest
Output from these tests are in: /usr/src/RPM/BUILD/clblast-1.6.3/x86_64-alt-linux/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
root@x86_64:~/RPM/BUILD/clblast-1.6.3/x86_64-alt-linux# cat /usr/src/RPM/BUILD/clblast-1.6.3/x86_64-alt-linux/Testing/Temporary/LastTest.log
Start testing: Nov 20 23:22 UTC
----------------------------------------------------------
47/53 Testing: clblast_test_xconvgemm
47/53 Test: clblast_test_xconvgemm
Command: "/usr/src/RPM/BUILD/clblast-1.6.3/x86_64-alt-linux/clblast_test_xconvgemm"
Directory: /usr/src/RPM/BUILD/clblast-1.6.3/x86_64-alt-linux
"clblast_test_xconvgemm" start time: Nov 20 23:22 UTC
Output:
----------------------------------------------------------
LLVM ERROR: Instruction Combining did not reach a fixpoint after 1 iterations
<end of output>
Test time =   0.41 sec
----------------------------------------------------------
Test Failed.
"clblast_test_xconvgemm" end time: Nov 20 23:22 UTC
"clblast_test_xconvgemm" time elapsed: 00:00:00
----------------------------------------------------------

End testing: Nov 20 23:22 UTC

Interestingly, when I first built it on 2024-07-08, all tests passed.

@CNugteren
Copy link
Owner

Could you compile CLBlast in VERBOSE mode to get additional output? And perhaps also share the output of the clblast_test_diagnostics program? Thanks!

@vt-alt
Copy link
Author

vt-alt commented Nov 21, 2024

There is temporary build log with VERBOSE build as you requested: https://git.altlinux.org/tasks/363103/build/100/x86_64/log
It includes run of clblast_test_diagnostics too.

@CNugteren
Copy link
Owner

Thanks, but I think either CMake is suppressing the output of the test itself, or the VERBOSE build did not work. If you run ./clblast_test_xconvgemm you should get a lot of output.

@vt-alt
Copy link
Author

vt-alt commented Nov 21, 2024

Maybe there is not a lot of output because it's aborted early?

root@x86_64:~/RPM/BUILD/clblast-1.6.3/x86_64-alt-linux# ./clblast_test_xconvgemm

* NOTE: This non-BLAS routine is tested against a custom implementation,
  not against clBLAS or a CPU BLAS library. Thus, the arguments '-clblas'
  and '-cblas' have no effect.

* Options given/available:
    -platform 0 [=default]
    -device 0 [=default]
    -full_test [false]
    -verbose [false]
    -cblas 1 [=default]

* Running on OpenCL device 'cpu-x86-64-QEMU Virtual CPU version 2.5+'.
* Starting tests for the 'SCONVGEMM' routine. Legend:
   : -> Test produced correct results
   . -> Test returned the correct error code
   X -> Test produced incorrect results
   / -> Test returned an incorrect error code
   \ -> Test not executed: OpenCL-kernel compilation error
   o -> Test not executed: Unsupported precision
   - -> Test not completed: Reference CBLAS doesn't output error codes
* Testing with error margins of 0.5% (relative) and 0.001 (absolute)
* Testing 'regular behaviour' for 'default':
   [DEBUG] Searching database for kernel 'Xconvgemm'
[DEBUG] Device type 'CPU'; vendor 'Intel'
[DEBUG] Device name 'cpu-x86-64-QEMU Virtual CPU version 2.5+'; architecture ''
[DEBUG] Found architectures of vendor 'Intel' and type 'CPU'
[DEBUG] Found devices of architecture type 'default'
[DEBUG] Found parameters for device type 'default'
[DEBUG] CONVGEMM_Xconvgemm_1_16_32_32_8_0_0_1_1_32
[DEBUG] Compiling routine 'CONVGEMM-32 (single)'
[DEBUG] Completed compilation in 84.57 ms
LLVM ERROR: Instruction Combining did not reach a fixpoint after 1 iterations
Aborted

In gdb:

...
LLVM ERROR: Instruction Combining did not reach a fixpoint after 1 iterations

Thread 1 "clblast_test_xc" received signal SIGABRT, Aborted.
0x00007ffff74b567c in ?? () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff74b567c in ?? () from /lib64/libc.so.6
#1  0x00007ffff7466982 in raise () from /lib64/libc.so.6
#2  0x00007ffff744f4ef in abort () from /lib64/libc.so.6
#3  0x00007fffefe92115 in llvm::report_fatal_error(llvm::Twine const&, bool) () from /lib64/libLLVM.so.18.1
#4  0x00007ffff0c537a6 in ?? () from /lib64/libLLVM.so.18.1
#5  0x00007ffff0c5313e in llvm::InstCombinePass::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) ()
   from /lib64/libLLVM.so.18.1
#6  0x00007ffff5a12d7d in ?? () from /lib64/libclang-cpp.so.18.1
#7  0x00007ffff009ac62 in llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) () from /lib64/libLLVM.so.18.1
#8  0x00007ffff1ea4a5d in ?? () from /lib64/libLLVM.so.18.1
#9  0x00007ffff009e082 in llvm::ModuleToFunctionPassAdaptor::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) ()
   from /lib64/libLLVM.so.18.1
#10 0x00007ffff1ea489d in ?? () from /lib64/libLLVM.so.18.1
#11 0x00007ffff0099ae4 in llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) () from /lib64/libLLVM.so.18.1
#12 0x00007ffff72bd4ae in TwoStagePoCLModulePassManager::run(llvm::Module&) () from /lib64/libpocl.so.2.13.0
#13 0x00007ffff72c1521 in ?? () from /lib64/libpocl.so.2.13.0
#14 0x00007ffff72bef07 in pocl_llvm_generate_workgroup_function_nowrite () from /lib64/libpocl.so.2.13.0
#15 0x00007ffff724664b in pocl_check_kernel_disk_cache () from /lib64/libpocl.so.2.13.0
#16 0x00007ffff724756d in pocl_check_kernel_dlhandle_cache () from /lib64/libpocl.so.2.13.0
#17 0x00007ffff733cf54 in pocl_basic_compile_kernel () from /lib64/pocl/libpocl-devices-basic.so
#18 0x00007ffff724b7fd in pocl_driver_build_poclbinary () from /lib64/libpocl.so.2.13.0
#19 0x00007ffff721401b in ?? () from /lib64/libpocl.so.2.13.0
#20 0x00007ffff7213e2a in POclGetProgramInfo () from /lib64/libpocl.so.2.13.0
#21 0x00007ffff795dc20 in clGetProgramInfo () from /lib64/libOpenCL.so.1
#22 0x00007ffff7a0e61d in clblast::Program::GetIR[abi:cxx11]() const (this=0x555555aa7800)
    at /usr/src/RPM/BUILD/clblast-1.6.3/src/clpp11.hpp:526
#23 clblast::Routine::InitProgram (this=this@entry=0x7fffffffca10, source=...)
    at /usr/src/RPM/BUILD/clblast-1.6.3/src/routine.cpp:139
#24 0x00007ffff7a0ef11 in clblast::Routine::Routine (this=<optimized out>, queue=..., event=<optimized out>, name=...,
    kernel_names=..., precision=<optimized out>, userDatabase=..., source=..., this=<optimized out>, queue=...,
    event=<optimized out>, name=..., kernel_names=..., precision=<optimized out>, userDatabase=..., source=...)
    at /usr/src/RPM/BUILD/clblast-1.6.3/src/routine.cpp:66
#25 0x00007ffff7baeae4 in clblast::Xconvgemm<float>::Xconvgemm (this=<optimized out>, queue=..., event=<optimized out>,
    name=..., method=<optimized out>, this=<optimized out>, queue=..., event=<optimized out>, name=...,
    method=<optimized out>) at /usr/src/RPM/BUILD/clblast-1.6.3/src/routines/levelx/xconvgemm.cpp:40
#26 0x00007ffff7a58ab0 in clblast::Convgemm<float> (kernel_mode=clblast::KernelMode::kCrossCorrelation, channels=1,
    height=7, width=7, kernel_h=1, kernel_w=1, pad_h=0, pad_w=0, stride_h=1, stride_w=1, dilation_h=1, dilation_w=1,
    num_kernels=1, batch_count=1, im_buffer=0x5555558b9b40, im_offset=0, kernel_buffer=0x5555558b9d80, kernel_offset=0,
    result_buffer=0x5555558b9fc0, result_offset=0, queue=0x7fffffffcbc0, event=0x7fffffffcbb8)
    at /usr/src/RPM/BUILD/clblast-1.6.3/src/clblast.cpp:2315
#27 0x00005555555d9fb7 in clblast::TestXconvgemm<float>::RunRoutine (args=..., buffers=..., queue=...)
    at /usr/src/RPM/BUILD/clblast-1.6.3/test/routines/levelx/xconvgemm.hpp:95
#28 0x00005555555d99a2 in std::__invoke_impl<clblast::StatusCode, clblast::StatusCode (*&)(clblast::Arguments<float> const&, clblast::Buffers<float>&, clblast::Queue&), clblast::Arguments<float> const&, clblast::Buffers<float>&, clblast::Queue&--Type <RET> for more, q to quit, c to continue without paging--
> (__f=<optimized out>) at /usr/include/c++/14/bits/invoke.h:60
#29 std::__invoke_r<clblast::StatusCode, clblast::StatusCode (*&)(clblast::Arguments<float> const&, clblast::Buffers<float>&, clblast::Queue&), clblast::Arguments<float> const&, clblast::Buffers<float>&, clblast::Queue&> (__fn=<optimized out>)
    at /usr/include/c++/14/bits/invoke.h:138
#30 std::_Function_handler<clblast::StatusCode (clblast::Arguments<float> const&, clblast::Buffers<float>&, clblast::Queue&), clblast::StatusCode (*)(clblast::Arguments<float> const&, clblast::Buffers<float>&, clblast::Queue&)>::_M_invoke(std::_Any_data const&, clblast::Arguments<float> const&, clblast::Buffers<float>&, clblast::Queue&) (__functor=...,
    __args#0=..., __args#1=..., __args#2=...) at /usr/include/c++/14/bits/std_function.h:290
#31 0x00005555555ba8bc in std::function<clblast::StatusCode (clblast::Arguments<float> const&, clblast::Buffers<float>&, clblast::Queue&)>::operator()(clblast::Arguments<float> const&, clblast::Buffers<float>&, clblast::Queue&) const (
    this=0x7fffffffe138, __args#0=..., __args#1=..., __args#2=...) at /usr/include/c++/14/bits/std_function.h:591
#32 clblast::TestBlas<float, float>::TestRegular (this=0x7fffffffdf60, test_vector=..., name=...)
    at /usr/src/RPM/BUILD/clblast-1.6.3/test/correctness/testblas.cpp:142
#33 0x00005555555d8327 in clblast::RunTests<clblast::TestXconvgemm<float>, float, float> (argc=argc@entry=1,
    argv=argv@entry=0x7fffffffe3d8, silent=silent@entry=false, name="SCONVGEMM")
    at /usr/src/RPM/BUILD/clblast-1.6.3/test/correctness/testblas.hpp:480
#34 0x00005555555d9293 in main (argc=1, argv=0x7fffffffe3d8)
    at /usr/src/RPM/BUILD/clblast-1.6.3/test/correctness/routines/levelx/xconvgemm.cpp:18

@vt-alt
Copy link
Author

vt-alt commented Nov 21, 2024

Another run in gdb with debuginfo installed:

Thread 1 "clblast_test_xc" received signal SIGABRT, Aborted.
__pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
44            return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0;
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x00007ffff74b56df in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
#2  0x00007ffff7466982 in __GI_raise (sig=6, 
    sig@entry=<error reading variable: Offset from DW_FORM_GNU_str_index or DW_FORM_strx pointing outside of .debug_str.dwo section in CU at offset 0x89018 [in module /usr/lib/debug/usr/lib/llvm-18.1/lib64/libLLVM.so.18.1.debug]>) at ../sysdeps/posix/raise.c:26
(gdb) 

@CNugteren
Copy link
Owner

CNugteren commented Nov 21, 2024

Thanks, that's helpful. Indeed, it does not even seem to start the first kernel. It fails before that, during one of the compilation steps of the kernel. In particular, from your first backtrace, it seems to fail in GetIR() in one of the clGetProgramInfo calls in that function. I'm not sure this is a CLBlast issue, unless clGetProgramInfo is somehow called with arguments that make it crash. However, in the other tests this function is also called, so it seems related to the kernel that is being compiled. But in any case this piece of code hasn't been changed since a long time...

@vt-alt
Copy link
Author

vt-alt commented Nov 21, 2024

Yes and it's worked at 2024-07-08 so it's obscure. Maybe it's related to clang update, I will try to compile with older clang later.

@vt-alt
Copy link
Author

vt-alt commented Nov 22, 2024

I bisected our build process and found that the test failure happened since this change in the build env:

-rpmi: clang17.0-17.0.6-alt4.2
-rpmi: clang17.0-support-17.0.6-alt4.2
-rpmi: libclang-cpp17-17.0.6-alt4.2
+rpmi: libomp18.1-18.1.8-alt0.1
-rpmi: libpocl2-5.0-alt0.2
+rpmi: libpocl2-6.0-alt0.1
-rpmi: llvm17.0-filesystem-17.0.6-alt4.2
-rpmi: llvm17.0-gold-17.0.6-alt4.2
-rpmi: llvm17.0-libs-17.0.6-alt4.2
-rpmi: llvm17.0-polly-17.0.6-alt4.2
-rpmi: opencl-headers-1:2023.12.14-alt1
+rpmi: opencl-headers-1:2024.05.08-alt1
-rpmi: pocl-devel-5.0-alt0.2
-rpmi: pocl-opencl-icd-5.0-alt0.2
+rpmi: pocl-devel-6.0-alt0.1
+rpmi: pocl-opencl-icd-6.0-alt0.1

(Yes llvm17.0 is removed, but llvm18.1 (18.1.8) is present in both envs.)

@vt-alt
Copy link
Author

vt-alt commented Nov 22, 2024

Googled this llvm/llvm-project#92648 and it's inspired me to think that it's possible this is a compiler bug that is happening not even in clblast codebase, but where and how to workaround it...

@vt-alt
Copy link
Author

vt-alt commented Nov 22, 2024

Also this pocl/pocl#1608

@gspr
Copy link
Contributor

gspr commented Dec 4, 2024

A bit late to the party, but I somehow missed that Debian's testsuite started failing at the same test on September 29. Sadly the data retention policy means that the logs for the last passed test (September 28) are gone, so I can't pinpoint exactly what changed for us on the 29th.

We also see a failure with LLVM ERROR: Instruction Combining did not reach a fixpoint after 1 iterations.

This is with POCL 6.0. But POCL 6.0 entered Debian in June, and we definitely have successful tests between that and the first failure.

@CNugteren
Copy link
Owner

CNugteren commented Dec 10, 2024

Thanks @gspr for joining in. I see that @vt-alt was recently active in pocl/pocl#1608 about this matter.

If there is anything I can do, please let me know. If I would be able to reproduce things locally (and had some time), here's what I would do:

  1. Add a printf("CLBlast kernel:\n%s\n", kernel_string.c_str()); in compile.cpp#L117.
  2. Re-compile and run the program, and make note of the OpenCL kernel that is causing the issue.
  3. Write a new stand-alone simple program (outside of CLBlast) with the printed kernel, run it, and make sure that it crashes still.
  4. Start removing parts of the kernel until it no longer crashes.
  5. Report which parts cause the crash, and perhaps we can find a work-around.

Alternatively you can also run the CLBlast GEMM kernel tuner. That will compile a lot of different flavours of the kernel, and perhaps not all of them cause the issue: that why we could also find out which variant(s) cause the crash.

I hope this is helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants