-
Notifications
You must be signed in to change notification settings - Fork 949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for newer CUDA drivers? #1129
Comments
You mean this?
Those are external links. They are hosted on
From the README:
Unfortunately, we simple don't have the resources to maintain this, so you will have to customize it yourself (though I don't see why it wouldn't work with the correct version of LibTorch unless it's not backward-compatible). We could use some help from the community, though! For example, things like this:
Could I ask you to submit a PR with those fixes? That would not only help us out, but every other user would benefit as well. So we'd be quite grateful for the help. |
First of all thanks for the quick reply. Id be very happy to prepare a PR. I initially thought those Comments with Cuda versions are the only ones which are supported. Ill try around and finish my Setup. When i get my Game to train on my GPU with the examples, i will commit a PR. 👍 |
I have tried out alot. Using the old recommended Pytorch install links from global_variables.sh and newer versions with according local CUDA/cudnn installations. I still somehow do not get rid of following error when trying to execute open_spiel/build/examples/alphazero_torch_example: ( i rerun ./install and build run all tests and everything suceeded) I used the latest Pytorch CUDA link inside global_variables.sh (GPU): CUDA 10.2 https://download.pytorch.org/libtorch/cu102/libtorch-cxx11-abi-shared-with-deps-1.5.1.zip (venv) (base) caspar@caspar-5801:~/repos/open_spiel/build/examples$ ./alpha_zero_torch_example
Logging directory: /tmp/az
Creating model: /tmp/az/vpnet.pb
Playing game: tic_tac_toe
Loading model from step 0
[W TensorCompare.cpp:519] Warning: where received a uint8 condition tensor. This behavior is deprecated and will be removed in a future version of PyTorch. Use a boolean condition instead. (function operator())
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 128 n 1024 k 9 mat1_ld 9 mat2_ld 9 result_ld 128 abcType 0 computeType 68 scaleType 0
Exception raised from gemm_and_bias at ../aten/src/ATen/cuda/CUDABlas.cpp:813 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7f363b06e38b in /home/caspar/repos/open_spiel/open_spiel/libtorch/libtorch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xbf (0x7f363b068f3f in /home/caspar/repos/open_spiel/open_spiel/libtorch/libtorch/lib/libc10.so)
frame #2: <unknown function> + 0x31b6135 (0x7f35cfbb6135 in /home/caspar/repos/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0x31e372d (0x7f35cfbe372d in /home/caspar/repos/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x2f51541 (0x7f35cf951541 in /home/caspar/repos/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0x2f515f0 (0x7f35cf9515f0 in /home/caspar/repos/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cuda.so)
frame #6: at::_ops::addmm::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&) + 0xab (0x7f361fffdc2b in /home/caspar/repos/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x3a6a247 (0x7f3621c6a247 in /home/caspar/repos/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x3a6afc2 (0x7f3621c6afc2 in /home/caspar/repos/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cpu.so)
frame #9: at::_ops::addmm::call(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&) + 0x1b1 (0x7f3620062b41 in /home/caspar/repos/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cpu.so)
frame #10: torch::nn::LinearImpl::forward(at::Tensor const&) + 0xb3 (0x7f36233cd743 in /home/caspar/repos/open_spiel/open_spiel/libtorch/libtorch/lib/libtorch_cpu.so)
frame #11: <unknown function> + 0x413e5b (0x55ede6fc9e5b in ./alpha_zero_torch_example)
frame #12: <unknown function> + 0x4172e0 (0x55ede6fcd2e0 in ./alpha_zero_torch_example)
frame #13: <unknown function> + 0x417bb1 (0x55ede6fcdbb1 in ./alpha_zero_torch_example)
frame #14: <unknown function> + 0x42b00c (0x55ede6fe100c in ./alpha_zero_torch_example)
frame #15: <unknown function> + 0x4022b8 (0x55ede6fb82b8 in ./alpha_zero_torch_example)
frame #16: <unknown function> + 0x4067bc (0x55ede6fbc7bc in ./alpha_zero_torch_example)
frame #17: <unknown function> + 0x8bf60 (0x55ede6c41f60 in ./alpha_zero_torch_example)
frame #18: <unknown function> + 0x29d90 (0x7f35c5829d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #19: __libc_start_main + 0x80 (0x7f35c5829e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #20: <unknown function> + 0x8af25 (0x55ede6c40f25 in ./alpha_zero_torch_example)
Aborted (core dumped) is this problem known? Only thing i could find researching about this error is that it might be caused by an out of memory CUDA exception but i simply changed the device from cpu to gpu for training and my PC is quite powerful. Rtx 3070ti and 16gb ram which was enough to run the examples on my CPU, so this cant be the issue |
Is there a way that @CasparQuast can post up the code here at the point he has it - a branch maybe? Then others could try out what he has and collaborate to try to move the ball forward. It would be really great if we could eventually get things working on the latest versions of the entire Torch/CUDA/etc stack. I could try to get it going on Windows, which may be able to offer some additional debugging capabilities. But I recently had to blow away my dev environment, and sadly real life has prevented me from rebuilding it just yet. :( Speaking of debugging, do you have nVidia's full-stack profiling/debugging tooling installed? I am uncertain what they expose in Linux, but I would actually expect it to be more robust than what is offered in Windows, at least when it comes to CUDA development. |
@TheSQLGuru I agree, that would be great. Will require some community coordination. @CasparQuast, are you willing to post your code up somewhere in a fork or pull request (which creates a branch)? (Apologies for the late reply!) |
What i currenctly have is a Cuda 11.3 Version. Not the newest but newer than the latest Version from the project documentation. I didnt manage to get the newest Cuda to run with the torch version. Next week my University project will be over and then ill create a pull request for my game implementation and updated config files. The only thing i changed basically is the link inside the install script to match the torch gpu download link for CUDA 11.3 (https://download.pytorch.org/libtorch/cu113/libtorch-cxx11-abi-shared-with-deps-1.10.1%2Bcu113.zip) and i followed the already posted Bug bypasses to build successfully (#966). |
Hello,
im working on a game implementation and want to train an Agent for the Game with the AlphaZero approach.
I managed to compile and run tests with the workaround described #966 there.
I had to also remove the dqn_torch examples and all references to this in every CMake file to successfully run all tests. Some files from those examples are trying to import a game for the test suite which isnt existing in this repo anymore.
My Question now is whether i can use newer CUDA drivers for example 12.2 with latest Cudnn. In the global_variables.sh i could only see the options 10.2 and lower which is very old. Did somebody test newer CUDA drivers?
According NVIDIA Toolkit theres not even a 10.2 Cuda version which is supported by Ubuntu 22 (which is the recommended OS for this project) just 18 and lower.
https://developer.nvidia.com/cuda-10.2-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu
If this whole approach is just there to be viewed by externals and not really maintained, is there maybe a better recommended alternativ to using ./alphazero_torch_example ? My goal is just to use my RTX 3070TI with CUDA/Cudnn to accelerate the training process which seemed to be straight forward at first sight, but appears to be quite hard to even setup. Is there anyone who recently experimented with this or can help me with some tipps? Thanks in advance 👍
The text was updated successfully, but these errors were encountered: