-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error after few iteration on training #34
Comments
Please change the num of batch sizes or num_workers. |
I have changed the batch sizes or num_workers both but get the same error. I am using Quadro RTX 8000 48GB. |
Did you use a single gpu? If yes, change |
It seems there are negative or nan values in your output or labels. |
I want to train the network --arch 7 with my custom 62k dataset that is similar to DUTS. I am using 48GB CUDA and batch size 8. After a few iteration, I am getting the following error
Traceback (most recent call last):
File "main.py", line 55, in
main(args)
File "main.py", line 35, in main
Trainer(args, save_path)
File "/root/TRACER/trainer.py", line 58, in init
train_loss, train_mae = self.training(args)
File "/root/TRACER/trainer.py", line 117, in training
loss.backward()
File "/root/TRACER/venv/lib/python3.6/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/root/TRACER/venv/lib/python3.6/site-packages/torch/autograd/init.py", line 156, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
The text was updated successfully, but these errors were encountered: