Error after few iteration on training #34

yaju1234 · 2023-02-17T00:53:23Z

I want to train the network --arch 7 with my custom 62k dataset that is similar to DUTS. I am using 48GB CUDA and batch size 8. After a few iteration, I am getting the following error
Traceback (most recent call last):
File "main.py", line 55, in
main(args)
File "main.py", line 35, in main
Trainer(args, save_path)
File "/root/TRACER/trainer.py", line 58, in init
train_loss, train_mae = self.training(args)
File "/root/TRACER/trainer.py", line 117, in training
loss.backward()
File "/root/TRACER/venv/lib/python3.6/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/root/TRACER/venv/lib/python3.6/site-packages/torch/autograd/init.py", line 156, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Karel911 · 2023-02-17T06:49:04Z

Please change the num of batch sizes or num_workers.
It is mainly originated from CUDA, not the codes.

yaju1234 · 2023-02-17T11:40:18Z

I have changed the batch sizes or num_workers both but get the same error. I am using Quadro RTX 8000 48GB.

Karel911 · 2023-02-20T07:39:29Z

Did you use a single gpu? If yes, change multi_gpu=False in config.
Then, try num_workers=0. When we tried it with the same device, we could not reproduce the error.

yaju1234 · 2023-02-21T17:21:55Z

Now I am getting the following error
Traceback (most recent call last):
File "main.py", line 55, in
main(args)
File "main.py", line 35, in main
Trainer(args, save_path)
File "/home/user/PycharmProjects/TRACER/trainer.py", line 56, in init
train_loss, train_mae = self.training(args)
File "/home/user/PycharmProjects/TRACER/trainer.py", line 106, in training
loss1 = self.criterion(outputs, masks)
File "/home/user/PycharmProjects/TRACER/util/losses.py", line 41, in adaptive_pixel_intensity_loss
bce = F.binary_cross_entropy(pred, mask, reduce=None)
File "/home/user/PycharmProjects/TRACER/venv/lib/python3.6/site-packages/torch/nn/functional.py", line 2915, in binary_cross_entropy
return torch._C._nn.binary_cross_entropy(input, target, weight, reduction_enum)
RuntimeError: all elements of input should be between 0 and 1

Karel911 · 2023-03-23T08:45:13Z

It seems there are negative or nan values in your output or labels.
Please check the values.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error after few iteration on training #34

Error after few iteration on training #34

yaju1234 commented Feb 17, 2023

Karel911 commented Feb 17, 2023

yaju1234 commented Feb 17, 2023

Karel911 commented Feb 20, 2023

yaju1234 commented Feb 21, 2023

Karel911 commented Mar 23, 2023

Error after few iteration on training #34

Error after few iteration on training #34

Comments

yaju1234 commented Feb 17, 2023

Karel911 commented Feb 17, 2023

yaju1234 commented Feb 17, 2023

Karel911 commented Feb 20, 2023

yaju1234 commented Feb 21, 2023

Karel911 commented Mar 23, 2023