-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TypeError: forward() missing 1 required positional argument: 'input' When training on CityScape DataSet #2
Comments
Hi, thanks for the message. Could you tell a little bit more about the context where you are getting these errors? Are you getting these after some modification to the code or is it the same code as in the github? The second error that you mention when the validation starts seems related to a change in the labels. Are you using 0-19 like in cityscapes or different? In previous code it was using the labels.py but I recently uploaded new custom code (iouEval.py) to evaluate IoU without relying on the cityscapes scripts. In the new code, the default ignore label is 19 so you need to change this if you use different labels. Are you using the new iouEval.py or the previous code? |
@Eromera Thanks for your response. I clone the newest code on github repeatly and train again without other changes except adding some print code, but I met the same error: The error occur in:
There is same error when I train on my own data. |
Hi, I am not able to reproduce this error in my end so I think that this must be in some way related to your version of PyTorch. Which version are you using? I'm using latest 0.2 Another difference is that you are using CUDA 9 and I am using CUDA 8, I think that there is not full support in PyTorch for CUDA 9 just yet, maybe this could be causing a bug in Pytorch? The problem seems to be in "parallel_apply" which is related to the DataParallel when using GPU. Also are you using 1 gpu or multiple? Thanks |
Hi,
That is why I use cuda9.0. Maybe your are right. There is not full support in PyTorch for CUDA 9 just yet and this could be causing a bug in Pytorch. @Eromera Thanks!! |
@HyuanTan Great, I'm glad that the problem was found and solved by downgrading to CUDA 8. According to latest posts in the PyTorch issues like this one, I think that compiling with CUDA 9.0 and CuDNN 7 should have recently been fixed and possible by now, but maybe your problem would only be fixed by applying a workaround that is mentioned in those posts: that you need to install NCCL as well (NVIDIA Collective Communications Library for multi-gpu). If you did not have NCCL installed then this would make sense with your error in DataParallel so if you prefer to have CUDA 9 you could still try that. Otherwise sticking to CUDA 8.0 should be ok for some time, you will not gain much unless you have one of the newest GPUs. |
@Eromera Thanks, I found that maybe I don't need to install NCCL when using CUDA8.0: But with CUDA9.0, I will try your advise. Thanks!! |
I got the same issue with CUDA 9.0 and PyTorch 0.4. I realized the reason is because my batch size is not divisible by the number of gpus I have. After I fix it, the error disappear. Hope this could be helpful for someone. |
One simple solution may just use one GPU, "CUDA_VISIBLE_DEVICES=0, python main.py" |
@ChengshuLi can you please share your updated version of the code? Since you use Pytorch 0.4 and some of the functions used here have been depreciated in PyTorch 0.4. Thanks! |
I got the same error when feeding a batch smaller than the number of gpus on my machine. |
This did not solve my problem. I had 4 GPU's and a batch size of 64. My Pytorch version is 0.4 and Cuda version is 9.0. It was still crashing with this error trace :
|
If have 4 gpu on the machine, just change batch size to whatever 8 16 32... if CUDA memory is enough, Not the 6 in the tutorial. It's not divisible. |
Hi,
I tried this with batchsizes of 64, 32, 16 and even 4. It gave the same error in all these cases. Without data parallelisation it doesn’t generate such an error.
Thank you,
Sahil
… On 24-Jan-2019, at 4:24 PM, Edu ***@***.***> wrote:
I'm reopening the issue since people are still having trouble, please @vsahil confirm if last suggestion by @phdsky worked. Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Hi vsahil, can you check your dataloader's output's shape? The problem might come from dataloader as well. For example, if you have 5 data and set batch_size as 4, your second batch will only consist of 1 data instead of 4, this will cause the parallel error as well. |
I believe @siqims is right. That was my issue as well. I rounded off my total dataset size to be a factor of my batch_size and everything seems to work smoothly. It's a quick enough test to also try out for yourself. |
@vsahil Sorry to say that, I tested again and think @siqims 's answer is right. Assume you have N samples, the batch size is b, make sure the div = N / b div must be dividable by GPU numbers. If not, the problem occurs. |
I meet the same problem when i training yolov3. Acutually the problem is that the remainder of test numbers and batch size can not be divided by the GPU number. For instance, the initial number of test number of yolov3 is 450 and the initial batch size is 16, the remainder is 2 and it can not be divided by 4 GPUs and the problem will appear. So the best way to solve this issue is to change the test number. In the above intance , the problem will not appear any more when the test number changes to 452. |
Hello, thanks for your share.
I want to train on CityScape DataSet using /train/main.py, but I usually met some error in encode stage when train or val like:
Traceback (most recent call last): File "main.py", line 538, in <module> main(parser.parse_args()) File "main.py", line 492, in main model = train(args, model, True) #Train encoder File "main.py", line 251, in train outputs = model(inputs, only_encode=enc) File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__ result = self.forward(*input, **kwargs) File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 68, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 78, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 67, in parallel_apply raise output File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 42, in _worker output = module(*input, **kwargs) File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__ result = self.forward(*input, **kwargs) TypeError: forward() missing 1 required positional argument: 'input'
I debug in pycharm and found that the images and labels were loaded correctly, but when in
inputs = Variable(images)
, I found some error:cannot call .data on torch.Tensor
. Did I really load the data correctly or I make something wrong in other place?Beside, the NUM_CLASSES = 20 in CityScape DataSet, but when I train I also met an error in val:
----- VALIDATING - EPOCH 1 ----- VAL loss: 0.6922 (epoch: 1, step: 0) // Avg time/img: 0.2710 s ERROR: Unknown label with id 19
So, does the label range from 0~19 or using the trainId in labels.py?
I use Ubuntu16.04, python3.6.3 and cuda9.0.
Thanks!
The text was updated successfully, but these errors were encountered: