TypeError: forward() missing 1 required positional argument: 'input' When training on CityScape DataSet #2

HyuanTan · 2017-11-21T09:10:12Z

Hello, thanks for your share.
I want to train on CityScape DataSet using /train/main.py, but I usually met some error in encode stage when train or val like:

Traceback (most recent call last): File "main.py", line 538, in <module> main(parser.parse_args()) File "main.py", line 492, in main model = train(args, model, True) #Train encoder File "main.py", line 251, in train outputs = model(inputs, only_encode=enc) File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__ result = self.forward(*input, **kwargs) File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 68, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 78, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 67, in parallel_apply raise output File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 42, in _worker output = module(*input, **kwargs) File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__ result = self.forward(*input, **kwargs) TypeError: forward() missing 1 required positional argument: 'input'

I debug in pycharm and found that the images and labels were loaded correctly, but when in inputs = Variable(images), I found some error: cannot call .data on torch.Tensor. Did I really load the data correctly or I make something wrong in other place?

Beside, the NUM_CLASSES = 20 in CityScape DataSet, but when I train I also met an error in val:

----- VALIDATING - EPOCH 1 ----- VAL loss: 0.6922 (epoch: 1, step: 0) // Avg time/img: 0.2710 s ERROR: Unknown label with id 19

So, does the label range from 0~19 or using the trainId in labels.py?

I use Ubuntu16.04, python3.6.3 and cuda9.0.
Thanks!

The text was updated successfully, but these errors were encountered:

Eromera · 2017-11-21T20:33:24Z

Hi, thanks for the message.

Could you tell a little bit more about the context where you are getting these errors? Are you getting these after some modification to the code or is it the same code as in the github?

The second error that you mention when the validation starts seems related to a change in the labels. Are you using 0-19 like in cityscapes or different? In previous code it was using the labels.py but I recently uploaded new custom code (iouEval.py) to evaluate IoU without relying on the cityscapes scripts. In the new code, the default ignore label is 19 so you need to change this if you use different labels. Are you using the new iouEval.py or the previous code?

HyuanTan · 2017-11-22T03:06:43Z

@Eromera Thanks for your response. I clone the newest code on github repeatly and train again without other changes except adding some print code, but I met the same error:
labels: [ 0 1 2 5 7 8 10 11 13 17 18 19] filename: /DataSet/DSNeo/PublicDataSet/CITYSCAPES_DATASET/StandardStructure/leftImg8bit/train/dusseldorf/dusseldorf_000106_000019_leftImg8bit.png filenameGt: /DataSet/DSNeo/PublicDataSet/CITYSCAPES_DATASET/StandardStructure/gtFine/train/dusseldorf/dusseldorf_000106_000019_gtFine_labelTrainIds.png filename: /DataSet/DSNeo/PublicDataSet/CITYSCAPES_DATASET/StandardStructure/leftImg8bit/train/tubingen/tubingen_000037_000019_leftImg8bit.png filenameGt: /DataSet/DSNeo/PublicDataSet/CITYSCAPES_DATASET/StandardStructure/gtFine/train/tubingen/tubingen_000037_000019_gtFine_labelTrainIds.png labels: [ 0 1 2 4 5 6 7 8 10 11 13 14 19] filename: /DataSet/DSNeo/PublicDataSet/CITYSCAPES_DATASET/StandardStructure/leftImg8bit/train/bremen/bremen_000106_000019_leftImg8bit.png filenameGt: /DataSet/DSNeo/PublicDataSet/CITYSCAPES_DATASET/StandardStructure/gtFine/train/bremen/bremen_000106_000019_gtFine_labelTrainIds.png filename: /DataSet/DSNeo/PublicDataSet/CITYSCAPES_DATASET/StandardStructure/leftImg8bit/train/bremen/bremen_000023_000019_leftImg8bit.png filenameGt: /DataSet/DSNeo/PublicDataSet/CITYSCAPES_DATASET/StandardStructure/gtFine/train/bremen/bremen_000023_000019_gtFine_labelTrainIds.png labels: [ 0 1 2 5 7 8 9 10 11 19] labels: [ 0 1 2 5 6 7 8 10 13 19] labels: [ 0 1 2 5 6 7 8 10 11 12 13 15 17 18 19] labels: [ 0 1 2 3 4 5 6 7 8 9 10 11 13 19] labels: [ 0 1 2 4 5 7 8 9 10 13 19] labels: [ 0 1 2 5 7 8 10 11 12 13 17 18 19] labels: [ 0 1 2 3 4 5 8 9 10 11 12 13 18 19] labels: [ 0 1 2 3 5 7 8 9 10 11 12 13 18 19] labels: [ 0 1 2 5 7 8 11 19] Traceback (most recent call last): File "main.py", line 538, in <module> main(parser.parse_args()) File "main.py", line 492, in main model = train(args, model, True) #Train encoder File "main.py", line 251, in train outputs = model(inputs, only_encode=enc) File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__ result = self.forward(*input, **kwargs) File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 68, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 78, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 67, in parallel_apply raise output File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 42, in _worker output = module(*input, **kwargs) File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__ result = self.forward(*input, **kwargs)

The error occur in:

    `model.train()
    for step, (images, labels) in enumerate(loader):

        start_time = time.time()
        #print (labels.size())
        #print (np.unique(labels.numpy()))
        print("labels: ", np.unique(labels[0].numpy()))
        #labels = torch.ones(4, 1, 512, 1024).long()

        if args.cuda:
            images = images.cuda()
            labels = labels.cuda()

        inputs = Variable(images)
        targets = Variable(labels)
        outputs = model(inputs, only_encode=enc)`

There is same error when I train on my own data.
Thanks!

Eromera · 2017-11-22T03:42:18Z

Hi,

I am not able to reproduce this error in my end so I think that this must be in some way related to your version of PyTorch. Which version are you using? I'm using latest 0.2

Another difference is that you are using CUDA 9 and I am using CUDA 8, I think that there is not full support in PyTorch for CUDA 9 just yet, maybe this could be causing a bug in Pytorch? The problem seems to be in "parallel_apply" which is related to the DataParallel when using GPU. Also are you using 1 gpu or multiple?

Thanks

HyuanTan · 2017-11-22T06:52:34Z

Hi,
I also use the latest pytorch version 0.2 which installed from the source code using python3.6 setup.py install, but when I install it ask me to install cndnn6:

In file included from torch/csrc/cudnn/GridSampler.h:7:0, from torch/csrc/cudnn/GridSampler.cpp:1: torch/csrc/cudnn/cudnn-wrapper.h:10:2: error: #error "CuDNN version not supported" #error "CuDNN version not supported" ^ torch/csrc/cudnn/cudnn-wrapper.h:9:198: note: #pragma message: CuDNN v5 found, but need at least CuDNN v6. You can get the latest version of CuDNN from https://developer.nvidia.com/cudnn or disable CuDNN with NO_CUDNN=1 NG(CUDNN_MAJOR) " found, but need at least CuDNN v6.

That is why I use cuda9.0.
Now I change back to cuda8.0 and reinstall pytorch using pip3 install http://download.pytorch.org/whl/cu80/torch-0.2.0.post3-cp36-cp36m-manylinux1_x86_64.whl pip3 install torchvision(special pytorch version for my envirenment) and all error gone. They run well on multiple gup.

Maybe your are right. There is not full support in PyTorch for CUDA 9 just yet and this could be causing a bug in Pytorch.

@Eromera Thanks!!

Eromera · 2017-11-22T07:40:14Z

@HyuanTan Great, I'm glad that the problem was found and solved by downgrading to CUDA 8.

According to latest posts in the PyTorch issues like this one, I think that compiling with CUDA 9.0 and CuDNN 7 should have recently been fixed and possible by now, but maybe your problem would only be fixed by applying a workaround that is mentioned in those posts: that you need to install NCCL as well (NVIDIA Collective Communications Library for multi-gpu). If you did not have NCCL installed then this would make sense with your error in DataParallel so if you prefer to have CUDA 9 you could still try that. Otherwise sticking to CUDA 8.0 should be ok for some time, you will not gain much unless you have one of the newest GPUs.

HyuanTan · 2017-11-22T08:36:05Z

@Eromera Thanks, I found that maybe I don't need to install NCCL when using CUDA8.0:

But with CUDA9.0, I will try your advise.

Thanks!!

ChengshuLi · 2018-05-18T18:58:30Z

I got the same issue with CUDA 9.0 and PyTorch 0.4. I realized the reason is because my batch size is not divisible by the number of gpus I have. After I fix it, the error disappear. Hope this could be helpful for someone.

zhhtu · 2018-07-13T00:43:03Z

One simple solution may just use one GPU, "CUDA_VISIBLE_DEVICES=0, python main.py"

Saif-03 · 2018-08-28T13:09:25Z

@ChengshuLi can you please share your updated version of the code? Since you use Pytorch 0.4 and some of the functions used here have been depreciated in PyTorch 0.4.

Thanks!

dmenig · 2018-12-13T09:40:08Z

I got the same error when feeding a batch smaller than the number of gpus on my machine.

vsahil · 2019-01-03T18:39:20Z

I got the same issue with CUDA 9.0 and PyTorch 0.4. I realized the reason is because my batch size is not divisible by the number of gpus I have. After I fix it, the error disappear. Hope this could be helpful for someone.

This did not solve my problem. I had 4 GPU's and a batch size of 64. My Pytorch version is 0.4 and Cuda version is 9.0. It was still crashing with this error trace :

    return f(x) if callable(f) else model(x)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 123, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 133, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 77, in parallel_apply
    raise output
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 53, in _worker
    output = module(*input, **kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
TypeError: forward() missing 1 required positional argument: 'x'

phdsky · 2019-01-24T10:52:11Z

If have 4 gpu on the machine, just change batch size to whatever 8 16 32... if CUDA memory is enough, Not the 6 in the tutorial. It's not divisible.

Eromera · 2019-01-24T10:54:55Z

I'm reopening the issue since people are still having trouble, please @vsahil confirm if last suggestion by @phdsky worked. Thanks!

vsahil · 2019-01-24T11:42:26Z

Hi, I tried this with batchsizes of 64, 32, 16 and even 4. It gave the same error in all these cases. Without data parallelisation it doesn’t generate such an error. Thank you, Sahil

…

On 24-Jan-2019, at 4:24 PM, Edu ***@***.***> wrote: I'm reopening the issue since people are still having trouble, please @vsahil confirm if last suggestion by @phdsky worked. Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

intersun · 2019-02-04T07:10:33Z

Hi vsahil, can you check your dataloader's output's shape? The problem might come from dataloader as well. For example, if you have 5 data and set batch_size as 4, your second batch will only consist of 1 data instead of 4, this will cause the parallel error as well.

ShreyasSkandan · 2019-03-31T02:21:28Z

I believe @siqims is right. That was my issue as well. I rounded off my total dataset size to be a factor of my batch_size and everything seems to work smoothly. It's a quick enough test to also try out for yourself.

phdsky · 2019-04-24T09:25:28Z

@vsahil Sorry to say that, I tested again and think @siqims 's answer is right.

Assume you have N samples, the batch size is b, make sure the div = N / b div must be dividable by GPU numbers. If not, the problem occurs.

huangwenyi1991 · 2020-07-13T05:33:41Z

I meet the same problem when i training yolov3. Acutually the problem is that the remainder of test numbers and batch size can not be divided by the GPU number. For instance, the initial number of test number of yolov3 is 450 and the initial batch size is 16, the remainder is 2 and it can not be divided by 4 GPUs and the problem will appear. So the best way to solve this issue is to change the test number. In the above intance , the problem will not appear any more when the test number changes to 452.

Eromera closed this as completed Nov 22, 2017

Eromera reopened this Jan 24, 2019

linzzzzzz mentioned this issue Jun 19, 2020

TypeError: forward() missing 1 required positional argument: 'x' ultralytics/yolov3#1074

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TypeError: forward() missing 1 required positional argument: 'input' When training on CityScape DataSet #2

TypeError: forward() missing 1 required positional argument: 'input' When training on CityScape DataSet #2

HyuanTan commented Nov 21, 2017 •

edited

Loading

Eromera commented Nov 21, 2017

HyuanTan commented Nov 22, 2017

Eromera commented Nov 22, 2017

HyuanTan commented Nov 22, 2017

Eromera commented Nov 22, 2017

HyuanTan commented Nov 22, 2017

ChengshuLi commented May 18, 2018

zhhtu commented Jul 13, 2018

Saif-03 commented Aug 28, 2018

dmenig commented Dec 13, 2018

vsahil commented Jan 3, 2019 •

edited

Loading

phdsky commented Jan 24, 2019

Eromera commented Jan 24, 2019

vsahil commented Jan 24, 2019 via email

intersun commented Feb 4, 2019

ShreyasSkandan commented Mar 31, 2019

phdsky commented Apr 24, 2019

huangwenyi1991 commented Jul 13, 2020 •

edited

Loading

TypeError: forward() missing 1 required positional argument: 'input' When training on CityScape DataSet #2

TypeError: forward() missing 1 required positional argument: 'input' When training on CityScape DataSet #2

Comments

HyuanTan commented Nov 21, 2017 • edited Loading

Eromera commented Nov 21, 2017

HyuanTan commented Nov 22, 2017

Eromera commented Nov 22, 2017

HyuanTan commented Nov 22, 2017

Eromera commented Nov 22, 2017

HyuanTan commented Nov 22, 2017

ChengshuLi commented May 18, 2018

zhhtu commented Jul 13, 2018

Saif-03 commented Aug 28, 2018

dmenig commented Dec 13, 2018

vsahil commented Jan 3, 2019 • edited Loading

phdsky commented Jan 24, 2019

Eromera commented Jan 24, 2019

vsahil commented Jan 24, 2019 via email

intersun commented Feb 4, 2019

ShreyasSkandan commented Mar 31, 2019

phdsky commented Apr 24, 2019

huangwenyi1991 commented Jul 13, 2020 • edited Loading

HyuanTan commented Nov 21, 2017 •

edited

Loading

vsahil commented Jan 3, 2019 •

edited

Loading

huangwenyi1991 commented Jul 13, 2020 •

edited

Loading