Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nan #26

Open
cymdhx opened this issue Apr 1, 2021 · 28 comments
Open

Nan #26

cymdhx opened this issue Apr 1, 2021 · 28 comments

Comments

@cymdhx
Copy link

cymdhx commented Apr 1, 2021

when I use I meet
image

@d-li14
Copy link
Owner

d-li14 commented Apr 2, 2021

please specify the experimental details

@cymdhx
Copy link
Author

cymdhx commented Apr 2, 2021

请具体说明实验的细节。

我是将他加在了yolo网络上,将panet层的conv改成了involution,使用conv的时候不会nan,但是改成involution时出现了nan

@cymdhx
Copy link
Author

cymdhx commented Apr 2, 2021

请具体说明实验的细节。

我是将他加在了yolo网络上,将panet层的conv改成了involution,使用conv的时候不会nan,但是改成involution时出现了nan

like this
image

@cymdhx
Copy link
Author

cymdhx commented Apr 2, 2021

请具体说明实验的细节。

我是将他加在了yolo网络上,将panet层的conv改成了involution,使用conv的时候不会nan,但是改成involution时出现了nan

大佬,有啥办法可以解决吗,我试过将loss调低但是没什么用

@d-li14
Copy link
Owner

d-li14 commented Apr 2, 2021

You may try the gradient clipping method, which is also used sometimes when we train our detection models, for example, https://github.com/d-li14/involution/blob/main/det/configs/involution/retinanet_red50_neck_fpn_1x_coco.py#L8

@cymdhx
Copy link
Author

cymdhx commented Apr 5, 2021

你可以试试梯度裁剪方法,有时在我们训练检测模型时也会用到,例如,Https://github.com/d-li14/involution/blob/main/det/configs/involution/retinanet_red50_neck_fpn_1x_coco.py#L8

thank you so much

@cymdhx
Copy link
Author

cymdhx commented Apr 5, 2021

你可以试试梯度裁剪方法,有时在我们训练检测模型时也会用到,例如,Https://github.com/d-li14/involution/blob/main/det/configs/involution/retinanet_red50_neck_fpn_1x_coco.py#L8

thank you so much

image
当我用了梯度裁剪好像还是会nan

@songyonger
Copy link

I replaced the conv in the resblock in the super resolution model "edsr" with involution, and i used the gradient method, but the loss is still inf.

@cymdhx
Copy link
Author

cymdhx commented Apr 7, 2021

我用对合代替了超分辨率模型“edsr”中的conv,使用了梯度法,但损失仍然是inf。

你现在解决了吗

@songyonger
Copy link

还没有

@cymdhx
Copy link
Author

cymdhx commented Apr 7, 2021

还没有

我也没有,可以讨论讨论

@545088212
Copy link

还没有

我也没有,可以讨论讨论

请问一下你们现在解决了吗?

@NNPanNPU
Copy link

The loss of mine in the training set is fine, while in cv set, some batches are nan.
It's definitely not gradient explosion. I don't know how to find the problem and debug.

@songwaimai
Copy link

The loss of mine in the training set is fine, while in cv set, some batches are nan.
It's definitely not gradient explosion. I don't know how to find the problem and debug.

Maybe your dataset is not pure?

@songwaimai
Copy link

I also met this problem in generation task. I replaced the con 3x3 by involution, the loss in nan or inf.

@cymdhx
Copy link
Author

cymdhx commented Apr 15, 2021

我在代任务中也遇到了这个问题。我将con3x3替换为对合,在NaN或INF中的损失。

我也没解决,所以我已经快要放弃使用involution了

@songwaimai
Copy link

我在代任务中也遇到了这个问题。我将con3x3替换为对合,在NaN或INF中的损失。

我也没解决,所以我已经快要放弃使用involution了
I also tried the gradient clipping method, but the NAN problem is not be solved, i will try to find some else methods which may work out.

@cymdhx
Copy link
Author

cymdhx commented Apr 15, 2021

我在代任务中也遇到了这个问题。我将con3x3替换为对合,在NaN或INF中的损失。

我也没解决,所以我已经快要放弃使用involution了
I also tried the gradient clipping method, but the NAN problem is not be solved, i will try to find some else methods which may work out.

I also tried the gradient clipping method too.But It didn't work.If you have any good methods, please share them, thank you

@lygsbw
Copy link

lygsbw commented Apr 15, 2021

#26 (comment)

I also met the same problem when dealing with the pose estimation task.

@songwaimai
Copy link

我在代任务中也遇到了这个问题。我将con3x3替换为对合,在NaN或INF中的损失。

我也没解决,所以我已经快要放弃使用involution了
I also tried the gradient clipping method, but the NAN problem is not be solved, i will try to find some else methods which may work out.

I also tried the gradient clipping method too.But It didn't work.If you have any good methods, please share them, thank you

ok

@LJill
Copy link

LJill commented Apr 27, 2021

image
我在使用involution替换RCAN中的CA模块时,loss也非常大

@songyonger
Copy link

I replace the standard conv with involution and added bn, then the loss seems normal.But the final result is worse than edsr baseline with bn layer,even though i added the parameters of the edsr-involution.Now i have given up. You can have a try and we can talk.@LJill

@LJill
Copy link

LJill commented Apr 27, 2021

I replace the standard conv with involution and added bn, then the loss seems normal.But the final result is worse than edsr baseline with bn layer,even though i added the parameters of the edsr-involution.Now i have given up. You can have a try and we can talk.@LJill

Thanks for your reply . I tried your method on EDSR and RCAN , it works , the loss is normal now . I will conduct experiments to observe the final result .

@songwaimai
Copy link

I replace the standard conv with involution and added bn, then the loss seems normal.But the final result is worse than edsr baseline with bn layer,even though i added the parameters of the edsr-involution.Now i have given up. You can have a try and we can talk.@LJill

Thanks for your reply . I tried your method on EDSR and RCAN , it works , the loss is normal now . I will conduct experiments to observe the final result .
when i replace the conv with involution and add BN, the train loss seems normal, but the val loss is NAN still, has this happened to your model?

@whf9527
Copy link

whf9527 commented May 5, 2021

我换成involution,结果参数好像不能进行优化。train loss一直下降,但是val loss一直保持一个值没变。有大佬知道这是不是过拟合造成的,还是代码错误。
我感觉不是过拟,因为train loss下降,val loss基本没变。还没有解决这个问题

@whf9527
Copy link

whf9527 commented May 6, 2021

The loss of mine in the training set is fine, while in cv set, some batches are nan.
It's definitely not gradient explosion. I don't know how to find the problem and debug.

what cause this problem ?? I also met this issue. train loss is better, but the val loss is unchange.

@ChristophReich1996
Copy link

I implemented a pure PyTorch 2D involution and faced a similar issue of Nans occurring during training when using the involution as a plug-in replacement for convolutions. In my case this was caused by exploding activation. For me, the issue could be solved by utilizing a higher momentum (0.3) in the batch normalization (after reduction) layer. I guess the distribution of the activation change that much that batch norm, with track_running_stats=True and momentum=0.1, can not follow the changing distribution, resulting in exploding activations. This was my conclusion after looking at the PyTorch batch norm implementation, which uses also the running stats for normalization during training (correct me if I'm wrong).

@weiguangzhao
Copy link

weiguangzhao commented May 24, 2021

when I use I meet
image
@cymdhx @songwaimai @whf9527

我解决了我遇到的nan问题,附上我的解决方法,不知道是否适用于你们的:
问题描述: Unet + resnet 改为 unet + rednet50时出现 nan,inf
解决方案: 把程序中的 以下代码去掉,不要人为初始化 weight and bias

def set_bn_init(m):
classname = m.__class__.__name__
if classname.find('BatchNorm') != -1:
m.weight.data.fill_(1.0)
m.bias.data.fill_(0.0)

I solved the nan problem I encountered, and attached my solution, I don’t know if it applies to yours:
Problem description: When Unet + resnet is changed to unet + rednet50, nan and inf appear
Solution: Remove the following code in the program, do not initialize weight and bias artificially

def set_bn_init(m):
classname = m.__class__.__name__
if classname.find('BatchNorm') != -1:
m.weight.data.fill_(1.0)
m.bias.data.fill_(0.0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests