Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question about training CIFAR10 #19

Open
namedBen opened this issue Oct 12, 2018 · 6 comments
Open

question about training CIFAR10 #19

namedBen opened this issue Oct 12, 2018 · 6 comments

Comments

@namedBen
Copy link

namedBen commented Oct 12, 2018

您好,当我在用论文提供的网络结构以及初始学习率训练cifar10的时候,发现无法训练,Loss爆炸了直接nan。您的VGG7参考网络结构为“2×(128-C3) + MP2 +2×(256-C3) + MP2 + 2×(512-C3) + MP2 + 1024-FC + Softmax。想请教您两个问题:
1.在1024-FC层之前,特征图的大小为batch * 512 * 4 * 4,请问这个1024FC是如何做到把8192变成10的维度的?
2.其次,按照BPWNs的网络结构 2×1024F C)−10SVM,以您参考的base_lr=0.1训练的loss是nan,请问这是什么原因?
蟹蟹

@fengfu-chris
Copy link
Owner

fengfu-chris commented Oct 14, 2018

@namedBen 分析如下:

  1. 1024FC的输入是上一层512x4x4拉成一个行向量(8192维),输出是1024个节点;之后再接一个全连接层,输入是1024,输出是类别数,这里是10;之后用Softmax层作用一下。
  2. loss爆炸的时候可以考虑降低lr。比如MNIST数据集,从lr=0.01开始。

@namedBen
Copy link
Author

首先谢谢您的回复,对于第一点,我也是这么实现的,对于第二点,目前我是用VGG7训练CIFAR10,但是根据您论文所提供的训练策略训练full percision weight networks时,取initial learning rate=0.1,optimizer=SGD,训练是loss爆炸的(epoch1就产生)。请问您具体是如何实现FPWNs达到92.88的?

@fengfu-chris
Copy link
Owner

您可以把主要训练参数和log信息贴一下吗?

@namedBen
Copy link
Author

namedBen commented Oct 14, 2018

网络模型定义如下:
(module): VGG(
(layer): Sequential(
(0): BasicBlock(
(conv1): Conv2d (3, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
)
(1): BasicBlock(
(conv1): Conv2d (128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
)
(2): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1))
(3): BasicBlock(
(conv1): Conv2d (128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
)
(4): BasicBlock(
(conv1): Conv2d (256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
)
(5): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1))
(6): BasicBlock(
(conv1): Conv2d (256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
)
(7): BasicBlock(
(conv1): Conv2d (512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
)
(8): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1))
)
(classifier1): Linear(in_features=8192, out_features=1024)
(classifier3): Linear(in_features=1024, out_features=10)
)

训练log日志如下:
Training [epoch:1, iter:1, load:10/50000] LR:1.0e-01 Loss: 2.297 | Acc: 30.000%
Training [epoch:1, iter:2, load:20/50000] LR:1.0e-01 Loss: 88.866 | Acc: 20.000%
Training [epoch:1, iter:3, load:30/50000] LR:1.0e-01 Loss: 205.516 | Acc: 16.667%
Training [epoch:1, iter:4, load:40/50000] LR:1.0e-01 Loss: 8440.584 | Acc: 12.500%
Training [epoch:1, iter:5, load:50/50000] LR:1.0e-01 Loss: 214870.594 | Acc: 10.000%
Training [epoch:1, iter:6, load:60/50000] LR:1.0e-01 Loss: 195956608.000 | Acc: 10.000%
Training [epoch:1, iter:7, load:70/50000] LR:1.0e-01 Loss: 110850178285568.000 | Acc: 8.571%
Training [epoch:1, iter:8, load:80/50000] LR:1.0e-01 Loss: 16265785521874871328440320.000 | Acc: 7.500%
Training [epoch:1, iter:9, load:90/50000] LR:1.0e-01 Loss: nan | Acc: 7.778%
Training [epoch:1, iter:10, load:100/50000] LR:1.0e-01 Loss: nan | Acc: 9.000%
Training [epoch:1, iter:11, load:110/50000] LR:1.0e-01 Loss: nan | Acc: 8.182%
Training [epoch:1, iter:12, load:120/50000] LR:1.0e-01 Loss: nan | Acc: 8.333%
Training [epoch:1, iter:13, load:130/50000] LR:1.0e-01 Loss: nan | Acc: 8.462%

训练超参数:
SGD: base_lr=0.1, momentum=0.9, weight_decay=1e-4
batch_size: 10

@fengfu-chris
Copy link
Owner

这是用pytorch框架实现的吗?跟原始caffe的repo有几点差别:

  1. 我用的batch size是100 (v.s. 10);
  2. 我的conv层是有bias的 (v.s. no bias);
  3. BatchNorm的实现不一样(这是重点!!!);

@namedBen
Copy link
Author

嗯嗯,用的是pytorch
1.batch size我试过,100和10都会nan
2.您说的conv层有bias,我一开始想过,但是后来觉得量化操作应该要去除bias,才能准确评价量化Weight的影响
3.这个我还真没注意到,谢谢提醒!之后我会用caffe或者改一下pytorch的BN层重新训练下,看看效果。
再次谢谢您的解答!(比心心)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants