Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gradient explosion in TAPT DAPT pretraining #2

Open
Danshi-Li opened this issue Apr 30, 2021 · 2 comments
Open

gradient explosion in TAPT DAPT pretraining #2

Danshi-Li opened this issue Apr 30, 2021 · 2 comments

Comments

@Danshi-Li
Copy link

Hi, I am trying to reproduce the results of AdaptSum and met problem when pretaing the model in the TAPT setting. It worked quite well for the science and debate datasets, where the data size is small. However, when I trained TAPT for social media, the loss function got exploded:

I run pretraining with command:
python ./src/tapt_pretraining.py -path=./dataset/'social media'/TAPT-data/train.source
-dm='social media'
-visible_gpu=1
-save_interval=1000
-recadam
-logging_Euclid_dist

and the training process witnesses the loss exploding to NaN:
(Epoch 0) LOSS: 2.291335 Euclid dist: 322.301648 13% 1999/15089 [17:55<1:47:14, 2.03it/s]
(Epoch 0) LOSS: 2.246833 Euclid dist: 959.653581 20% 2999/15089 [26:46<1:39:52, 2.01it/s]
(Epoch 0) LOSS: 9.272711 Euclid dist: 1541903563718079518205927655211008.00000 33% 3999/15089 [35:40<1:46:22, 1.74it/s]
(Epoch 0) LOSS: nan Euclid dist: nan 40% 4999/15089 [44:14<1:21:29, 2.16it/s]
(Epoch 0) LOSS: nan Euclid dist: nan 46% 5999/15089 [52:34<1:10:48, 1.80it/s]
(Epoch 0) LOSS: nan Euclid dist: nan 53% 6999/15089 [1:01:15<1:14:45, 1.49it/s]

I tried to lower learning rate to 0.01 and adjust the gradient clip value, it put the time of loss explosion later, but didn't solve the problem. Am I missing something or doing it wrong? What should I do in order to control the model?

@TysonYu
Copy link
Owner

TysonYu commented May 5, 2021

Hi Danshi,

When using RecAdam in TAPT, you can also set different "anneal_t0" and "anneal_k", because RecAdam optimizer is very sensitive to these two parameters. In our experiments, as reported in the paper, "we select the best t0 and k in {500, 600, 700, 800, 900, 1, 000} and{1e−2,1e−3,1e−4,1e−5,1e−6}". So the default number of these two parameters may cause the loss explosion problem.

@EngSalem
Copy link

Hi @TysonYu
Great work, I have a similar question regarding SDPT pretraining
Gradient is exploding so early. I was wondering what were the optimum numbers of for t0 and k in that case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants