Training doubts #25

Rutvik21 · 2020-01-23T06:13:43Z

Hello,
I have cloned your git hub repo. And run it in the colab with the configuration file available in your repo. But after some steps the loss explodes(reaches to >10^9). So what is the problem and what is your configuration when you have trained the model.

vijendra1125 · 2020-01-25T13:14:46Z

Hi,
I used same configuration as given in this repo for training.
Information you are providing is very little to comment what could be going wrong. Below is similar discussion from tensorflow/model repo (which is base for this repo): tensorflow/models#3868
I hope you might find a solution there or else we could continue further on this thread.

Rutvik21 · 2020-01-25T14:01:17Z

Okay, will check that and let you know.

Thanks.

DynamicCodes · 2020-05-08T15:02:31Z

i'm facing the same problem, is their any updated solution?

Rutvik21 · 2020-05-09T10:08:42Z

Actually I haven't tried it after as I was working on other projects. But I will again try it and will let you know.

joelbudu · 2020-09-25T17:20:59Z

@Rutvik21 @DynamicCodes @vijendra1125 I realised what the issue is after facing the same issue. The gradient explosion happens when your class names in label.pbtxt do not match those in your tfrecord file when training.
The label.pbtxt file provided to your training script should be something like

item { id: 1 name: 'speaker' } item { id: 2 name: 'cup' }

instead of

item { id: 1 name: 'speaker076' } item { id: 2 name: 'cup026' }

parthlathiya2697 · 2020-12-01T17:11:46Z

Still got the same loss explosion after 25000 steps.
I am training mobilenet v1 from tensorflow models zoo to train an object detection model to detect only balls 🎾 . Using mobilenet's configuration pipeline, I've edited num classes to be 1 and edited label_map_path=ball.pbtxt which has only one item i.e., 'ball' itself (cross checked with the .record file too). I also tried reducing batch_size if that matters but still get the same issue.

Edit:
I annotated all images again and generated xml files and then converted to .record files. Now, the loss explosion problem does not show up and training goes smooth.

New Issue
With the same .record files, when I started training mobilenet v2, the loss explosion occurs again. Check the pipeline.config again. num_classes=1

Checkout this article gives quite a clearer picture. But can't get sure how to fix it.

Stackoverflow answers suggests:

To discuss the potential reasons for this explosion, it could probably because of a nasty combination of random initialisation of weights, learning rate, and also probably the batch of training-data which was passed during the iteration.

Without knowing the exact details of the model, you should try smaller learning rate and probably shuffle your training data well. Hope this somewhat helps.

In the case of deep neural networks, this can occur due to the exploding/vanishing gradient. You may want to do either do weight clipping or adjust weight initialization such that weights are closer to 1 so that the chances of explosion reduces.

Also, if your learning rate is big, then such a problem can occur. In such case, you can either lower down the learning rate or use learning rate decay as well.

Could be exploding gradient, i.e. one very big gradient step makes your model "jump" to some extremely far away point where it gets really bad loss, and then it has to "recover" from this slowly. This is a problem especially for RNNs.

Has anyone come to any other conclusions? or know how to make these changes in training mobilenetv2 from object detction zoo?

An Update on my work further
So, I had only one option to go for the reasoning of loss explosion. So, I looked up gradient descent explosion and vanish which led me to changing settings in ssd_mobileenet_v2.config lowering the learning_rate_base to 0.008 from 0.800000011920929 and warmup_learning_rate to 0.0013333 from 0.13333000242710114.

Now my learning is quite stable. Fluctuates by decimal points but thats fine with me. The tradeoff with this method is that now your training time has increased if your up with that. Stop training when loss reaches desired value.
Refer tensorboard to visualise loss drop.

Loss explosion in config
optimizer {
momentum_optimizer: {
learning_rate: {
cosine_decay_learning_rate {
learning_rate_base: .008
total_steps: 90000
warmup_learning_rate: 0.0013333
warmup_steps: 1000
}
}
momentum_optimizer_value: 0.9
}
use_moving_average: false
}
max_number_of_boxes: 100
unpad_groundtruth_tensors: false
}

Edited config

optimizer { momentum_optimizer: { learning_rate: { cosine_decay_learning_rate { learning_rate_base: .008 total_steps: 90000 warmup_learning_rate: 0.0013333 warmup_steps: 1000 } } momentum_optimizer_value: 0.9 } use_moving_average: false } max_number_of_boxes: 100 unpad_groundtruth_tensors: false }

Good luck 🤩
I've also posted an article regarding this in depth. You can check out here.

This was referenced Sep 25, 2020

training lose 1047081846814059528192.0000 #32

Open

loss explodes after few iterations tensorflow/models#3868

Closed

Repository owner deleted a comment from monikawaghole Sep 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training doubts #25

Training doubts #25

Rutvik21 commented Jan 23, 2020

vijendra1125 commented Jan 25, 2020

Rutvik21 commented Jan 25, 2020

DynamicCodes commented May 8, 2020

Rutvik21 commented May 9, 2020

joelbudu commented Sep 25, 2020

parthlathiya2697 commented Dec 1, 2020 •

edited

Loading

Training doubts #25

Training doubts #25

Comments

Rutvik21 commented Jan 23, 2020

vijendra1125 commented Jan 25, 2020

Rutvik21 commented Jan 25, 2020

DynamicCodes commented May 8, 2020

Rutvik21 commented May 9, 2020

joelbudu commented Sep 25, 2020

parthlathiya2697 commented Dec 1, 2020 • edited Loading

parthlathiya2697 commented Dec 1, 2020 •

edited

Loading