Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training doubts #25

Open
Rutvik21 opened this issue Jan 23, 2020 · 6 comments
Open

Training doubts #25

Rutvik21 opened this issue Jan 23, 2020 · 6 comments

Comments

@Rutvik21
Copy link

Hello,
I have cloned your git hub repo. And run it in the colab with the configuration file available in your repo. But after some steps the loss explodes(reaches to >10^9). So what is the problem and what is your configuration when you have trained the model.

@vijendra1125
Copy link
Owner

Hi,
I used same configuration as given in this repo for training.
Information you are providing is very little to comment what could be going wrong. Below is similar discussion from tensorflow/model repo (which is base for this repo): tensorflow/models#3868
I hope you might find a solution there or else we could continue further on this thread.

@Rutvik21
Copy link
Author

Okay, will check that and let you know.

Thanks.

@DynamicCodes
Copy link

i'm facing the same problem, is their any updated solution?

@Rutvik21
Copy link
Author

Rutvik21 commented May 9, 2020

Actually I haven't tried it after as I was working on other projects. But I will again try it and will let you know.

@joelbudu
Copy link

@Rutvik21 @DynamicCodes @vijendra1125 I realised what the issue is after facing the same issue. The gradient explosion happens when your class names in label.pbtxt do not match those in your tfrecord file when training.
The label.pbtxt file provided to your training script should be something like

item { id: 1 name: 'speaker' } item { id: 2 name: 'cup' }

instead of

item { id: 1 name: 'speaker076' } item { id: 2 name: 'cup026' }

@parthlathiya2697
Copy link

parthlathiya2697 commented Dec 1, 2020

Still got the same loss explosion after 25000 steps.
I am training mobilenet v1 from tensorflow models zoo to train an object detection model to detect only balls 🎾 . Using mobilenet's configuration pipeline, I've edited num classes to be 1 and edited label_map_path=ball.pbtxt which has only one item i.e., 'ball' itself (cross checked with the .record file too). I also tried reducing batch_size if that matters but still get the same issue.

Edit:
I annotated all images again and generated xml files and then converted to .record files. Now, the loss explosion problem does not show up and training goes smooth.

New Issue
With the same .record files, when I started training mobilenet v2, the loss explosion occurs again. Check the pipeline.config again. num_classes=1

Checkout this article gives quite a clearer picture. But can't get sure how to fix it.

Stackoverflow answers suggests:

To discuss the potential reasons for this explosion, it could probably because of a nasty combination of random initialisation of weights, learning rate, and also probably the batch of training-data which was passed during the iteration.

Without knowing the exact details of the model, you should try smaller learning rate and probably shuffle your training data well. Hope this somewhat helps.

In the case of deep neural networks, this can occur due to the exploding/vanishing gradient. You may want to do either do weight clipping or adjust weight initialization such that weights are closer to 1 so that the chances of explosion reduces.

Also, if your learning rate is big, then such a problem can occur. In such case, you can either lower down the learning rate or use learning rate decay as well.

Could be exploding gradient, i.e. one very big gradient step makes your model "jump" to some extremely far away point where it gets really bad loss, and then it has to "recover" from this slowly. This is a problem especially for RNNs.

Has anyone come to any other conclusions? or know how to make these changes in training mobilenetv2 from object detction zoo?

An Update on my work further
So, I had only one option to go for the reasoning of loss explosion. So, I looked up gradient descent explosion and vanish which led me to changing settings in ssd_mobileenet_v2.config lowering the learning_rate_base to 0.008 from 0.800000011920929 and warmup_learning_rate to 0.0013333 from 0.13333000242710114.

Now my learning is quite stable. Fluctuates by decimal points but thats fine with me. The tradeoff with this method is that now your training time has increased if your up with that. Stop training when loss reaches desired value.
Refer tensorboard to visualise loss drop.

Loss explosion in config
optimizer {
momentum_optimizer: {
learning_rate: {
cosine_decay_learning_rate {
learning_rate_base: .008
total_steps: 90000
warmup_learning_rate: 0.0013333
warmup_steps: 1000
}
}
momentum_optimizer_value: 0.9
}
use_moving_average: false
}
max_number_of_boxes: 100
unpad_groundtruth_tensors: false
}

Edited config

optimizer { momentum_optimizer: { learning_rate: { cosine_decay_learning_rate { learning_rate_base: .008 total_steps: 90000 warmup_learning_rate: 0.0013333 warmup_steps: 1000 } } momentum_optimizer_value: 0.9 } use_moving_average: false } max_number_of_boxes: 100 unpad_groundtruth_tensors: false }

Good luck 🤩
I've also posted an article regarding this in depth. You can check out here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants