Skip to content

tom-bleher/ALLS-Multivariable-Momentum-Gradient-Descent-

Repository files navigation

Project description

This version of the optimization algorithm utilizes momentum gradient descent such that instead of the usual gradient descent:

$$W_{n+1}=W_{n}-\gamma \nabla f(W_{n})$$

By adding a 'momentum' term, we arrive at the modified expression:

$$W_{n+1}=W_{n}+\beta(W_{n}-W_{n-1})-\gamma \nabla f(W_{n})$$

Where $\gamma$ is the learning rate as previously seen in the classic gradient descent algorithm, and $\beta$ is the new momentum constant.

In Python this mathematical formula takes on the following form:

self.new_focus = self.focus_history[-1] + (self.momentum*(self.focus_history[-1]-(self.focus_history[-2])) - self.focus_learning_rate*self.count_focus_der[-1]

This upgrade promises to help the optimization:

  • Take less step until optimization
  • Avoid getting stuck in local maximum points

Faster optimization

As evident by comparing the graphs for the optimization process, the momentum acceleration plot (right) reaches the maximum faster in fewer steps.

Momentum Gradient Descent Vanilla Gradient Descent
Momentum Gradient Descent Vanilla Gradient Descent

Unlike classic gradient descent, momentum gradient descent takes less sharp turns. Essentially, where gradient descent depends on the previous gradient, momentum gradient descent incorporates a moving average of past gradients, allowing it to smooth out variations in the optimization.


Absolute optimization

In order to test the promise of momentum gradient descent to find the absolute maximum and minimum points, I tested using the following function: $$C(f,\phi_{2})=(0.1(f+\phi_{2}))^{2}\cdot \sin(0.01(f+\phi_{2}))$$ When plotting it we see it takes on the form:

Focusing on the region near zero, I get the following optimization test region with local and 'absolute' minimum points.

I will initialize the algorithm at the red point as seen above. Where a classic gradient descent will get stuck in the local minimum, the momentum gradient descent will be able to optimize to the greater 'absolute' minimum surpassing the local minimum and saddle point.

Now, running the algorithm as seen in momentum_main.py I will verify my assumptions.

Using vanilla gradient descent ($\beta=0$), we arrive at the local minimum point as expected:

Using momentum gradient we arrive at the greater minimum point:

As shown, the algorithm was able to get to the global minimum. Mission complete? Not exactly. The system is sensitive to the learning rates and momentum constants and will not always arrive at the optimal solution.

About

Multivariable optimization for count using momentum gradient descent

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages