Post

Understanding Gradient Descent in Deep Learning

Understanding Gradient Descent in Deep Learning

1. Mathematical Definition

  • Derivative Coefficient: This refers to the numerical value representing the rate of change of a function’s value at a specific point, or the slope of the tangent line at that point.
  • In Deep Learning, calculating this value is an absolute requirement for executing Gradient Descent.

2. Derivative = 0 vs. Gradient Descent

Why do we use Gradient Descent instead of simply finding the point where the derivative is zero?

  • Mathematical Ideal (Closed-form Solution): For simple functions like a quadratic equation, we can solve \(f'(x) = 0\) algebraically to find the minimum in one step.
  • The Reality of Deep Learning:
    • No Closed-form: Most loss functions in DL are non-convex and cannot be solved with a simple formula.
    • High Complexity: Loss functions involve millions of parameters (\(W\), \(b\)), making it computationally impossible to solve for zero simultaneously.
    • Efficiency: When dealing with massive datasets, Gradient Descent is far more computationally efficient.

3. The Logic of Gradient Descent

The direction of the update depends on the sign of the derivative. The update rule is:

\[W_{new} = W_{old} - \alpha \cdot \frac{\partial Loss}{\partial W}\]
  • If the derivative is positive (+):
    • The function value is increasing as \(W\) increases.
    • Action: Move in the opposite direction (decrease \(W\)) to find the minimum.
  • If the derivative is negative (-):
    • The function value is decreasing as \(W\) increases.
    • Action: Continue in that direction (increase \(W\)) to reach the minimum.

4. The Essence of Training

Ultimately, training a deep learning model is the continuous process of updating numerous weights (\(W\)) by calculating the derivatives of the loss function until the gradient reaches near zero (the minimum point).

This post is licensed under CC BY 4.0 by the author.