Convergence issue of gradient descent


  • The divergence function minimized is only a proxy for classification error(like Softmax)
  • Minimizing divergence may not minimize classification error
    • Does not separate the points even though the points are linearly separable
    • This is because the separating solution is not a feasible optimum for the loss function
  • Compare to perceptron
    • Perceptron rule has low bias(makes no errors if possible)
      • But high variance(swings wildly in response to small changes to input)
    • Backprop is minimally changed by new training instances
      • Prefers consistency over perfection(which is good)


Univariate inputs

  • For quadratic surfaces

Minimize E=12aw2+bw+c \text {Minimize } E=\frac{1}{2} a w^{2}+b w+c

w(k+1)=w(k)ηdE(w(k))dw \mathrm{w}^{(k+1)}=\mathrm{w}^{(k)}-\eta \frac{d E\left(\mathrm{w}^{(k)}\right)}{d \mathrm{w}}

  • Gradient descent with fixed step size η\eta to estimate scalar parameter ww
  • Using Taylor expansion

E(w)=E(w(k))+E(w(k))(ww(k))+E(w(k))(ww(k))2 E(w)=E\left(\mathbf{w}^{(k)}\right)+E^{\prime}\left(\mathbf{w}^{(k)}\right)\left(w-\mathbf{w}^{(k)}\right)+E^{\prime\prime}\left(\mathbf{w}^{(k)}\right)\left(w-\mathbf{w}^{(k)}\right)^2

  • So we can get the optimum step size ηopt=E(w(k))1\eta_{opt} = E^{\prime\prime}(w^{(k)})^{-1}
    • For η<ηopt\eta < \eta_{opt} the algorithm will converge monotonically
    • For 2ηopt>η>ηopt2\eta_{opt} > \eta > \eta_{opt}, we have oscillating convergence
    • For η>2ηopt\eta > 2\eta_{opt}, we get divergence
  • For generic differentiable convex objectives
    • also can use Taylor expansion to estimate
    • Using Newton's method

ηopt=(d2E(w(k))dw2)1 \eta_{o p t}=\left(\frac{d^{2} E\left(\mathrm{w}^{(k)}\right)}{d w^{2}}\right)^{-1}

Multivariate inputs

  • Quadratic convex function

E=12wTAw+wTb+c E=\frac{1}{2} \mathbf{w}^{T} \mathbf{A} \mathbf{w}+\mathbf{w}^{T} \mathbf{b}+c

  • If AA is diagonal

E=12i(aiiwi2+biwi)+c E=\frac{1}{2} \sum_{i}\left(a_{i i} w_{i}^{2}+b_{i} w_{i}\right)+c

  • We can optimize each coordinate independently
    • Like η1,opt=a111\eta_{1,opt} = a^{-1}_{11}, η2,opt=a221\eta_{2,opt} = a^{-1}_{22}
    • But Optimal learning rate is different for the different coordinates
  • If updating gradient descent for entire vector, need to satisfy

η<2miniηi,opt \eta < 2 \min_i \eta_{i,opt}

  • This, however, makes the learning very slow if maxiηi,optminiηi,opt\frac{\max_i \eta_{i,opt}}{\min_i\eta_{i,opt}} is large
  • Solution: Normalize the objective to have identical eccentricity in all directions
    • Then all of them will have identical optimal learning rates
    • Easier to find a working learning rate
  • Target

E=12w^Tw^+b^Tw^+c E=\frac{1}{2} \widehat{\mathbf{w}}^{T} \widehat{\mathbf{w}}+\hat{\mathbf{b}}^{T} \widehat{\mathbf{w}}+c

  • So let w^=Sw\widehat{\mathbf{w}}=\mathbf{S} \mathbf{w}, and S=A0.5S = A^{0.5}, b^=A0.5b\hat{b} = A^{-0.5}b ,w^=A0.5w\widehat{\mathbf{w}} = A^{0.5} \mathbf{w}
  • Gradient descent rule

w^(k+1)=w^(k)ηw^E(w^(k))T \widehat{\mathbf{w}}^{(k+1)}=\widehat{\mathbf{w}}^{(k)}-\eta \nabla_{\widehat{\mathbf{w}}} E\left(\widehat{\mathbf{w}}^{(k)}\right)^{T}

w(k+1)=w(k)ηA1wE(w(k))T \mathbf{w}^{(k+1)}=\mathbf{w}^{(k)}-\eta \mathbf{A}^{-1} \nabla_{\mathbf{w}} E\left(\mathbf{w}^{(k)}\right)^{T}

  • So we just need to caculate A1\mathbf{A}^{-1} , and the step size of each direction is all the same(1)

  • For generic differentiable multivariate convex functions

    • Also use Taylor expansion

    • E(w)E(w(k))+wE(w(k))(ww(k))+12(ww(k))THE(w(k))(ww(k))+ E(\mathbf{w}) \approx E\left(\mathbf{w}^{(k)}\right)+\nabla_{\mathbf{w}} E\left(\mathbf{w}^{(k)}\right)\left(\mathbf{w}-w^{(k)}\right)+\frac{1}{2}\left(\mathbf{w}-w^{(k)}\right)^{T} H_{E}\left(w^{(k)}\right)\left(\mathbf{w}-w^{(k)}\right)+\cdots

    • We get the normalized update rule

    • w(k+1)=w(k)ηHE(w(k))1wE(w(k))T \mathbf{w}^{(k+1)}=\mathbf{w}^{(k)}-\eta H_{E}\left(\boldsymbol{w}^{(k)}\right)^{-1} \nabla_{\mathbf{w}} E\left(\mathbf{w}^{(k)}\right)^{T}

    • Use quadratic approximations to get the maximum



  • For complex models such as neural networks, with a very large number of parameters, the Hessian is extremely difficult to compute
  • For non-convex functions, the Hessian may not be positive semi-definite, in which case the algorithm can diverge

Learning rate

  • For complex models such as neural networks the loss function is often not convex
    • η>2ηopt\eta > 2\eta_{opt} can actually help escape local optima
  • However always having η>2ηopt\eta > 2\eta_{opt} will ensure that you never ever actually find a solution
  • Using Decaying learning rate

results matching ""

    No results matching ""