• Momentum and Nestorov’s method improve convergence by normalizing the mean (first moment) of the derivatives
  • Considering the second moments
    • RMS Prop / Adagrad / AdaDelta / ADAM1
  • Simple gradient and momentum methods still demonstrate oscillatory behavior in some directions2
    • Depends on magic step size parameters (learning rate)
  • Need to dampen step size in directions with high motion
    • Second order term (use variation to smooth it)
    • Scale down updates with large mean squared derivatives
    • scale up updates with small mean squared derivatives

RMS Prop

  • Notion
    • The squared derivative is w2D=(wD)2\partial_{w}^{2} D=\left(\partial_{w} D\right)^{2}
    • The mean squared derivative is E[W2D]E\left[\partial_{W}^{2} D\right]
  • This is a variant on the basic mini-batch SGD algorithm
    • Updates are by parameter

E[w2D]k=γE[w2D]k1+(1γ)(w2D)k E\left[\partial_{w}^{2} D\right]_{k}=\gamma E\left[\partial_{w}^{2} D\right]_{k-1}+(1-\gamma)\left(\partial_{w}^{2} D\right)_{k}

wk+1=wkηE[w2D]k+ϵwD w_{k+1}=w_{k}-\frac{\eta}{\sqrt{E\left[\partial_{w}^{2} D\right]_{k}+\epsilon}} \partial_{w} D

  • If using the same step over a long period, E[w2D]k+ϵwD\sqrt{E\left[\partial_{w}^{2} D\right]_{k}+\epsilon} \approx |\partial_{w} D|
    • So wk+1=wksign(wD)ηw_{k+1}=w_{k}-\text{sign} (\partial_{w} D )\eta
    • Only the sign remain, similar to RProp


  • RMS prop only considers a second-moment normalized version of the current gradient
  • ADAM utilizes a smoothed version of the momentum-augmented gradient
    • Considers both first and second moments

mk=δmk1+(1δ)(wD)k m_{k}=\delta m_{k-1}+(1-\delta)\left(\partial_{w} D\right)_{k}

vk=γvk1+(1γ)(w2D)k v_{k}=\gamma v_{k-1}+(1-\gamma)\left(\partial_{w}^{2} D\right)_{k}

mk^=mk1δk,v^k=vk1γk \hat{m_k}=\frac{m_{k}}{1-\delta^{k}}, \quad \quad \hat{v}_{k}=\frac{v_{k}}{1-\gamma^{k}}

wk+1=wkηv^k+ϵm^k w_{k+1}=w_{k}-\frac{\eta}{\sqrt{\hat{v}_{k}+\epsilon}} \hat{m}_{k}

  • Typically δ1\delta \approx 1, initalize mk1,vk10m_{k-1}, v_{k-1} \approx 0 , so 1δ01- \delta \approx 0, will be very slow to update in the beginning
    • So we need mk^=mk1δk\hat{m_k}=\frac{m_{k}}{1-\delta^{k}} term to scale up in the beginning


  • To make the network converge better, we can consider the following aspects
    • The Divergence
    • Dropout
    • Batch normalization
    • Gradient clipping
    • Data augmentation


  • What shape do we want the divergence function would be?
    • Must be smooth and not have many poor local optima
    • The best type of divergence is steep far from the optimum, but shallow at the optimum
      • But not too shallow(hard to converge to minimum)
  • The choice of divergence affects both the learned network and results

  • Common choices

    • L2 divergence

    Div=12i(yidi)2 D i v=\frac{1}{2} \sum_{i}\left(y_{i}-d_{i}\right)^{2}

    • KL divergence

      Div=idilog(di)idilog(yi) D i v=\sum_{i} d_{i} \log \left(d_{i}\right)-\sum_{i} d_{i} \log \left(y_{i}\right)

  • L2 is particularly appropriate when attempting to perform regression

    • Numeric prediction
    • For L2 divergence the derivative w.r.t. the pre-activation of the output layer is :
      • z12yd2=(yd)Jy(z)\nabla_{z} \frac{1}{2}\|y-d\|^{2}=(y-d) J_{y}(z)
    • We literally “propagate” the error (yd)(y-d) backward
    • Which is why the method is sometimes called “error backpropagation
  • The KL divergence is better when the intent is classification

    • The output is a probability vector

Batch normalization

  • Covariate shifts problem

    • Training assumes the training data are all similarly distributed (So as mini-batch)
    • In practice, each minibatch may have a different distribution
    • Which may occur in each layer of the network
    • Minimize one batch cannot give the correction of other batches
  • Solution

    • Move all batches to have a mean of 0 and unit standard deviation
    • Eliminates covariate shift between batches
  • Batch normalization is a covariate adjustment unit that happens after the weighted addition of inputs (affine combination) but before the application of activation 5

  • Steps

    • Covariate shift to standard position

    ui=ziμBσB2+ϵ u_{i}=\frac{z_{i}-\mu_{B}}{\sqrt{\sigma_{B}^{2}+\epsilon}}

    • Shift to right position

      zi^=γui+β \hat{z_i} = \gamma u_i + \beta


  • The outputs are now functions of μB\mu_B and σB2\sigma_B^2 which are functions of the entire minibatch

Div(MB)=1BtDiv(Yt(Xt,μB,σB2),dt(Xt)) \operatorname{Div}(M B)=\frac{1}{B} \sum_{t} \operatorname{Div}\left(Y_{t}\left(X_{t}, \mu_{B}, \sigma_{B}^{2}\right), d_{t}\left(X_{t}\right)\right)

  • The divergence for each YtY_t depends on all the XtX_t within the mini-batch
    • Is a vector function over the mini-batch
  • Using influence diagram to caculate derivatives3

  • Goal

    • We need to caculate the learnable parameters dDivγ,dDivβ\frac{d D i v}{\gamma}, \frac{d D i v}{\beta}, and the affine combination dDivzi\frac{d D i v}{z_i}

    Divzi=Divuiuizi+DivσB2σB2zi+DivμBμBzi \frac{\partial D i v}{\partial z_{i}}=\frac{\partial D i v}{\partial u_{i}} \cdot \frac{\partial u_{i}}{\partial z_{i}}+\frac{\partial D i v}{\partial \sigma_{B}^{2}} \cdot \frac{\partial \sigma_{B}^{2}}{\partial z_{i}}+\frac{\partial D i v}{\partial \mu_{B}} \cdot \frac{\partial \mu_{B}}{\partial z_{i}}

    • So we need extra Divui,DivσB2,DivμB\frac{\partial D i v}{\partial u_{i}},\frac{\partial D i v}{\partial \sigma_{B}^{2}},\frac{\partial D i v}{\partial \mu_{B}}
  • Preparation

μB=1Bi=1BziσB2=1Bi=1B(ziμB)2 \mu_{B}=\frac{1}{B} \sum_{i=1}^{B} z_{i}\quad \quad \sigma_{B}^{2}=\frac{1}{B} \sum_{i=1}^{B}\left(z_{i}-\mu_{B}\right)^{2}

ui=ziμBσB2+ϵzi^=γui+β u_{i}=\frac{z_{i}-\mu_{B}}{\sqrt{\sigma_{B}^{2}+\epsilon}} \quad \quad \hat{z_i} = \gamma u_i + \beta

  • For the first term Divuiuizi\frac{\partial D i v}{\partial u_{i}} \cdot \frac{\partial u_{i}}{\partial z_{i}}

    • First caculate dDivγ,dDivβ\frac{d D i v}{\gamma}, \frac{d D i v}{\beta} dDivdβ=dDivdz^dDivdγ=udDivdz^ \frac{d D i v}{d \beta}=\frac{d D i v}{d \hat{z}} \quad \quad \frac{d D i v}{d \gamma}=u \frac{d D i v}{d \hat{z}}

    • uizi=1σB2+ϵ\frac{\partial u_{i}}{\partial z_{i}} = \frac{1}{\sqrt{\sigma^2_B +\epsilon}}, so the first term = Divui1σB2+ϵ\frac{\partial D i v}{\partial u_{i}} \cdot \frac{1}{\sqrt{\sigma_{B}^{2}+\epsilon}}

  • For the second term DivσB2σB2zi\frac{\partial D i v}{\partial \sigma_{B}^{2}} \cdot \frac{\partial \sigma_{B}^{2}}{\partial z_{i}}

    • Caculate DivσB2\frac{\partial D i v}{\partial \sigma_{B}^{2}}

    DivσB2=DivuiuiσB2 \frac{\partial Div}{\partial \sigma_{B}^{2}}=\sum \frac{\partial Div}{\partial u_{i}} \frac{\partial u_{i}}{\partial \sigma_{B}^{2}}

    DivσB2=12(σB2+ϵ)3/2i=1BDivui(ziμB) \frac{\partial D i v}{\partial \sigma_{B}^{2}}=\frac{-1}{2}\left(\sigma_{B}^{2}+\epsilon\right)^{-3 / 2} \sum_{i=1}^{B} \frac{\partial D i v}{\partial u_{i}}\left(z_{i}-\mu_{B}\right)

    • And σB2zi\frac{\partial \sigma_{B}^{2}}{\partial z_{i}}

    σB2zi=2(ziμB)B \frac{\partial \sigma_{B}^{2}}{\partial z_{i}}=\frac{2\left(z_{i}-\mu_{B}\right)}{B}

    • So the second term = DivσB22(ziμB)B\frac{\partial D i v}{\partial \sigma_{B}^{2}} \cdot \frac{2\left(z_{i}-\mu_{B}\right)}{B}
  • Finally for the third term DivμBμBzi\frac{\partial D i v}{\partial \mu_{B}} \cdot \frac{\partial \mu_{B}}{\partial z_{i}}

    • Caculate DivμB\frac{\partial D i v}{\partial \mu_{B}}

    DivμB=DivμiμiμB+DivσB2σB2μB \frac{\partial D i v}{\partial \mu_B}=\sum \frac{\partial Div}{\partial \mu_{i}} \frac{\partial \mu_{i}}{\partial \mu_{B}}+\frac{\partial Div}{\partial\sigma_{B}^{2} } \frac{\partial \sigma_{B}^{2}}{\partial \mu_{B}}

    DivμB=(i=1BDivui1σB2+ϵ)+DivσB2i=1B2(ziμB)B \frac{\partial D i v}{\partial \mu_{B}}=\left(\sum_{i=1}^{B} \frac{\partial D i v}{\partial u_{i}} \cdot \frac{-1}{\sqrt{\sigma_{B}^{2}+\epsilon}}\right)+\frac{\partial D i v}{\partial \sigma_{B}^{2}} \cdot \frac{\sum_{i=1}^{B}-2\left(z_{i}-\mu_{B}\right)}{B}

    • The last term is zero, and because μz=1Bzi\mu_z = \frac{1}{B} \sum z_i

    μBzi=1B \frac{\partial \mu_{B}}{\partial z_{i}}=\frac{1}{B}

  • So the third term = DivμB1B\frac{\partial D i v}{\partial \mu_{B}} \cdot \frac{1}{B}
  • Overall

Divzi=Divui1σB2+ϵ+DivσB22(ziμB)B+DivμB1B \frac{\partial D i v}{\partial z_{i}}=\frac{\partial D i v}{\partial u_{i}} \cdot \frac{1}{\sqrt{\sigma_{B}^{2}+\epsilon}}+\frac{\partial D i v}{\partial \sigma_{B}^{2}} \cdot \frac{2\left(z_{i}-\mu_{B}\right)}{B}+\frac{\partial D i v}{\partial \mu_{B}} \cdot \frac{1}{B}

DivσB2=12(σB2+ϵ)3/2i=1BDivui(ziμB) \frac{\partial D i v}{\partial \sigma_{B}^{2}}=\frac{-1}{2}\left(\sigma_{B}^{2}+\epsilon\right)^{-3 / 2} \sum_{i=1}^{B} \frac{\partial D i v}{\partial u_{i}}\left(z_{i}-\mu_{B}\right)

DivμB=1σB2+ϵi=1BDivui \frac{\partial D i v}{\partial \mu_{B}}=\frac{-1}{\sqrt{\sigma_{B}^{2}+\epsilon}}\sum_{i=1}^{B} \frac{\partial D i v}{\partial u_{i}}


  • On test data, BN requires μB\mu_B and σB2\sigma_B^2
  • We will use the average over all training minibatches

μBN=1NbatchesbatμB(batch) \mu_{B N}=\frac{1}{\text {Nbatches}} \sum_{b a t} \mu_{B}(\text {batch})

σBN2=B(B1)NbatchesbatchσB2(batch) \sigma_{B N}^{2}=\frac{B}{(B-1) N b a t c h e s} \sum_{b a t c h} \sigma_{B}^{2}(\text {batch})

  • Note: these are neuron-specific
    • μB(batch),σBbatch\mu_B(batch), \sigma_B{batch} are obtained from the final converged network
    • The 𝐵/(𝐵 − 1) term gives us an unbiased estimator for the variance

What can it do

  • Improves both convergence rate and neural network performance
    • Anecdotal evidence that BN eliminates the need for dropout
  • To get maximum benefit from BN, learning rates must be increased and learning rate decay can be faster
    • Since the data generally remain in the high-gradient regions of the activations
    • e.g. For sigmoid function, move data to the linear part, the gradient is high
  • Also needs better randomization of training data order


  • Smoothness through network structure
    • MLPs are universal approximators
    • For a given number of parameters, deeper networks impose more smoothness than shallow&wide ones
    • Each layer restricts the shape of the function
  • Smoothness through weight constrain


  • The "desired” output is generally smooth
    • Capture statistical or average trends
  • Overfitting
    • But an unconstrained model will model individual instances instead
    • Why overfitting?4
  • Using a sigmoid activation, as w|w| increases, the response becomes steeper

  • Constraining the weights to be low will force slower perceptrons and smoother output response
  • Regularized training: minimize the loss while also minimizing the weights

L(W1,W2,,WK)=Loss(W1,W2,,WK)+12λkWk22 L\left(W_{1}, W_{2}, \ldots, W_{K}\right)=\operatorname{Loss}\left(W_{1}, W_{2}, \ldots, W_{K}\right)+\frac{1}{2} \lambda \sum_{k}\left\|W_{k}\right\|_{2}^{2}

  • λ\lambda is the regularization parameter whose value depends on how important it is for us to want to minimize the weights
  • Increasing assigns greater importance to shrinking the weights
    • Make greater error on training data, to obtain a more acceptable network


  • “Dropout” is a stochastic data/model erasure method that sometimes forces the network to learn more robust models
  • Bagging method
    • Using ensemble classifiers to improve prediction
  • Dropout

    • For each input, at each iteration, “turn off” each neuron with a probability 1α1-\alpha
    • Also turn off inputs similarly
  • Backpropagation is effectively performed only over the remaining network

    • The effective network is different for different inputs
    • Effectively learns a network that averages over all possible networks (Bagging)
  • Dropout as a mechanism to increase pattern density

    • Dropout forces the neurons to learn “rich” and redundant patterns
    • E.g. without dropout, a noncompressive layer may just “clone” its input to its output
    • Transferring the task of learning to the rest of the network upstream


  • The expected output of the neuron is yi(k)=ασ(jwji(k)yj(k1)+bi(k)) y_{i}^{(k)}=\alpha \sigma\left(\sum_{j} w_{j i}^{(k)} y_{j}^{(k-1)}+b_{i}^{(k)}\right)

  • During test, push the a to all outgoing weights

zi(k)=jwji(k)yj(k1)+bi(k)=jwji(k)ασ(zj(k1))+bi(k)=j(αwji(k))σ(zj(k1))+bi(k) \begin{aligned} z_{i}^{(k)} &=\sum_{j} w_{j i}^{(k)} y_{j}^{(k-1)}+b_{i}^{(k)} \\\\ &=\sum_{j} w_{j i}^{(k)} \alpha \sigma\left(z_{j}^{(k-1)}\right)+b_{i}^{(k)} \\\\ &=\sum_{j}\left(\alpha w_{j i}^{(k)}\right) \sigma\left(z_{j}^{(k-1)}\right)+b_{i}^{(k)} \end{aligned}

  • So Wtest=αWtrainedW_{test} = \alpha W_{trained}
    • Instead of multiplying every output by all weights by α\alpha, multiply all weight by α\alpha
  • Alternate implementation
    • During training, replace the activation of all neurons in the network by α1σ(.)\alpha ^{-1} \sigma(.)
    • Use σ(.)\sigma(.) as the activation during testing, and not modify the weights

More tricks

  • Obtain training data
    • Use appropriate representation for inputs and outputs
    • Data Augmentation
  • Choose network architecture
    • More neurons need more data
    • Deep is better, but harder to train
  • Choose the appropriate divergence function
    • Choose regularization
  • Choose heuristics
    • batch norm, dropout ...
  • Choose optimization algorithm
    • Adagrad / Adam / SGD
  • Perform a grid search for hyper parameters (learning rate, regularization parameter, …) on held-out data
  • Train
    • Evaluate periodically on validation data, for early stopping if required
1. A good summary of recent optimizers can be seen in here.
2. Animations for optimization algorithms
3. A simple and clear demostration of 2 variables in a single network
4. The perceptrons in the network are individually capable of sharp changes in output
5. Batch normalization in Neural Networks

results matching ""

    No results matching ""