Convolution

  • Each position in zz consists of convolution result in previous map

  • Way for shrinking the maps
    • Stride greater than 1
    • Downsampling (not necessary)
      • Typically performed with strides > 1
  • Pooling
    • Maxpooling
      • Note: keep tracking of location of max (needed while back prop)
    • Mean pooling

Learning the CNN

  • Training is as in the case of the regular MLP
    • The only difference is in the structure of the network
  • Define a divergence between the desired output and true output of the network in response to any input
  • Network parameters are trained through variants of gradient descent
  • Gradients are computed through backpropagation

Final flat layers

  • Backpropagation continues in the usual manner until the computation of the derivative of the divergence
  • Recall in Backpropagation
    • Step 1: compute Divzn\frac{\partial Div}{\partial z^{n}}Divyn\frac{\partial Div}{\partial y^{n}}
    • Step 2: compute Divwn\frac{\partial Div}{\partial w^{n}} according to step 1

Convolutional layer

Computing Z(l)Div\nabla_{Z(l)} D i v

  • dDivdz(l,m,x,y)=dDivdY(l,m,x,y)f(z(l,m,x,y)) \frac{d D i v}{d z(l, m, x, y)}=\frac{d D i v}{d Y(l, m, x, y)} f^{\prime}(z(l, m, x, y))

  • Simple compont-wise computation

Computing Y(l1)Div\nabla_{Y(l-1)} D i v

  • Each Y(l1,m,x,y)Y(l-1,m,x,y) affects several z(l,n,x,y)z(l,n,x\prime,y\prime) terms for every nn (map)

    • Through wl(m,n,xx,yy)w_l(m,n,x-x\prime,y-y\prime)
    • Affects terms in all lthl^{th} layer maps
    • All of them contribute to the derivative of the divergence Y(l1,m,x,y)Y(l-1,m,x,y)
  • Derivative w.r.t a specific yy term

dDivdY(l1,m,x,y)=nx,ydDivdz(l,n,x,y)dz(l,n,x,y)dY(l1,m,x,y) \frac{d D i v}{d Y(l-1, m, x, y)}=\sum_{n} \sum_{x^{\prime}, y^{\prime}} \frac{d D i v}{d z\left(l, n, x^{\prime}, y^{\prime}\right)} \frac{d z\left(l, n, x^{\prime}, y^{\prime}\right)}{d Y(l-1, m, x, y)}

dDivdY(l1,m,x,y)=nx,ydDivdz(l,n,x,y)wl(m,n,xx,yy) \frac{d D i v}{d Y(l-1, m, x, y)}=\sum_{n} \sum_{x \prime, y^{\prime}} \frac{d D i v}{d z\left(l, n, x^{\prime}, y^{\prime}\right)} w_{l}\left(m, n, x-x^{\prime}, y-y^{\prime}\right)

Computing w(l)Div\nabla_{w(l)} D i v

  • Each weight wl(m,n,x,y)w_l(m,n,x\prime,y\prime) also affects several z(l,n,x,y)z(l,n,x,y) term for every nn
    • Affects terms in only one ZZ map (the nth map)
    • All entries in the map contribute to the derivative of the divergence w.r.t. wl(m,n,x,y)w_l(m,n,x\prime,y\prime)
  • Derivative w.r.t a specific ww term

dDivdwl(m,n,x,y)=x,ydDivdz(l,n,x,y)dz(l,n,x,y)dwl(m,n,x,y) \frac{d D i v}{d w_{l}(m, n, x, y)}=\sum_{x^{\prime}, y^{\prime}} \frac{d D i v}{d z\left(l, n, x^{\prime}, y^{\prime}\right)} \frac{d z\left(l, n, x^{\prime}, y^{\prime}\right)}{d w_{l}(m, n, x, y)}

dDivdwl(m,n,x,y)=x,ydDivdz(l,n,x,y)Y(l1,m,x+x,y+y) \frac{d D i v}{d w_{l}(m, n, x, y)}=\sum_{x \prime, y^{\prime}} \frac{d D i v}{d z\left(l, n, x^{\prime}, y^{\prime}\right)} Y\left(l-1, m, x^{\prime}+x, y^{\prime}+y\right)

Summary

In practice

dDivdY(l1,m,x,y)=nx,ydDivdz(l,n,x,y)wl(m,n,xx,yy) \frac{d D i v}{d Y(l-1, m, x, y)}=\sum_{n} \sum_{x \prime, y^{\prime}} \frac{d D i v}{d z\left(l, n, x^{\prime}, y^{\prime}\right)} w_{l}\left(m, n, x-x^{\prime}, y-y^{\prime}\right)

  • This is a convolution, with defferent order
    • Use mirror image to do normal convolution (flip up down / flip left right)

  • In practice, the derivative at each (x,y) location is obtained from all ZZ maps

  • This is just a convolution of Divz(l,n,x,y)\frac{\partial Div}{\partial z(l,n,x,y)} by the inverted filter
    • After zero padding it first with L1L-1 zeros on every side

  • Note: the x,yx\prime, y\prime refer to the location in filter
  • Shifting down and right by K1K-1, such that 0,00,0 becomes K1,K1K-1,K-1

zshift(l,n,m,x,y)=z(l,n,xK+1,yK+1) z_{\text {shift}}(l, n, m, x, y)=z(l, n, x-K+1, y-K+1)

Divy(l1,m,x,y)=nx,yw^(l,n,m,x,y)Divzshift(l,n,x+x,y+y) \frac{\partial D i v}{\partial y(l-1, m, x, y)}=\sum_{n} \sum_{x^{\prime}, y^{\prime}} \widehat{w}\left(l, n, m, x^{\prime}, y^{\prime}\right) \frac{\partial D i v}{\partial z_{s h i f t}\left(l, n, x+x^{\prime}, y+y^{\prime}\right)}

  • Regular convolution running on shifted derivative maps using flipped filter

Pooling

  • Pooling is typically performed with strides > 1
    • Results in shrinking of the map
    • Downsampling

Derivative of Max pooling

dDivdY(l,m,k,l)={dDivdU(l,m,i,j) if (k,l)=P(l,m,i,j)0 otherwise  \frac{d D i v}{d Y(l, m, k, l)}=\left\{\begin{array}{c} \frac{d D i v}{d U(l, m, i, j)} \text { if }(k, l)=P(l, m, i, j) \\ 0 \text { otherwise } \end{array}\right.

  • Max pooling selects the largest from a pool of elements 1

Derivative of Mean pooling

  • The derivative of mean pooling is distributed over the pool

dy(l,m,k,n)=1Klpool2du(l,m,k,n) d y(l, m, k, n)=\frac{1}{K_{l p o o l}^{2}} d u(l, m, k, n)

Transposed Convolution

  • We’ve always assumed that subsequent steps shrink the size of the maps
  • Can subsequent maps increase in size 2

  • Output size is typically an integer multiple of input
    • +1 if filter width is odd

Model variations

  • Very deep networks
    • 100 or more layers in MLP
    • Formalism called “Resnet”
  • Depth-wise convolutions
    • Instead of multiple independent filters with independent parameters, use common layer-wise weights and combine the layers differently for each filter

Depth-wise convolutions

  • In depth-wise convolution the convolution step is performed only once
  • The simple summation is replaced by a weighted sum across channels
    • Different weights (for summation) produce different output channels

Models

  • For CIFAR 10

    • Le-net 5 3
  • For ILSVRC(Imagenet Large Scale Visual Recognition Challenge)

    • AlexNet
      • NN contains 60 million parameters and 650,000 neurons
      • 5 convolutional layers, some of which are followed by max-pooling layers
      • 3 fully-connected layers
    • VGGNet
      • Only used 3x3 filters, stride 1, pad 1
      • Only used 2x2 pooling filters, stride 2
      • ~140 million parameters in all
    • Googlenet
      • Multiple filter sizes simultaneously
  • For ImageNet

    • Resnet
      • Last layer before addition must have the same number of filters as the input to the module
      • Batch normalization after each convolution

    • Densenet
      • All convolutional
      • Each layer looks at the union of maps from all previous layers
        • Instead of just the set of maps from the immediately previous layer
1. Backprop Through Max-Pooling Layers?
2. Transposed Convolution Demystified
3. https://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html

results matching ""

    No results matching ""