Modelling Series

  • In many situations one must consider a series of inputs to produce an output

    • Outputs too may be a series
  • Finite response model

    • Can use convolutional neural net applied to series data (slide)

      • Also called a Time-Delay neural network
    • Something that happens today only affects the output of the system for days into the future

      • Yt=f(Xt,Xt1,,XtN) Y_{t}=f\left(X_{t}, X_{t-1}, \ldots, X_{t-N}\right)
  • Infinite response systems

    • Systems often have long-term dependencies

    • What happens today can continue to affect the output forever

      • Yt=f(Xt,Xt1,,Xt) Y_{t}=f\left(X_{t}, X_{t-1}, \ldots, X_{t-\infty}\right)

Infinite response systems

  • A one-tap NARX network

    • 「nonlinear autoregressive network with exogenous inputs」
    • Yt=f(Xt,Yt1)Y_t = f(X_t,Y_{t-1})
    • An input at t=0 affects outputs forever
  • An explicit memory variable whose job it is to remember

    • mt=r(yt1,ht1,mt1)ht=f(xt,mt)yt=g(ht) \begin{array}{c} m_{t}=r\left(y_{t-1}, h_{t-1}^{\prime}, m_{t-1}\right) \\\\ h_{t}=f\left(x_{t}, m_{t}\right) \\\\ y_{t}=g\left(h_{t}\right) \end{array}

  • Jordan Network

    • Memory unit simply retains a running average of past outputs
    • Memory has fixed structure; does not “learn” to remember
  • Elman Networks

    • Separate memory state from output
    • Only the weight from the memory unit to the hidden unit is learned
      • But during training no gradient is backpropagated over the “1” link (Just cloned state)
  • Problem

    • “Simple” (or partially recurrent) because during learning current error does not actually propagate to the past

State-space model

ht=f(xt,ht1)yt=g(ht) \begin{array}{c} h_{t}=f\left(x_{t}, h_{t-1}\right) \\\\ y_{t}=g\left(h_{t}\right) \end{array}

  • hth_t is the state of the network
    • Model directly embeds the memory in the state
    • State summarizes information about the entire past
  • Recurrent neural network1


  • All columns are identical

  • The simplest structures are most popular

Recurrent neural network

Forward pass

Backward pass

  • BPTT
    • Back Propagation Through Time
    • Defining a divergence between the actual and desired output sequences
    • Backpropagating gradients over the entire chain of recursion
      • Backpropagation through time
    • Pooling gradients with respect to individual parameters over time


  • The divergence computed is between the sequence of outputs by the network and the desired sequence of outputs
    • DIV is a scalar function of a ..series.. of vectors
    • This is not just the sum of the divergences at individual times
  • Y(t)Y(t) is the output at time tt
    • Yi(t)Y_i(t) is the ith output
  • Z(2)(t)Z^{(2)}(t) is the pre-activation value of the neurons at the output layer at time tt
  • h(t)h(t) is the output of the hidden layer at time tt


  • Y(t)Y(t) is a column vector
  • DIVDIV is a scalar
  • dDivdY(t)\frac{d Div}{d Y(t)} is a row vector
Derivative at time TT
  1. Compute dDIVdYi(T)\frac{d DIV}{d Y_i(T)} for all ii

    • In general we will be required to compute dDIVdYi(t)\frac{d DIV}{d Y_i(t)} for all ii and tt as we will see

      • This can be a source of significant difficulty in many scenarios
    • Special case, when the overall divergence is a simple sum of local divergences at each time

      • dDIVdYi(t)=dDiv(t)dYi(t) \frac{d D I V}{d Y_{i}(t)}=\frac{d D i v(t)}{d Y_{i}(t)}
  2. Compute Z(2)(T)DIV\nabla_{Z^{(2)}(T)}{D I V}

    • Z(2)(T)DIV=Y(T)DIVZ(2)(T)Y(T) \nabla_{Z^{(2)}(T)}{D I V}=\nabla_{Y(T)} D I V \nabla_{Z^{(2)}(T)} Y(T)

    • For scalar output activation

      • dDIVdZi(2)(T)=dDIVdYi(T)dYi(T)dZi(2)(T) \frac{d D I V}{d Z_{i}^{(2)}(T)}=\frac{d D I V}{d Y_{i}(T)} \frac{d Y_{i}(T)}{d Z_{i}^{(2)}(T)}
    • For vector output activation

      • dDIVdZi(2)(T)=idDIVdYj(T)dYj(T)dZi(2)(T) \frac{d D I V}{d Z_{i}^{(2)}(T)}=\sum_{i} \frac{d D I V}{d Y_{j}(T)} \frac{d Y_{j}(T)}{d Z_{i}^{(2)}(T)}
  3. Compute h(T)DIV\nabla_{h_(T)}{D I V}

    • W(2)h(T)=Z(2)(T) W^{(2)} h(T) = Z^{(2)}(T)

    • dDIVdhi(T)=jdDIVdZj(2)(T)dZj(2)(T)dhi(T)=jwij(2)dDIVdZj(2)(T) \frac{d D I V}{d h_{i}(T)}=\sum_{j} \frac{d D I V}{d Z_{j}^{(2)}(T)} \frac{d Z_{j}^{(2)}(T)}{d h_{i}(T)}=\sum_{j} w_{i j}^{(2)} \frac{d D I V}{d Z_{j}^{(2)}(T)}

    • h(T)DIV=Z(2)(T)DIVW(2) \nabla_{h(T)} D I V=\nabla_{Z^{(2)}(T)} D I V W^{(2)}

  4. Compute W(2)DIV\nabla_{W^{(2)}}{D I V}

    • dDIVdwij(2)=dDIVdZj(2)(T)hi(T) \frac{d D I V}{d w_{i j}^{(2)}}=\frac{d D I V}{d Z_{j}^{(2)}(T)} h_{i}(T)

    • W(2)DIV=h(T)Z(2)(T)DIV \nabla_{W^{(2)}} D I V=h(T) \nabla_{Z^{(2)}(T)} D I V

  5. Compute Z(1)(T)DIV\nabla_{Z^{(1)}(T)}{D I V}

    • dDIVdZi(1)(T)=dDIVdhi(T)dhi(T)dZi(1)(T) \frac{d D I V}{d Z_{i}^{(1)}(T)}=\frac{d D I V}{d h_{i}(T)} \frac{d h_{i}(T)}{d Z_{i}^{(1)}(T)}

    • Z(1)(T)DIV=h(T)DIVZ(1)(T)h(T) \nabla_{Z^{(1)}(T)} D I V=\nabla_{h(T)} D I V \nabla_{Z^{(1)}(T)} h(T)

  6. Compute W(1)DIV\nabla_{W^{(1)}}{D I V}

    • W(1)X(T)+W(11)h(T1)=Z(1)(T) W^{(1)} X(T) + W^{(11)} h(T-1)= Z^{(1)}(T)

    • dDIVdwij(1)=dDIVdZj(1)(T)Xi(T) \frac{d D I V}{d w_{i j}^{(1)}}=\frac{d D I V}{d Z_{j}^{(1)}(T)} X_{i}(T)

    • W(1)DIV=X(T)Z(1)(T)DIV \nabla_{W^{(1)}} D I V=X(T) \nabla_{Z^{(1)}(T)} D I V

  7. Compute W(11)DIV\nabla_{W^{(11)}}{D I V}

    • dDIVdwii(11)=dDIVdZi(1)(T)hi(T1) \frac{d D I V}{d w_{i i}^{(11)}}=\frac{d D I V}{d Z_{i}^{(1)}(T)} h_{i}(T-1)

    • W(11)DIV=h(T1)Z(1)(T)DIV \nabla_{W}^{(11)} D I V=h(T-1) \nabla_{Z^{(1)}(T)} D I V

Derivative at time T1T-1
  1. Compute Z(2)(T1)DIV\nabla_{Z^{(2)}(T-1)}{D I V}

    • Z(2)(T1)DIV=Y(T1)DIVZ(2)(T1)Y(T1) \nabla_{Z^{(2)}(T-1)}{D I V}=\nabla_{Y(T-1)} D I V \nabla_{Z^{(2)}(T-1)} Y(T-1)

    • For scalar output activation

      • dDIVdZi(2)(T1)=dDIVdYi(T1)dYi(T1)dZi(2)(T1) \frac{d D I V}{d Z_{i}^{(2)}(T-1)}=\frac{d D I V}{d Y_{i}(T-1)} \frac{d Y_{i}(T-1)}{d Z_{i}^{(2)}(T-1)}
    • For vector output activation

      • dDIVdZi(2)(T1)=jdDIVdYj(T1)dYj(T1)dZi(2)(T1) \frac{d D I V}{d Z_{i}^{(2)}(T-1)}=\sum_{j} \frac{d D I V}{d Y_{j}(T-1)} \frac{d Y_{j}(T-1)}{d Z_{i}^{(2)}(T-1)}
  2. Compute h(T1)DIV\nabla_{h_(T-1)}{D I V}

    • dDIVdhi(T1)=jwij(2)dDIVdZj(2)(T1)+jwij(11)dDIVdZj(1)(T) \frac{d D I V}{d h_{i}(T-1)}=\sum_{j} w_{i j}^{(2)} \frac{d D I V}{d Z_{j}^{(2)}(T-1)}+\sum_{j} w_{i j}^{(11)} \frac{d D I V}{d Z_{j}^{(1)}(T)}

    • h(T1)DIV=Z(2)(T1)DIVW(2)+Z(1)(T)DIVW(11) \nabla_{h(T-1)} D I V=\nabla_{Z^{(2)}(T-1)} D I V W^{(2)}+\nabla_{Z^{(1)}(T)} D I V W^{(11)}

  3. Compute W(2)DIV\nabla_{W^{(2)}}{D I V}

    • dDIVdwij(2)+=dDIVdZj(2)(T1)hi(T1) \frac{d D I V}{d w_{i j}^{(2)}}+=\frac{d D I V}{d Z_{j}^{(2)}(T-1)} h_{i}(T-1)

    • W(2)DIV+=h(T1)Z(2)(T1)DIV \nabla_{W^{(2)}} D I V+=h(T-1) \nabla_{Z^{(2)}(T-1)} D I V

  4. Compute Z(1)(T1)DIV\nabla_{Z^{(1)}(T-1)}{D I V}

    • dDIVdZi(1)(T1)=dDIVdhi(T1)dhi(T1)dZi(1)(T1) \frac{d D I V}{d Z_{i}^{(1)}(T-1)}=\frac{d D I V}{d h_{i}(T-1)} \frac{d h_{i}(T-1)}{d Z_{i}^{(1)}(T-1)}

    • Z(1)(T1)DIV=h(T1)DIVZ(1)(T1)h(T1) \nabla_{Z^{(1)}(T-1)} D I V=\nabla_{h(T-1)} D I V \nabla_{Z^{(1)}(T-1)} h(T-1)

  5. Compute W(1)DIV\nabla_{W^{(1)}}{D I V}

    • dDIVdwij(1)+=dDIVdZj(1)(T1)Xi(T1) \frac{d D I V}{d w_{i j}^{(1)}}+=\frac{d D I V}{d Z_{j}^{(1)}(T-1)} X_{i}(T-1)

    • W(1)DIV+=X(T1)Z(1)(T1)DIV \nabla_{W^{(1)}} D I V+=X(T-1) \nabla_{Z^{(1)}(T-1)} D I V

  6. Compute W(11)DIV\nabla_{W^{(11)}}{D I V}

    • dDIVdwij(11)+=dDIVdZj(1)(T1)hi(T2) \frac{d D I V}{d w_{i j}^{(11)}}+=\frac{d D I V}{d Z_{j}^{(1)}(T-1)} h_{i}(T-2)

      • {% math %} \nabla{W^{(11)}} D I V+=h(T-2) \nabla{Z^{(1)}(T-1)} D I V {% endmath %}

Back Propagation Through Time

dDIVdhi(1)=iwij(11)dDIVdZj(1)(0) \frac{d D I V}{d h_{i}(-1)}=\sum_{i} w_{i j}^{(11)} \frac{d D I V}{d Z_{j}^{(1)}(0)}

dDIVdhi(k)(t)=jwi,j(k+1)dDIVdZj(k+1)(t)+jwi,j(k,k)dDIVdZj(k)(t+1) \frac{d D I V}{d h_{i}^{(k)}(t)}=\sum_{j} w_{i, j}^{(k+1)} \frac{d D I V}{d Z_{j}^{(k+1)}(t)}+\sum_{j} w_{i, j}^{(k, k)} \frac{d D I V}{d Z_{j}^{(k)}(t+1)}

dDIVdZi(k)(t)=dDIVdhi(k)(t)fk(Zi(k)(t)) \frac{d D I V}{d Z_{i}^{(k)}(t)}=\frac{d D I V}{d h_{i}^{(k)}(t)} f_{k}^{\prime}\left(Z_{i}^{(k)}(t)\right)

dDIVdwij(1)=tdDIVdZj(1)(t)Xi(t) \frac{d D I V}{d w_{i j}^{(1)}}=\sum_{t} \frac{d D I V}{d Z_{j}^{(1)}(t)} X_{i}(t)

dDIVdwij(11)=tdDIVdZj(1)(t)hi(t1) \frac{d D I V}{d w_{i j}^{(11)}}=\sum_{t} \frac{d D I V}{d Z_{j}^{(1)}(t)} h_{i}(t-1)


Bidirectional RNN

  • Two independent RNN
  • Clearly, this is not an online process and requires the entire input data
  • It is easy to learning two RNN independently
    • Forward pass: Compute both forward and backward networks and final output
    • Backpropagation
      • A basic backprop routine that we will call
      • Two calls to the routine within a higher-level wrapper

1:The Unreasonable Effectiveness of Recurrent Neural Networks

results matching ""

    No results matching ""