Variants on recurrent nets

  • Architectures
    • How to train recurrent networks of different architectures
  • Synchrony
    • The target output is time-synchronous with the input
    • The target output is order-synchronous, but not time synchronous

One to one

  • No recurrence in model

    • Exactly as many outputs as inputs
    • One to one correspondence between desired output and actual output
  • Common assumption Y(t)Div(Ytarget(1T),Y(1T))=wtY(t)Div(Ytarget(t),Y(t)) \nabla_{Y(t)} \operatorname{Div}\left(Y_{\text {target}}(1 \ldots T), Y(1 \ldots T)\right)=w_{t} \nabla_{Y(t)} \operatorname{Div}\left(Y_{\text {target}}(t), Y(t)\right)

    • wtw_t is typically set to 1.0

Many to many

  • The divergence computed is between the sequence of outputs by the network and the desired sequence of outputs
  • This is not just the sum of the divergences at individual times

Language modelling: Representing words

  • Represent words as one-hot vectors

    • Sparse problem
    • Makes no assumptions about the relative importance of words
  • The Projected word vectors

    • Replace every one-hot vector WiW_i by PWiPW_i
    • PP is an M×NM\times N matrix
  • How to learn projections

    • Soft bag of words
      • Predict word based on words in immediate context
      • Without considering specific position
    • Skip-grams
      • Predict adjacent words based on current word

Many to one

  • Example
    • Question answering
      • Input : Sequence of words
      • Output: Answer at the end of the question
    • Speech recognition
      • Input : Sequence of feature vectors (e.g. Mel spectra)
      • Output: Phoneme ID at the end of the sequence

  • Outputs are actually produced for every input

    • We only read it at the end of the sequence
  • How to train

    • Define the divergence everywhere
      • DIV(Ytarget,Y)=twtXent(Y(t), Phoneme)D I V\left(Y_{\text {target}}, Y\right)=\sum_{t} w_{t} \operatorname{Xent}(Y(t), \text { Phoneme})
    • Typical weighting scheme for speech
      • All are equally important
    • Problem like question answering
      • Answer only expected after the question ends

Sequence-to-sequence

  • How do we know when to output symbols
    • In fact, the network produces outputs at every time
    • Which of these are the real outputs
      • Outputs that represent the definitive occurrence of a symbol

  • Option 1: Simply select the most probable symbol at each time
    • Merge adjacent repeated symbols, and place the actual emission of the symbol in the final instant
    • Cannot distinguish between an extended symbol and repetitions of the symbol
    • Resulting sequence may be meaningless
  • Option 2: Impose external constraints on what sequences are allowed
    • Only allow sequences corresponding to dictionary words
    • Sub-symbol units
  • How to train when no timing information provided

  • Only the sequence of output symbols is provided for the training data
    • But no indication of which one occurs where
  • How do we compute the divergence?
    • And how do we compute its gradient

results matching ""

    No results matching ""