Variants on recurrent nets

Architectures
- How to train recurrent networks of different architectures
Synchrony
- The target output is time-synchronous with the input
- The target output is order-synchronous, but not time synchronous

No recurrence in model
- Exactly as many outputs as inputs
- One to one correspondence between desired output and actual output
Common assumption $\nabla_{Y(t)} \operatorname{Div}\left(Y_{\text {target}}(1 \ldots T), Y(1 \ldots T)\right)=w_{t} \nabla_{Y(t)} \operatorname{Div}\left(Y_{\text {target}}(t), Y(t)\right)$
- $w_t$ is typically set to 1.0

The divergence computed is between the sequence of outputs by the network and the desired sequence of outputs
This is not just the sum of the divergences at individual times

Represent words as one-hot vectors
- Sparse problem
- Makes no assumptions about the relative importance of words
The Projected word vectors
- Replace every one-hot vector $W_i$ by $PW_i$
- $P$ is an $M\times N$ matrix
How to learn projections
- Soft bag of words
  - Predict word based on words in immediate context
  - Without considering specific position
- Skip-grams
  - Predict adjacent words based on current word

Outputs are actually produced for every input
- We only read it at the end of the sequence
How to train
- Define the divergence everywhere
  - $D I V\left(Y_{\text {target}}, Y\right)=\sum_{t} w_{t} \operatorname{Xent}(Y(t), \text { Phoneme})$
- Typical weighting scheme for speech
  - All are equally important
- Problem like question answering
  - Answer only expected after the question ends

How do we know when to output symbols
- In fact, the network produces outputs at every time
- Which of these are the real outputs
  - Outputs that represent the definitive occurrence of a symbol

Option 1: Simply select the most probable symbol at each time
- Merge adjacent repeated symbols, and place the actual emission of the symbol in the final instant
- Cannot distinguish between an extended symbol and repetitions of the symbol
- Resulting sequence may be meaningless
Option 2: Impose external constraints on what sequences are allowed
- Only allow sequences corresponding to dictionary words
- Sub-symbol units
How to train when no timing information provided

Only the sequence of output symbols is provided for the training data
- But no indication of which one occurs where
How do we compute the divergence?
- And how do we compute its gradient

14 Divergence Of RNN