Sequence to sequence

  • Sequence goes in, sequence comes out
  • No notion of “time synchrony” between input and output
    • May even nots maintain order of symbols (from one language to another)

With order synchrony

  • The input and output sequences happen in the same order
    • Although they may be time asynchronous
    • E.g. Speech recognition
      • The input speech corresponds to the phoneme sequence output
  • Question
    • How do we know when to output symbols
    • In fact, the network produces outputs at every time
    • Which of these are the real outputs?

Option 1

  • Simply select the most probable symbol at each time
  • Merge adjacent repeated symbols, and place the actual emission of the symbol in the final instant
  • Problem
    • Cannot distinguish between an extended symbol and repetitions of the symbol
    • Resulting sequence may be meaningless

Option 2

  • Impose external constraints on what sequences are allowed
    • E.g. only allow sequences corresponding to dictionary words

Decoding

  • The process of obtaining an output from the network

  • Time-synchronous & order-synchronous sequence

    • aaabbbbbbccc => abc (probility 0.5)
      aabbbbbbbccc => abc (probility 0.001)
      cccddddddeee => cde (probility 0.4)
      cccddddeeeee => cde (probility 0.4)
      
    • So abc is the most likely time-synchronous output sequence
    • But cde is the the most likely order-synchronous sequence
  • Option 2 is in fact a suboptimal decode that actually finds the most likely time-synchronous output sequence

  • The “merging” heuristics do not guarantee optimal order-synchronous sequences

No timing information provided

  • Only the sequence of output symbols is provided for the training data
    • But no indication of which one occurs where

Guess the alignment

  • Initialize
    • Assign an initial alignment
    • Either randomly, based on some heuristic, or any other rationale
  • Iterate
    • Train the network using the current alignment
    • Reestimate the alignment for each training instance

Constraining the alignment

  • Try 1
    • Block out all rows that do not include symbols from the target sequence
    • E.g. Block out rows that are not /B/ /IY/ or /F/
  • Only decode on reduced grid
    • We are now assured that only the appropriate symbols will be hypothesized
  • But this still doesn’t assure that the decode sequence correctly expands the target symbol sequence
    • Order variance
  • Try 2
    • Explicitly arrange the constructed table
    • Arrange the constructed table so that from top to bottom it has the exact sequence of symbols required
      • If a symbol occurs multiple times, we repeat the row in the appropriate location

  • Constrain that the first symbol in the decode must be the top left block

    • The last symbol must be the bottom right
    • The rest of the symbols must follow a sequence that monotonically travels down from top left to bottom right
    • This guarantees that the sequence is an expansion of the target sequence
  • Compose a graph such that every path in the graph from source to sink represents a valid alignment

    • The “score” of a path is the product of the probabilities of all nodes along the path
    • Find the most probable path from source to sink using any dynamic programming algorithm (viterbi algorithm)

Viterbi algorithm

  • Main idea
    • The best path to any node must be an extension of the best path to one of its parent nodes
  • Dynamically track the best path (and the score of the best path) from the source node to every node in the graph

    • At each node, keep track of
      • The best incoming parent edge (BP)
      • The score of the best path from the source to the node through this best parent edge (Bscr)
  • Process

  • Algorithm

Gradients

DIV=tXent(Yt,symboltbestpath)=tlogY(t,symboltbestpath) D I V=\sum_{t} X e n t\left(Y_{t}, s y m b o l_{t}^{b e s t p a t h}\right)=-\sum_{t} \log Y\left(t, s y m b o l_{t}^{b e s t p a t h}\right)

  • The gradient w.r.t the -th output vector YtY_t YtDIV=[001Y(t,symboltbestpath)00] \nabla_{Y_{t}} D I V=[0 \quad 0 \cdot \cdot \cdot \frac{-1}{Y(t, s y m b o l_{t}^{b e s t p a t h})} \quad 0 \cdot \cdot \cdot 0]

  • Problem

    • Approach heavily dependent on initial alignment
    • Prone to poor local optima
      • Because we commit to the single “best” estimated alignment
      • This can be way off, particularly in early iterations, or if the model is poorly initialized
  • Alternate view

    • There is a probability distribution over alignments of the target Symbol sequence (to the input)
    • Selecting a single alignment is the same as drawing a single sample from it
  • Instead of only selecting the most likely alignment, use the statistical expectation over all possible alignments

    • DIV=E[tlogY(t,st)] D I V=E\left[-\sum_{t} \log Y\left(t, s_{t}\right)\right]

    • Use the entire distribution of alignments

    • This will mitigate the issue of suboptimal selection of alignment

  • Using the linearity of expectation

    • DIV=tE[logY(t,st)] D I V=-\sum_{t} E\left[\log Y\left(t, s_{t}\right)\right]

    • DIV=tSS1SKP(st=SS,X)logY(t,st=S) D I V=-\sum_{t} \sum_{S \in S_{1} \ldots S_{K}} P\left(s_{t}=S | \mathbf{S}, \mathbf{X}\right) \log Y\left(t, s_{t}=S\right)

A posteriori probabilities of symbols

  • P(st=SS,X)P(s_{t}=S | \mathbf{S}, \mathbf{X}) is the probability of seeing the specific symbol ss at time tt, given that the symbol sequence is an expansion of S=s0,...,SK1S = s_0,...,S_{K-1} and given the input sequence X=Xo,...,XN1X = X_o,...,X_{N-1}

    • P(st=SrS,X)P(st=Sr,SX)P\left(s_{t}=S_{r} | \mathbf{S}, \mathbf{X}\right) \propto P\left(s_{t}=S_{r}, \mathbf{S} | \mathbf{X}\right)
  • P(st=Sr,SX)P\left(s_{t}=S_{r}, \mathbf{S} | \mathbf{X}\right) is the total probability of all valid paths in the graph for target sequence SS that go through the symbol SrS_r (the rthr^{th} symbol in the sequence S0,...,SK1S_0,...,S_{K-1} ) at time

  • For a recurrent network without feedback from the output we can make the conditional independence assumption

P(st=Sr,SX)=P(S0Sr,st=SrX)P(st+1succ(Sr),succ(Sr),,SK1X) P\left(s_{t}=S_{r}, \mathbf{S} | \mathbf{X}\right)=P\left(S_{0} \ldots S_{r}, s_{t}=S_{r} | \mathbf{X}\right) P\left(s_{t+1} \in \operatorname{succ}\left(S_{r}\right), \operatorname{succ}\left(S_{r}\right), \ldots, S_{K-1} | \mathbf{X}\right)

  • We will call the first term the forward probability α(t,r)\alpha(t,r)
  • We will call the second term the backward probability β(t,r)\beta(t,r)

Forward algorithm

  • In practice the recursion will gererally underflow

α(t,l)=(α(t1,l)+α(t1,l1))ytS(l) \alpha(t, l)=(\alpha(t-1, l)+\alpha(t-1, l-1)) y_{t}^{S(l)}

  • Instead we can do it in the log domain

logα(t,l)=log(elogα(t1,l)+elogα(t1,l1))+logytS(l) \log \alpha(t, l)=\log \left(e^{\log \alpha(t-1, l)}+e^{\log \alpha(t-1, l-1)}\right)+\log y_{t}^{S(l)}

Backward algorithm

β(t,r)=P(st+1succ(Sr),succ(Sr),,SK1X) \beta(t, r)=P\left(s_{t+1} \in \operatorname{succ}\left(S_{r}\right), \operatorname{succ}\left(S_{r}\right), \ldots, S_{K-1} | \mathbf{X}\right)

The joint probability

P(st=Sr,SX)=α(t,r)β(t,r) P\left(s_{t}=S_{r}, \mathbf{S} | \mathbf{X}\right)=\alpha(t, r) \beta(t, r)

  • Need to be normalized, get posterior probability

γ(t,r)=P(st=SrS,X)=P(st=Sr,SX)SrP(st=Sr,SX)=α(t,r)β(t,r)rα(t,r)β(t,r) \gamma(t,r) = P\left(s_{t}=S_{r} | \mathbf{S}, \mathbf{X}\right)=\frac{P\left(s_{t}=S_{r}, \mathbf{S} | \mathbf{X}\right)}{\sum_{S_{r}^{\prime}} P\left(s_{t}=S_{r}^{\prime}, \mathbf{S} | \mathbf{X}\right)}=\frac{\alpha(t, r) \beta(t, r)}{\sum_{r^{\prime}} \alpha\left(t, r^{\prime}\right) \beta\left(t, r^{\prime}\right)}

  • We can also write this using the modified beta formula as (you will see this in papers)

γ(t,r)=1ytS(r)α(t,r)β^(t,r)r1ytS(r)α(t,r)β^(t,r) \gamma(t, r)=\frac{\frac{1}{y_{t}^{S(r)}} \alpha(t, r) \hat{\beta}(t, r)}{\sum_{r^{\prime}} \frac{1}{y_{t}^{S(r)}} \alpha(t, r) \hat{\beta}(t, r)}

The expected divergence

DIV=tsS0SK1P(st=sS,X)logY(t,st=s) D I V=-\sum_{t} \sum_{s \in S_{0} \ldots S_{K-1}} P\left(s_{t}=s | \mathbf{S}, \mathbf{X}\right) \log Y\left(t, s_{t}=s\right)

DIV=trγ(t,r)logytS(r) D I V=-\sum_{t} \sum_{r} \gamma(t, r) \log y_{t}^{S(r)}

  • The derivative of the divergence w.r.t the output YtY_t of the net at any time

YtDIV=[dDIVdytS0dDIVdytS1dDIVdytSL1] \nabla_{Y_{t}} D I V=\left[\begin{array}{llll} \frac{d D I V}{d y_{t}^{S_{0}}} & \frac{d D I V}{d y_{t}^{S_{1}}} & \cdots & \frac{d D I V}{d y_{t}^{S_{L-1}}} \end{array}\right]

  • Compare to Viterbi algorithm, components will be non-zero only for symbols that occur in the training instance

  • Compute derivative

dDIVdytl=r:S(r)=lddytS(r)γ(t,r)logytS(r) \frac{d D I V}{d y_{t}^{l}}=-\sum_{r: S(r)=l} \frac{d}{d y_{t}^{S(r)}} \gamma(t, r) \log y_{t}^{S(r)}

ddytS(r)γ(t,r)logytS(r)=γ(t,r)ytS(r)+dγ(t,r)dytS(r)logytS(r) \frac{d}{d y_{t}^{S(r)}} \gamma(t, r) \log y_{t}^{S(r)}=\frac{\gamma(t, r)}{y_{t}^{S(r)}}+\frac{d \gamma(t, r)}{d y_{t}^{S(r)}} \log y_{t}^{S(r)}

  • Normally we ignore the second term, and think of as a maximum-likelihood estimate
  • And the derivatives at both these locations must be summed if occurs repetition

dDIVdytl=r:S(r)=lγ(t,r)ytS(r) \frac{d D I V}{d y_{t}^{l}}=-\sum_{r: S(r)=l} \frac{\gamma(t, r)}{y_{t}^{S(r)}}

Repetitive decoding problem

  • Cannot distinguish between an extended symbol and repetitions of the symbol
    • We have a decode: R R R O O O O O D
    • Is this the symbol sequence ROD or ROOD?
  • Solution: Introduce an explicit extra symbol which serves to separate discrete versions of a symbol
    • A “blank” (represented by “-”)
    • RRR---OO---DDD = ROD
    • RR-R---OO---D-DD = RRODD

Modified forward algorithm

  • A blank is mandatory between repetitions of a symbol but not required between distinct symbols
  • Skips are permitted across a blank, but only if the symbols on either side are different

Modified backward algorithm

Overall training procedure

  1. Setup the network

    • Typically many-layered LSTM
  2. Initialize all parameters of the network

    • Include a 「blank」 symbol in vocabulary
  3. Foreach training instance

    • Pass the training instance through the network and obtain all symbol probabilities at each time
    • Construct the graph representing the specific symbol sequence in the instance
  4. Backword pass

    • Perform the forward backward algorithm to compute α(t,r)\alpha(t,r) and β(t,r)\beta(t,r)

    • Compute derivative of divergence YtDIV\nabla_{Y_{t}} D I V for each YtY_t

    • YtDIV=[dDIVdyt0dDIVdyt1dDIVdytL1]\nabla_{Y_{t}} D I V=\left[\begin{array}{llll}\frac{d D I V}{d y_{t}^{0}} & \frac{d D I V}{d y_{t}^{1}} & \cdots & \frac{d D I V}{d y_{t}^{L-1}}\end{array}\right]
    • dDIVdytl=r:S(r)=lγ(t,r)ytS(r)\frac{d D I V}{d y_{t}^{l}}=-\sum_{r: S(r)=l} \frac{\gamma(t, r)}{y_{t}^{S(r)}}
    • Aggregate derivatives over minibatch and update parameters
  5. CTC: Connectionist Temporal Classification

    • The overall framework is referred to as CTC
    • This is in fact a suboptimal decode that actually finds the most likely time-synchronous output sequence
  6. Actual decoding objective
    • S^=argmaxsαS(SK1,T1)\hat{\mathbf{S}}=\underset{\mathbf{s}}{\operatorname{argmax}} \alpha_{\mathbf{S}}\left(S_{K-1}, T-1\right)
    • Explicit computation of this will require evaluate of an exponential number of symbol sequences
  • Only used in test
  • Uses breadth-first search to build its search tree
    • Choose top k words for next prediction (prone)

results matching ""

    No results matching ""