Non-linear extensions of linear Gaussian models.

EM for PCA

With complete information

  • If we knew zz for each xx, estimating AA and DD would be simple

x=Az+E x=A z+E

P(xz)=N(Az,D) P(x \mid z)=N(A z, D)

  • Given complete information (x1,z1),(x2,z2)\left(x_{1}, z_{1}\right),\left(x_{2}, z_{2}\right)

argmaxA,D(x,z)logP(x,z)=argmaxA,D(x,z)logP(xz) \underset{A, D}{\operatorname{argmax}} \sum_{(x, z)} \log P(x, z)=\underset{A, D}{\operatorname{argmax}} \sum_{(x, z)} \log P(x \mid z)

=argmaxA,D(x,Z)log1(2π)dDexp(0.5(xAz)TD1(xAz)) =\underset{A, D}{\operatorname{argmax}} \sum_{(x, Z)} \log \frac{1}{\sqrt{(2 \pi)^{d}|D|}} \exp \left(-0.5(x-A z)^{T} D^{-1}(x-A z)\right)

  • We can get a close form solution: A=XZ+A = XZ^{+}
  • But we don't have ZZ => missing

With incomplete information

  • Initialize the plane
  • Complete the data by computing the appropriate zz for the plane
    • P(zX;A)P(z|X;A) is a delta, because EE is orthogonal to AA
  • Reestimate the plane using the zz
  • Iterate

Linear Gaussian Model

  • PCA assumes the noise is always orthogonal to the data
    • Not always true
  • The noise added to the output of the encoder can lie in any direction (uncorrelated)
  • We want a generative model: to generate any point
    • Take a Gaussian step on the hyperplane
    • Add full-rank Gaussian uncorrelated noise that is independent of the position on the hyperplane
      • Uncorrelated: diagonal covariance matrix
      • Direction of noise is unconstrained

With complete information

x=Az+e x=A z+e

P(xz)=N(Az,D) P(x \mid z)=N(A z, D)

  • Given complete information X=[x1,x2,],Z=[z1,z2,]X=\left[x_{1}, x_{2}, \ldots\right], Z=\left[z_{1}, z_{2}, \ldots\right]

argmaxA,D(x,z)logP(x,z)=argmaxA,D(x,z)logP(xz) \underset{A, D}{\operatorname{argmax}} \sum_{(x, z)} \log P(x, z)=\underset{A, D}{\operatorname{argmax}} \sum_{(x, z)} \log P(x \mid z)

=argmaxA,D(x,z)log1(2π)dDexp(0.5(xAz)TD1(xAz)) =\underset{A, D}{\operatorname{argmax}} \sum_{(x, z)} \log \frac{1}{\sqrt{(2 \pi)^{d}|D|}} \exp \left(-0.5(x-A z)^{T} D^{-1}(x-A z)\right)

=argmaxA,D(x,z)12logD0.5(xAz)TD1(xAz) =\underset{A, D}{\operatorname{argmax}} \sum_{(x, z)}-\frac{1}{2} \log |D|-0.5(x-A z)^{T} D^{-1}(x-A z)

  • We can also get closed form solution

With incomplete information

Option 1

  • In every possible way proportional to P(zx)P(z|x) (Gaussian)
  • Compute the solution from the completed data

argmaxA,Dxp(zx)(12logD0.5(xAz)TD1(xAz))dz \underset{A, D}{\operatorname{argmax}} \sum_{x} \int_{-\infty}^{\infty} p(z \mid x)\left(-\frac{1}{2} \log |D|-0.5(x-A z)^{T} D^{-1}(x-A z)\right) d z

  • The same as before

Option 2

  • By drawing samples from P(zx)P(z|x)
  • Compute the solution from the completed data

The intuition behind Linear Gaussian Model

  • zN(0,I)z \sim N(0, I) => AzAz
    • The linear transform stretches and rotates the K-dimensional input space onto a Kdimensional hyperplane in the data space
  • X=Az+EX = Az +E
    • Add Gaussian noise to produce points that aren’t necessarily on the plane

  • The posterior probability P(zx)P(z|x) gives you the location of all the points on the plane that could have generated xx and their probabilities

  • What about data that are not Gaussian distributed close to a plane?

    • Linear Gaussian Models fail
  • How to do that

Non-linear Gaussian Model

  • f(z)f(z) is a non-linear function that produces a curved manifold
    • Like the decoder of a non-linear AE
  • Generating process
    • Draw a sample zz from a Uniform Gaussian
    • Transform zz by f(z)f(z)
      • This places zz on the curved manifold
    • Add uncorrelated Gaussian noise to get the final observation

  • Key requirement
    • Identifying the dimensionality KK of the curved manifold
    • Having a function that can transform the (linear) KK-dimensional input space (space of zz ) to the desired KK-dimensional manifold in the data space

With complete data

x=f(z;θ)+e x=f(z ; \theta)+e

P(xz)=N(f(z;θ),D) P(x \mid z)=N(f(z ; \theta), D)

  • Given complete information X=[x1,x2,],Z=[z1,z2,]X=\left[x_{1}, x_{2}, \ldots\right], \quad Z=\left[z_{1}, z_{2}, \ldots\right]

θ,D=argmaxθ,D(x,z)logP(x,z)=argmaxθ,D(x,z)logP(xz) \theta^{\star}, D^{\star}=\underset{\theta, D}{\operatorname{argmax}} \sum_{(x, z)} \log P(x, z)=\underset{\theta, D}{\operatorname{argmax}} \sum_{(x, z)} \log P(x \mid z)

=argmaxθ,D(x,Z)log1(2π)dDexp(0.5(xf(z;θ))TD1(xf(z;θ))) =\underset{\theta, D}{\operatorname{argmax}} \sum_{(x, Z)} \log \frac{1}{\sqrt{(2 \pi)^{d}|D|}} \exp \left(-0.5(x-f(z ; \theta))^{T} D^{-1}(x-f(z ; \theta))\right)

=argmaxθ,D(x,Z)12logD0.5(xf(z;θ))TD1(xf(z;θ)) =\underset{\theta, D}{\operatorname{argmax}} \sum_{(x, Z)}-\frac{1}{2} \log |D|-0.5(x-f(z ; \theta))^{T} D^{-1}(x-f(z ; \theta))

  • There isn’t a nice closed form solution, but we could learn the parameters using backpropagation

Incomplete data

  • The posterior probability is given by

P(zx)=P(xz)P(z)P(x) P(z \mid x)=\frac{P(x \mid z) P(z)}{P(x)}

  • The denominator

P(x)=N(x;f(z;θ),D)N(z;0,D)dz P(x)=\int_{-\infty}^{\infty} N(x ; f(z ; \theta), D) N(z ; 0, D) d z

  • Can not have a closed form solution
    • Try to approximate it

  • We approximate P(zx)P(z|x) as

P(zx)Q(z,x)=GaussianN(z;μ(x),Σ(x)) P(z \mid x) \approx Q(z, x)=\operatorname{Gaussian} N(z ; \mu(x), \Sigma(x))

  • Sample zz from N(z;μ(x;ϕ),σ(x;ϕ))N(z;\mu (x;\phi),\sigma (x;\phi)) for each training instance
    • Draw KK-dimensional vector ε\varepsilon from N(0,I)N(0,I)
    • Compute z=μ(x;φ)+Σ(x;φ)0.5εz=\mu(x ; \varphi)+\Sigma(x ; \varphi)^{0.5} \varepsilon
  • Reestimate θ\theta from the entire “complete” data
    • Using backpropagation

L(θ,D)=(x,z)logD+(xf(z;θ))TD1(xf(z;θ)) L(\theta, D)=\sum_{(x, z)} \log |D|+(x-f(z ; \theta))^{T} D^{-1}(x-f(z ; \theta))

θ,D=argminθ,DL(θ,D) \theta^{\star}, D^{\star}=\underset{\theta, D}{\operatorname{argmin}} L(\theta, D)

  • Estimate φ\varphi using the entire “complete” data
    • Recall Q(z,x)=N(z;μ(x;φ),Σ(x;φ))Q(z, x)=N(z ; \mu(x ; \varphi), \Sigma(x ; \varphi)) must approximate P(zx)P(z|x) as closely as possible
    • Define a divergence between Q(z,x)Q(z,x) and P(zx)P(z|x)

Variational AutoEncoder

  • Non-linear extensions of linear Gaussian models
  • f(z;θ)f(z;\theta) is generally modelled by a neural network
  • μ(x;φ)\mu(x ; \varphi) and Σ(x;φ)\Sigma(x ; \varphi) are generally modelled by a common network with two outputs

  • However, VAE can not be used to compute the likelihoood of data
    • P(x;θ)P(x;\theta) is intractable
  • Latent space
    • The latent space zz often captures underlying structure in the data xx in a smooth manner
    • Varying zz continuously in different directions can result in plausible variations in the drawn output

results matching ""

    No results matching ""