Logistic regression

This the perceptron with a sigmoid activation
- It actually computes the probability that the input belongs to class 1
- Decision boundaries may be obtained by comparing the probability to a threshold
- These boundaries will be lines (hyperplanes in higher dimensions)
- The sigmoid perceptron is a linear classifier

Estimating the model

Given: Training data: $\left(X_{1}, y_{1}\right),\left(X_{2}, y_{2}\right), \ldots,\left(X_{N}, y_{N}\right)$
$X$ are vectors, $y$ are binary (0/1) class values
Total probability of data

$\begin{array}{l} P\left(\left(X_{1}, y_{1}\right),\left(X_{2}, y_{2}\right), \ldots,\left(X_{N}, y_{N}\right)\right)= \prod_{i} P\left(X_{i}, y_{i}\right) \\\\ =\prod_{i} P\left(y_{i} \mid X_{i}\right) P\left(X_{i}\right)=\prod_{i} \frac{1}{1+e^{-y_{i}\left(w_{0}+w^{T} X_{i}\right)}} P\left(X_{i}\right) \end{array}$

Likelihood

$P(\text {Training data})=\prod_{i} \frac{1}{1+e^{-y_{i}\left(w_{0}+w^{T} X_{i}\right)}} P\left(X_{i}\right)$

Log likelihood

$\begin{array}{l} \log P(\text {Training data})= \sum_{i} \log P\left(X_{i}\right)-\sum_{i} \log \left(1+e^{-y_{i}\left(w_{0}+w^{T} X_{i}\right)}\right) \end{array}$

Maximum Likelihood Estimate

$w_{0}, w_{1}=\underset{w_{0}, w_{1}}{\operatorname{argmax}} \log P(\text {Training data})$

Equals (note argmin rather than argmax)

$w_{0}, w_{1}=\underset{w_{0}, w}{\operatorname{argmin}} \sum_{i} \log \left(1+e^{-y_{i}\left(w_{0}+w^{T} X_{i}\right)}\right)$

Identical to minimizing the KL divergence between the desired output and actual output $\frac{1}{1+e^{-\left(w_{0}+w^{T} X_{i}\right)}}$

MLP

Separable case

The rest of the network may be viewed as a transformation that transforms data from non-linear classes to linearly separable features
- We can now attach any linear classifier above it for perfect classification
- Need not be a perceptron
- Could even train an SVM on top of the features!
For insufficient structures, the network may attempt to transform the inputs to linearly separable features
- Will fail to separate exactly, but will try to minimize error
The network until the second-to-last layer is a non-linear function $f(X)$ that converts the input space $X$ of into the feature space where the classes are maximally linearly separable

Lower layers

Manifold hypothesis: For separable classes, the classes are linearly separable on a non-linear manifold
Layers sequentially “straighten” the data manifold
The “feature extraction” layer transforms the data such that the posterior probability may now be modelled by a logistic

Weight as a template

In high dimensional space, all vectors are more or less the same length
- Which means all $x$ are in this surface of sphere
The perceptron fires if the input is within a specified angle of the weight
- Represents a convex region on the surface of the sphere!
- The network is a Boolean function over these regions
Neuron fires if the input vector is close enough to the weight vector
- If the input pattern matches the weight pattern closely enough
The perceptron is a correlation filter!

Autoencoder

The lowest layers of a network detect significant features in the signal
The signal could be (partially) reconstructed using these features
- Will retain all the significant components of the signal

Simplest autoencoder

This is just PCA!
The autoencoder finds the direction of maximum energy
Simply varying the hidden representation will result in an output that lies along the major axis

Terminology

Encoder
- The “Analysis” net which computes the hidden representation
Decoder
- The “Synthesis” which recomposes the data from the hidden representation

Nonlinearity

When the hidden layer has a linear activation the decoder represents the best linear manifold to fit the data
- Varying the hidden value will move along this linear manifold
When the hidden layer has non-linear activation, the net performs nonlinear PCA
- The decoder represents the best non-linear manifold to fit the data
- Varying the hidden value will move along this non-linear manifold

The model is specific to the training data
- Varying the hidden layer value only generates data along the learned manifold
- Any input will result in an output along the learned manifold
- But may not generalize beyond the manifold
  - Input unseen data may behave beyond intuitive manner, no constrain!
  - The decoder can only generate data on the manifold that the training data lie on
This also makes it an excellent “generator” of the distribution of the training data

Dictionary-based techniques

The decoder represents a source-specific generative dictionary
- Exciting it will produce typical data from the source!

Signal separation

Separation: Identify the combination of entries from both dictionaries that compose the mixed signal

Given mixed signal and source dictionaries, find excitation that best recreates mixed signal
- Simple backpropagation
Intermediate results are separated signals

17 Representation