Training hopfield nets

Geometric approach

  • Behavior of E(y)=yTWy\mathbf{E}(\mathbf{y})=\mathbf{y}^{T} \mathbf{W y} with W=YYTNpI\mathbf{W}=\mathbf{Y} \mathbf{Y}^{T}-N_{p} \mathbf{I} is identical to behavior with W=YYTW=YY^T

    • Energy landscape only differs by an additive constant
    • Gradients and location of minima remain same (Have the same eigen vectors)
  • Sine : yT(YYTNpI)y=yTYYTyNNp\mathbf{y}^{T}\left(\mathbf{Y} \mathbf{Y}^{T}-N_{p} \mathbf{I}\right) \mathbf{y}=\mathbf{y}^{T} \mathbf{Y} \mathbf{Y}^{T} \mathbf{y}-N N_{p}

  • We use yTYYTy\mathbf{y}^{T} \mathbf{Y} \mathbf{Y}^{T} \mathbf{y} for analyze

  • A pattern ypy_p is stored if:

    • sign(Wyp)=y_p\operatorname{sign}\left(\mathbf{W} \mathbf{y}_{p}\right)=\mathbf{y}\_{p} for all target patterns
  • Training: Design WW such that this holds

  • Simple solution: ypy_p is an Eigenvector of WW

Storing k orthogonal patterns

  • Let Y=[y_1y_2y_K]\mathbf{Y}=\left[\mathbf{y}\_{1} \mathbf{y}\_{2} \ldots \mathbf{y}\_{K}\right]
    • W=YΛYT\mathbf{W}=\mathbf{Y} \Lambda \mathbf{Y}^{T}
    • λ1,...,λk\lambda_1,...,\lambda_k are positive
    • for λ1=λ2=λk=1\lambda_1= \lambda_2=\lambda_k= 1 this is exactly the Hebbian rule
  • Any pattern yy can be written as
    • y=a1y1+a2y2++aNyN\mathbf{y}=a_{1} \mathbf{y}_{1}+a_{2} \mathbf{y}_{2}+\cdots+a_{N} \mathbf{y}_{N}
    • Wy=a1Wy1+a2Wy2++aNWyN=y\mathbf{W y}=a_{1} \mathbf{W y}_{1}+a_{2} \mathbf{W y}_{2}+\cdots+a_{N} \mathbf{W y}_{N} = y
  • All patterns are stable
    • Remembers everything
    • Completely useless network
  • Even if we store fewer than NN patterns
    • Let Y=[y_1y_2y_Kr_K+1r_K+2r_N]Y=\left[\mathbf{y}\_{1} \mathbf{y}\_{2} \ldots \mathbf{y}\_{K} \mathbf{r}\_{K+1} \mathbf{r}\_{K+2} \ldots \mathbf{r}\_{N}\right]
    • W=YΛYTW=Y \Lambda Y^{T}
    • r_K+1r_K+2r_N\mathbf{r}\_{K+1} \mathbf{r}\_{K+2} \ldots \mathbf{r}\_{N} are orthogonal to y1y2yK\mathbf{y}_1 \mathbf{y}_2 \ldots \mathbf{y}_K
    • λ1=λ2=λk=1\lambda_1= \lambda_2=\lambda_k= 1
    • Problem arise because eigen values are all 1.0
      • Ensures stationarity of vectors in the subspace
      • All stored patterns are equally important

General (nonorthogonal) vectors

  • wji=p{p}yipyjpw_{j i}=\sum_{p \in\{p\}} y_{i}^{p} y_{j}^{p}
  • The maximum number of stationary patterns is actually exponential in NN (McElice and Posner, 84’)
  • For a specific set of KK patterns, we can always build a network for which all KK patterns are stable provided kNk \le N
    • But this may come with many “parasitic” memories

Optimization

  • Energy function
    • E=12yTWybTyE=-\frac{1}{2} \mathbf{y}^{T} \mathbf{W} \mathbf{y}-\mathbf{b}^{T} \mathbf{y}
    • This must be maximally low for target patterns
    • Must be maximally high for all other patterns
      • So that they are unstable and evolve into one of the target patterns
  • Estimate WW such that
    • EE is minimized for y1,...,yPy_1,...,y_P
    • EE is maximized for all other yy
  • Minimize total energy of target patterns
    • E(y)=12yTWyW^=argminWyYPE(y)E(\mathbf{y})=-\frac{1}{2} \mathbf{y}^{T} \mathbf{W y} \quad \widehat{\mathbf{W}}=\underset{\mathbf{W}}{\operatorname{argmin}} \sum_{\mathbf{y} \in \mathbf{Y}_{P}} E(\mathbf{y})
    • However, might also pull all the neighborhood states down
  • Maximize the total energy of all non-target patterns
    • E(y)=12yTWyE(\mathbf{y})=-\frac{1}{2} \mathbf{y}^{T} \mathbf{W} \mathbf{y}
    • W^=argminWyYPE(y)yYPE(y)\widehat{\mathbf{W}}=\underset{\mathbf{W}}{\operatorname{argmin}} \sum_{\mathbf{y} \in \mathbf{Y}_{P}} E(\mathbf{y})-\sum_{\mathbf{y} \notin \mathbf{Y}_{P}} E(\mathbf{y})
  • Simple gradient descent
    • W=w+η(yYPyyTyYPyyT)\mathbf{W}=\mathbf{w}+\eta\left(\sum_{\mathbf{y} \in \mathbf{Y}_{P}} \mathbf{y} \mathbf{y}^{T}-\sum_{\mathbf{y} \notin \mathbf{Y}_{P}} \mathbf{y} \mathbf{y}^{T}\right)
    • minimize the energy at target patterns
    • raise all non-target patterns
      • Do we need to raise everything?

Raise negative class

  • Focus on raising the valleys
    • If you raise every valley, eventually they’ll all move up above the target patterns, and many will even vanish
  • How do you identify the valleys for the current WW?
    • Initialize the network randomly and let it evolve
    • It will settle in a valley

  • Should we randomly sample valleys?
    • Are all valleys equally important?
    • Major requirement: memories must be stable
      • They must be broad valleys
  • Solution: initialize the network at valid memories and let it evolve
    • It will settle in a valley
    • If this is not the target pattern, raise it
  • What if there’s another target pattern downvalley
    • no need to raise the entire surface, or even every valley
      • Raise the neighborhood of each target memory

Storing more than N patterns

  • Visible neurons
    • The neurons that store the actual patterns of interest
  • Hidden neurons
    • The neurons that only serve to increase the capacity but whose actual values are not important

  • The maximum number of patterns the net can store is bounded by the width NN of the patterns..
  • So lets pad the patterns with KK “don’t care” bits
    • The new width of the patterns is N+KN+K
    • Now we can store N+KN+K patterns!
  • Taking advantage of don’t care bits
    • Simple random setting of don’t care bits, and using the usual training and recall strategies for Hopfield nets should work
    • However, to exploit it properly, it helps to view the Hopfield net differently: as a probabilistic machine

A probabilistic interpretation

  • For binary y the energy of a pattern is the analog of the negative log likelihood of a Boltzmann distribution
    • Minimizing energy maximizes log likelihood
    • E(y)=12yTWyP(y)=Cexp(E(y))E(\mathbf{y})=-\frac{1}{2} \mathbf{y}^{T} \mathbf{W y} \quad P(\mathbf{y})=\operatorname{Cexp}(-E(\mathbf{y}))

Boltzmann Distribution

  • E(y)=12yTWybTyE(\mathbf{y})=-\frac{1}{2} \mathbf{y}^{T} \mathbf{W} \mathbf{y}-\mathbf{b}^{T} \mathbf{y}
  • P(y)=Cexp(E(y)kT)P(\mathbf{y})=\operatorname{Cexp}\left(\frac{-E(\mathbf{y})}{k T}\right)
  • C=1yexp(E(y)kT)C=\frac{1}{\sum_{\mathrm{y}} \exp \left(\frac{-E(\mathbf{y})}{k T}\right)}
  • kk is the Boltzmann constant, TT is the temperature of the system
  • Optimizing WW
    • E(y)=12yTWyW^=argminWyYPE(y)yYPE(y)E(\mathbf{y})=-\frac{1}{2} \mathbf{y}^{T} \mathbf{W} \mathbf{y} \quad \widehat{\mathbf{W}}=\underset{\mathbf{W}}{\operatorname{argmin}} \sum_{\mathbf{y} \in \mathbf{Y}_{P}} E(\mathbf{y})-\sum_{\mathbf{y} \notin \mathbf{Y}_{P}} E(\mathbf{y})
    • Simple gradient descent
    • W=W+η(yYPαyyyTyYPβ(E(y))yyT)\mathbf{W}=\mathbf{W}+\eta\left(\sum_{\mathbf{y} \in \mathbf{Y}_{P}} \alpha_{\mathbf{y}} \mathbf{y} \mathbf{y}^{T}-\sum_{\mathbf{y} \notin \mathbf{Y}_{P}} \beta(E(\mathbf{y})) \mathbf{y} \mathbf{y}^{T}\right)
    • αy\alpha_y more importance to more frequently presented memories
    • β(E(y))\beta (E(y)) more importance to more attractive spurious memories
    • Looks like an expectation
    • W=W+η(EyYPyyTEyYyyT)\mathbf{W}=\mathbf{W}+\eta\left(E_{\mathbf{y} \sim \mathbf{Y}_{P}} \mathbf{y} \mathbf{y}^{T}-E_{\mathbf{y} \sim Y} \mathbf{y} \mathbf{y}^{T}\right)
  • The behavior of the Hopfield net is analogous to annealed dynamics of a spin glass characterized by a Boltzmann distribution

results matching ""

    No results matching ""