Encouraging similar covariance matrices between classes

This page is in reference to the article on Hierarchical Gaussian mixtures.

Model

Let $n \in 1 \dots N$ index subjects, $i \in 1 \dots I$ index voxels and $k \in 1 \dots K$ index classes. Let us write raw images $\mathbf{x}_n$. Let us assume that (expected) class belonging of each voxel is known and stored in $\mathbf{z}_n$ (i.e., a “segmentation” or “responsibility” image). This is the situation that we have after the “Expectation” step of an EM (or variational EM) algorithm.

The conditional prior probability for each voxel is:

Each gaussian is assumed to stem from a Normal-Wishart distribution:

All prior inverse scale matrices stem from a shared Wishart distribution:

We make the mean field approximation that the posterior factorises over $(\boldsymbol\mu_{k,n}, \boldsymbol\Lambda_{k,n})$, $\mathbf{W}_{k,0}^{-1}$ and $\boldsymbol\Psi_0$. We’ll be searching for mode estimates for all the other variables ($\boldsymbol\mu_{k,0}$, $\lambda_{k,0}$, $\nu_{k,0}$, $m_0$).

Posterior

Gaussian parameters (x Subject x Class)

with

yielding

Wishart scale matrix (x Class)

with

yielding

Lower bound

The lower bound on the model evidence (or negative free energy) can be written as

with

Hyper-parameters optimisation

Normal mean degrees of freedom ($\lambda_{k,0}$)

Keeping only terms that depend on $\lambda_{k,0}$, we have:

By differentiating with respect to $\lambda_{k,0}$, we find:

Solving for a null derivative yields

Normal mean ($\boldsymbol\mu_{k,0}$)

Keeping only terms that depend on $\boldsymbol\mu_{k,0}$, and substituting the above result for $\lambda_{k,0}$, we have:

Let us write $\mathcal{S} = \sum_{n=1}^N \nu_{k,n}^\star(\boldsymbol\mu_{k,n}^\star - \boldsymbol\mu_{k,0})^{\mathrm{T}}\mathbf{W}_{k,n}^\star(\boldsymbol\mu_{k,n}^\star - \boldsymbol\mu_{k,0})$ and $\mathcal{L} = \sum_{n=1}^N \frac{M}{\lambda_{k,n}^\star}$ so that $\mathcal{E} = MN\frac{\mathcal{S}}{\mathcal{L} + \mathcal{S}}$, with

Differentiating the complete objective function yields

Consequently

Normal variance degrees of freedom ($\nu_{k,0}$)

Keeping only terms that depend on $\nu_{k,0}$, we have:

with:

yielding:

Differentiating yields:

There is no closed form solution, so we’ll use the gradient and Hessian of the objective function, with a Gauss-Newton optimisation scheme, instead:

Hyper-Wishart inverse scale matrix ($\boldsymbol\Psi_0^{-1}$)

Keeping only terms that depend on $\boldsymbol\Psi_0$, we have:

Differentiating yields:

Hyper-Wishart degrees of freedom ($m_0$)

Keeping only terms that depend on $m_0$, we have:

Note: In order to find the global optimum at once, I first tried substituting $\boldsymbol\Psi_0$ for its optimum value (above) that depends on $m_0$. However, this breaks the objective function’s convexity, resulting in an ill-conditionned optimisation problem. It is thus preferable to alternate between optimising $m_0$ by Gauss-Newton and updating $\boldsymbol\Psi_0$ based on this new value.

Once again, we will resolve to using numerical optimisation:


Created by Yaël Balbastre on 11 April 2018. Last edited on 13 April 2018.