This page is in reference to the article on Hierarchical Gaussian mixtures.
Let $n \in 1 \dots N$ index subjects, $i \in 1 \dots I$ index voxels and $k \in 1 \dots K$ index classes. Let us write raw images $\mathbf{x}_n$. Let us assume that (expected) class belonging of each voxel is known and stored in $\mathbf{z}_n$ (i.e., a “segmentation” or “responsibility” image). This is the situation that we have after the “Expectation” step of an EM (or variational EM) algorithm.
The conditional prior probability for each voxel is:
Each gaussian is assumed to stem from a Normal-Wishart distribution:
All prior inverse scale matrices stem from a shared Wishart distribution:
We make the mean field approximation that the posterior factorises over $(\boldsymbol\mu_{k,n}, \boldsymbol\Lambda_{k,n})$, $\mathbf{W}_{k,0}^{-1}$ and $\boldsymbol\Psi_0$. We’ll be searching for mode estimates for all the other variables ($\boldsymbol\mu_{k,0}$, $\lambda_{k,0}$, $\nu_{k,0}$, $m_0$).
with
yielding
with
yielding
The lower bound on the model evidence (or negative free energy) can be written as
with
Keeping only terms that depend on $\lambda_{k,0}$, we have:
By differentiating with respect to $\lambda_{k,0}$, we find:
Solving for a null derivative yields
Keeping only terms that depend on $\boldsymbol\mu_{k,0}$, and substituting the above result for $\lambda_{k,0}$, we have:
Let us write $\mathcal{S} = \sum_{n=1}^N \nu_{k,n}^\star(\boldsymbol\mu_{k,n}^\star - \boldsymbol\mu_{k,0})^{\mathrm{T}}\mathbf{W}_{k,n}^\star(\boldsymbol\mu_{k,n}^\star - \boldsymbol\mu_{k,0})$ and $\mathcal{L} = \sum_{n=1}^N \frac{M}{\lambda_{k,n}^\star}$ so that $\mathcal{E} = MN\frac{\mathcal{S}}{\mathcal{L} + \mathcal{S}}$, with
Differentiating the complete objective function yields
Consequently
Keeping only terms that depend on $\nu_{k,0}$, we have:
with:
yielding:
Differentiating yields:
There is no closed form solution, so we’ll use the gradient and Hessian of the objective function, with a Gauss-Newton optimisation scheme, instead:
Keeping only terms that depend on $\boldsymbol\Psi_0$, we have:
Differentiating yields:
Keeping only terms that depend on $m_0$, we have:
Note: In order to find the global optimum at once, I first tried substituting $\boldsymbol\Psi_0$ for its optimum value (above) that depends on $m_0$. However, this breaks the objective function’s convexity, resulting in an ill-conditionned optimisation problem. It is thus preferable to alternate between optimising $m_0$ by Gauss-Newton and updating $\boldsymbol\Psi_0$ based on this new value.
Once again, we will resolve to using numerical optimisation:
Created by Yaël Balbastre on 11 April 2018. Last edited on 13 April 2018.