This section describes the conditional posterior for each unknown parameter in order to implement a Gibbs sampler. In particular, it details the cumbersome task of sampling the full textured images (Section 4.4) and the labels (Section 4.5). These two sampling processes represent the major algorithmic challenges of our approach.
Each conditional posterior can be deduced from the joint posterior (8) by picking the factors that are function of the considered parameter.
4.1 Precision parameters
Regarding the noise parameter γn and the texture scale parameters γk, from (8), we have:
$$\begin{array}{@{}rcl@{}} \gamma_{\mathrm{n}} &\sim& \gamma_{\mathrm{n}}^{\,\alpha_{\mathrm{n}}^{0}\, +P-1} \, \exp - \gamma_{\mathrm{n}} \left[\beta_{\mathrm{n}}^{0} + \| {\boldsymbol{y}}-{\mathbf{H}} {\boldsymbol{z}} \|^{2}\right] \,{,}\\ \gamma_{k} &\sim& \gamma_{k}^{\,\alpha_{k}^{0}\, +P-1} \, \exp-\gamma_{k}\left[\beta_{k}^{0} + \|{\boldsymbol{x}}_{k}\|^{2}_{{\boldsymbol{\Lambda}}_{k}({\boldsymbol{\theta}}_{k})}\right] \,{.} \end{array} $$
They must be sampled under Gamma densities \({\mathcal {G}}(\gamma ; \alpha, \beta)\) with respective parameters:
$$\alpha = \alpha_{\mathrm{n}}^{0}+P ~~\text{and\:}~~\beta=\beta_{\mathrm{n}}^{0} + \| {\boldsymbol{y}}-{\mathbf{H}}{\boldsymbol{z}} \|^{2} $$
for the noise parameter γn and
$$\alpha = \alpha_{k}^{0}+ P ~~\text{and\:}~~\beta=\beta_{k}^{0} + \|{\boldsymbol{x}}_{k}\|^{2}_{{\boldsymbol{\Lambda}}_{k}({\boldsymbol{\theta}}_{k})} $$
for the texture parameters γk. As Gamma variables, they can be straightforwardly sampled. In addition, given the hierarchical structure (see Fig. 2), they are mutually (a posteriori) independent.
4.2 Shape texture parameters
Regarding the shape parameters θk of the textured image PSD, the problem is made far more complicated by the intricate relation between the density, the PSD and the parameter θk; see Eqs. (2) and (3). As a consequence, the conditional posterior has a non-standard form:
$${\boldsymbol{\theta}}_{k} \sim {\mathcal{U}}_{[{\boldsymbol{\theta}}_{k}^{\mathrm{m}},{\boldsymbol{\theta}}_{k}^{\mathrm{M}}]}({\boldsymbol{\theta}}_{k}) \prod_{p} \lambda_{p}({\boldsymbol{\theta}}_{k}) \, \exp-\gamma_{k}\lambda_{p}({\boldsymbol{\theta}}_{k}) |\overset{{\circ}} x_{k,p}|^{2} $$
nevertheless, it can be sampled using a Metropolis-Hastings (MH) stepFootnote 2. Basically, it consists in drawing a proposal based on a proposition law, evaluating an acceptance probability, and then, at random according to this probability, setting the new value as the proposal (acceptation) or as the current value (duplication). There are numerous options in order to formulate a proposition law, and both the convergence rate and the mixing properties are influenced by its adequacy to the (conditional) posterior. Thus, designing a proposition law that embeds information about the posterior will significantly enhance the performances. In this context, the directional Random Walk MH (RWMH) algorithm taking advantage of first- or second-order derivatives of the posterior seems relevant. A standard case is the Metropolis-adjusted Langevin algorithm (MALA) [51], which takes advantage of the posterior derivative. The preconditioned MALA [52] and the quasi-Newton proposals [53] exploit the posterior curvature. More advanced versions rely on the Fisher matrix (instead of the Hessian) and leads to an efficient sampler called the Fisher-RWMH: [54] proposes a general statement and our previous paper [55] (see also [56]) focuses on texture parameters.
Explicitly, from the current value θc, the algorithm formulates the proposal θp:
$$ {\boldsymbol{\theta}}_{\mathrm{p}} = {\boldsymbol{\theta}}_{\mathrm{c}} + \frac{1}{2} \varepsilon^{2} \, {\mathcal{I}}^{-1}({\boldsymbol{\theta}}_{c}) \, {\mathcal{L}}^{\prime}({\boldsymbol{\theta}}_{c}) + \varepsilon \, {\mathcal{I}}({\boldsymbol{\theta}}_{c})^{-1/2} \, {\boldsymbol{u}} $$
where \({\mathcal {I}}({\boldsymbol {\theta }})\) is the Fisher matrix, \({\mathcal {L}}({\boldsymbol {\theta }})\) is the log of the conditional posterior and \({\mathcal {L}}^{\prime }({\boldsymbol {\theta }})\) its gradient, ε is a tuning parameter and \({\boldsymbol {u}}\sim {\mathcal {N}}(0,{\mathbf {I}})\) a standard Gaussian sample.
4.3 Potts parameter
The granularity coefficient β conditionally follows an intricate density also deduced from (8):
$$\beta \sim {\mathcal{Z}}(\beta)^{-1} \, \exp\left[{ \beta {\sum}_{p \sim q} \delta(\ell_{p}; \ell_{q})}\right] ~ \mathcal{U}_{[0,B]}(\beta) \,{.} $$
The sampling is a very difficult task first of all because the density does not have a standard form. Moreover, the major problem is that \({\mathcal {Z}}(\beta)\) is intractable, so the density cannot even be evaluated for a given value of β.
To overcome the obstacle, the partition function \({\mathcal {Z}}(\beta)\) has been precomputed on a fine grid of values for β, ranging from β=0 to β=B=3, with a step of 0.01, for several numbers of pixel P and numbers of class K. Details are given in Annex 6. It is therefore easy to compute the cumulative density function F(β) by standard numerical integration / interpolation. Then, it suffices to sample a uniform variable u on [0,1] and to compute β=F−1(u) to obtain a desired sample. So, this step is inexpensive (since the table of values of \({\mathcal {Z}}(\beta)\) is precomputed).
Remark 7
Although it allows for very efficient computations, this approach has a limitation: \({\mathcal {Z}}\) must be precomputed for the considered number of pixel P and class K.
The procedure is identical to the one presented in our previous papers [40, 41, 57]. The reader is invited to consult [29, 42–46] for alternatives and complementary results.
4.4 Textured image
Remark 8
To improve the readability, in the following, we will use the simplified notation Λk=Λk(θk).
The textured image xk has a Gaussian density, deduced from (8):
$$ {\boldsymbol{x}}_{k} \sim \exp - \left[ \gamma_{\mathrm{n}} \| {\boldsymbol{y}}-{\mathbf{H}}\sum_{l}{\mathbf{S}}_{l}{\boldsymbol{x}}_{l} \|^{2} \!\!+ \gamma_{k}\|{\boldsymbol{x}}_{k}\|^{2}_{{\boldsymbol{\Lambda}}_{k} }\right] $$
(9)
and it is easy to show that the mean μk and the covariance Σk write:
$$\begin{array}{@{}rcl@{}} {\boldsymbol{\Sigma}}_{k}^{-1} &=& \gamma_{\mathrm{n}} {\mathbf{S}}_{k}^{\dag} {\mathbf{H}}^{\dag}\mathbf{H}\mathbf{S}_{k} + \gamma_{k}{\boldsymbol{\Lambda}}_{k}\\ {\boldsymbol{\mu}}_{k} &=& \gamma_{\mathrm{n}} {\boldsymbol{\Sigma}}_{k} {\mathbf{S}}_{k}^{\dag}{\mathbf{H}}^{\dag}\bar {\boldsymbol{y}}_{k} \end{array} $$
where \(\bar {\boldsymbol {y}}_{k}= {\boldsymbol {y}}-{\mathbf {H}}\sum _{l\neq k}{\mathbf {S}}_{l}{\boldsymbol {x}}_{l}\). This quantity is founded on the extraction of the contribution of the image xk from the data. More specifically, \(\bar {\boldsymbol {y}}_{k}\) relies on the subtraction from the observations y of the convolution of all the parts of the image z that are not labelled k.
However, the practical sampling of this Gaussian density is a thorny issue due to the high dimension of the variable. Usually, sampling a Gaussian density requires handling the covariance or the precision, for instance factorization (e.g., Cholesky), diagonalisation, or inversion, which are impossible here. This could be possible for special structures, e.g., sparse or circulant. Here, Λk, H and by extension H†H are CbC; however, the presence of the Sk breaks the circularity: Σk is not diagonalizable by FFT and, consequently, the sampling of xk cannot be performed efficiently in the Fourier domain.
Nevertheless, the literature accounts for alternatives based on the strong links between matrix factorization, diagonalization, inversion, linear system and optimization of quadratic criteria [58–62]. We resort here to our previous work [61] (see also [63]) based on a perturbation-optimization (PO) principle: adequate stochastic perturbation of a quadratic criterion and optimization of the perturbed criterion. It is shown that the criterion optimizer is a sample of the target density. It is applicable if the precision matrix and the mean can be written as a sum of the form:
$${\boldsymbol{\Sigma}}_{k}^{-1} = \sum_{n=1}^{N} {\mathbf{M}}_{n}^{\mathrm{t}} {\mathbf{C}}_{n}^{-1} {\mathbf{M}}_{n} ~~\text{and\:}~~ {\boldsymbol{\mu}}_{k} = {\boldsymbol{\Sigma}}_{k} \sum_{n=1}^{N} {\mathbf{M}}_{n}^{\mathrm{t}} {\mathbf{C}}_{n}^{-1} {\boldsymbol{m}}_{n} $$
By identification, with N=2:
$$\begin{aligned} \left\{\begin{array}{ll} {\mathbf{M}}_{1} & =~ {\mathbf{H}}{\mathbf{S}}_{k} \\ {\mathbf{C}}_{1} & =~ \gamma_{\mathrm{n}}^{-1} {\mathbf{I}}_{P}\\ {\boldsymbol{m}}_{1} & =~ \bar{\boldsymbol{y}}_{k} \end{array}\right. \hspace{1cm} \left\{\begin{array}{ll} {\mathbf{M}}_{2}& =~ {\mathbf{I}}_{P}\\ {\mathbf{C}}_{2}& =~ \gamma_{k}^{-1} {\boldsymbol{\Lambda}}_{k}^{-1}\\ {\boldsymbol{m}}_{2}& =~ {\mathbf{O}}_{P} \end{array}\right. \end{aligned} $$
4.4.1 Perturbation
The perturbation phase of this algorithm consists in drawing the following Gaussian samples:
$${\boldsymbol{\xi}}_{1} \sim {\mathcal{N}}\left({\boldsymbol{m}}_{1},{\mathbf{C}}_{1}\right) ~~\text{and\:}~~ {\boldsymbol{\xi}}_{2} \sim {\mathcal{N}}\left({\boldsymbol{m}}_{2},{\mathbf{C}}_{2}\right) $$
The cost of these sampling is not prohibitive: ξ1 is a realization of a white noise and ξ2 is a realization of the prior model for xk and it is computed by FFT.
4.4.2 Optimization
In order to obtain a sample of the image xk, the following criterion must be optimized w.r.t. x:
$$J_{k}({\boldsymbol{x}}) = \gamma_{\mathrm{n}} \left\|{\boldsymbol{\xi}}_{1}-{\mathbf{H}}{\mathbf{S}}_{k}{\boldsymbol{x}}\right\|^{2}+\gamma_{k}\left\|{\boldsymbol{\xi}}_{2}-{\boldsymbol{x}}\right\|^{2}_{{\boldsymbol{\Lambda}}_{k}} \,{.} $$
For notational convenience, let us rewrite:
$$\begin{array}{@{}rcl@{}} J_{k}({\boldsymbol{x}}) & = & {\boldsymbol{x}}^{\dag} {\mathbf{Q}}_{k} {\boldsymbol{x}} - 2 {\boldsymbol{x}}^{\dag} {\boldsymbol{q}}_{k} + J_{k}(0) \end{array} $$
where the matrix \({\mathbf {Q}}_{k}=\gamma _{\mathrm {n}} {\mathbf {S}}_{k}^{\dag } {\mathbf {H}}^{\dag }{\mathbf {H}}{\mathbf {S}}_{k}+\gamma _{k}{\boldsymbol {\Lambda }}_{k}\) is half the Hessian (and the precision matrix) and the vector \({\boldsymbol {q}}_{k}=\gamma _{\mathrm {n}} {\mathbf {S}}_{k}^{\dag }{\mathbf {H}}^{\dag } {\boldsymbol {\xi }}_{1} + \gamma _{k} {\boldsymbol {\Lambda }}_{k}^{-1}{\boldsymbol {\xi }}_{2}\) is the opposite of the gradient at the origin. The gradient at x itself is: gk=2(Qkx−qk).
Theoretically, there is no constraint on the optimization technique to be used and the literature on the subject is abundant [64–66]. We have only considered algorithms that are guaranteed to converge (to the unique minimizer) and among them the basic directions:
We have first used the conjugate gradient direction, since it is more efficient especially for a high-dimension problem and a quadratic criterion. However, we have experienced convergence difficulties, making the overall algorithm very slow. In practice, the step length at each iteration was extremely small, probably due to conditioning issues. Consequently, the differences between the iterates were almost insignificant. The solution relies on a preconditioner. It has been defined as a CbC approximation of the inverse Hessian of Jk:
$$ {\boldsymbol{\Pi}}_{k}= \left(\gamma_{\mathrm{n}} {\mathbf{H}}^{\dag}{\mathbf{H}} + \gamma_{k}{\boldsymbol{\Lambda}}_{k}\right)^{-1}/2 $$
(10)
obtained by eliminating the Sk matrix from Qk and chosen for computational efficiency. It is used for both of the aforementioned directions:
In this context, the two methods have yielded similar results, and finally, we have focused on the preconditioned gradient.
The second ingredient that is necessary is the step length s in the considered direction, at each iteration. Here again, a variety of strategies is available. We have used an optimal step that is explicitly given:
$$s = \frac{{{\boldsymbol{g}}_{k}}^{\dag} {\boldsymbol{\Pi}}_{k}^{\dag} {\boldsymbol{g}}_{k}}{ {{\boldsymbol{g}}_{k}}^{\dag} {\boldsymbol{\Pi}}_{k}^{\dag} {\mathbf{Q}}_{k} {\boldsymbol{\Pi}}_{k} {\boldsymbol{g}}_{k}} $$
and efficiently computable.
4.4.3 Practical implementation
The algorithm requires at each iteration the computation of the preconditioned gradient and the step length. Finally, the required computations for performing the optimization are the vector qk and the products of a vector by the matrices Πk and Qk.
-
The vector qk writes:
$$ {\boldsymbol{q}}_{k} = \gamma_{\mathrm{n}} \underbrace{{\mathbf{S}}_{k}^{\dag}\underbrace{{\mathbf{H}}^{\dag} {\boldsymbol{\xi}}_{1}}_{\text{FFT}}}_{\text{ZF}}+ \gamma_{k} \underbrace{{\boldsymbol{\Lambda}}_{k}^{-1}{\boldsymbol{\xi}}_{2}}_{\text{FFT}} $$
(11)
and thus efficiently computed through a FFT and zero-forcing (ZF).
-
The product Qkx writes:
$${\mathbf{Q}}_{k} {\boldsymbol{x}} = \gamma_{\mathrm{n}} \underbrace{{\mathbf{S}}_{k}^{\dag} \underbrace{\underbrace{{\mathbf{H}}^{\dag}{\mathbf{H}}}_{\text{FFT}}\underbrace{{\mathbf{S}}_{k} {\boldsymbol{x}}}_{\text{ZF}}}_{\text{FFT}} }_{\text{ZF}} + \gamma_{k} \underbrace{{\boldsymbol{\Lambda}}_{k}{\boldsymbol{x}}}_{\text{FFT}} $$
and thus also efficiently computed through a series of FFT and ZF.
-
Regarding Πkgk, since the matrix Πk is CbC, the product can also be efficiently computed by FFT.
The zero-forcing process is achieved in the spatial domain (it amounts to setting to zero some pixels of images), while the costly products by matrices are performed in the Fourier domain (all of them by FFT).
4.5 Labels
The label set has a multidimensional categorical distribution:
$${\boldsymbol{\ell}} \sim \exp\left[ \beta \sum_{r\sim s} \, \delta(\ell_{r},\ell_{s}) - \gamma_{\mathrm{n}} \| {\boldsymbol{y}}-{\mathbf{H}}\sum_{k}{\mathbf{S}}_{k}({\boldsymbol{\ell}}){\boldsymbol{x}}_{k}\|^{2} \right] $$
and it is a non-separable and non-standard form, so its sampling is not an easy task. A solution is to sample the ℓp one by one conditionally on the others and on the rest of the variables, in a Gibbs scheme.
To this end, let us introduce the notation \({\boldsymbol {z}}_{k}^{p}\) for the image with all its pixels identical to z except for pixel p. The pixel p in \({\boldsymbol {z}}_{k}^{p}\) is the pixel p from xk. Let us note \({\mathcal {E}}_{p,k}=\left \|{\boldsymbol {y}}-{\mathbf {H}}{\boldsymbol {z}}_{k}^{p}\right \|^{2}\). This error quantifies the discrepancy between the data and the class k regarding pixel p.
Sampling a label \(\ell _{p_{0}}\) requires its conditional probability. A precise analysis of the conditional distribution for \(\ell _{p_{0}}\) yields:
$$\text{Pr}(\ell_{p_{0}}=k|\star) \propto \exp\left[\beta \sum_{r;r\sim p_{0}} \delta(\ell_{r},k)-\gamma_{\mathrm{n}} {\mathcal{E}}_{p_{0},k}\right] $$
for k=1,…K. This computation is performed up to a multiplicative constant, which can be determined knowing that the probabilities sum to 1.
To compute these probabilities, we must evaluate the two terms of the argument of the exponential function, at pixel p0. The first term is the contribution of the prior and it can be easily computed for each k by counting the neighbours of pixel p0 having the label k. Let us now focus on the second term, \({\mathcal {E}}_{p,k}\). To write this term in a more convenient form, we introduce:
-
A vector \({\mathbbm{1}}_{p}\in {\mathbb {R}}^{P}\): its p-th entry is 1 and the other is 0.
-
A quantity \(\Delta _{p,k}\in {\mathbb {R}}\) that records the difference between the p-th pixel of the image z and the one of image xk: \(\Delta _{p,k} = {\mathbbm{1}}_{p}^{\dag } ({\boldsymbol {z}}-{\boldsymbol {x}}_{k})\).
We then have \({\boldsymbol {z}}_{k}^{p}={\boldsymbol {z}}-\Delta _{p,k} {\mathbbm{1}}_{p}\), so \({\mathcal {E}}_{p,k}\) writes:
$$ \begin{aligned} {\mathcal{E}}_{p,k} &=\left\|{\boldsymbol{y}}-{\mathbf{H}}\left({\boldsymbol{z}}-\Delta_{p,k}{\mathbbm{1}}_{p}\right)\right\|^{2}\\ &=\left\|({\boldsymbol{y}}-{\mathbf{H}}{\boldsymbol{z}}) - \Delta_{p,k}{\mathbf{H}}{\mathbbm{1}}_{p} \right\|^{2}\\ &=\bar{\boldsymbol{y}}^{\dag} \bar{\boldsymbol{y}} + \Delta_{p,k}^{2} {\mathbbm{1}}_{p}^{\dag}{\mathbf{H}}^{\dag}{\mathbf{H}} {\mathbbm{1}}_{p} - 2 \Delta_{p,k} {\mathbbm{1}}_{p}^{\dag} {\mathbf{H}}^{\dag} \bar{\boldsymbol{y}} \end{aligned} $$
(12)
where \(\bar {\boldsymbol {y}} = {\boldsymbol {y}}-{\mathbf {H}}{\boldsymbol {z}}\). Then, to complete the description, let us analyse each term.
-
1.
The first term \(\bar {\boldsymbol {y}}^{\dag } \bar {\boldsymbol {y}}\) does not depend on p or k. Consequently, its value is not required in the sampling process and it can be included in a multiplicative factor.
-
2.
The term \({\mathbbm{1}}_{p}^{\dag }{\mathbf {H}}^{\dag }{\mathbf {H}} {\mathbbm{1}}_{p}=\| {\mathbf {H}} {\mathbbm{1}}_{p}\|^{2}\) does not depend on p due to the CbC form of the H matrix. Moreover, this norm only needs to be computed once for all, since the H matrix does not change throughout the iterations. In fact, this norm amounts to the sum \({\sum _{q}}{|\overset {{\circ }} h_{q}}|^{2}\).
-
3.
Finally, in the third term \({\mathbbm{1}}_{p}^{\dag } {\mathbf {H}}^{\dag } \bar {\boldsymbol {y}}\), the product \({\mathbf {H}}^{\dag } \bar {\boldsymbol {y}}\) is a convolution efficiently computable by FFT and the product with \({\mathbbm{1}}_{p}^{\dag }\) selects the pixel p. Under this form, the computation would not be efficient since \({\mathbf {H}}^{\dag } \bar {\boldsymbol {y}}\) should be recomputed at each iteration. A far better alternative is to update \({\mathbf {H}}^{\dag } \bar {\boldsymbol {y}}\) after updating each label.