This section describes the conditional posterior for each unknown parameter in order to implement a Gibbs sampler. In particular, it details the cumbersome task of sampling the full textured images (Section 4.4) and the labels (Section 4.5). These two sampling processes represent the major algorithmic challenges of our approach.
Each conditional posterior can be deduced from the joint posterior (8) by picking the factors that are function of the considered parameter.
4.1 Precision parameters
Regarding the noise parameter γ_{n} and the texture scale parameters γ_{k}, from (8), we have:
$$\begin{array}{@{}rcl@{}} \gamma_{\mathrm{n}} &\sim& \gamma_{\mathrm{n}}^{\,\alpha_{\mathrm{n}}^{0}\, +P1} \, \exp  \gamma_{\mathrm{n}} \left[\beta_{\mathrm{n}}^{0} + \ {\boldsymbol{y}}{\mathbf{H}} {\boldsymbol{z}} \^{2}\right] \,{,}\\ \gamma_{k} &\sim& \gamma_{k}^{\,\alpha_{k}^{0}\, +P1} \, \exp\gamma_{k}\left[\beta_{k}^{0} + \{\boldsymbol{x}}_{k}\^{2}_{{\boldsymbol{\Lambda}}_{k}({\boldsymbol{\theta}}_{k})}\right] \,{.} \end{array} $$
They must be sampled under Gamma densities \({\mathcal {G}}(\gamma ; \alpha, \beta)\) with respective parameters:
$$\alpha = \alpha_{\mathrm{n}}^{0}+P ~~\text{and\:}~~\beta=\beta_{\mathrm{n}}^{0} + \ {\boldsymbol{y}}{\mathbf{H}}{\boldsymbol{z}} \^{2} $$
for the noise parameter γ_{n} and
$$\alpha = \alpha_{k}^{0}+ P ~~\text{and\:}~~\beta=\beta_{k}^{0} + \{\boldsymbol{x}}_{k}\^{2}_{{\boldsymbol{\Lambda}}_{k}({\boldsymbol{\theta}}_{k})} $$
for the texture parameters γ_{k}. As Gamma variables, they can be straightforwardly sampled. In addition, given the hierarchical structure (see Fig. 2), they are mutually (a posteriori) independent.
4.2 Shape texture parameters
Regarding the shape parameters θ_{k} of the textured image PSD, the problem is made far more complicated by the intricate relation between the density, the PSD and the parameter θ_{k}; see Eqs. (2) and (3). As a consequence, the conditional posterior has a nonstandard form:
$${\boldsymbol{\theta}}_{k} \sim {\mathcal{U}}_{[{\boldsymbol{\theta}}_{k}^{\mathrm{m}},{\boldsymbol{\theta}}_{k}^{\mathrm{M}}]}({\boldsymbol{\theta}}_{k}) \prod_{p} \lambda_{p}({\boldsymbol{\theta}}_{k}) \, \exp\gamma_{k}\lambda_{p}({\boldsymbol{\theta}}_{k}) \overset{{\circ}} x_{k,p}^{2} $$
nevertheless, it can be sampled using a MetropolisHastings (MH) step^{Footnote 2}. Basically, it consists in drawing a proposal based on a proposition law, evaluating an acceptance probability, and then, at random according to this probability, setting the new value as the proposal (acceptation) or as the current value (duplication). There are numerous options in order to formulate a proposition law, and both the convergence rate and the mixing properties are influenced by its adequacy to the (conditional) posterior. Thus, designing a proposition law that embeds information about the posterior will significantly enhance the performances. In this context, the directional Random Walk MH (RWMH) algorithm taking advantage of first or secondorder derivatives of the posterior seems relevant. A standard case is the Metropolisadjusted Langevin algorithm (MALA) [51], which takes advantage of the posterior derivative. The preconditioned MALA [52] and the quasiNewton proposals [53] exploit the posterior curvature. More advanced versions rely on the Fisher matrix (instead of the Hessian) and leads to an efficient sampler called the FisherRWMH: [54] proposes a general statement and our previous paper [55] (see also [56]) focuses on texture parameters.
Explicitly, from the current value θ_{c}, the algorithm formulates the proposal θ_{p}:
$$ {\boldsymbol{\theta}}_{\mathrm{p}} = {\boldsymbol{\theta}}_{\mathrm{c}} + \frac{1}{2} \varepsilon^{2} \, {\mathcal{I}}^{1}({\boldsymbol{\theta}}_{c}) \, {\mathcal{L}}^{\prime}({\boldsymbol{\theta}}_{c}) + \varepsilon \, {\mathcal{I}}({\boldsymbol{\theta}}_{c})^{1/2} \, {\boldsymbol{u}} $$
where \({\mathcal {I}}({\boldsymbol {\theta }})\) is the Fisher matrix, \({\mathcal {L}}({\boldsymbol {\theta }})\) is the log of the conditional posterior and \({\mathcal {L}}^{\prime }({\boldsymbol {\theta }})\) its gradient, ε is a tuning parameter and \({\boldsymbol {u}}\sim {\mathcal {N}}(0,{\mathbf {I}})\) a standard Gaussian sample.
4.3 Potts parameter
The granularity coefficient β conditionally follows an intricate density also deduced from (8):
$$\beta \sim {\mathcal{Z}}(\beta)^{1} \, \exp\left[{ \beta {\sum}_{p \sim q} \delta(\ell_{p}; \ell_{q})}\right] ~ \mathcal{U}_{[0,B]}(\beta) \,{.} $$
The sampling is a very difficult task first of all because the density does not have a standard form. Moreover, the major problem is that \({\mathcal {Z}}(\beta)\) is intractable, so the density cannot even be evaluated for a given value of β.
To overcome the obstacle, the partition function \({\mathcal {Z}}(\beta)\) has been precomputed on a fine grid of values for β, ranging from β=0 to β=B=3, with a step of 0.01, for several numbers of pixel P and numbers of class K. Details are given in Annex 6. It is therefore easy to compute the cumulative density function F(β) by standard numerical integration / interpolation. Then, it suffices to sample a uniform variable u on [0,1] and to compute β=F^{−1}(u) to obtain a desired sample. So, this step is inexpensive (since the table of values of \({\mathcal {Z}}(\beta)\) is precomputed).
Remark 7
Although it allows for very efficient computations, this approach has a limitation: \({\mathcal {Z}}\) must be precomputed for the considered number of pixel P and class K.
The procedure is identical to the one presented in our previous papers [40, 41, 57]. The reader is invited to consult [29, 42–46] for alternatives and complementary results.
4.4 Textured image
Remark 8
To improve the readability, in the following, we will use the simplified notation Λ_{k}=Λ_{k}(θ_{k}).
The textured image x_{k} has a Gaussian density, deduced from (8):
$$ {\boldsymbol{x}}_{k} \sim \exp  \left[ \gamma_{\mathrm{n}} \ {\boldsymbol{y}}{\mathbf{H}}\sum_{l}{\mathbf{S}}_{l}{\boldsymbol{x}}_{l} \^{2} \!\!+ \gamma_{k}\{\boldsymbol{x}}_{k}\^{2}_{{\boldsymbol{\Lambda}}_{k} }\right] $$
(9)
and it is easy to show that the mean μ_{k} and the covariance Σ_{k} write:
$$\begin{array}{@{}rcl@{}} {\boldsymbol{\Sigma}}_{k}^{1} &=& \gamma_{\mathrm{n}} {\mathbf{S}}_{k}^{\dag} {\mathbf{H}}^{\dag}\mathbf{H}\mathbf{S}_{k} + \gamma_{k}{\boldsymbol{\Lambda}}_{k}\\ {\boldsymbol{\mu}}_{k} &=& \gamma_{\mathrm{n}} {\boldsymbol{\Sigma}}_{k} {\mathbf{S}}_{k}^{\dag}{\mathbf{H}}^{\dag}\bar {\boldsymbol{y}}_{k} \end{array} $$
where \(\bar {\boldsymbol {y}}_{k}= {\boldsymbol {y}}{\mathbf {H}}\sum _{l\neq k}{\mathbf {S}}_{l}{\boldsymbol {x}}_{l}\). This quantity is founded on the extraction of the contribution of the image x_{k} from the data. More specifically, \(\bar {\boldsymbol {y}}_{k}\) relies on the subtraction from the observations y of the convolution of all the parts of the image z that are not labelled k.
However, the practical sampling of this Gaussian density is a thorny issue due to the high dimension of the variable. Usually, sampling a Gaussian density requires handling the covariance or the precision, for instance factorization (e.g., Cholesky), diagonalisation, or inversion, which are impossible here. This could be possible for special structures, e.g., sparse or circulant. Here, Λ_{k}, H and by extension H^{†}H are CbC; however, the presence of the S_{k} breaks the circularity: Σ_{k} is not diagonalizable by FFT and, consequently, the sampling of x_{k} cannot be performed efficiently in the Fourier domain.
Nevertheless, the literature accounts for alternatives based on the strong links between matrix factorization, diagonalization, inversion, linear system and optimization of quadratic criteria [58–62]. We resort here to our previous work [61] (see also [63]) based on a perturbationoptimization (PO) principle: adequate stochastic perturbation of a quadratic criterion and optimization of the perturbed criterion. It is shown that the criterion optimizer is a sample of the target density. It is applicable if the precision matrix and the mean can be written as a sum of the form:
$${\boldsymbol{\Sigma}}_{k}^{1} = \sum_{n=1}^{N} {\mathbf{M}}_{n}^{\mathrm{t}} {\mathbf{C}}_{n}^{1} {\mathbf{M}}_{n} ~~\text{and\:}~~ {\boldsymbol{\mu}}_{k} = {\boldsymbol{\Sigma}}_{k} \sum_{n=1}^{N} {\mathbf{M}}_{n}^{\mathrm{t}} {\mathbf{C}}_{n}^{1} {\boldsymbol{m}}_{n} $$
By identification, with N=2:
$$\begin{aligned} \left\{\begin{array}{ll} {\mathbf{M}}_{1} & =~ {\mathbf{H}}{\mathbf{S}}_{k} \\ {\mathbf{C}}_{1} & =~ \gamma_{\mathrm{n}}^{1} {\mathbf{I}}_{P}\\ {\boldsymbol{m}}_{1} & =~ \bar{\boldsymbol{y}}_{k} \end{array}\right. \hspace{1cm} \left\{\begin{array}{ll} {\mathbf{M}}_{2}& =~ {\mathbf{I}}_{P}\\ {\mathbf{C}}_{2}& =~ \gamma_{k}^{1} {\boldsymbol{\Lambda}}_{k}^{1}\\ {\boldsymbol{m}}_{2}& =~ {\mathbf{O}}_{P} \end{array}\right. \end{aligned} $$
4.4.1 Perturbation
The perturbation phase of this algorithm consists in drawing the following Gaussian samples:
$${\boldsymbol{\xi}}_{1} \sim {\mathcal{N}}\left({\boldsymbol{m}}_{1},{\mathbf{C}}_{1}\right) ~~\text{and\:}~~ {\boldsymbol{\xi}}_{2} \sim {\mathcal{N}}\left({\boldsymbol{m}}_{2},{\mathbf{C}}_{2}\right) $$
The cost of these sampling is not prohibitive: ξ_{1} is a realization of a white noise and ξ_{2} is a realization of the prior model for x_{k} and it is computed by FFT.
4.4.2 Optimization
In order to obtain a sample of the image x_{k}, the following criterion must be optimized w.r.t. x:
$$J_{k}({\boldsymbol{x}}) = \gamma_{\mathrm{n}} \left\{\boldsymbol{\xi}}_{1}{\mathbf{H}}{\mathbf{S}}_{k}{\boldsymbol{x}}\right\^{2}+\gamma_{k}\left\{\boldsymbol{\xi}}_{2}{\boldsymbol{x}}\right\^{2}_{{\boldsymbol{\Lambda}}_{k}} \,{.} $$
For notational convenience, let us rewrite:
$$\begin{array}{@{}rcl@{}} J_{k}({\boldsymbol{x}}) & = & {\boldsymbol{x}}^{\dag} {\mathbf{Q}}_{k} {\boldsymbol{x}}  2 {\boldsymbol{x}}^{\dag} {\boldsymbol{q}}_{k} + J_{k}(0) \end{array} $$
where the matrix \({\mathbf {Q}}_{k}=\gamma _{\mathrm {n}} {\mathbf {S}}_{k}^{\dag } {\mathbf {H}}^{\dag }{\mathbf {H}}{\mathbf {S}}_{k}+\gamma _{k}{\boldsymbol {\Lambda }}_{k}\) is half the Hessian (and the precision matrix) and the vector \({\boldsymbol {q}}_{k}=\gamma _{\mathrm {n}} {\mathbf {S}}_{k}^{\dag }{\mathbf {H}}^{\dag } {\boldsymbol {\xi }}_{1} + \gamma _{k} {\boldsymbol {\Lambda }}_{k}^{1}{\boldsymbol {\xi }}_{2}\) is the opposite of the gradient at the origin. The gradient at x itself is: g_{k}=2(Q_{k}x−q_{k}).
Theoretically, there is no constraint on the optimization technique to be used and the literature on the subject is abundant [64–66]. We have only considered algorithms that are guaranteed to converge (to the unique minimizer) and among them the basic directions:
We have first used the conjugate gradient direction, since it is more efficient especially for a highdimension problem and a quadratic criterion. However, we have experienced convergence difficulties, making the overall algorithm very slow. In practice, the step length at each iteration was extremely small, probably due to conditioning issues. Consequently, the differences between the iterates were almost insignificant. The solution relies on a preconditioner. It has been defined as a CbC approximation of the inverse Hessian of J_{k}:
$$ {\boldsymbol{\Pi}}_{k}= \left(\gamma_{\mathrm{n}} {\mathbf{H}}^{\dag}{\mathbf{H}} + \gamma_{k}{\boldsymbol{\Lambda}}_{k}\right)^{1}/2 $$
(10)
obtained by eliminating the S_{k} matrix from Q_{k} and chosen for computational efficiency. It is used for both of the aforementioned directions:
In this context, the two methods have yielded similar results, and finally, we have focused on the preconditioned gradient.
The second ingredient that is necessary is the step length s in the considered direction, at each iteration. Here again, a variety of strategies is available. We have used an optimal step that is explicitly given:
$$s = \frac{{{\boldsymbol{g}}_{k}}^{\dag} {\boldsymbol{\Pi}}_{k}^{\dag} {\boldsymbol{g}}_{k}}{ {{\boldsymbol{g}}_{k}}^{\dag} {\boldsymbol{\Pi}}_{k}^{\dag} {\mathbf{Q}}_{k} {\boldsymbol{\Pi}}_{k} {\boldsymbol{g}}_{k}} $$
and efficiently computable.
4.4.3 Practical implementation
The algorithm requires at each iteration the computation of the preconditioned gradient and the step length. Finally, the required computations for performing the optimization are the vector q_{k} and the products of a vector by the matrices Π_{k} and Q_{k}.

The vector q_{k} writes:
$$ {\boldsymbol{q}}_{k} = \gamma_{\mathrm{n}} \underbrace{{\mathbf{S}}_{k}^{\dag}\underbrace{{\mathbf{H}}^{\dag} {\boldsymbol{\xi}}_{1}}_{\text{FFT}}}_{\text{ZF}}+ \gamma_{k} \underbrace{{\boldsymbol{\Lambda}}_{k}^{1}{\boldsymbol{\xi}}_{2}}_{\text{FFT}} $$
(11)
and thus efficiently computed through a FFT and zeroforcing (ZF).

The product Q_{k}x writes:
$${\mathbf{Q}}_{k} {\boldsymbol{x}} = \gamma_{\mathrm{n}} \underbrace{{\mathbf{S}}_{k}^{\dag} \underbrace{\underbrace{{\mathbf{H}}^{\dag}{\mathbf{H}}}_{\text{FFT}}\underbrace{{\mathbf{S}}_{k} {\boldsymbol{x}}}_{\text{ZF}}}_{\text{FFT}} }_{\text{ZF}} + \gamma_{k} \underbrace{{\boldsymbol{\Lambda}}_{k}{\boldsymbol{x}}}_{\text{FFT}} $$
and thus also efficiently computed through a series of FFT and ZF.

Regarding Π_{k}g_{k}, since the matrix Π_{k} is CbC, the product can also be efficiently computed by FFT.
The zeroforcing process is achieved in the spatial domain (it amounts to setting to zero some pixels of images), while the costly products by matrices are performed in the Fourier domain (all of them by FFT).
4.5 Labels
The label set has a multidimensional categorical distribution:
$${\boldsymbol{\ell}} \sim \exp\left[ \beta \sum_{r\sim s} \, \delta(\ell_{r},\ell_{s})  \gamma_{\mathrm{n}} \ {\boldsymbol{y}}{\mathbf{H}}\sum_{k}{\mathbf{S}}_{k}({\boldsymbol{\ell}}){\boldsymbol{x}}_{k}\^{2} \right] $$
and it is a nonseparable and nonstandard form, so its sampling is not an easy task. A solution is to sample the ℓ_{p} one by one conditionally on the others and on the rest of the variables, in a Gibbs scheme.
To this end, let us introduce the notation \({\boldsymbol {z}}_{k}^{p}\) for the image with all its pixels identical to z except for pixel p. The pixel p in \({\boldsymbol {z}}_{k}^{p}\) is the pixel p from x_{k}. Let us note \({\mathcal {E}}_{p,k}=\left \{\boldsymbol {y}}{\mathbf {H}}{\boldsymbol {z}}_{k}^{p}\right \^{2}\). This error quantifies the discrepancy between the data and the class k regarding pixel p.
Sampling a label \(\ell _{p_{0}}\) requires its conditional probability. A precise analysis of the conditional distribution for \(\ell _{p_{0}}\) yields:
$$\text{Pr}(\ell_{p_{0}}=k\star) \propto \exp\left[\beta \sum_{r;r\sim p_{0}} \delta(\ell_{r},k)\gamma_{\mathrm{n}} {\mathcal{E}}_{p_{0},k}\right] $$
for k=1,…K. This computation is performed up to a multiplicative constant, which can be determined knowing that the probabilities sum to 1.
To compute these probabilities, we must evaluate the two terms of the argument of the exponential function, at pixel p_{0}. The first term is the contribution of the prior and it can be easily computed for each k by counting the neighbours of pixel p_{0} having the label k. Let us now focus on the second term, \({\mathcal {E}}_{p,k}\). To write this term in a more convenient form, we introduce:

A vector \({\mathbbm{1}}_{p}\in {\mathbb {R}}^{P}\): its pth entry is 1 and the other is 0.

A quantity \(\Delta _{p,k}\in {\mathbb {R}}\) that records the difference between the pth pixel of the image z and the one of image x_{k}: \(\Delta _{p,k} = {\mathbbm{1}}_{p}^{\dag } ({\boldsymbol {z}}{\boldsymbol {x}}_{k})\).
We then have \({\boldsymbol {z}}_{k}^{p}={\boldsymbol {z}}\Delta _{p,k} {\mathbbm{1}}_{p}\), so \({\mathcal {E}}_{p,k}\) writes:
$$ \begin{aligned} {\mathcal{E}}_{p,k} &=\left\{\boldsymbol{y}}{\mathbf{H}}\left({\boldsymbol{z}}\Delta_{p,k}{\mathbbm{1}}_{p}\right)\right\^{2}\\ &=\left\({\boldsymbol{y}}{\mathbf{H}}{\boldsymbol{z}})  \Delta_{p,k}{\mathbf{H}}{\mathbbm{1}}_{p} \right\^{2}\\ &=\bar{\boldsymbol{y}}^{\dag} \bar{\boldsymbol{y}} + \Delta_{p,k}^{2} {\mathbbm{1}}_{p}^{\dag}{\mathbf{H}}^{\dag}{\mathbf{H}} {\mathbbm{1}}_{p}  2 \Delta_{p,k} {\mathbbm{1}}_{p}^{\dag} {\mathbf{H}}^{\dag} \bar{\boldsymbol{y}} \end{aligned} $$
(12)
where \(\bar {\boldsymbol {y}} = {\boldsymbol {y}}{\mathbf {H}}{\boldsymbol {z}}\). Then, to complete the description, let us analyse each term.

1.
The first term \(\bar {\boldsymbol {y}}^{\dag } \bar {\boldsymbol {y}}\) does not depend on p or k. Consequently, its value is not required in the sampling process and it can be included in a multiplicative factor.

2.
The term \({\mathbbm{1}}_{p}^{\dag }{\mathbf {H}}^{\dag }{\mathbf {H}} {\mathbbm{1}}_{p}=\ {\mathbf {H}} {\mathbbm{1}}_{p}\^{2}\) does not depend on p due to the CbC form of the H matrix. Moreover, this norm only needs to be computed once for all, since the H matrix does not change throughout the iterations. In fact, this norm amounts to the sum \({\sum _{q}}{\overset {{\circ }} h_{q}}^{2}\).

3.
Finally, in the third term \({\mathbbm{1}}_{p}^{\dag } {\mathbf {H}}^{\dag } \bar {\boldsymbol {y}}\), the product \({\mathbf {H}}^{\dag } \bar {\boldsymbol {y}}\) is a convolution efficiently computable by FFT and the product with \({\mathbbm{1}}_{p}^{\dag }\) selects the pixel p. Under this form, the computation would not be efficient since \({\mathbf {H}}^{\dag } \bar {\boldsymbol {y}}\) should be recomputed at each iteration. A far better alternative is to update \({\mathbf {H}}^{\dag } \bar {\boldsymbol {y}}\) after updating each label.