Adaptive independent sticky MCMC algorithms
 Luca Martino^{1}Email author,
 Roberto Casarin^{2},
 Fabrizio Leisen^{3} and
 David Luengo^{4}
https://doi.org/10.1186/s1363401705246
© The Author(s) 2018
Received: 27 February 2017
Accepted: 20 December 2017
Published: 11 January 2018
Abstract
Monte Carlo methods have become essential tools to solve complex Bayesian inference problems in different fields, such as computational statistics, machine learning, and statistical signal processing. In this work, we introduce a novel class of adaptive Monte Carlo methods, called adaptive independent sticky Markov Chain Monte Carlo (MCMC) algorithms, to sample efficiently from any bounded target probability density function (pdf). The new class of algorithms employs adaptive nonparametric proposal densities, which become closer and closer to the target as the number of iterations increases. The proposal pdf is built using interpolation procedures based on a set of support points which is constructed iteratively from previously drawn samples. The algorithm’s efficiency is ensured by a test that supervises the evolution of the set of support points. This extra stage controls the computational cost and the convergence of the proposal density to the target. Each part of the novel family of algorithms is discussed and several examples of specific methods are provided. Although the novel algorithms are presented for univariate target densities, we show how they can be easily extended to the multivariate context by embedding them within a Gibbstype sampler or the hit and run algorithm. The ergodicity is ensured and discussed. An overview of the related works in the literature is also provided, emphasizing that several wellknown existing methods (like the adaptive rejection Metropolis sampling (ARMS) scheme) are encompassed by the new class of algorithms proposed here. Eight numerical examples (including the inference of the hyperparameters of Gaussian processes, widely used in machine learning for signal processing applications) illustrate the efficiency of sticky schemes, both as standalone methods to sample from complicated onedimensional pdfs and within Gibbs samplers in order to draw from multidimensional target distributions.
Keywords
1 Introduction
Markov chain Monte Carlo (MCMC) methods [1, 2] are very important tools for Bayesian inference and numerical approximation, which are widely employed in signal processing [3–7] and other related fields [1, 8]. A crucial issue in MCMC is the choice of a proposal probability density function (pdf), as this can strongly affect the mixing of the MCMC chain when the target pdf has a complex structure, e.g., multimodality and/or heavy tails. Thus, in the last decade, a remarkable stream of literature focuses on adaptive proposal pdfs, which allow for selftuning procedures of the MCMC algorithms, flexible movements within the state space, and improved acceptance rates [9, 10].
Adaptive MCMC algorithms are used in many statistical applications and several schemes have been proposed in the literature [8–11]. There are two main families of methods: parametric and nonparametric. The first strategy consists in adapting the parameters of a parametric proposal pdf according to the past values of the chain [10]. However, even if the parameters are perfectly adapted, a discrepancy between the target and the proposal pdfs remains. A second strategy attempts to adapt the entire shape of the proposal density using nonparametric procedures [12, 13]. Most authors have payed more attention to the first family, designing local adaptive randomwalk algorithms [9, 10], due to the difficulty of approximating the full target distribution by nonparametric schemes with any degree of generality.
In this work, we describe a general framework to design suitable adaptive MCMC algorithms with nonparametric proposal densities. After describing the different building blocks and the general features of the novel class, we introduce two specific algorithms. Firstly, we describe the adaptive independent sticky Metropolis (AISM) algorithm to draw efficiently from any bounded univariate target distribution.^{1} Then, we also propose a more efficient scheme that is based on the multiple try Metropolis (MTM) algorithm: the adaptive independent sticky Multiple Try Metropolis (AISMTM) method. The ergodicity of the adaptive sticky MCMC methods is ensured and discussed. The underlying theoretical support is based on the approach introduced in [14]. The new schemes are particularly suitable for sampling from complicated fullconditional pdfs within a Gibbs sampler [5–7].
 1.
A very general framework, that allows practitioners to design proper adaptive MCMC methods by employing a nonparametric proposal.
 2.
Two algorithms (AISM and AISMTM), that can be used offtheshelf in signal processing applications.
 3.
An exhaustive overview of the related algorithms proposed in the literature, showing that several wellknown methods (such as ARMS) are encompassed by the proposed framework.
 4.
A theoretical analysis of the AISM algorithm, proving its ergodicity and the convergence of the adaptive proposal to the target.
The structure of the paper is the following. Section 2 introduces the generalities of the class of sticky MCMC methods and the AISM scheme. Sections 3 and 4 present the general properties, altogether with specific examples, of the proposal constructions and the update control tests. Section 5 introduces some theoretical results. Section 6 discusses several related works and highlights some specific techniques belonging to the class of sticky methods. Section 7 introduces the AISMTM method. Section 8 describes the range of applicability of the proposed framework, including its use within other Monte Carlo methods (like the Gibbs sampler or the hit and run algorithm) to sample from multivariate distributions. Eight numerical examples (including the inference of hyperparameters of Gaussian processes) are then provided in Section 9.^{2} Finally, Section 10 contains some conclusions and possible future lines.^{3}
2 Adaptive independent sticky MCMC algorithms
 1.
Construction of the nonparametric proposal: given the nodes in \(\mathcal {S}_{t}\), the function q_{ t } is built using a suitable non parametric procedure that provides a function which is closer and closer to the target as the number of points m_{ t } increases. Section 3 describes the general properties that must be fulfilled by a suitable proposal construction, as well as specific procedures to build this proposal.
 2.
MCMC stage: some MCMC method is applied in order to produce the next state of the chain, x_{ t }, employing \(\widetilde {q}_{t}(x\mathcal {S}_{t})\) as (part of the) proposal pdf. This stage produces the next state of the chain, x_{t+1}, and an auxiliary variable z (see Tables 1 and 4), used in the following update stage.
 3.
Update stage: A statistical test on the auxiliary variable z is performed in order to decide whether to increase the number of points in \(\mathcal {S}_{t}\) or not, defining a new support set, \(\mathcal {S}_{t+1}\), which is used to construct the proposal at the next iteration. The update stage controls the computational cost and ensures the ergodicity of the generated chain (see Appendix A). Section 4 is devoted to the design of different suitable update rules.
In the following section, we describe the simplest possible sticky method, obtained by using the MH algorithm, whereas in Section 7 we consider a more sophisticated technique that employs the MTM scheme.^{5}
2.1 Adaptive independent sticky Metropolis
where \( \eta _{t}(z,d): \mathcal {X}\times \mathbb {R}^{+}\rightarrow [0,1] \)is an increasing test function w.r.t. the variable d, such that η_{ t }(z,0)=0, and \( d=d_{t}(z)=\left \pi (z)q_{t}(z\mathcal {S}_{t})\right .\) is the point distance between π and q_{ t } at z. The rationale behind this test is to use information from the target density in order to include in the support set only those points where the proposal pdf differs substantially from the target value at z. Note that, since z is always different from the current state (i.e., z≠x_{ t } for all t), then the proposal pdf is independent from the current state according to Holden’s definition [14], and thus the theoretical analysis is greatly simplified.
3 Construction of the sticky proposals
There are many alternatives available for the construction of a suitable sticky proposal (SP). However, in order to be able to provide some theoretical results in Section 5, let us define precisely what we understand here by a sticky proposal.
Definition 1
 1.
The proposal function is positive, i.e., \(q_{t}(x\mathcal {S}_{t})>0\) for all \(x\in \mathcal {X}\) and for all possible sets \(\mathcal {S}_{t}\) with \(t\in \mathbb {N}\).
 2.
Samples can be drawn directly and easily from the resulting proposal, \(\widetilde {q}_{t}(x\mathcal {S}_{t})\propto q_{t}(x\mathcal {S}_{t})\), using some exact sampling procedure.
 3.For any bounded target, π(x), the resulting proposal function, \(q_{t}(x\mathcal {S}_{t})\), is also bounded. Furthermore, defining \(\mathcal {I}_{t} = (s_{1},s_{m_{t}}]\), we have$$ \max_{x \in \mathcal{I}_{t}} q_{t}(x\mathcal{S}_{t}) \le \max_{x \in \mathcal{I}_{t}} \pi(x). $$
 4.The proposal function, \(q_{t}(x\mathcal {S}_{t})\), has heavier tails than the target, i.e., defining \(\mathcal {I}_{t}^{c} = (\infty,s_{1}] \cup (s_{m_{t}},\infty)\), we have$$ q_{t}(x\mathcal{S}_{t}) \ge \pi(x) \qquad \forall x \in \mathcal{I}_{t}^{c}. $$
Condition 1 guarantees that the function \(q_{t}(x\mathcal {S}_{t})\) leads to a valid pdf, \(\widetilde {q}_{t}(x\mathcal {S}_{t})\), that covers the entire support of the target.
Condition 2 is required from a practical point of view to obtain efficient algorithms. Finally, conditions 3 and 4 are required by the proofs of Theorems 3 and 1, respectively, and also make sense from a practical point of view: if the target is bounded, we would expect the proposal learnt from it to be also bounded and this proposal should be heavier tailed than the target in order to avoid undersampling the tails. Now we can define precisely what we understand by a “sticky” proposal.
Definition 2
where \(d_{t}(z) = \pi (z)q_{t}(z\mathcal {S}_{t})\) is the L_{1} distance between π(x) and q_{ t }(x) evaluated at \(z \in \mathcal {X}\), and (2) implies almost everywhere (a.k.a., almost surely) convergence of q_{ t }(x) to π(x).
In the following, we provide some examples of constructions that fulfill all the conditions in Definitions 1 and 2. All of them approximate the target pdf by interpolating points that belong to the graph of the target function π.
3.1 Examples of constructions
A more sophisticated and costly construction has been proposed for the ARMS method in [12]. However, note that this construction does not fulfill Condition 3 in Definition 1. A similar construction based on Bspline interpolation methods has been proposed in [22, 23] to build a nonadaptive random walk proposal pdf for an MH algorithm. Other alternative procedures can also be found in the literature [13, 16, 18–20].
4 Update of the set of support points
In AISM, a suitable choice of the function η_{ t }(z,d) is required. Although more general functions could be employed, we concentrate on test functions that fulfill the conditions provided in the following definition.
Definition 3
 1.
\(\eta _{t}(z,d): \mathcal {X}\times \mathbb {R}^{+}\rightarrow [0,1]\).
 2.
η_{ t }(z,0)=0 for all \(z\in \mathcal {X}\) and \(t\in \mathbb {N}\).
 3.
\(\lim \limits _{d\rightarrow \infty }\eta _{t}(z,d)=1\) for all \(z\in \mathcal {X}\) and \(t\in \mathbb {N}\).
 4.
η_{ t }(z,d) is a strictly increasing function w.r.t. d, i.e., η_{ t }(z,d_{2})>η_{ t }(z,d_{1}) for any d_{2}>d_{1}.
4.1 Examples of update rules
where β>0 is a constant parameter. Note that this is the cdf associated to an exponential random variable.
where 0<ε_{ t }<M_{ π }, with \(M_{\pi }=\max \limits _{z\in \mathcal {X}}\{\pi (z)\}\),^{6} is some appropriate timevarying threshold that can either follow some user predefined rule or be updated automatically.^{7} Alternatively, we can also set this threshold to a fixed value, ε_{ t }=ε, as done in the simulations. In this case, setting ε≥M_{ π } implies that the update of \(\mathcal {S}_{t}\) never happens (i.e., new support points are never added to the support set), whereas candidate nodes would be incorporated to \(\mathcal {S}_{t}\) almost surely by setting ε→0. For any other value of ε(i.e., 0<ε<M_{ π }), the adaptation would eventually stop and no support points would be added after some random number of iterations. Note that this update rule does not fulfill Condition 4 in Definition 3, implying that some of the theoretical results of Section 5(e.g., Conjecture 1) are not applicable. However, we have included it here because it is a very simple rule that has shown a good performance in practice and can be useful to limit the number of support points by using a fixed value of ε. Finally, note also that Eq. (6) corresponds to the cdf associated to a Dirac’s delta located at ε_{ t }.
Examples of test function η_{ t }(z,d) for different update rules (recall that \(d=d_{t}(z)=q_{t}(z\mathcal {S}_{t})\pi (z)\))
Rule 1  η_{ t }(d)=1−e^{−βd} 
Rule 2  \(\eta _{t}(d)=\left \{ \begin {array}{ll} 1, \text {if} ~d> \varepsilon _{t},\\ 0, \text {if}~ d\leq \varepsilon _{t} \end {array} \right.\) 
Rule 3  \(\eta _{t}(z,d)=\frac {d}{\max \{\pi (z),q_{t}(z\mathcal {S}_{t})\}}\) 
5 Theoretical results
In this section, we provide some theoretical results regarding the ergodicity of the proposed approach, the convergence of a sticky proposal to the target, and the expected growth of the number of support points of the proposal. First of all, regarding the ergodicity of the AISM, we have the following theorem.
Theorem 1
with c_{ π } and c_{ ℓ } denoting the normalizing constants of π(x) and \(q_{\ell }(x\mathcal {S}_{\ell })\), respectively.
Proof
See Appendix A. □
Theorem 1 ensures that the pdf of the states of the Markov chain becomes closer and closer to the target pdf as t increases, since 0≤1−a_{ t }≤1 and thus the product in the right hand side of (9) is a decreasing function of t. This theorem is a direct consequence of Theorem 2 in [14], and ensures the ergodicity of the proposed adaptive MCMC approach. Regarding the convergence of a sticky proposal to the target, we consider the following conjecture.
Conjecture 1
An intuitive argumentation is provided in Appendix A.
Note that Conjecture 1 essentially shows that the “sticky” condition is fulfilled for PWC and PWL proposals and continuous, bounded targets with bounded first and second derivatives. Note also that this conjecture implies that \(q_{t}(x\mathcal {S}_{t}) \to \pi (x)\) almost everywhere. Combining Theorem 1 and Conjecture 1 we get the following corollary.
Corollary 2
Proof
Finally, we also have a bound on the expected growth of the number of support points, as provided by the following theorem.
Theorem 3
where \(D_{1}(\pi, q_{t}) = \int _{\mathcal {X}}{d_{t}(z) dz}\), and \(C = \max _{z\in \mathcal {X}} \widetilde {q}_{t}(z\mathcal {S}_{t})\) is a constant that depends on the sticky proposal used. Furthermore, under the conditions of Conjecture 1, \(E[P_{a}x_{t1}, \mathcal {S}_{t}] \to 0\) as t→∞.
Proof
See Appendix C.1. □
Theorem 3 sets a bound on the expected probability of adding new support points, and thus on the expected rate of growth of the number of support points. Furthermore, under certain smoothness assumptions for the target (i.e., that π(x) is twice continuously differentiable), it also guarantees that this expectation tends to zero as the number of iterations increases, hence implying that less points are added as the algorithm evolves.Finally, note that Theorem 3 has been derived only for η_{ t }(z,d)=η_{ t }(d). However, under certain mild assumptions, it can be easily extended to more general test functions, as stated in the following corollary.
Corollary 4
where \(\widetilde {D}_{1}(\pi, q_{t}) = \int _{\mathcal {X}}{\widetilde {d}_{t}(z) dz}\) and \(C = \max _{z\in \mathcal {X}} \widetilde {q}_{t}(z\mathcal {S}_{t})\). Furthermore, under the conditions of Conjecture 1, \(E\left [P_{a}x_{t1}, \mathcal {S}_{t}\right ] \to 0\) as t→∞.
Proof
See Appendix C.2. □
Note that Corollary 4 allows us to extend the results of Theorem 3 to update rule 3, which corresponds to \(\eta _{t}(z,d_{t}(z)) = \widetilde {d}_{t}(z)\) with \(\widetilde {d}_{t}(z) = \frac {d}{\max \{\pi (z),q_{t}(z\mathcal {S}_{t})\}}\) and d denoting the L_{1} norm.
6 Related works
6.1 Other examples of sticky MCMC methods
 1.Draw \(x'\sim {\widetilde q}_{t}(x) \propto q_{t}(x)\) and \(u' \sim \mathcal {U}([0,1]\!)\).Table 3
Special cases of sticky MCMC algorithms
Features
Griddy Gibbs
ARMS
IA^{2}RMS
Main reference
[15]
[12]
[13]
Proposal pdf p_{ t }(x)
\(p_{t}(x)=\widetilde {q}_{t}(x)\)
p_{ t }(x)∝ min{q_{ t }(x),π(x)}
p_{ t }(x)∝ min{q_{ t }(x),π(x)}
Proposal Constr.
Eq. (3)
Update rule or P_{ a }(z)
Never update, i.e.,
If q_{ t }(z)≥π(x) then Rule 3,
Rule 3
Rule 2
If q_{ t }(z)<π(x) then
with ε=∞, i.e.,
no update, i.e.,
P_{ a }(z)=0 for all z.
Rule 2 with ε=∞, i.e.,
\(P_{a}(z)=\max \left [1\frac {\pi (z)}{q_{t}(z)},0\right ]\)
 2.
If \(u'\leq \frac {\pi (x')}{q_{t}(x')}\), then set x_{ a }=x^{′}.
 3.
Otherwise, if \(u' > \frac {\pi (x')}{q_{t}(x')}\), repeat from 1.
6.2 Related algorithms
Other related methods, using nonparametric proposals, can be found in the literature. Samplers for drawing from univariate pdfs, using similar proposal constructions, has been proposed in [20] but the sequence of adaptive proposals does not converge to the target. Interpolation procedures for building the proposal pdf are also employed in [22, 23]. The authors in [22, 23] suggest to build the proposal by bspline procedures. However, in this case, the resulting proposal is a random walktype (not independent) and the resulting algorithm is not adaptive. Furthermore, there is not a convergence of the shape of proposal to the shape to target, but only local approximations via bspline interpolation. The methods [12, 13, 15] are included in the sticky class of algorithms, as pointed out in Section 6.1. In [16], the authors suggest an alternative proposal construction considering pieces of second order polynomial, in order to be used with the ARMS structure [12].
The adaptive rejection sampling (ARS) method [19, 26] is not an MCMC technique, but it is strongly related to the sticky approach, since it also employs an adaptive nonparametric proposal pdf. ARS needs to be able to build a proposal such that q_{ t }(x)≥π(x), \(\forall x\in \mathcal {X}\) and \(\forall t\in \mathbb {N}\). This is possible only when more requirements about the target are assumed (for instance, logconcavity). For this reason, several extensions of the standard ARS have been also proposed [25, 27, 28], for tackling wider classes of target distributions. In [29], the nonparametric proposal is still adapted by in this case the number of support points remains constant, fixed in advance by the user. Different construction nonparametric procedures in order to address multivariate distributions have been also presented [21, 30, 31].
Other techniques have been developed to be applied specifically for Monte CarlowithininGibbs scenario when it is possible to draw directly from the fullconditional pdfs. In [32], an importance sampling approximation of the univariate target pdf is employed and a resampling step is performed in order to provide an “approximate” sample from the fullconditional. In [18], the authors suggest a nonadaptive strategy for building a suitable nonparametric proposal via interpolation. In this work, the interpolation procedure is first performed using a huge amount of nodes and then many of them are discarded, according to a suitable criteria. Several other alternatives involving MHtype algorithms have been used for sampling efficiently from the fullconditional pdfs within a Gibbs sampler [5–7, 15, 33–35].
7 Adaptive independent sticky MTM
AISMTM provides a better choice of the new support points than AISM (see Section 9). The price to pay for this increased efficiency is the higher computational cost per iteration. However, since the proposal quickly approaches the target, it is possible to design strategies with a decreasing number of tries (M_{1}≥M_{2}≥⋯≥M_{ t }≥⋯≥M_{ T }) in order to reduce the computational cost.
7.1 Update rules for AISMTM
where we have used \(1\sum _{i=1}^{(r)} P_{\mathcal {Z}}(z_{i})=\frac {M}{\sum _{j=1}^{M}\varphi _{t}\left (z_{j}\right)}\).
8 Range of applicability and multivariate generation
The range of applicability of the sticky MCMC methods is briefly discussed below. On the one hand, sticky MCMC methods can be employed as standalone algorithms. Indeed, in many applications, it is necessary to draw samples from complicated univariate target pdf (as example in signal processing, see [38]). In this case, the sticky schemes provide virtually independent samples (i.e., with correlation close to zero) very efficiently. It is also important to remark that AISM and AISMTM also provide automatically an estimation of the normalizing constant of the target (a.k.a. marginal likelihood or Bayesian evidence) (since, with a suitable choice of the update test, the proposal approaches the target pdf almost everywhere). This is usually a hard task using MCMC methods [1, 2, 11].
For instance, this happens in blind equalization and source separation, or spectral analysis [3, 4]. For simplicity, in the following we denote the target pdf as \(\widetilde {\pi }(\mathbf {x})\). When direct sampling from \(\widetilde {\pi }(\mathbf {x})\) in the space \(\mathbb {R}^{L}\) is unfeasible, a common approach is the use of Gibbstype samplers [2]. This type of methods split the complex sampling problem into simpler univariate cases. Below we briefly summarize some wellknown Gibbstype algorithms.
 1.
Draw \(x_{\ell }^{(k)} \sim \widetilde {\pi }_{\ell }\left (x\mathbf {x}_{1:\ell 1}^{(k)}, \mathbf {x}_{\ell +1:L}^{(k1)}\right)\) for ℓ=1,…,L.
 2.
Set \(\mathbf {x}^{(k)}=\left [x_{1}^{(k)},\ldots,x_{L}^{(k)}\right ]^{\top }\).
The steps above are repeated for k=1,…,N_{ G }, where N_{ G } is the total number of Gibbs iterations. However, even sampling from \(\widetilde {\pi }_{\ell }\) can often be complicated. In some specific situations, rejection samplers [41–45] and their adaptive versions, adaptive rejection sampling (ARS) algorithms, are employed to generate (one) sample from \(\widetilde {\pi }_{\ell }\) [12, 19, 25, 27–29, 40, 46, 47]. The ARS algorithms are very appealing techniques since they construct a nonparametric proposal in order to mimic the shape of the target pdf, yielding in general excellent performance (i.e., independent samples from \(\widetilde {\pi }_{\ell }\) with an high acceptance rate). However, their range of application is limited to some specific classes of densities [19, 47].
More generally, it is impossible to draw from a fullconditional pdf \(\widetilde {\pi }_{\ell }\) (neither a rejection sampler can be applied), an additional MCMC sampler is required in order to draw from \(\widetilde {\pi }_{\ell }\) [33]. Thus, in many practical scenarios, we have an MCMC (e.g., an MH sampler) inside another MCMC scheme (i.e., the Gibbs sampler). In the socalled MHwithinGibbs approach, only one MH step is often performed within each Gibbs iteration, in order to draw from each complicated fullconditionals. This hybrid approach preserves the ergodicity of the Gibbs sampler and provides good performance in many cases. On the other hand, several authors have noticed that using a single MH step for the internal MCMC is not always the best solution in terms of performance (cf. [48]). Other approximated approaches have been also proposed, considering the application of the importance sampling within the Gibbs sampler [32].
Using a larger number of iterations for the MH algorithm, there is more probability of avoiding the “burnin” period so that the last sample be distributed as the fullconditional [33–35]. Thus, this case is closer to the ideal situation, i.e., sampling directly from the fullconditional pdf. However, unless the proposal is very well tailored to the target, a properly designed adaptive MCMC algorithm should provide less correlated samples than a standard MH algorithm. Several more sophisticated (adaptive or not) MH schemes for the application “withinGibbs” have been proposed in literature [12, 13, 16, 18, 20, 23, 49, 50]. In general, these techniques employ a nonparametric proposal pdf in the same fashion of the ARS schemes (and as the sticky MCMC methods). It is important to remark that performing more steps of a standard or adaptive MH within a Gibbs sampler can provide better performance than performing a longer Gibbs chain applying only one MH step (see, e.g., [12, 13, 16, 17]).
Recycling Gibbs sampling. Recently, an alternative Gibbs scheme, called Recycling Gibbs (RG) sampler, has been proposed in literature [51]. The combined use of RG with a sticky algorithm is particularly interesting since RG recycles and employs all the samples drawn from each fullconditional pdfs in the final estimators. Clearly, this scheme fits specially well for the use of a adaptive sticky MCMC algorithm where different MCMC steps are performed for each fullconditional pdfs.
 1.Choose uniformly a direction d^{(k)} in \(\mathbb {R}^{L}\). For instance, it can be done drawing L samples v_{ ℓ } from a standard Gaussian \(\mathcal {N}(0,1)\), and settingwhere v=[v_{1},…,v_{ L }].$$ \mathbf{d}^{(k)}=\frac{\mathbf{v}}{\sqrt{\mathbf{v}\mathbf{v}^{\top}} }, $$
 2.Set x^{(k)}=x^{(k−1)}+λ^{(k)}d^{(k)} where λ^{(k)} is drawn from the univariate pdfwhere \(\widetilde {\pi }\left (\mathbf {x}_{\ell }^{(k1)}+\lambda \mathbf {d}^{(k)}\right)\) is a slice of the target pdf along the direction d^{(k)}.$$p(\lambda)\propto \widetilde{\pi}\left(\mathbf{x}^{\left(k1\right)}+\lambda \mathbf{d}^{(k)}\right), $$
Also in this case, we need to be able to draw from the univariate pdf p(λ) using either some direct sampling technique or another Monte Carlo method (e.g., see [50]).
There are several methods similar to the Hit and Run where drawing from a univariate pdf is required; for instance, the most popular one is the Adaptive Direction Sampling [52].
Sampling from univariate pdfs is also required inside other types of MCMC methods. For instance, this is the case of exchangetype MCMC algorithms [53] for handling models with intractable partition functions. In this case, efficient techniques for generating artificial observations are needed. Techniques which generalize the ARS method, using nonparametric proposals, have been applied for this purpose (see [54]).
9 Numerical simulations
In this section, we provide several numerical results comparing the sticky methods with several wellknown MCMC schemes, such as the ARMS technique [12], the adaptive MH method in [10], and the slice sampler [55].^{9} The first two experiments (which can be easily reproduced by interested users) correspond to bimodal onedimensional and twodimensional targets, respectively, and are used as benchmarks to compare different variants of the AISM and AISMTM methods with other techniques. They allow us to show the benefits of the nonparametric proposal construction, even in these two simple experiments. Then, in the third example, we approximate the hyperparameters of a Gaussian process (GP) [56], which is often used for regression purposes in machine learning for signal processing applications.
9.1 Multimodal target distribution

P1: the construction given in [12] formed by exponential pieces, specifically designed for ARMS.

P2: alternative construction formed by exponential pieces obtained by a linear interpolation in the logpdf domain, given in [13].

P3: the construction using uniform pieces in Eq. (3).

P4: the construction using linear pieces in Eq. (4).
Furthermore, for AISM and AISMTM, we consider the Update Rule 1 (R1) with different values of the parameter β, the Update Rule 2 (R2) with different value of the parameter ε, and the Update Rule 3 (R3) for the inclusion of a new node in the set \(\mathcal {S}_{t}\) (see Section 4). More specifically, we first test AISM and AISMTM with all the construction procedures P1, P2, P3, and P4 jointly with the rule R3. Then, we test AISM with the construction P4 and the update test R2 with ε∈{0.005,0.01,0.1,0.2}. For Rule 1 we consider β∈{0.3,0.5,0.7,2,3,4}. All the algorithms start with \(\mathcal {S}_{0}=\{10,8,5,10\}\) and initial state x_{0}=−6.6. For AISMTM, we have set M∈{10,50}. For each independent run, we perform T=5000 iterations of the chain.
(ExSect9.1). For each algorithm, the table shows the mean square error (MSE), the autocorrelation (ρ(τ)) at different lags, the effective sample size (ESS), the final number of support points (m_{ T }), the computing times normalized w.r.t. ARMS (Time)
Algorithm  MSE  ρ(1)  ρ(10)  ρ(50)  ESS  m _{ T }  Time 

ARMS [12]  10.04  0.4076  0.3250  0.2328  89.12  118.19  1.00 
AISMP1R3  3.0277  0.1284  0.1099  0.0934  235.76  152.63  1.23 
AISMP2R3  2.9952  0.1306  0.1125  0.0929  235.01  71.14  0.27 
AISMP3R3  0.0290  0.0535  0.0165  0.0077  609.05  279.65  0.65 
AISMP4R3  0.0354  0.0354  0.0195  0.0086  608.76  84.87  0.33 
AISMTMP1 (M=10)  0.6720  0.0726  0.0696  0.0624  336.84  159.01  2.35 
R3 (M=50)  0.1666  0.0430  0.0395  0.0316  617.10  160.75  5.45 
AISMTMP2 (M=10)  0.5632  0.0588  0.0525  0.0443  440.23  72.16  1.13 
R3 (M=50)  0.1156  0.0345  0.0303  0.0231  746.45  72.53  4.38 
AISMTMP3 (M=10)  0.0105  0.0045  0.0001  0.0001  4468.10  315.78  2.60 
R3 (M=50)  0.0099  0.0041  0.0001  0.0001  4843.81  360.73  10.59 
AISMTMP4 (M=10)  0.0108  0.0036  0.0011  0.0014  3678.79  92.67  1.86 
R3 (M=50)  0.0098  0.0001  0.0001  0.0001  4912.07  101.78  7.25 
AISMP4R2 (ε=0.01)  0.0412  0.0407  0.0213  0.0074  604.95  35.01  0.11 
(ε=0.005)  0.0321  0.0360  0.0181  0.0072  610.01  43.32  0.20 
AISMP4R1 (β=0.3)  0.1663  0.2710  0.1368  0.0593  216.75  25.56  0.08 
(β=0.7)  0.1046  0.1781  0.0866  0.0441  356.21  33.55  0.11 
(β=2)  0.0824  0.0947  0.0408  0.0204  677.73  46.81  0.21 
(β=3)  0.0371  0.0720  0.0281  0.0099  714.90  52.76  0.23 
(β=4)  0.0310  0.0621  0.0253  0.0096  802.18  58.66  0.24 
AISM and AIMTM outperform ARMS, providing a smaller MSE and correlation (both close to zero). This is because ARMS does not allow a complete adaptation of the proposal pdf as highlighted in [13]. The adaptation in AISM and AIMTM provides a better approximation of the target than ARMS, as also indicated by the ESS which is substantially higher in the proposed methods. ARMS is in general slower than AISM for two main reasons. Firstly, the construction P1 (used by ARMS) is more costly since it requires the computation of several intersection points [12]. It is not required for the procedures P2, P3, and P4. Secondly, the effective number of iterations in ARMS is higher than T=5000 (the averaged value is ≈5057.83) due to the discarded samples in the rejection step (in this case, the chain is not moved forward).
9.2 Missing mode experiment
(ExSect9.2). Mean absolute error (MAE) in the estimation of the var[X]=49.55, for different techniques and different scale parameters σ_{ p } (T=10^{4})
Algorithm  σ_{ p }=2  σ_{ p }=3  σ_{ p }=8  σ_{ p }=10 

Standard MH  13.51  0.94  0.27  0.35 
Adaptive MH  3.28  0.29  0.24  0.28 
Robust AISM  1.79  0.16  0.13  0.14 
9.3 Heavytailed target distribution
∀x≥λ. Given a random variable \(X \sim {\bar \pi }(x)\), we have that E[X]=∞ and Var[X]=∞ due to the heavytail of the Lévy distribution. However, the normalizing constant, \(\frac {1}{c_{\pi }}\), such that \({\bar \pi }(x) = \frac {1}{c_{\pi }} \pi (x)\) integrates to one, can be determined analytically, and is given by \(\frac {1}{c_{\pi }} = \sqrt {\frac {\nu }{2\pi }}\).
Our goal is estimating the normalizing constant \(\frac {1}{c_{\pi }}\) via Monte Carlo simulation, when λ=0 and ν=2. In general, it is difficult to estimate a normalizing constant using MCMC outputs [2, 58, 59]. However, in the sticky MCMC algorithms (with update rules as R1 and R3 in Table 2), the normalizing constant of the adaptive nonparametric proposal approaches the normalizing constant of the target. We compare AISMP4R3 and different Multipletry Metropolis (MTM) schemes. For the MTM schemes, we use the following procedure: given the MTM outputs obtained in one run, we use these samples as nodes, then construct the approximated function using the construction P4 (considering these nodes), and finally compute the normalizing constant of this approximated function. Note that we use the same construction procedure P4, for a fair comparison.
For AISM, we start with only m_{0}=3 support points, \(\mathcal {S}_{0}=\{s_{1}=0,s_{2},s_{3}\}\), where two nodes are randomly chosen at each run, i.e., \(s_{2},s_{3} \sim \mathcal {U}([1,10])\) with s_{2}<s_{3}. We also test three different MTM techniques, two of them using an independent proposal pdf (MTMind) and the last one a random walk proposal pdf (MTMrw). For the MTM schemes, we set M=1000 tries and importance weights designed again to choose the best candidate in each step [37]. We set T=5000 for all the methods. Note that, the total number of target valuation E of AISM is only E=T=5000 whereas we E=MT=5·10^{6} for the MTMind schemes and E=2MT=10^{7} for the MTMrw algorithm (see [37] for further details). For the MTMind methods, we use an independent proposal \(\widetilde {q}(x)\propto \exp ((x\mu)^{2}/(2\sigma ^{2}))\) with μ∈{10,100} and σ^{2}=2500. In MTMrw, we have a random walk proposal \(\widetilde {q}(xx_{t1})\propto \exp \left ((xx_{t1})^{2}\left /\left (2\sigma ^{2}\right)\right.\!\right)\) with σ^{2}=2500. Note that we need to choose huge values of σ^{2} due to the heavytailed feature of the target.
Estimation of the normalizing constant \(\frac {1}{c_{\pi }}=0.5642\) for the Lévy distribution (T=5000)
Technique  MSE  Target evaluation 

AISMP4R3  0.0015  E=T=5000 
MTMind  0.0028  E=MT=5·10^{6} 
0.0024  
MTMrw  0.0054  E=2MT=10^{7} 
9.4 Sticky MCMC methods within Gibbs sampling
9.4.1 Example 1: comparing different MCMCwithinGibbs schemes
(ExSect9.4.1). Mean absolute error (MAE) in the estimation of four statistics (first component) and normalized computing time
Technique  T  N _{ G }  Init.  MAE  Avg. MAE  Time  

Mean  Variance  Skewness  Kurtosis  
Panel I  
AISMP4  3  2000  In1  0.878  0.781  0.437  0.223  0.579  0.066 
5  0.749  0.576  0.389  0.160  0.468  0.098  
10  0.266  0.057  0.136  0.020  0.120  0.178  
50  0.101  0.041  0.051  0.003  0.049  0.741  
AISMTMP4  3  2000  In1  0.251  0.056  0.128  0.017  0.113  0.202 
(M=5)  10  0.096  0.031  0.048  0.003  0.044  0.642  
ARMS  3  2000  In1  3.408  11.580  3.384  11.572  7.486  0.077 
5  3.151  9.839  2.650  7.079  5.679  0.116  
10  2.798  7.665  2.024  4.124  4.152  0.223  
50  1.918  3.407  1.134  1.292  1.937  1.000  
MH (σ_{ p }=1)  100  2000  In1  3.509  12.308  3.671  13.666  8.288  0.602 
MH (σ_{ p }=2)  1.756  3.077  0.978  0.963  1.693  0.602  
MH (σ_{ p }=10)  0.075  0.037  0.036  0.002  0.038  0.602  
MH (σ_{ p }=1)  1000  2000  In1  3.508  12.302  3.665  13.624  8.274  4.052 
MH (σ_{ p }=2)  1.601  2.560  0.874  0.769  1.451  4.052  
MH (σ_{ p }=10)  0.074  0.036  0.036  0.002  0.037  4.052  
MH (σ_{ p }=10)  1  2000  In1  0.697  11.598  0.883  3.622  4.200  0.033 
10000  0.493  9.881  0.611  2.905  3.472  0.162  
3  2000  0.352  6.510  0.290  0.927  2.019  0.042  
10  0.085  1.411  0.043  0.160  0.424  0.081  
Adaptive MH  100  2000  In1  0.415  0.304  0.234  0.068  0.255  0.634 
1000  0.075  0.038  0.037  0.002  0.038  4.107  
HMC  10  2000  In1  0.091  1.509  0.042  0.123  0.441  0.092 
100  0.078  0.037  0.039  0.002  0.039  0.630  
Slice  3  2000  In1  0.810  1.174  0.415  0.231  0.658  0.156 
10  0.607  0.372  0.306  0.096  0.345  0.463  
50  0.156  0.043  0.077  0.007  0.071  2.311  
Panel II  
AISMP4  3  2000  In2  0.138  0.055  0.070  0.006  0.067  0.066 
5  0.112  0.050  0.057  0.004  0.056  0.098  
10  0.093  0.045  0.046  0.002  0.046  0.178  
3  10000  0.095  0.023  0.050  0.002  0.042  0.335  
AISMTMP4  3  2000  In2  0.085  0.036  0.043  0.002  0.042  0.202 
(M=5)  4000  0.083  0.028  0.042  0.002  0.038  0.400  
(M=10)  2000  0.073  0.031  0.036  0.002  0.035  0.316 
(ExSect9.4.1). Mean absolute error (MAE) in the estimation of four statistics (first component) and normalized computing time (Continued)
Technique  T  N _{ G }  Init.  MAE  Avg. MAE  Time  

Mean  Variance  Skewness  Kurtosis  
MH (σ_{ p }=10)  1  10000  In2  0.178  0.126  0.091  0.012  0.102  0.162 
20000  0.151  0.112  0.090  0.008  0.090  0.331  
30000  0.138  0.063  0.068  0.007  0.069  0.492  
2  10000  0.130  0.062  0.066  0.006  0.066  0.196  
3  0.125  0.066  0.063  0.006  0.065  0.223  
10  2000  0.149  0.083  0.075  0.009  0.079  0.081  
Adaptive MH  10  2000  In2  0.158  0.082  0.087  0.012  0.084  0.090 
100  0.146  0.076  0.073  0.010  0.076  0.634  
HMC  10  2000  In2  0.152  0.092  0.079  0.015  0.084  0.092 
100  0.148  0.081  0.070  0.012  0.077  0.630  
Slice  3  2000  In2  0.204  0.105  0.103  0.022  0.108  0.156 
10  0.188  0.091  0.095  0.018  0.098  0.463  
3  10000  0.132  0.051  0.066  0.007  0.064  0.783 
The results are provided in Table 8. First of all, we notice that AISM outperforms ARMS and the slice sampler for all values of T and N_{ G }, in terms of performance and computational time. Regarding the use of the MH algorithm within Gibbs, the results depend largely on the choice of the variance of the proposal, \(\sigma _{p}^{2}\), and the initialization, showing the need for adaptive MCMC strategies. For a fixed value of T×N_{ G }, the AISM schemes provide results close to the smallest averaged MAE for In1 and the best results for In2 with a slight increase in the computing time, w.r.t. the standard MH algorithm. Finally, Table 8 shows the advantage of the nonparametric adaptive independent sticky approach w.r.t. the parametric adaptive approach [8, 10].
9.4.2 Example 2: comparison with an ideal Gibbs sampler
with ξ_{1}=1 and ξ_{2}=0.2. The joint pdf is a bivariate Gaussian pdf with mean vector μ=[0,0]^{⊤} and covariance matrix Σ=[1.08 0.54; 0.54 0.31]. We apply a Gibbs sampler with N_{ G } iterations to estimate both the mean and the covariance of the joint pdf. Then, we calculate the average MSE in the estimation of all the elements in μ and Σ, averaged over 2000 independent runs. We use this simple case, where we can draw directly from the fullconditionals, to check the performance of MH and AISMP3R3 within Gibbs as a function of T and N_{ G }. For the MH scheme, we use a Gaussian random walk proposal, \(\widetilde {q}\left (x_{\ell,t}^{(k)}\left x_{\ell,t1}^{(k)}\right.\right) \propto \exp \left (\left.\left (x_{\ell,t}^{(k)}0.5x_{\ell,t1}^{(k)}\right)^{2}\right /\left (2\sigma _{p}^{2}\right)\right)\) for ℓ∈{1,2}, 1≤t≤T and 1≤k≤N_{ G }. For AISMP3R3, we start with \(\mathcal {S}_{0}=\{2,0,2\}\).
9.5 Sticky MCMC methods within Recycling Gibbs sampling
9.5.1 Optimal scale parameter for MH
9.5.2 Comparison among different schemes
For AISMP3R3, we start with the set of support points \(\mathcal {S}_{0}=\{ 10,6,2,2,6,10\}\). We have averaged the MSE values over 10^{5} independent runs for each Gibbs scheme.
In Fig. 10b (represented in logscale), we fix N_{ G }=1000 and vary T. As T grows, when a standard Gibbs (SG) sampler is used, the curves show an horizontal asymptote since the internal chains converge after some value T≥T^{∗}. Considering an RG scheme, the increase of T yield lower MSE since now we recycle the internal samples. Figure 10b shows the advantage of using AISMR3P3 even when compared with the optimized MH method. The advantage of AISMR3P3 is clearer with small T values (10<T<30; recall that in this experiment N_{ G }=1000 is kept fixed). The performance of AISMR3P3 and optimized MH (within Gibbs) becomes more similar as T increases. This is due to the fact that, in this case, with a high enough value of T, the MH chain is able to exceed its burnin period and eventually converges.
9.6 Tuning of the hyperparameters of a Gaussian process (GP)
9.6.1 Exponential Power kernel function
(ExSect9.6.1). MSE in the estimation of the hyperparameters θ^{∗} with N_{ G }=2000
Algorithm  MH (σ_{ p }=1)  MH (σ_{ p }=2)  MH (σ_{ p }=3)  IA^{2}RMSP4  AISMP4R3 

MSE  6.21  5.08  6.83  3.12  3.46 
Time  1  1  1  1.64  1.42 
(ExSect9.6.1). MSE in the estimation of the hyperparameters θ^{∗} employing a Riemann quadrature, i.e., using a grid approximation [ 0,100]^{3} with step ε_{ g }
MSE  10.52  8.72  4.09  2.67  1.01 
ε _{ g }  2  1  0.5  0.2  0.1 
Time  0.11  0.36  1.25  7.31  20.71 
9.6.2 Automatic Relevant Determination kernel function
i.e., all the parameters of the kernel function in Eq. (22) and standard deviation σ of the observation noise. We assume \(p({\boldsymbol {\theta }})=\prod _{\ell =1}^{d_{Z}+1}\frac {1}{\theta _{\ell }^{\alpha }}\mathbb {I}_{\theta _{\ell }}\) where α=1.3, \(\mathbb {I}_{v}=1\) if v>0, and \(\mathbb {I}_{v}=0\) if v≤0. We desire to compute the expected value \({\mathbb E}[{\boldsymbol {\Theta }}]\) with Θ∼p(θy,Z,κ), via Monte Carlo quadrature.
More specifically, we apply a AISMP4R3 withinGibbs (with \(\mathcal {S}_{0}=\{0.01,0.5,1,2,5,8,10,15\}\)) and the Single Component Adaptive Metropolis (SCAM) algorithm [63] withinGibbs to draw from π(θ)∝p(θy,Z,κ). Note that dimension of the problem is D=d_{ X }+1 since \({\boldsymbol {\theta }}\in \mathbb {R}^{D}\). For SCAM, we use the Gaussian random walk proposal \(q(x_{\ell,t}x_{\ell,t1}) \propto \exp \left ((x_{\ell,t}x_{\ell,t1})^{2}/\left (2\gamma _{\ell,t}^{2}\right)\right)\). In SCAM, the scale parameters γ_{ℓ,t} are adapted (one for each component) considering all the previous corresponding samples (starting with γ_{ℓ,0}=1).
We generated the P=500 pairs of data, \(\{y_{j},\mathbf {z}_{j}\}_{j=1}^{P}\), drawing \(\mathbf {z}_{j}\sim \mathcal {U}\left ([0,10]^{d_{Z}}\right)\) and y_{ j } according to the model in Eq. (21), considered d_{ Z }∈{1,3,5,7,9} so that D∈{2,4,6,8,10}, and set \(\sigma ^{*}=\frac {1}{2}\) and \(\delta _{\ell }^{*}=2\), ∀ℓ, for all the experiments (recall that θ^{∗}=[δ^{∗},σ^{∗}]). We consider θ^{∗} as ground truth and compute the MSE obtained by the different Monte Carlo techniques.
(ExSect9.6.2). MSE for different techniques and different dimensions D=d_{ Z }+1 of the inference problem (number of hyperparameters), with T=20 and N_{ G }=1000 for both schemes
Algorithm  D=2  D=4  D=6  D=8  D=10 

SCAM withinGibbs  0.0452  0.3013  1.61  2.87  4.68 
AISMP4R3 withinGibbs  0.0170  0.1521  0.5821  1.33  2.67 
10 Conclusions
In this work, we have introduced a new class of adaptive MCMC algorithms for anypurpose stochastic simulation. We have discussed the general features of the novel family, describing the different parts which form a generic sticky adaptive MCMC algorithm. The proposal density used in the new class is adapted online, constructed by employing nonparametric procedures. The name “sticky” remarks that the proposal pdf becomes progressively more and more similar to the target. Namely, a complete adaptation of the shape of the proposal is obtained (unlike using parametric proposals). The role of the update control test for the inclusion of new support points has been investigated. The design of this test is extremely important, since it controls the tradeoff between computational cost and the efficiency of the resulting algorithm. Moreover, we have discussed how the combined design of a suitable proposal construction and a proper update test ensures the ergodicity of the generated chain.
Two specific sticky schemes, AISM and ASMTM, have been proposed and tested exhaustively in different numerical simulations. The numerical results show the efficiency of the proposed algorithms with respect to other stateoftheart adaptive MCMC methods. Furthermore, we have showed that other wellknown algorithms already introduced in the literature are encompassed by the novel class of methods proposed. A detailed description of the related works in the literature and their range of applicability are also provided, which is particularly useful for the interested practitioners and researchers. The novel methods can be applied both as a standalone algorithm or within any Monte Carlo approach that requires sampling from univariate densities (e.g., the Gibbs sampler, the hitandrun algorithm or adaptive direction sampling). A promising future line is designing suitable constructions of the proposal density in order to allow the direct sampling from multivariate target distributions (similarly as [21, 30, 31, 39, 40]). However, we remark that the structure of the novel class of methods is valid regardless of the dimension of the target.
11 Appendix A: Proof of Theorem 1
Therefore, we conclude that 0<a_{ t }≤1, the strong Doeblin condition is satisfied and thus all the conditions for Theorem 2 in [14] are fulfilled.
12 Appendix B: Argumentation for Conjecture 1
with \(C_{t}^{(0)} = \max _{x \in \mathcal {I}_{t}}\dot {\pi }(x)\) and \(C_{t}^{(1)} = \frac {1}{2} \max _{x \in \mathcal {I}_{t}}\ddot {\pi }(x)\).
where the last inequality is obtained by applying Newton’s binomial theorem, which states that A^{ℓ+1}+B^{ℓ+1}<(A+B)^{ℓ+1} for any A,B>0, using A=s^{′}−s_{ k }>0 and B=s_{k+1}−s^{′}>0. Hence, the bound in Eq. (36) can never increase when a new support point is incorporated and indeed tends to decrease as new points are added to the support set.
Note that we could still have \(L_{t}^{(\ell)} \to K > 0\) as t→∞. However, the conditions of Definition 1 ensure that the support of the proposal always contains the support of the target (i.e., \(q_{t}(x\mathcal {S}_{t})>0\) whenever π(x)>0 for any t and \(\mathcal {S}_{t}\)) and it has uniformly heavier tails (implying that \(q_{t}(x\mathcal {S}_{t}) \to 0\) slower than π(x) as x→±∞). Consequently, support points can be added anywhere inside the support of the target, \(\mathcal {X} \subseteq \mathbb {R}\). This implies that \(L_{t}^{(\ell)} \to 0\) as t→∞, since (s_{i+1}−s_{ i })→0 as more points are added inside \(\mathcal {I}_{t}\), and thus also \(D_{\mathcal {I}_{t}}(\pi,q_{t}) \to 0\) as t→∞. Let us focus now on \(D_{\mathcal {I}_{t}^{c}}(\pi,q_{t})\). Let us assume, without loss of generality, that a new point, s^{′}∈(−∞,s_{1}],^{14} is added at some iteration t^{′}>t using the mechanism described in the AISM algorithm (see Table 1) and that no other points have been incorporated to the support set for t+1,…,t^{′}−1. In this case, it is clear that the distance in the tails decreases (i.e., \(D_{\mathcal {I}_{t'}^{c}}(\pi,q_{t}) < D_{\mathcal {I}_{t}^{c}}(\pi,q_{t})\)) at the expense of increasing the distance in the central part of the target (i.e., \(D_{\mathcal {I}_{t'}}(\pi,q_{t}) > D_{\mathcal {I}_{t}}(\pi,q_{t})\)). However, even if this leads to a momentary increase in the overall distance, note that we still have \(D_{\mathcal {I}_{t'}}(\pi,q_{t}) \to 0\) as t^{′}→∞ as long as new support points can be added inside \(\mathcal {I}_{t'}\), something which is guaranteed by the AISM algorithm. Finally, since there is always a nonnull probability of incorporating points in the tails,^{15} thus implying that \(D_{\mathcal {I}_{t}^{c}}(\pi,q_{t}) \to 0\) as t→∞, since \(\mathcal {I}_{t}^{c}\) becomes smaller and smaller as t increases.
Therefore, we can guarantee that using the AISM algorithm in Table 1, with a valid proposal that fulfills Definition 1 and an acceptance rule according to Definition 3, we obtain a sticky proposal that fulfills Definition 2.
13 Appendix C: Support points
In this appendix we provide the proofs of Theorem 3 and Corollary 4, which bound the expected growth of the number of support points.
13.1 C.1 Proof of Theorem 3
Finally, noting C<∞, that both d_{ t }(x_{t−1})→0 and D_{1}(q_{ t },π)→0 as t→∞ by Conjecture 1, and that η_{ t }(0)=0 by condition 2 in Definition 3, we have \(E[P_{a}(z)x_{t1}, \mathcal {S}_{t}]\to 0\) as t→∞.
13.2 C.2 Proof of Corollary 4
14 Appendix D: Variate generation
 1.Compute the area A_{ i } below each piece composing \(q_{t}(x\mathcal {S}_{t})\), i=0,…,m_{ t }. This is straightforward for the construction procedures in Eqs. (3)(4) since the function \(q_{t}(x\mathcal {S}_{t})\) is formed by linear or constant pieces, so that it can be easily done analytically. Moreover, since the tails are exponential functions also in this case we compute the areas below A_{0} and \(A_{m_{t}}\) analytically. Then, we need to normalize them,$$ \eta_{i} = \frac{A_{i}}{\sum_{j=1}^{m} A_{j}}, \quad \text{for} \quad i=0,\ldots, m. $$
 2.
Choose a piece (i.e., an index j^{∗}∈{0,…,m_{ t }}) according to the weights η_{ i } for i=0,…,m_{ t }.
 3.
Given the index j^{∗}, draw a sample x^{′} in the interval \(\phantom {\dot {i}\!}\mathcal {I}_{j^{*}}\) with pdf \(\phantom {\dot {i}\!}\phi _{j^{*}}(x)\), i.e., \(\phantom {\dot {i}\!}x' \sim \phi _{j^{*}}(x)\).
15 Appendix E: Robust algorithms
In this appendix, we briefly discuss how to increase the robustness of the method, both with respect to a bad choice of the initial set \(\mathcal {S}_{0}\) (e.g., when information about the range of the target pdf is not available) and w.r.t. the heavy tails that appear in many target pdfs.
15.1 E.1 Mixture of proposal densities
where \(\widetilde {q}_{2}(x\mathcal {S}_{t})\) is a sticky proposal pdf built as described in Section 3. The density \(\widetilde {q}_{1}(x)\) is a generic proposal function with an explorative task. The explorative behavior of \(\widetilde {q}_{1}\) can be controlled by its scale parameter. The weight α_{ t } can be kept constant α_{ t }=α_{0}=0.5 for all t (this is the most defensive strategy), or it can be decreased with the iteration t, i.e., α_{ t }→0 as t→∞. The joint adaptation of the weight α_{ t }, the scale parameter of \(\widetilde {q}_{1}\) and \(\widetilde {q}_{2}\) using a sticky procedure needs and deserves additional studies.
15.2 E.2 Heavy tails
Moreover, we can also draw samples easily from each Pareto tail using the inversion method [2].
The adjective “sticky” highlights the ability of the proposed schemes to generate a sequence of proposal densities that progressively “stick” to the target.
The purpose of this work is to provide a family of methods applicable to a wide range of signal processing problems. A generic Matlab code (not focusing on any specific application) is provided at http://www.lucamartino.altervista.org/STICKY.zip.
A preliminary version of this work has been published in [64]. With respect to that paper, the following major changes have been performed: we discuss exhaustively the general structure of the new family (not just a particular algorithm); we perform a complete theoretical analysis of the AISM algorithm; we extend substantially the discussion about related works; we introduce the AISMTM algorithm; we show how sticky methods can be used to sample from multivariate pdfs by embedding them within a Gibbs sampler or the hit and run algorithm; and we provide additional numerical simulations (including comparisons with other benchmark sampling algorithms and the estimation of the hyper parameters of a Gaussian processes).
For simplicity, we assume that π(x) is bounded. However, the case of unbounded target pdfs can also be tackled by designing a suitable proposal construction that takes into account the vertical asymptotes of the target function. Similarly, we consider a target function defined in a continuous space \(\mathcal {X}\) for the sake of simplicity, although the support domain could also be discrete.
Note that \(d_{t}(z) \le \max \{\pi (z), q_{t}(z\mathcal {S}_{t})\} \le M_{\pi }\), since \(M_{t}=\max \limits _{z\in \mathcal {X}} q_{t}(z\mathcal {S}_{t})\le M_{\pi }\) for all of the constructions described in Section 3 for the proposal function. Therefore, all the ε_{ t }≥M_{ π } lead to equivalent update rules.
Regarding the definition of ε_{ t }, this threshold should decrease over time (to guarantee that new support points can always be added), but not too fast (to avoid adding too many points and thus increasing the computational cost). Selecting the optimum threshold can be very challenging, but many of the rules that have been used in the area of stochastic filtering for the update parameter could be used here. For instance, good update rules could be ε_{ t }=κM_{ π }·e^{−γt} or \(\varepsilon _{t} = \frac {\kappa M_{\pi }}{t+1}\) for some appropriate values of 0<κ<1 and γ>0. Exploring this issue is out of the scope of this paper, but we plan to address this in future works.
We have used the equality \(d_{t}(z_{i})=\pi (z_{i})q_{t}(z_{i}\mathcal {S}_{t})=\max \{\pi (z_{i}),q_{t}(z_{i}\mathcal {S}_{t})\}\min \{\pi (z_{i}),q_{t}(z_{i}\mathcal {S}_{t})\}\).
Preliminary Matlab code for the AISM algorithm, with the constructions described in Section 3.1 and the update control rule R3, is provided at https://www.mathworks.com/matlabcentral/fileexchange/54701adaptiveindependentstickymetropolis–aism–algorithm.
Note that we can always guarantee that \(q_{t}(x\mathcal {S}_{t})\) is heavier tailed than π(x) by using an appropriate construction for the tails of the proposal, as discussed in Section 3 and Appendix E.2 Heavy tails.
If we consider the complementary case (i.e., π(s_{i+1})≥π(s_{ i }) and thus \(q_{t}(x)=\pi (s_{i+1})\ \forall x \in \mathcal {I}_{t,i}\)) we obtain exactly the same bound following an identical procedure.
Note that the proposals are assumed to be uniformly heavier tailed than the target by Condition 4 of Definition 1. Therefore, we can guarantee that enough candidate samples are generated in the tails.
Declarations
Funding
This work has been supported by the Spanish Ministry of Economy and Competitiveness (MINECO) through the MIMODPLC (TEC201564835C33R) and KERMES (TEC201681900REDT/AEI) projects; by the Italian Ministry of Education, University and Research (MIUR); by PRIN 201011 grant; and by the European Union (Seventh Framework Programme FP7/20072013) under grant agreement no:630677.
Authors’ contributions
All the authors have participated in writing the manuscript and have revised the final version. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 JS Liu, Monte Carlo Strategies in Scientific Computing (SpringerVerlag, 2004).Google Scholar
 CP Robert, G Casella, Monte Carlo Statistical Methods (Springer, 2004).Google Scholar
 WJ Fitzgerald, Markov chain Monte Carlo methods with applications to signal processing. Signal Process.81:, 3–18 (2001).View ArticleMATHGoogle Scholar
 A Doucet, X Wang, Monte Carlo methods for signal processing: a review in the statistical signal processing context. IEEE Signal Process. Mag.22:, 152–17 (2005).View ArticleGoogle Scholar
 M Davy, C Doncarli, JY Tourneret, Classification of chirp signals using hierarchical Bayesian learning and MCMC methods. IEEE Trans. Signal Process. 50:, 377–388 (2002).View ArticleGoogle Scholar
 N Dobigeon, JY Tourneret, CI Chang, Semisupervised linear spectral unmixing using a hierarchical Bayesian model for hyperspectral imagery. IEEE Trans. Signal Process. 56:, 2684–2695 (2008).MathSciNetView ArticleGoogle Scholar
 T Elguebaly, N Bouguila, Bayesian learning of finite generalized Gaussian mixture models on images. Signal Process.91:, 801–820 (2011).View ArticleMATHGoogle Scholar
 GO Roberts, JS Rosenthal, Examples of adaptive MCMC. J. Comput. Graph. Stat.18:, 349–367 (2009).MathSciNetView ArticleGoogle Scholar
 C Andrieu, J Thoms, A tutorial on adaptive MCMC. Stat. Comput.18:, 343–373 (2008).MathSciNetView ArticleGoogle Scholar
 H Haario, E Saksman, J Tamminen, An adaptive Metropolis algorithm. Bernoulli. 7:, 223–242 (2001).MathSciNetView ArticleMATHGoogle Scholar
 F Liang, C Liu, R Caroll, Advanced Markov Chain Monte Carlo Methods: Learning from Past Samples (Wiley Series in Computational Statistics, England, 2010).View ArticleMATHGoogle Scholar
 WR Gilks, NG Best, KKC Tan, Adaptive rejection Metropolis sampling within Gibbs sampling. Appl.Stat.44:, 455–472 (1995).View ArticleMATHGoogle Scholar
 L Martino, J Read, D Luengo, Independent doubly adaptive rejection Metropolis sampling within Gibbs sampling. IEEE Trans. Signal Process.63:, 3123–3138 (2015).MathSciNetView ArticleGoogle Scholar
 L Holden, R Hauge, M Holden, Adaptive independent MetropolisHastings. Ann. Appl. Probab.19:, 395–413 (2009).MathSciNetView ArticleMATHGoogle Scholar
 C Ritter, MA Tanner, Facilitating the Gibbs sampler: The Gibbs stopper and the griddyGibbs sampler. J. Am. Stat. Assoc.87:, 861–868 (1992).View ArticleGoogle Scholar
 R Meyer, B Cai, F Perron, Adaptive rejection Metropolis sampling using Lagrange interpolation polynomials of degree 2. Comput. Stat. Data Anal. 52:, 3408–3423 (2008).MathSciNetView ArticleMATHGoogle Scholar
 L Martino, J Read, D Luengo, Independent doubly adaptive rejection Metropolis sampling. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2014).Google Scholar
 L Martino, H Yang, D Luengo, J Kanniainen, J Corander, A fast universal selftuned sampler within Gibbs sampling. Digital Signal Process.47:, 68–83 (2015).MathSciNetView ArticleGoogle Scholar
 WR Gilks, P Wild, Adaptive rejection sampling for Gibbs sampling. Appl. Stat.41:, 337–348 (1992).View ArticleMATHGoogle Scholar
 B Cai, R Meyer, F Perron, MetropolisHastings algorithms with adaptive proposals. Stat. Comput. 18:, 421–433 (2008).MathSciNetView ArticleGoogle Scholar
 W Hörmann, J Leydold, G Derflinger, Automatic nonuniform random variate generation (Springer, 2003).Google Scholar
 G Krzykowski, W Mackowiak, Metropolis Hastings simulation method with spline proposal kernel. An Isaac Newton Institute Workshop (2006).Google Scholar
 W Shao, G Guo, F Meng, S Jia, An efficient proposal distribution for MetropolisHastings using a bsplines technique. Comput. Stat. Data Anal. 53:, 465–478 (2013).MathSciNetView ArticleMATHGoogle Scholar
 L Tierney, Markov chains for exploring posterior distributions. Ann. Stat.22:, 1701–1728 (1994).MathSciNetView ArticleMATHGoogle Scholar
 L Martino, J Míguez, Generalized rejection sampling schemes and applications in signal processing. Signal Process.90:, 2981–2995 (2010).View ArticleMATHGoogle Scholar
 WsR Gilks, Derivativefree adaptive rejection sampling for Gibbs sampling. Bayesian Stat.4:, 641–649 (1992).MATHGoogle Scholar
 D Görür, YW Teh, Concave convex adaptive rejection sampling. J. Comput. Graph. Stat.20:, 670–691 (2011).MathSciNetView ArticleGoogle Scholar
 W Hörmann, A rejection technique for sampling from Tconcave distributions. ACM Trans. Math. Softw. 21:, 182–193 (1995).MathSciNetView ArticleMATHGoogle Scholar
 L Martino, F Louzada, Adaptive rejection sampling with fixed number of nodes. (to appear) Communications in Statistics  Simulation and Computation, 1–11 (2017). doi:10.1080/03610918.2017.1395039.
 J Leydold, A rejection technique for sampling from logconcave multivariate distributions. ACM Trans. Model. Comput. Simul. 8:, 254–280 (1998).View ArticleMATHGoogle Scholar
 J Leydold, W Hörmann, A sweep plane algorithm for generating random tuples in a simple polytopes. Math. Comput.67:, 1617–1635 (1998).MathSciNetView ArticleMATHGoogle Scholar
 KR Koch, Gibbs sampler by samplingimportanceresampling. J. Geodesy. 81:, 581–591 (2007).View ArticleMATHGoogle Scholar
 AE Gelfand, TM Lee, Discussion on the meeting on the Gibbs sampler and other Markov Chain Monte Carlo methods. J. R. Stat. Soc. Ser. B. 55:, 72–73 (1993).MathSciNetGoogle Scholar
 C Fox, A Gibbs sampler for conductivity imaging and other inverse problems. Proc. SPIE Image Reconstruction Incomplete Data VII. 8500:, 1–6 (2012).Google Scholar
 P Müller, A generic approach to posterior integration and, Gibbs sampling. Technical Report 9109 (Department of Statistics of Purdue University, 1991).Google Scholar
 JS Liu, F Liang, WH Wong, The multipletry method and local optimization in Metropolis sampling. J. Am. Stat. Assoc.95:, 121–134 (2000).MathSciNetView ArticleMATHGoogle Scholar
 L Martino, J Read, On the flexibility of the design of multiple try Metropolis schemes. Comput. Stat. 28:, 2797–2823 (2013).MathSciNetView ArticleMATHGoogle Scholar
 D Luengo, L Martino, Almost rejectionless sampling from Nakagamim distributions (m≥1). IET Electron. Lett. 48:, 1559–1561 (2012).View ArticleGoogle Scholar
 R Karawatzki, The multivariate Ahrens sampling method. Technical Report 30, Department of Statistics and Mathematics (2006).Google Scholar
 W Hörmann, A universal generator for bivariate logconcave distributions. Computing. 52:, 89–96 (1995).View ArticleMATHGoogle Scholar
 BS Caffo, JG Booth, AC Davison, Empirical supremum rejection sampling. Biometrika. 89:, 745–754 (2002).MathSciNetView ArticleMATHGoogle Scholar
 W Hörmann, A note on the performance of the Ahrens algorithm. Computing. 69:, 83–89 (2002).MathSciNetView ArticleMATHGoogle Scholar
 J W Hörmann, G Leydold, Derflinger, Inverse transformed density rejection for unbounded monotone densities. Research Report Series/ Department of Statistics and Mathematics (Economy and Business) (Vienna University, 2007).Google Scholar
 G Marrelec, H Benali, Automated rejection sampling from product of distributions. Comput Stat.19:, 301–315 (2004).MathSciNetView ArticleMATHGoogle Scholar
 H Tanizaki, On the nonlinear and nonnormal filter using rejection sampling. IEEE Trans. Automatic Control. 44:, 314–319 (1999).MathSciNetView ArticleMATHGoogle Scholar
 M Evans, T Swartz, Random variate generation using concavity properties of transformed densities. J. Comput. Graph. Stat.7:, 514–528 (1998).Google Scholar
 L Martino, J Míguez, A generalization of the adaptive rejection sampling algorithm. Stat. Comput.21:, 633–647 (2011).MathSciNetView ArticleMATHGoogle Scholar
 M Brewer, C Aitken, Discussion on the meeting on the Gibbs sampler and other Markov Chain Monte Carlo methods. J. R. Stat. Soc. Ser. B. 55:, 69–70 (1993).MathSciNetGoogle Scholar
 F Lucka, Fast Gibbs sampling for highdimensional Bayesian inversion (2016). arXiv:1602.08595.Google Scholar
 H Zhang, Y Wu, L Cheng, I Kim, Hit and run ARMS: adaptive rejection Metropolis sampling with hit and run random direction. J. Stat. Comput. Simul.86:, 973–985 (2016).MathSciNetView ArticleGoogle Scholar
 L Martino, V Elvira, G CampsValls, Recycling Gibbs sampling. 25th European Signal Processing Conference (EUSIPCO), 181–185 (2017).Google Scholar
 WR Gilks, NGO Robert, EI George, Adaptive direction sampling. The Statistician. 43:, 179–189 (1994).View ArticleGoogle Scholar
 I Murray, Z Ghahramani, DJC MacKay, in Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence (UAI06). MCMC for doublyintractable distributions, (2006), pp. 359–366.Google Scholar
 D Rohde, J Corcoran, in Statistical Signal Processing (SSP), 2014 IEEE Workshop on. MCMC methods for univariate exponential family models with intractable normalization constants, (2014), pp. 356–359.Google Scholar
 RM Neal, Slice sampling. Ann. Stat.31:, 705–767 (2003).MathSciNetView ArticleMATHGoogle Scholar
 CE Rasmussen, CKI Williams, Gaussian processes for machine learning, (2006).Google Scholar
 D Gamerman, Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference (Chapman and Hall/CRC, 1997).Google Scholar
 BP Carlin, S Chib, Bayesian model choice via markov chain monte carlo methods. J. R. Stat. Soc. Series B (Methodological). 3:, 473–484 (1995).MATHGoogle Scholar
 S Chib, I Jeliazkov, Marginal likelihood from the MetropolisHastings output. J. Am. Stat. Assoc.96:, 270–281 (2001).MathSciNetView ArticleMATHGoogle Scholar
 R Neal, Chapter 5 of the Handbook of Markov Chain Monte Carlo. (S Brooks, A Gelman, G Jones, XL Meng, eds.) (Chapman and Hall/CRC Press, 2011).Google Scholar
 IT Nabney, Netlab: Aalgorithms for Pattern Recognition (Springer, 2008).Google Scholar
 C Bishop, Pattern Recognition and Machine Learning (Springer, 2006).Google Scholar
 H Haario, E Saksman, J Tamminen, Componentwise adaptation for high dimensional MCMC. Comput. Stat. 20:, 265–273 (2005).View ArticleMATHGoogle Scholar
 L Martino, R Casarin, D Luengo, Sticky proposal densities for adaptive MCMC methods. IEEE Workshop on Statistical Signal Processing (SSP) (2016).Google Scholar
 PJ Davis, Interpolation and approximation (Courier Corporation, 1975).Google Scholar
 GH Hardy, JE Littlewood, G Pólya, Inequalities (Cambridge Univ. Press, 1952).Google Scholar