Optimal Nonparametric Covariance Function Estimation for Any Family of Nonstationary Random Processes

A covariance function estimate of a zero-mean nonstationary random process in discrete time is accomplished from one observed realization by weighting observations with a kernel function. Several kernel functions have been proposed in the literature. In this paper, we prove that the mean square error (MSE) optimal kernel function for any parameterized family of random processes can be computed as the solution to a system of linear equations. Even though the resulting kernel is optimized for members of the chosen family, it seems to be robust in the sense that it is often close to optimal for many other random processes as well. We also investigate a few examples of families, including a family of locally stationary processes, nonstationary AR-processes, and chirp processes, and their respective MSE optimal kernel functions.


Introduction
In several applications, including statistical time-frequency analysis [1][2][3][4], the covariance function of a nonstationary random process has to be estimated from one single observed realization. We assume that the complex-valued process, which we denote by {x(t), t ∈ Z}, is in discrete time and has finite support: x(t) = 0 for all t / ∈ T n = {1, . . . , n}. Most often, the mean of the process is assumed to be known or already estimated and, hereby, we can, without loss of generality, assume that the mean of the process is zero. An estimate of the covariance function defined and denoted by r x (s, t) = E[x(s)x(t) * ] is then accomplished by a weighted average of observations of x(s + k)x(t + k) * with different weights for different k, [5,6], where * denotes complex conjugate. Presumably, the weights, also known as the kernel function, are allowed to vary with the time-lag τ = s − t. We denote and define this estimator by R x;H : T 2 n → C: where K τ is the set {−n + 1 + |τ|, . . . , n − 1 − |τ|}, and H is a kernel function which belongs to the set H = {H : K τ ×T = {−n + 1, . . . , n − 1} → C} of all possible kernel functions, and where we denote the cardinality of a set S by |S|. Some care has to be taken in order for this estimate to be nonnegative definite, but as this problem has appropriate solutions [7], we will not discuss it further. Naturally, one wishes to choose the kernel function H carefully so that the estimate does not suffer from large bias or variance. If nothing except zero mean is assumed about the process, no other estimator of r(s, t) than x(s)x(t) * can be justified. This is equivalent with H(k, τ) = 1 for k = 0 and zero otherwise. We will assume that there is some a priori knowledge about the process. If, for example, the process is quasistationary in the sense that r(s, t) ≈ r(s + δ, t + δ) for integers |δ| < D, then it may be wise to use a kernel function which is "bell-shaped" in the k-direction with a bandwidth proportional to D. And if, for example, it is known that the process decorrelates at large time lags, meaning that r(s, s + τ) ≈ 0 for |τ| > T, then it makes sense to use a kernel function H(k, τ) ≈ 0 for |τ| > T, [8].
The kernel function that gives the least mean square error (MSE) of the estimate (squared bias plus variance) for processes in continuous time was derived by Sayeed and Jones in the ambiguity domain as a function of the process characteristics, including second-and fourth-order moments up to amplitude scaling, frequency shift, and 2 EURASIP Journal on Advances in Signal Processing time shift [9]. Their result has been used in, for example, [10,11]. If the formula found by Sayeed and Jones is used in the discrete ambiguity domain, the resulting covariance estimator will not be MSE optimal, as discussed in [12]. However, for these processes the MSE optimal covariance function estimator can be computed in time-lag domain [12]. This can be used if one can, by prior knowledge, construct a random process for which the optimal kernel function is similar to the optimal kernel function for {x(t)}. In this paper, we prove that the MSE optimal kernel function for any parameterized family of random processes with an a priori parameter distribution can be computed by solving a system of linear equations. The solution is optimal in the sense that there is no other choice of kernel function H which gives a covariance function estimate R x;H with less expected MSE if the observed realization belongs to a process in the chosen family with the presumed parameter distribution.
The result derived in this paper is useful when so little is known about the random process that a nonparametric covariance function estimator of the form (1) ought to be used (rather than estimating parameters in a model). For such a situation, one has to decide which kernel function H to use. This choice of H must be guided by some prior knowledge about the random process. If this knowledge can be condensed into a parameterized family of processes where we can assign a probability distribution to the parameter space, then the optimal kernel function for the whole family of processes, computed as described in Section 2, can be used. It is important to stress that we do not need to assume that the realization we are about to observe is a realization from a process in this parameterized family. Rather, we believe that the realization comes from a process which has a suitable kernel function in common with some processes in the family.
The remainder of this paper is organized as follows. The MSE optimal solution is presented in Section 2. An example with a family of locally stationary processes is described in Section 3.1 and a family of nonstationary AR(1)-processes is described in Section 3.2. In Section 3.3, we compute the MSE optimal kernel function for a family of chirp processes which we use on a set of heart rate variability data. Conclusions and final remarks are given in the last section. Proofs are found in the Appendix.

The MSE Optimal Kernel Estimator for a Family of Processes
Let (Ω, F , P) be a fixed probability space. Let {x q (t), t ∈ Z, q ∈ Q} be a family of random variables parameterized by t ∈ Z ("time") and q ∈ Q, where Q ("parameter space for the family of random processes") is a fixed subset of R L , for some fixed integer L. For a fixed q ∈ Q, we think of {x q (t), t ∈ Z} as a random process in discrete time, and we assume that this process has the following properties: (a) it has zero mean: E[x q (t)] = 0, for all t ∈ Z, (b) it has finite moments, and (c) it has finite support: x q (t) = 0, for all t / ∈ T n = {1, . . . , n}, where n is a fixed integer. We also assume that x q (t) and x p (s) are independent for q / = p. Now, let Q be a random element in Q with distribution F Q . Conditional on Q = q, we have observed a realization of the process {x q (t), t ∈ Z}, and we shall estimate the covariance function. Since Q is random, that is we do not know which process in the family we have an observation of, we would like to compute a kernel function which gives a low MSE for as many random processes within the family as possible. Also, we would like to take into account that we have a probability distribution of Q and, hence, it seems natural to let the optimization give a higher weight to members of the family with a high probability according to the distribution F Q . In order to achieve this, we need a few definitions. The covariance function of {x q (t), t ∈ Z}, for a fixed q ∈ Q, is denoted by r xq (s, t), and, for a random element Q in Q, we let r xQ (s, t) denote the covariance function conditional on Q, that is, Given a kernel function H, the estimator of r xQ (s, t) is given by as in (1). We are now ready to define the kernel function that minimizes the expected error of R xQ;H . Definition 1. The MSE optimal kernel function for a family {x Q (t), t ∈ Z} of random processes is defined and denoted by Remark 1. The definition can equivalently be written as As stated in the following theorem, the MSE optimal kernel function can be computed by solving a system of linear equations.
Theorem 1 (MSE optimal kernel for a family of random processes). For a fixed τ ∈ T , the MSE optimal kernel function H xQ−opt (k, τ) for the family {x Q (t), t ∈ Z}, is found as the solution to the following system of linear equations: where ρ xq is the fourth-order moment of {x q (t)} defined by Proof. See appendix.
We note that for a fixed τ ∈ T , the theorem gives |K τ | = 2n − 2|τ| − 1 linear equations, by which we can compute H xQ−opt (k, τ) for all k ∈ K τ using any standard routine available for solving linear equations. The bias of the estimator R x; and the variance is The MSE optimal kernel function is invariant to some manipulations of the family {x Q (t)}, including amplitude scaling when the whole family is scaled by the same scalar. It is also invariant to frequency shift of each process in the family, in the following sense: let {y q (t), t ∈ Z, q ∈ Q} be defined by y q (t) = x q (t)e −i2π fqt , where f q ∈ R. Then H yQ−opt = H xQ−opt . This can be proved by inserting the relations r yq (s, t) = e −i2π(s−t) fq r xq (s, t) and R yq;H (s, t) = e −i2π fq(s−t) R xq;H (s, t) into (4). The MSE optimal kernel function is also approximately time-shift invariant in the following sense: let {z q (t), t ∈ Z, q ∈ Q}, z q (t) = 0 for all t / ∈ T n be defined by z q (t) = x q (t − δ q ), where |δ q | is an integer much smaller than n. Then H zQ−opt is given by We see that the only difference between (10) and (5) is that the summation in (5) is made for s = 1, . . . , n and t = 1, . . . , n whereas in (10), the summation is made for s = 1 − δ q , . . . , n − δ q and t = 1 − δ q , . . . , n − δ q . Since there is only a small shift in the area for which the minimization is performed, H xQ−opt will be approximately as optimal as H zQ−opt on the family {z q (t), t ∈ Z,q ∈ Q}.

A Family of Locally Stationary Processes.
We will now consider a family of processes which are approximately locally stationary. Let {x α,β (t), t ∈ Z, (α, β) ∈ Q}, where Q = {(α, β) : 0 < α ≤ β ≤ 1} be a set of jointly Gaussian random variables such that x q (t) and x p (s), q / = p, . Each random process {x α,β (t), t ∈ Z} is approximately locally stationary in Silverman's sense [13,14]. Such processes have been widely used in the literature, see for example [10,15]. Now, let Q be a random element of Q with uniform distribution on Q, that is, the density function is 2 in Q and 0 otherwise. The MSE optimal kernel function for this family, computed by the use of Theorem 1, is shown in Figure 1, where n = 64.
The optimal kernel function, H xQ−opt , for this family can be compared with the optimal kernel function for each member of the family. Figure 2 shows the ratio between the MSE when H xQ−opt is used and the MSE when H xα,β−opt is used on realizations from {x α,β (t), t ∈ Z}, where H xα,β−opt is the MSE optimal kernel function for the process {x α,β (t), t ∈ Z}.  The kernel function H xQ−opt works remarkably well for every member of the family, except when α is close to zero. In fact, for more than 50% of the members of this family, the use of the kernel function optimized for the whole family results in less than 8% larger MSE than the kernel optimized for each member.

A Family of Nonstationary AR(1)-Processes.
Let e(t) be a stationary AR(1)-process: e θ1 (t) = θ 1 e θ1 (t − 1) + (t), |θ 1 | < 1, where { (t), t ∈ Z} is a white Gaussian noise process with variance (1 − |θ 1 |) 1.5 . This process is enveloped in order to get a nonstationary random process: x θ1, θ2 (t) = e θ1 (t)e −(t−n/2−0.5) 2 /(θ2n) 2 . As seen, the process x θ1, θ2 is described by two parameters, θ 1 and θ 2 . In this example, we will compare two different kernels. The first one, H rect-opt , is optimized using Theorem 1 for the rectangular parameter set 0.5 ≤ θ 1 ≤ 0.9 and 0.5 ≤ θ 2 ≤ 1. More formally, we apply Theorem 1 on the family {x θ1, θ2 (t), t ∈ Z, (θ 1 , θ 2 ) ∈ Q}, where Q = [0.5, 0.9] × [0.5, 1] and where Q is a random element of Q with uniform distribution. The uniform distribution is approximated with an equidistant grid of point masses in order to simplify the integral expression. The second kernel is a separable kernel function, that is, a kernel function that can be separated into one function dependent on k and one function dependent on τ. Such kernels are well suited for covariance function estimation of the random processes that we consider in this example. We choose the separable kernel H sep-opt (k, τ) = h 1 (k)h 2 (τ), where h 1 and h 2 are Hanning windows, each with a length and amplitude that have been numerically MSE optimized for (θ 1 , θ 2 ) = (0.7, 0.75).
Thus, we now have two kernels, the first one optimized on the rectangular space 0.5 ≤ θ 1 ≤ 0.9 and 0.5 ≤ θ 2 ≤ 1 using Theorem 1, and the second one numerically optimized in its four parameters (length of the two Hanning windows and their amplitudes). We will now compare these for all processes 0 < θ 1 < 1.5, 0 < θ 2 < 1. Note that we The separable kernel, H sep , is numerically optimized at this point. The kernel, H rect-opt , is optimized for the rectangular area using Theorem 1.

Except for this region,
H rect-opt is MSE superior to H sep . Figure 4: The ratio between the MSE of the optimal kernel function for the family of nonstationary AR(1)-processes and the MSE of a separable kernel function. The first kernel has been optimized, as given by Theorem 1, for the processes with parameters inside the rectangle. The lengths and the amplitudes of the two Hanning windows of the separable kernel have been optimized for the parameter values at the circle. The black contour shows the border where the two kernels give equally MSE. Outside this region the kernel optimized as described in this paper is MSE superior to the separable kernel.
include processes outside the rectangular, where H rect-opt is optimized. The ratio between the MSE of the kernels is given in Figure 4 as a function of the parameter space. We see that the first kernel, H rect-opt , is better than the separable kernel nearly everywhere.

A Family of Chirp Processes.
In this example, we will study measurements of heart rate variability (HRV), [16]. Such measurements are often modeled as an observed realization from a nonstationary random process with stationary mean. The second-order moments are considered to be of greatest value from a medical perspective, [17]. Our HRV measurements can be expected to have an increasing frequency as the recording is made during an experiment with increasing respiratory rate. Our data consist of n = 170 HRV measurements with sampling rate 2 Hz. After the mean of the data has been removed, we consider it to be an observation of a nonstationary zero-meaned random process. In order to estimate the covariance function of this process, we use an estimator of the form (1), where we compute the kernel function to be MSE optimal to the following family of jointly Gaussian distributed enveloped chirps: let {x (α,β,γ) (t), t ∈ Z, (α, β, γ) ∈ Q}, where x (α,β,γ) (t) = Aw γ (t) sin(2πν (α,β) (t)t + ν 0 ) for all t ∈ T n and 0 otherwise, ν (α,β) (t) = αt/n + β, w γ (t) = 1/γ e −(t−n/2−0.5) 2 /(γn) 2 , ν 0 is a random variable uniformly distributed in [0, 2π), and A is a Rayleigh-distributed random variable independent of ν 0 . The parameters α and β can be thought of as the raise and starting point of the chirp frequency and γ as the width of the envelope. We choose the a priori distribution of the parameters to be uniform on −0.1 ≤ α ≤ 0.1, −0.25 ≤ β ≤ 0.25, 0.1 ≤ γ ≤ 1, but in order to simplify the computations we approximate this distribution with a uniform point distribution. The MSE optimal kernel function is computed as described in Theorem 1 and can be seen in Figure 5. As mentioned in the introduction, nonparametric covariance function estimators are not guaranteed to be non-negative definite, [7]. We make the resulting covariance matrix estimate non-negative definite by writing the estimate as an eigenvalue decomposition and removing the negative eigenvalues and their respective eigenvectors. The corresponding Wigner spectrum is computed using Jeong and Williams discrete Wigner representation, [4,18].