Study on adaptive compressed sensing & reconstruction of quantized speech signals

Yunyun, Ji; Zhen, Yang

doi:10.1186/1687-6180-2012-232

Research
Open access
Published: 31 October 2012

Study on adaptive compressed sensing & reconstruction of quantized speech signals

Ji Yunyun¹ &
Yang Zhen²

EURASIP Journal on Advances in Signal Processing volume 2012, Article number: 232 (2012) Cite this article

3004 Accesses
Metrics details

Abstract

Compressed sensing (CS) is a rising focus in recent years for its simultaneous sampling and compression of sparse signals. Speech signals can be considered approximately sparse or compressible in some domains for natural characteristics. Thus, it has great prospect to apply compressed sensing to speech signals. This paper is involved in three aspects. Firstly, the sparsity and sparsifying matrix for speech signals are analyzed. Simultaneously, a kind of adaptive sparsifying matrix based on the long-term prediction of voiced speech signals is constructed. Secondly, a CS matrix called two-block diagonal (TBD) matrix is constructed for speech signals based on the existing block diagonal matrix theory to find out that its performance is empirically superior to that of the dense Gaussian random matrix when the sparsifying matrix is the DCT basis. Finally, we consider the quantization effect on the projections. Two corollaries about the impact of the adaptive quantization and nonadaptive quantization on reconstruction performance with two different matrices, the TBD matrix and the dense Gaussian random matrix, are derived. We find that the adaptive quantization and the TBD matrix are two effective ways to mitigate the quantization effect on reconstruction of speech signals in the framework of CS.

1 Introduction

In recent years, compressed sensing (CS)[1–4] has been a new and popular paradigm of signal acquisition and compression in applied science and engineering such as image processing, wireless communication, magnetic resonance imaging (MRI) and so on. In contrast with the conventional Nyquist sampling theorem, CS theory demonstrates that a sparse signal can be exactly recovered through far fewer projections, providing that the sensing matrix is highly incoherent with the sparsifying matrix.

As an important branch of signal processing, speech signal processing has achieved a considerable development in past decades. In addition, the application of CS theory to the field of speech signal processing is becoming a rising research focus. In[5, 6], the sparsity of the residual excitation is utilized to construct sparsifying matrices for voiced speech signals. However, in the aforementioned two literatures, the sparsifying matrix constructed using the impulse response for voiced speech is impractical for its dependence on the currently reconstructed signal itself. Therefore, a codebook of impulse response vectors generated from the training speech data is proposed as the sparsifying matrix in[5].

This work also constructs an adaptive sparsifying matrix for voiced speech based on the quasi-periodicity during voiced segments. And this adaptive sparsifying matrix is a kind of symmetric cyclic matrix which is generated on the basis of the long term prediction. Therefore, this adaptive sparsifying matrix is dependent on the previously reconstructed signal instead of the current signal.

Then, a kind of CS matrix called two-block diagonal (TBD) matrix is constructed for voiced speech signals. The concentration inequality of the TBD matrix is simply demonstrated in Section 4. Subsequently, we can find that the TBD matrix satisfies the restricted isometry property (RIP)[7] according to a theorem in[8].

The third key point of this work to be discussed is quantization. It is well known that analog signals should be sampled, quantized and then encoded before transmission. Thus, quantization of CS projections is of great importance. The distortion rate performance and some measures to mitigate the impact of quantization noise on reconstruction have been considered in[9–15]. In this paper, we apply uniform scalar quantization to the measurements of the speech signal and quantitatively show that how adaptive quantization affects the reconstruction quality compared with the nonadaptive quantization. In addition, we find that the TBD matrix is more robust to the quantization noise than the dense Gaussian matrix based on the fact that the TBD matrix can effectively restricted the impact of quantization noise on reconstruction of speech signals.

The rest of the paper is organized as follows. In Section 2, we briefly review the principle of CS. Section 3 presents the construction of an adaptive sparsifying matrix for voiced speech signals. In Section 4, a sensing matrix is constructed for voiced speech signals. And in Section 5, the effect of quantization of projections on reconstruction is discussed. Section 6 then concludes our work.

2 Compressed sensing background

Supposed that a vector $x = {[\begin{array}{c} x (1) & x (2) & \dots & x (N) \end{array}]}^{T}$ can be represented as a linear combination of some basis vectors {ϕ₁ ϕ₂ ⋅ ⋅ ⋅ ϕ_N }, we have

x = Ψ θ = \sum_{i = 1}^{N} ϕ_{i} θ (i)

(1)

where Ψ = [φ₁φ₂ ⋅ ⋅ ⋅ φ_N] and $θ = {[\begin{array}{c} θ (1) & θ (2) & \dots & θ (N) \end{array}]}^{T}$ . If the number of nonzero entries of θ which can be represented as ${‖θ‖}_{l_{0}}$ satisfies

{‖θ‖}_{l_{0}} \leq K

(2)

x is considered to be K-sparse with respect to Ψ. Then Ψ is called a sparsifying matrix.

And a matrix Φ ∈ R^M × N can be employed to project a N-dimensional vector onto a M-dimensional subspace. Then, we can acquire a low-dimensional vector y and we have

y = Φ x = Φ Ψ θ = A θ

(3)

where Φ is called the sensing matrix and A is named the CS matrix. It is required that the CS matrix must satisfy certain conditions for effective reconstruction of the coefficient vector θ. And RIP is a sufficient condition for effective reconstruction. In the following, we firstly recall the definition of restricted isometry constant.

Definition 1 (Restricted isometry constant) ([7, 16]). The restricted isometry constant δ_K of matrix A is defined as the smallest quantity such that

(1 - δ_{K}) {‖θ‖}_{l_{2}}^{2} \leq {‖Α θ‖}_{l_{2}}^{2} \leq (1 + δ_{K}) {‖θ‖}_{l_{2}}^{2}

(4)

holds for all K-sparse vectors. And the matrix A is said to satisfy K-order RIP with prescribed constant δ_K.

Although Eq. (3) is ill-conditioned, it is demonstrated in[16] that as

δ_{2 K} < \sqrt{2} - 1

(5)

we can find the exact solution for K-sparse vector θ from

min {‖θ‖}_{l_{1}} s.t . y = Α θ

(6)

which is called BP algorithm[17].

When the measurement vector is corrupted by bounded noise and can be represented as

y = A θ + t

(7)

we can employ the basis pursuit denoising (BPDN) algorithm[17]

min {‖θ‖}_{l_{1}} . s . t . {‖y - A θ‖}_{l_{2}} \leq ε

(8)

to achieve effective reconstruction, where ε is an upper bound of l₂-norm of the noise vector t. A theorem introducing the reconstruction performance of BPDN algorithm in detail is presented in Section 5 which is firstly formulated in[16].

Another kind of reconstruction algorithms are named greedy pursuit algorithms including orthogonal matching pursuit (OMP)[18], subspace pursuit (SP)[19], stagewise orthogonal matching pursuit (StOMP)[20], regularized orthogonal matching pursuit (ROMP)[21] and sparsity adaptive matching pursuit (SAMP)[22].

3 Sparsity and sparsifying matrix of speech signals

Speech signals, because of their natural characteristics such as the rich frequency components, cannot meet the definition of exact sparsity in a strict sense. And speech signals can only be regarded as compressible with a lot of nonzero but small coefficients in some basis like DCT. It is known that sparsity of signals is the precondition of CS. Thus, in the following, we firstly construct an adaptive sparsifying matrix for voiced segments.

3.1 Sparsifying matrix and sparsity of voiced speech

The sparsity of voiced speech has some bearing on its quasi-periodicity. In conventional speech signal coding system, the long-term prediction is always used to minimize the mean-square error between the predicted and the true values of voiced speech signals[23]. Supposing that a voiced segment includes several pitch periods (the reciprocal of vibration frequency of vocal cords) and x_i and x_{i+ 1} denote the vectors of the i^th period and the (i+1)^th period respectively, according to the principle of long-term prediction, we have

\begin{array}{l} x_{i + 1} (n) \approx β (- 1) x_{i} (n - T + 1) + β (0) x_{i} (n - T) \\ + β (1) x_{i} (n - T - 1) (n = i T, i T + 1, \dots (i + 1) T - 1) \end{array}

(9)

where T denotes the number of samples in a pitch period, namely, pitch period. In terms of the quasi-periodicity of voiced speech, some assumptions are made below.

As for the first point and the last point in the (i+1)^th period, we have x_i + 1(i T) ≈ β(−1)x_i((i − 1)T + 1) + β(0)x_i((i − 1)T) + β(1)x_i((i − 1)T − 1) and

\begin{array}{l} x_{i + 1} ((i + 1) T - 1) \approx β (- 1) x_{i} (i T) + β (0) x_{i} (i T - 1) \\ + β (1) x_{i} (i T - 2) . \end{array}

However, the time-domain range of x_i is from (i − 1)T to i T − 1. Therefore, in the duration of x_i, we make artificially x_i((i − 1)T − 1) and x_i(i T) in Eq. (9) equal to x_i(i T − 1) and x_i((i − 1)T) and then we have

[\begin{array}{c} x_{i + 1} (i T) \\ x_{i + 1} (i T + 1) \\ x_{i + 1} (i T + 2) \\ ⋮ \\ x_{i + 1} ((i + 1) T - 1) \end{array}] \approx [\begin{array}{c} x_{i} ((i - 1) T) & x_{i} ((i - 1) T + 1) & \dots & x_{i} (i T - 1) \\ x_{i} ((i - 1) T + 1) & x_{i} ((i - 1) T + 2) & \dots & x_{i} ((i - 1) T) \\ x_{i} ((i - 1) T + 2) & x_{i} ((i - 1) T + 3) & \dots & x_{i} ((i - 1) T + 1) \\ ⋮ & ⋮ & ⋮ & ⋮ \\ x_{i} (i T - 1) & x_{i} ((i - 1) T) & \dots & x_{i} (i T - 2) \end{array}] [\begin{array}{c} β (0) \\ β (- 1) \\ 0 \\ ⋮ \\ β (1) \end{array}]

(10)

Furthermore, in terms of Eq. (10), we establish that

Ψ = [\begin{array}{c} x_{i} ((i - 1) T) & x_{i} ((i - 1) T + 1) & x_{i} ((i - 1) T + 2) & \dots & x_{i} (i T - 1) \\ x_{i} ((i - 1) T + 1) & x_{i} ((i - 1) T + 2) & x_{i} ((i - 1) T + 3) & \dots & x_{i} ((i - 1) T) \\ x_{i} ((i - 1) T + 2) & x_{i} ((i - 1) T + 3) & x_{i} ((i - 1) T + 4) & \dots & x_{i} ((i - 1) T + 1) \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ x_{i} (i T - 1) & x_{i} ((i - 1) T) & x_{i} ((i - 1) T + 1) & \dots & x_{i} (i T - 2) \end{array}]

(11)

β = {[\begin{array}{c} β (0) & β (- 1) & 0 & \dots & β (1) \end{array}]}^{T}

(12)

and

x_{i + 1} \approx Ψ β

(13)

Thus, the vector β is called the coefficient vector of x_{i+ 1} with respect to the adaptive sparsifying matrix Ψ and we have

{‖β‖}_{l_{0}} = 3.

(14)

It is obvious that x_i+ 1 is approximately sparse with respect to the matrix Ψ defined in Eq. (11) which is composed of components of x_i. Thus, at the decoder, the recovered signal of the current pitch period can be used to constitute a sparsifying matrix for the signal of next pitch period.

As the adaptive sparsifying matrix Ψ is a real symmetric cyclic matrix, we can get its eigenvalues[24] which are denoted by λ_m(m = 0, 1 ⋯ T − 1). We define

f (z) = \sum_{l = 0}^{T - 1} x_{i} ((i - 1) T + l) z^{l}

(15)

and

ω = e^{j \frac{2 π}{T}}

(16)

Supposed that T is even, we have

λ_{0} = f (1)

(17)

λ_{m} = |f (ω^{m})| (m = 1, 2 \dots \frac{T}{2} - 1)

(18)

λ_{m} = - |f (ω^{m - \frac{T}{2} + 1})| (m = \frac{T}{2}, \frac{T}{2} + 1, \dots T - 2)

(19)

and

λ_{T - 1} = f (- 1)

(20)

Otherwise, when T is odd, we have

λ_{0} = f (1)

(21)

λ_{m} = |f (ω^{m})| (m = 1, 2 \dots \frac{T - 1}{2})

(22)

and

λ_{m} = - |f (ω^{m - \frac{T - 1}{2}})| (m = \frac{T - 1}{2} + 1, \frac{T - 1}{2} + 2, \dots T - 1) .

(23)

Moreover, we can recall the DFT transform of x_i which can be expressed as $X_{i} (k) = \sum_{l = 0}^{T - 1} x_{i} ((i - 1) T + l) ω^{- l k}$ . Then we can obtain the relation between the eigenvalues of the adaptive sparsifying matrix Ψ and the spectrum of the signal x_i. When T is even, we have

λ_{m} = |X_{i} (m)| (m = 1, 2 \dots \frac{T}{2} - 1)

(24)

and

λ_{m} = - |X_{i} (m - \frac{T}{2} + 1)| (m = \frac{T}{2}, \frac{T}{2} + 1, \dots T - 2)

(25)

And when T is odd, we have

λ_{m} = |X_{i} (m)| (m = 1, 2 \dots \frac{T - 1}{2})

(26)

and

\begin{array}{l} λ_{m} = - |X_{i} (m - \frac{T - 1}{2})| \\ \times (m = \frac{T - 1}{2} + 1, \frac{T - 1}{2} + 2, \dots T - 1) \end{array}

(27)

And we define

g = λ_{0} λ_{T - 1} \prod_{m = 1}^{T - 2} λ_{m}

(28)

Moreover, if g ≠ 0, the adaptive sparsifying matrix Ψ defined in Eq. (11) is invertible.

Although this adaptive sparsifying matrix is not a canonical basis in a conventional sense, it has two advantages. On the one hand, as an adaptive sparsifying matrix which is constructed by the recovered signal, the decoder doesn’t need additional storage space and at the encoder it is not necessary to spend time attaining the training data to construct the codebook and to transmit it to the decoder such as the approach proposed in[5]. On the other hand, the approximate sparsity of speech signals with respect to this adaptive sparsifying matrix is superior to the DCT basis, which can be verified by the comparison of reconstruction performance between the adaptive sparsifying matrix and the DCT basis in the subsection 3.3.

3.2 Sparsity of unvoiced speech signals

The transform coefficients based on the spectral characteristics of unvoiced speech signals are nearly uniformly distributed in the frequency domain with no obvious decay. Consequently, the sparsity of unvoiced speech signal with respect to the DCT basis is undesirable. Furthermore, we have not found a satisfactory sparsifying matrix for unvoiced speech signals. Therefore, the usual practice in the framework of CS is to apply the scheme to entire speech signals and not to distinguish voiced speech signals and unvoiced speech signals in advance. Moreover, we find that the overall performance has not been greatly influenced, which can be verified by the simulation results in the following subsection. The reason is that the proportion of voiced speech is more than seventy percent and voiced speech bears dominating information of speech. Certainly, it is of great significance for us to seek to construct a basis or a redundant dictionary for unvoiced speech signals, which is the focus of our future work.

3.3 Simulation

Some simulation results are illustrated in this subsection to show the performance of the adaptive sparsifying matrix. The testing speech signals are sampled at 16K Hz with the length of a frame N=320. There are 152 frames including 135 frames of voiced speech and 17 frames of unvoiced speech. And the sensing matrix used in this section is the dense Gaussian random matrix whose entries are i.i.d Gaussian random variables with mean zero and variance $\frac{1}{M}$ . And BP algorithm is used in this subsection to achieve reconstruction of speech signals.

It should be pointed out that the first pitch period in each frame is recovered with respect to the DCT basis. Moreover, the following pitch periods are compressed with the same compression rate and at the decoder we achieve reconstruction with respect to the adaptive sparsifying matrix. The compression rate is defined as

u = \frac{M}{N}

(29)

Moreover, it is necessary for us to distinguish the compression rate denoted as u_f for the first pitch period and the compression rate denoted as u_s for the following periods. Thus, we have

u_{f} = \frac{M_{f}}{N}

(30)

and

u_{s} = \frac{M_{s}}{N}

(31)

where M_f and M_s represent the number of measurements for the first period and the following ones respectively. Moreover, it is required that

u_{f} \geq u_{s}

(32)

for mitigating error propagation.

The measure used to evaluate the reconstruction performance is signal to noise ratio (SNR) which is defined as

SNR = 10 {log}_{10} \frac{{‖x‖}_{l_{2}}^{2}}{{‖x - x^{*}‖}_{l_{2}}^{2}}

(33)

where x^* is the reconstructed signal vector.

As the adaptive sparsifying matrix is constructed according to the quasi-periodicity of voiced speech, it is necessary for us to analyze the reconstruction performance of the different types of voiced speech signals. We make an analysis of the testing speech signals and identify three types of voiced speech signals which are shown in Figure 1a, Figure 2a and Figure 3a. There are 41 frames, 21frames and 18 frames of voiced speech signals similar to the first type, the second type and the third type of voiced speech respectively in the testing speech signals. Figure 1b, Figure 2b and Figure 3b show the average compression rate for the above three types of voiced speech signals.

Moreover, it is illustrated in Figure 4, Figure 5 and Figure 6 the comparison of reconstruction qualities for the above three different types of voiced speech signals between the adaptive sparsifying matrix and the DCT basis. Figure 4a, Figure 5a and Figure 6a show average SNR of each pitch period with different compression rates with respect to the adaptive sparsifying matrix and the DCT basis. And Figure 4b, Figure 5b and Figure 6b show average SNR of each frame.

Regardless of the types of pitch periods, when u_s ≤ 0.5, the reconstruction performance of the adaptive sparsifying matrix is far better than that of DCT. But when u_s > 0.5, the adaptive sparsifying matrix and the DCT basis have similar performance for the first type and third type of voiced speech. However, for the second type of voiced speech, the reconstruction performance of the adaptive sparsifying matrix is slightly worse than that of the DCT basis. The reason is that with the great attenuation of the amplitude, the quasi-periodicity of the second type of voiced speech is undesirable.

Figure 7 illustrates the average reconstruction performance of all the voiced speech signals in the testing speech signals. It is obvious that the adaptive sparsifying matrix can achieve better reconstruction performance for voiced speech than the DCT basis with u_s ≤ 0.5. However, it is obvious in Figure 7 that the reconstruction performance of voiced speech signals with respect to the adaptive sparsifying matrix is slightly worse than that of the DCT basis with u_s = 0.7. The reason is that the approximate sparsity of the adaptive sparsifying matrix is far better than that of the DCT basis but the whole approximation accuracy of the adaptive sparsifying matrix is slightly worse than that of the DCT basis.

Finally, we apply the adaptive sparsifying matrix to the entire speech signals including voiced speech and unvoiced speech and illustrate the reconstruction performance in Figure 8. Compared with Figure 7, we found out the performance in this case just degrades slightly.

4 Sensing matrix for speech signals

4.1 Two-block diagonal matrix

A sufficient condition for successful reconstruction of a sparse vector from undersampled measurements is that the CS matrix satisfies RIP with a required constant. It has been shown in some literatures that a dense Gaussian random matrix whose entries are i.i.d. random variables drawn according to normal distribution with mean zero and variance $\frac{1}{M}$ [1, 2, 8 ] satisfies RIP with high probability.

In this section, a sensing matrix is constructed according to the characteristics of voiced speech signals. In [25–29], a kind of structured random matrix called block diagonal matrix is applied to achieve CS in wireless communication and image processing. In[25, 26], a lot of identical blocks are used to construct a block diagonal matrix as a sensing matrix for image processing with no proof of its property to meet RIP. From a view of information theory,[27] proposes the block diagonal matrix for natural images also with no proof of its property to meet RIP. In addition,[28, 29] present RIP for block diagonal matrices.

However, in this work, a specific block diagonal matrix with just two different blocks called two-block diagonal (TBD) matrix is constructed for voiced speech signals and a simple proof of its RIP is given although some proofs of RIP for block diagonal matrices have been given in[28, 29].

As we know, the spectral energy of voiced speech signals is concentrated in low-frequency domain and decays rapidly. Thus, the high-frequency coefficients of a voiced speech signal in DCT domain are much sparser than the low-frequency coefficients. In the following, the definition of the TBD matrix is stated.

Definition 2 (TBD matrix) A matrix Α ∈ R^M × N is defined as the TBD matrix endowed with the following structure

Α = [\begin{array}{c} Φ_{1} & 0 \\ 0 & Φ_{2} \end{array}]

(34)

where $Φ_{1} \in R^{M_{1} \times N_{1}}$ is a Gaussian random matrix whose entries are i.i.d. random variables drawn according to normal distribution with mean zero and variance $\frac{1}{M_{1}}$ and $Φ_{2} \in R^{M_{2} \times N_{2}}$ is also a Gaussian random matrix whose entries are i.i.d. random variables drawn according to normal distribution with mean zero and variance $\frac{1}{M_{2}}$ .

In line with this characteristic, a matrix $Φ = [\begin{array}{c} Φ_{1} & 0 \\ 0 & Φ_{2} \end{array}] Ψ^{T}$ is constructed as a sensing matrix for voiced speech signals, where Ψ^T is the transpose of an orthonormal basis. In addition, it is required that

M_{1} \geq M_{2}

(35)

M_{1} + M_{2} = M

(36)

and

N_{1} + N_{2} = N

(37)

And then we have

y = Φ x = [\begin{array}{c} Φ_{1} & 0 \\ 0 & Φ_{2} \end{array}] Ψ^{T} x = [\begin{array}{c} Φ_{1} & 0 \\ 0 & Φ_{2} \end{array}] θ = Α θ

(38)

where θ = Ψ^Tx. We just need to prove that A satisfies RIP.

Lemma 1 (Concentration inequality of TBD matrix) Suppose that the matrix A is a TBD matrix defined in definition 2. Then, the matrix obeys the concentration inequality with the prescribed constant δ

P (|{‖Α θ‖}_{l_{2}}^{2} - {‖θ‖}_{l_{2}}^{2}| \geq δ {‖θ‖}_{l_{2}}^{2}) \leq 2 e^{- M C (δ)}

(39)

where C(δ) is a constant depending on δ.

The proof of Lemma 1 can be found in Appendix. In order to prove that the TBD matrix satisfies RIP, a theorem in literature[8] is first recalled.

Theorem 1 ([8]) Suppose that a CS matrix A satisfies the concentration inequality. If

K \leq c_{1} M / log (N / K)

(40)

the matrix A satisfies the K-order RIP with the prescribed constant δ with probability $\geq 1 - 2 e^{- c_{2} M}$ , where c₁ and c₂ are constants depending on δ.

Therefore, in light of Lemma 1 and theorem 1, it suffices to show that the TBD matrix A satisfies RIP. In fact, the TBD matrix can also be employed as the CS matrix when the sparsifying matrix is the adaptive sparsifying matrix in Section 3. The reason is that the coefficient vector β with respect to the adaptive sparsifying matrix in Eq. (12) also exhibits similar concentration characteristic to the DCT coefficients. However, it is inappropriate to employ the adaptive sparsifying matrix and the TBD matrix simultaneously in CS system. Firstly, the adaptive sparsifying matrix must be orthonormalized in this case, which undoubtedly increase the computational complexity of the CS system. Secondly, more parameters need to be adjusted. The last but not the least, the TBD matrix cannot considerably improve the reconstruction performance with respect to the adaptive sparsifying matrix for the extremely compressible coefficient vector β and limited approximation accuracy. Thus, we employ the DCT basis as the sparsifying matrix for speech signals in Section 4 and Section 5.

4.2 Simulation

The testing speech signals used in the experiments of this subsection are the same as in Section 3. The BP algorithm is also employed in this subsection to achieve reconstruction. At first, we define

u_{l} = \frac{M_{1}}{N} a n d u_{h} = \frac{M_{2}}{N}

(41)

and then we have

u = u_{l} + u_{h} a n d u_{l} \geq u_{h}

(42)

In this subsection, we firstly compare the reconstruction performance between the TBD matrix and the dense Gaussian random matrix with respect to the adaptive sparsifying matrix. Figure 9a and Figure 9b show the average SNR of pitch periods and frames respectively for the TBD matrix and the dense Gaussian random matrix in the case of the adaptive sparsifying matrix. It is obvious in Figure 9 that the TBD matrix cannot bring about desirable improvement on the reconstruction performance with respect to the adaptive sparsifying matrix. Therefore, we focus on the reconstruction performance when TBD matrix is used as the CS matrix with respect to the DCT basis.

Figure 10a shows the comparison of average SNR of 135 frames of voiced speech signals between the TBD matrix and dense Gaussian random matrix when the sparsifying matrix is the DCT basis. It is obvious that the performance of the TBD matrix with the right values of u_l and u_h is much better than that of the dense Gaussian random matrix especially when the value of overall compression rate u is relatively small.

Figure 10b demonstrates the comparison of average SNR of the entire testing speech signals between the TBD matrix and the dense Gaussian random matrix. Although the overall reconstruction performance degrades slightly, the TBD matrix with right values of u_l and u_h still performs much better than the dense Gaussian random matrix.

More importantly, as the TBD matrix can restrict the impact of quantization noise on reconstruction of speech signals, it can attain better reconstruction performance than the dense Gaussian matrix when the measurements are quantized, which is described in details in the next section.

5 Quantization effect on speech signals with compressed sensing

5.1 Quantization of speech signals in the framework of CS

In this paper, we apply CS to speech signals to achieve efficient compression. However, we still need to quantify the projections before transmission. At first, we should analyze the distribution of the projections. When the sensing matrix is the dense Gaussian random matrix, we have

y = Φ x = \sum_{i = 1}^{N} x (i) ϕ_{i}

(43)

where $= [\begin{array}{c} ϕ_{1} & ϕ_{2} & \dots & ϕ_{N} \end{array}] = [\begin{array}{c} ϕ_{1, 1} & ϕ_{1, 2} & \dots & ϕ_{1, N} \\ ϕ_{2, 1} & ϕ_{2, 2} & \dots & ϕ_{2, N} \\ ⋮ & ⋮ & ⋮ & ⋮ \\ ϕ_{M, 1} & ϕ_{M, 2} & \dots & ϕ_{M, N} \end{array}] .$

And then, we can obtain

y (k) = \sum_{i = 1}^{N} x (i) ϕ_{k, i} (k = 1, 2 \dots M)

(44)

where ϕ_k,i is the i.i.d. Gaussian random variable with mean zero and variance $\frac{1}{M}$ . Thus, y(k) is a random variable independently drawn by the normal distribution with

E (y (k)) = 0

(45)

D (y (k)) = \frac{1}{M} \sum_{i = 1}^{N} {(x (k))}^{2} = \frac{1}{M} {‖x‖}_{l_{2}}^{2}

(46)

However, when the CS matrix is the TBD matrix, we have

y = Α θ = [\begin{array}{c} Φ_{1} & 0 \\ 0 & Φ_{2} \end{array}] [\begin{array}{c} θ_{1} \\ θ_{2} \end{array}]

(47)

where $Φ_{1} = [\begin{array}{c} ϕ_{1}^{1} & ϕ_{2}^{1} & \dots & ϕ_{N_{1}}^{1} \end{array}]$ and $Φ_{2} = [\begin{array}{c} ϕ_{1}^{2} & ϕ_{2}^{2} & \dots & ϕ_{N_{2}}^{2} \end{array}]$ are both dense Gaussian random matrices. Thus, we have

y (k) = \sum_{i = 1}^{N_{1}} ϕ_{k, i}^{1} θ_{1} (i) (k = 1, 2 \dots M_{1})

(48)

and

y (k) = \sum_{i = 1}^{N_{2}} ϕ_{k, i}^{2} θ_{2} (i) (k = M_{1} + 1, M_{1} + 2, \dots M_{1} + M_{2})

(49)

Then y(k) is also an independent Gaussian random variable with

E (y (k)) = 0 (k = 1, 2 \dots M_{1} + M_{2})

(50)

D (y (k)) = \frac{1}{M_{1}} {‖θ_{1}‖}_{l_{2}}^{2} (k = 1, 2 \dots M_{1})

(51)

and

D (y (k)) = \frac{1}{M_{2}} {‖θ_{2}‖}_{l_{2}}^{2} (k = M_{1} + 1, M_{2} + 1, \dots M_{1} + M_{2})

(52)

We apply uniform scalar quantization to the projections. In[30], an analysis of the noise power generated by the uniform scalar quantization when the input signal meets the Gaussian distribution has been carried out and a table for the optimal values of finite quantization range for different quantization levels is provided, which contributes to our following analysis on adaptive quantization.

Speech signals are a kind of time-variant signals and it is possible for the energy of different segments to show great changes. Furthermore, in an expectation sense, the energy of measurement vector is equal to that of the signal vector. Therefore, it is necessary to implement adaptive quantization to the projections. In the following, the effect of adaptive quantization on reconstruction performance is discussed in the framework of CS.

As we know, the noise of uniform scalar quantizer is induced by quantization and saturation. Let ∆ denote the quantization interval, Q denote the number of quantization intervals and σ_i denote the standard deviation of the projection of the i^th frame of voiced speech. [−mσ_i, mσ_i] is the quantization range for the i^th frame. And when the quantization is adaptive, [−mσ_i + 1, mσ_i + 1] is the quantization range for the (i + 1)^th frame. Otherwise, when the quantizer is fixed, in the convenience of analysis, [−mσ_i, mσ_i] is used as the quantization range for the (i + 1)^th frame. In other words, the nonadaptive quantization is used in the fixed quantizer. And EN_a denotes the noise power for the adaptive quantizer and EN_f denotes the noise power for the fixed quantizer. Hence, for adaptive quantizer, we have

Δ = \frac{2 m σ_{i + 1}}{Q}

(53)

and

\begin{array}{l} E N_{a} = \frac{Δ^{2}}{12} (1 - 2 \int_{m}^{+ \infty} \frac{1}{\sqrt{2 π}} e^{- \frac{t^{2}}{2}} d t) \\ + 2 ((m^{2} + 1) σ_{i + 1}^{2} \int_{m}^{+ \infty} \frac{1}{\sqrt{2 π}} e^{- \frac{t^{2}}{2}} d t - \frac{m σ_{i + 1}^{2}}{\sqrt{2 π}} e^{- \frac{m^{2}}{2}}) \end{array}

(54)

For a fixed quantizer, we have

Δ = \frac{2 m σ_{i}}{Q}

(55)

and

\begin{array}{l} E N_{f} = \frac{Δ^{2}}{12} (1 - 2 \int_{m \frac{σ_{i}}{σ_{i + 1}}}^{+ \infty} \frac{1}{\sqrt{2 π}} e^{- \frac{t^{2}}{2}} d t) \\ + 2 (σ_{i + 1}^{2} (1 + \frac{m^{2} σ_{i}^{2}}{σ_{i + 1}^{2}}) \int_{m \frac{σ_{i}}{σ_{i + 1}}}^{+ \infty} \frac{1}{\sqrt{2 π}} e^{- \frac{t^{2}}{2}} d t \\ - \frac{m σ_{i} σ_{i + 1}}{\sqrt{2 π}} e^{- \frac{m^{2} σ_{i}^{2}}{2 σ_{i + 1}^{2}}}) \end{array}

(56)

From Eq. (56), it is clear that the noise power of a fixed quantizer depends not only on the variance of the current frame but also depends on the ratio of the variances of the successive two frames.

Theorem 2 ([16]): Suppose that θ is an approximately sparse vector in R^N. Assuming that the 2K-order restricted isometry constant of the CS matrix satisfies

δ_{2 K} < \sqrt{2} - 1

(57)

the solution θ^* to Eq. (8) obeys

{‖θ^{*} - θ‖}_{l_{2}} \leq C_{1} ε + C_{2} \frac{{‖θ - θ_{K}‖}_{l_{1}}}{\sqrt{K}}

(58)

where C₁ and C₂ are constants depending on δ_2K.

For an adaptive quantizer, the reconstruction SNR is written as SNR_a, and for a fixed quantizer, the reconstruction SNR is written as SNR_f. In the following, two corollaries about the impact of the adaptive quantization on reconstruction performance are derived in this paper. In this paper, we focus on the effect of quantization noise. Therefore, in the two corollaries below, we assume that ${‖θ - θ_{K}‖}_{l_{1}}$ extends to zero.

Corollary 1: Suppose that x is a voiced speech signal vector and the sparsifying matrix is an orthonormal basis Ψ. Provided that the sensing matrix is the dense Gaussian random matrix whose entries are i.i.d. Gaussian variables with mean 0 and variance $\frac{1}{M}$ , there exist a constant C_q so that the reconstruction SNR for an adaptive quantizer with the value of quantization level Q to be 32 obeys

{SNR}_{a} \geq 24.792 - 10 {log}_{10} C_{1}^{2} C_{q}

(59)

Assuming that $\frac{σ_{i}}{σ_{i + 1}} = 1.25$ , then the reconstruction SNR for a fixed quantizer with the value of Q to be 32 obeys

{SNR}_{f} \geq 23.656 - 10 {log}_{10} C_{1}^{2} C_{q}

(60)

Assuming that $\frac{σ_{i}}{σ_{i + 1}} = 0.75$ , then the reconstruction SNR for a fixed quantizer with the value of Q to be 32 obeys

{SNR}_{f} \geq 20.8067 - 10 {log}_{10} C_{1}^{2} C_{q}

(61)

Corollary 2: Suppose that x is a voiced speech signal vector and the sparsifying matrix is the DCT basis. Provided that the CS matrix is the TBD matrix and u_l and u_h are defined as in Eq. (41), there exist a constant C_p so that the reconstruction SNR for an adaptive quantizer with the value of Q to be 32 obeys

{SNR}_{a} \geq 10 {log}_{10} \frac{u_{l}}{C_{1}^{2} C_{p} (3.317 \times 10^{- 3} u_{l} + 2.738 \times 10^{- 3} u_{h})}

(62)

Assuming that $\frac{σ_{i}}{σ_{i + 1}} = 1.25$ , then the reconstruction SNR for a fixed quantizer with the value of Q to be 32 obeys

{SNR}_{f} \geq 10 {log}_{10} \frac{u_{l}}{C_{1}^{2} C_{p} (4.39 \times 10^{- 3} u_{l} + 4.2775 \times 10^{- 3} u_{h})}

(63)

Assuming that $\frac{σ_{i}}{σ_{i + 1}} = 0.75$ , then the reconstruction SNR for a fixed quantizer with value of Q to be 32 obeys

{SNR}_{f} \geq 10 {log}_{10} \frac{u_{l}}{C_{1}^{2} C_{p} (8.305 \times 10^{- 3} u_{l} + 1.534 \times 10^{- 3} u_{h})}

(64)

5.2 Simulation

The testing speech signals used in experiments of this subsection are also the same as that in Section 3. The sparsifying matrix used in this section is the DCT basis. And we employ the BPDN algorithm to achieve reconstruction in this subsection. The measure of performance evaluation is also the average SNR.

At first, we analyze the performance of adaptive quantization compared with the nonadaptive quantization for both the TBD matrix and the dense Gaussian random matrix in the framework of CS. We fixed the value of Q to be 32. Figure 11a illustrates the quantization effect on voiced speech signals of the testing speech signals. And Figure 11b illustrates the quantization effect on the entire testing speech signals. It is obvious that the adaptive quantization can greatly improve the reconstruction performance compared with the nonadaptive quantization. Moreover, we can find out from Figure 11 that the performance of TBD matrix with u_h = 0.05 is superior to the dense Gaussian random matrix for both the adaptive quantization and nonadaptive quantization. The reason is that the TBD matrix is more robust to quantization noise based on the fact the TBD matrix can effectively restrict the impact of quantization on speech signals.

In the following, we focus on the adaptive quantization effect on reconstruction of speech signals with different quantization levels. Figure 12a, Figure 12b, Figure 13a and Figure 13b show the average reconstruction SNR of voiced speech signals with the quantization level Q to be 8, 16, 32 and 64 respectively when the adaptive quantization is applied to the projections in the case of TBD matrices and the dense Gaussian matrix. On the one hand, the reconstruction performance in the case of adaptive quantization improves with the increase of the quantization level. On the other hand, with right values of u_l and u_h, TBD matrix performs much better than the dense Gaussian random matrix confronted with the quantization noise regardless of the quantization level. In addition, Figure 14a, Figure 14b, Figure 15a and Figure 15b show the average reconstruction SNR of entire speech signals with the quantization level Q to be 8, 16, 32 and 64 respectively. And the above findings also hold for the entire speech signals including voiced and unvoiced speech signals. Thus, we can conclude that the adaptive quantization and the TBD matrix can effectively mitigate the impact of quantization noise on reconstruction in the framework of CS.

6 Conclusions

This paper demonstrates the potential of applying CS to speech signals especially voiced speech signals. From the viewpoint of long-term prediction, we analyze the sparsity of voiced speech signals and construct an adaptive sparsifying matrix. Moreover, a CS matrix called TBD matrix is constructed in terms of the spectral characteristics of voiced speech signals. Finally, the distribution of the projections is analyzed to carry out quantization. And the reconstruction performance of the adaptive quantization and nonadaptive quantization is studied. In addition, under the adaptive quantization, the reconstruction qualities of TBD matrix and the dense Gaussian matrix are empirically compared with different quantization bits. Therefore, we find that the TBD matrix and the adaptive quantization can effectively mitigate the quantization effect on reconstruction of speech signals in the framework of CS.

Appendix

Proof of Lemma 1 Let $θ = [\begin{array}{c} θ_{1} \\ θ_{2} \end{array}]$ where θ₁ and θ₂ are also column vectors. Then, we have

Α θ = [\begin{array}{c} Φ_{1} & 0 \\ 0 & Φ_{2} \end{array}] θ = [\begin{array}{c} Φ_{1} & 0 \\ 0 & Φ_{2} \end{array}] [\begin{array}{c} θ_{1} \\ θ_{2} \end{array}] = [\begin{array}{c} Φ_{1} θ_{1} \\ Φ_{2} θ_{2} \end{array}]

(65)

and

{‖Α θ‖}_{l_{2}}^{2} = {‖Φ_{1} θ_{1}‖}_{l_{2}}^{2} + {‖Φ_{2} θ_{2}‖}_{l_{2}}^{2}

(66)

As Φ₁ is an M₁ × N₁ Gaussian matrix whose entries are i.i.d. random variables drawn according to normal distribution with mean zero and variance $\frac{1}{M_{1}}$ and Φ₂ is an M₂ × N₂ Gaussian matrix whose entries are i.i.d. random variables drawn according to normal distribution with mean zero and variance $\frac{1}{M_{2}}$ , we establish

E ({‖Φ_{1} θ_{1}‖}_{l_{2}}^{2}) = {‖θ_{1}‖}_{l_{2}}^{2}

(67)

and

E ({‖Φ_{2} θ_{2}‖}_{l_{2}}^{2}) = {‖θ_{2}‖}_{l_{2}}^{2}

(68)

Hence, we have

E ({‖Α θ‖}_{l_{2}}^{2}) = {‖θ_{1}‖}_{l_{2}}^{2} + {‖θ_{2}‖}_{l_{2}}^{2} = {‖θ‖}_{l_{2}}^{2}

(69)

Moreover, it is proved in[31] and[32] that

P (|{‖Φ_{1} θ_{1}‖}_{l_{2}}^{2} - {‖θ_{1}‖}_{l_{2}}^{2}| \geq δ {‖θ_{1}‖}_{l_{2}}^{2}) \leq 2 e^{- \frac{M_{1} δ^{2}}{8}}

(70)

and

P (|{‖Φ_{2} θ_{2}‖}_{l_{2}}^{2} - {‖θ_{2}‖}_{l_{2}}^{2}| \geq δ {‖θ_{2}‖}_{l_{2}}^{2}) \leq 2 e^{- \frac{M_{2} δ^{2}}{8}}

(71)

Therefore, we have

P (- δ {‖θ_{1}‖}_{l_{2}}^{2} \leq {‖Φ_{1} θ_{1}‖}_{l_{2}}^{2} - {‖θ_{1}‖}_{l_{2}}^{2} \leq δ {‖θ_{1}‖}_{l_{2}}^{2}) \geq 1 - 2 e^{- \frac{M_{1} δ^{2}}{8}}

(72)

and

P (- δ {‖θ_{2}‖}_{l_{2}}^{2} \leq {‖Φ_{2} θ_{2}‖}_{l_{2}}^{2} - {‖θ_{2}‖}_{l_{2}}^{2} \leq δ {‖θ_{2}‖}_{l_{2}}^{2}) \geq 1 - 2 e^{- \frac{M_{2} δ^{2}}{8}}

(73)

Then, it suffice to show that

\begin{array}{l} P (\{|{‖Φ_{1} θ_{1}‖}_{l_{2}}^{2} - {‖θ_{1}‖}_{l_{2}}^{2}| \leq δ {‖θ_{1}‖}_{l_{2}}^{2}\} \\ \cap \{|{‖Φ_{2} θ_{2}‖}_{l_{2}}^{2} - {‖θ_{2}‖}_{l_{2}}^{2}| \leq δ {‖θ_{2}‖}_{l_{2}}^{2}\}) \geq 1 - 2 e^{- \frac{M_{1} δ^{2}}{8}} - 2 e^{- \frac{M_{2} δ^{2}}{8}} \end{array}

(74)

We can use the union bound to show that

\begin{array}{l} P (|{‖Α θ‖}_{l_{2}}^{2} - {‖θ‖}_{l_{2}}^{2}| \geq δ {‖θ‖}_{l_{2}}^{2}) \leq P (\{|{‖Φ_{1} θ_{1}‖}_{l_{2}}^{2} - {‖θ_{1}‖}_{l_{2}}^{2}| \\ \geq δ {‖θ_{1}‖}_{l_{2}}^{2} \}\cup \{|{‖Φ_{2} θ_{2}‖}_{l_{2}}^{2} - {‖θ_{2}‖}_{l_{2}}^{2}| \geq δ {‖θ_{2}‖}_{l_{2}}^{2}\}) \\ \leq P (|{‖Φ_{1} θ_{1}‖}_{l_{2}}^{2} - {‖θ_{1}‖}_{l_{2}}^{2}| \geq δ {‖θ_{1}‖}_{l_{2}}^{2}) \\ + P (|{‖Φ_{2} θ_{2}‖}_{l_{2}}^{2} - {‖θ_{2}‖}_{l_{2}}^{2}| \geq δ {‖θ_{2}‖}_{l_{2}}^{2}) \leq 2 e^{- \frac{M_{1} δ^{2}}{8}} + 2 e^{- \frac{M_{2} δ^{2}}{8}} \end{array}

(75)

There is certainly a constant C(δ) > 0 for δ ∈ (0, 1) so that

e^{- \frac{M_{1} δ^{2}}{8}} + e^{- \frac{M_{2} δ^{2}}{8}} = e^{- M C (δ)}

(76)

which yields that

C (δ) = - \frac{\log (e^{- \frac{M_{1} δ^{2}}{8}} + e^{- \frac{M_{2} δ^{2}}{8}})}{M}

(77)

Thus, we can conclude that

P (|{‖Α θ‖}_{l_{2}}^{2} - {‖θ‖}_{l_{2}}^{2}| \geq δ {‖θ‖}_{l_{2}}^{2}) \leq 2 e^{- M C (δ)}

(78)

Proof of Corollary 1 The class X of interest is a finite set of objects x which are voiced segments. Denote then

X = \{x_{k} : x_{k} is the k^{th} frame of voiced speech signals\} .

(79)

When the sensing matrix is the dense Gaussian random matrix, the projection vector of the (i + 1)^th frame of voiced speech signal x_i + 1 is denoted by y_i + 1 and then

y_{i + 1} = Φ x_{i + 1} .

(80)

In terms of Eq. (45) and Eq. (46), the entries of y_i + 1 are i.i.d. Gaussian random variables with mean 0 and variance $\frac{1}{M} {‖x_{i + 1}‖}_{l_{2}}^{2}$ . And the quantization vector of y_i + 1 is denoted by

{\hat{y}}_{i + 1} = y_{i + 1} + e_{i + 1} = Φ x_{i + 1} + e_{i + 1}

(81)

where $e_{i + 1} = {[\begin{array}{c} e_{i + 1} (1) & e_{i + 1} (2) & \dots & e_{i + 1} (M) \end{array}]}^{T}$ is the quantization error vector of the (i + 1)^th frame. The quantization error vectors for all the voiced segments in X can be represented by a matrix $\overset{―}{e} = [\begin{array}{c} e_{1} & e_{2} & \dots & e_{|X|} \end{array}]$ where |X| denotes the cardinality of the set X. When Q = 32, according to the results in[30], m = 2.9. Then for an adaptive quantizaer, in light with Eq. (54), we have,

E ({(e_{i + 1} (k))}^{2}) = 3.317 \times 10^{- 3} σ_{i + 1}^{2} (k = 1, 2 \dots M)

(82)

and

E ({‖e_{i + 1}‖}_{l_{2}}^{2}) = M E ({(e_{i + 1} (k))}^{2}) = 3.317 \times 10^{- 3} M σ_{i + 1}^{2} .

(83)

We can find a subset in X denoted by V that can be represented as

V = \{k : {‖x_{k}‖}_{l_{2}}^{2} = {‖x_{i + 1}‖}_{l_{2}}^{2}, x_{k} \in X\} .

(84)

Let ε > 0 and we have

ε^{2} = sup_{j \in V} {‖e_{j}‖}_{l_{2}}^{2} .

(85)

There exist a constant C_a such that

ε^{2} = C_{a} E ({‖e_{i + 1}‖}_{l_{2}}^{2})

(86)

As ${‖e_{i + 1}‖}_{l_{2}}^{2} \leq ε^{2}$ , we have

{‖e_{i + 1}‖}_{l_{2}}^{2} \leq C_{a} E ({‖e_{i + 1}‖}_{l_{2}}^{2})

(87)

In this paper, we are just concerned with the impact of quantization on reconstruction. Therefore, we assume that ${‖θ - θ_{K}‖}_{l_{1}}$ extends to zero. While the voiced speech signal is compressible with respect to an orthonormal basis, we have

\begin{array}{l} {‖x_{i + 1} - x_{i + 1}^{*}‖}_{l_{2}}^{2} = {‖Ψ (θ_{i + 1} - θ_{i + 1}^{*})‖}_{l_{2}}^{2} \\ = {‖θ_{i + 1} - θ_{i + 1}^{*}‖}_{l_{2}}^{2} \leq 3.317 \times 10^{- 3} C_{1}^{2} C_{a} M σ_{i + 1}^{2} \end{array}

(88)

where x_i + 1* = Ψθ_i + 1* and θ_i + 1* is the solution to

min {‖θ_{i + 1}‖}_{l_{1}} s.t {‖{\hat{y}}_{i + 1} - Φ Ψ θ_{i + 1}‖}_{l_{2}} \leq ε .

(89)

Therefore,

\begin{array}{l} {SNR}_{a} \geq 10 {log}_{10} (\frac{M σ_{i + 1}^{2}}{3.317 \times 10^{- 3} C_{1}^{2} C_{a} M σ_{i + 1}^{2}}) \\ = 24.792 - 10 {log}_{10} (C_{1}^{2} C_{a}) . \end{array}

(90)

However, for a fixed quantizer, when $\frac{σ_{i}}{σ_{i + 1}} = 1.25$ , according to Eq. (56), we establish

E ({(e_{i + 1} (k))}^{2}) = 4.309 \times 10^{- 3} σ_{i + 1}^{2} .

(91)

Then, we have

E ({‖e_{i + 1}‖}_{l_{2}}^{2}) = M E ({(e_{i + 1} (k))}^{2}) = 4.309 \times 10^{- 3} M σ_{i + 1}^{2} .

(92)

Let ε₁ > 0 and we have

ε_{1}^{2} = sup_{j \in V} {‖e_{j}‖}_{l_{2}}^{2} .

(93)

There exist a constant $C_{f_{1}}$ so that

ε_{1}^{2} = C_{f_{1}} E ({‖e_{i + 1}‖}_{l_{2}}^{2}) = 4.309 \times 10^{- 3} C_{f_{1}} M σ_{i + 1}^{2}

(94)

Then we have

\begin{array}{l} {SNR}_{f} \geq 10 {log}_{10} (\frac{M σ_{i + 1}^{2}}{4.309 \times 10^{- 3} C_{1}^{2} C_{f_{1}} M σ_{i + 1}^{2}}) \\ = 23.656 - 10 {log}_{10} (C_{1}^{2} C_{f_{1}}) \end{array}

(95)

Similarly, when $\frac{σ_{i}}{σ_{i + 1}} = 0.75$ , we have

E ({(e_{i + 1} (k))}^{2}) = 8.305 \times 10^{- 3} σ_{i + 1}^{2}

(96)

Thus, we can establish that

\begin{array}{l} {SNR}_{f} \geq 10 {log}_{10} (\frac{M σ_{i + 1}^{2}}{8.305 \times 10^{- 3} C_{1}^{2} C_{f_{2}} M σ_{i + 1}^{2}}) \\ = 20.8067 - 10 {log}_{10} (C_{1}^{2} C_{f_{2}}) \end{array}

(97)

Let $C_{q} = max (C_{a}, C_{f_{1}}, C_{f_{2}})$ and then we obtain

{SNR}_{a} \geq 24.792 - 10 {log}_{10} C_{1}^{2} C_{q} .

When $\frac{σ_{i}}{σ_{i + 1}} = 1.25$ , we have

{SNR}_{f} \geq 23.656 - 10 {log}_{10} C_{1}^{2} C_{q} .

When $\frac{σ_{i}}{σ_{i + 1}} = 0.75$ , we have

{SNR}_{f} \geq 20.8067 - 10 {log}_{10} C_{1}^{2} C_{q} .

Proof of Corollary 2 When the CS matrix is the TBD matrix, then we have

y_{i + 1} = Α θ_{i + 1} = [\begin{array}{c} Φ_{1} & 0 \\ 0 & Φ_{2} \end{array}] θ_{i + 1} = [\begin{array}{c} Φ_{1} & 0 \\ 0 & Φ_{2} \end{array}] [\begin{array}{c} θ_{i + 1, 1} \\ θ_{i + 1, 2} \end{array}]

where θ_i + 1 is the coefficient vector of x_i + 1 with respect to DCT. In terms of Eqs. (50), (51), (52), denote then

σ_{i + 1, 1}^{2} = \frac{1}{M_{1}} {‖θ_{i + 1, 1}‖}_{l_{2}}^{2}

(98)

and

σ_{i + 1, 2}^{2} = \frac{1}{M_{2}} {‖θ_{i + 1, 2}‖}_{l_{2}}^{2} .

(99)

Moreover, according to the characteristic of the voiced segments, σ_i + 1,1 ≫ σ_i + 1,2. As an adaptive quantizer, [−mσ_i + 1,1, mσ_i + 1,1] is used as the quantization range of the (i + 1)^th projection vector y_i + 1 and $Δ = \frac{2 m σ_{i + 1, 1}}{Q}$ . In light with Eq. (54), for an adaptive quantizer, we have

\begin{array}{l} E ({(e_{i + 1} (k))}^{2}) = 3.317 \\ \times 10^{- 3} σ_{i + 1, 1}^{2} (k = 1, 2 \dots M_{1}) \end{array}

(100)

And in terms of Eq. (56), we have

\begin{array}{l} E ({(e_{i + 1} (k))}^{2}) \approx \frac{Δ^{2}}{12} = 2.738 \\ \times 10^{- 3} σ_{i + 1, 1}^{2} (k = M_{1} + 1, M_{1} \\ + 2, \dots M_{1} + M_{2}) \end{array}

(101)

and

\begin{array}{l} E ({‖e_{i + 1}‖}_{l_{2}}^{2}) = M_{1} E ({(e_{i + 1} (M_{1}))}^{2}) \\ + M_{2} E ({(e_{i + 1} (M_{1} + M_{2}))}^{2}) \\ = 3.317 \times 10^{- 3} M_{1} σ_{i + 1, 1}^{2} \\ + 2.738 \times 10^{- 3} M_{2} σ_{i + 1, 1}^{2} \end{array}

(102)

We can find a subset in X denoted by V that can be represented as

\begin{array}{l} V = \{k : {‖θ_{k, 1}‖}_{l_{2}}^{2} = {‖θ_{i + 1, 1}‖}_{l_{2}}^{2}, {‖θ_{k, 2}‖}_{l_{2}}^{2} = {‖θ_{i + 1, 2}‖}_{l_{2}}^{2}, \\ θ_{k} is the DCT coefficients vector of x_{k}, x_{k} \in X} \end{array}

(103)

We define that $ε^{2} = sup_{j \in V} {‖e_{j}‖}_{l_{2}}^{2} .$ There exist a constant C_b such that

ε^{2} = C_{b} E ({‖e_{i + 1}‖}_{l_{2}}^{2})

(104)

Therefore, we establish

{‖e_{i + 1}‖}_{l_{2}}^{2} \leq C_{b} E ({‖e_{i + 1}‖}_{l_{2}}^{2})

(105)

As stated in corollary 1, we extend ${‖θ - θ_{K}‖}_{l_{1}}$ to zero. Thus, we establish ${‖x_{i + 1} - x_{i + 1}^{*}‖}_{l_{2}}^{2} = {‖Ψ (θ_{i + 1} - θ_{i + 1}^{*})‖}_{l_{2}}^{2} = {‖θ_{i + 1} - θ_{i + 1}^{*}‖}_{l_{2}}^{2} \leq C_{1}^{2} C_{b} (3.317 \times 10^{- 3} M_{1} σ_{i + 1, 1}^{2} + 2.738 \times 10^{- 3} M_{2} σ_{i + 1, 1}^{2})$ where θ_i + 1^* is the solution to

\min {‖θ_{i + 1}‖}_{l_{1}} s.t . {‖{\hat{y}}_{i + 1} - Α θ_{i + 1}‖}_{l_{2}} \leq ε

(106)

and then we have

x_{i + 1}^{*} = Ψ θ_{i + 1}^{*} .

(107)

Then, we have

\begin{array}{l} {SNR}_{a} \geq 10 {log}_{10} \\ \frac{M_{1} σ_{i + 1, 1}^{2} + M_{2} σ_{i + 1, 2}^{2}}{C_{1}^{2} C_{b} (3.317 \times 10^{- 3} M_{1} σ_{i + 1, 1}^{2} + 2.738 \times 10^{- 3} M_{2} σ_{i + 1, 1}^{2})} \\ \geq 10 {log}_{10} \\ \frac{M_{1} σ_{i + 1, 1}^{2}}{C_{1}^{2} C_{b} (3.317 \times 10^{- 3} M_{1} σ_{i + 1, 1}^{2} + 2.738 \times 10^{- 3} M_{2} σ_{i + 1, 1}^{2})} \\ = 10 {log}_{10} \frac{u_{l}}{C_{1}^{2} C_{b} (3.317 \times 10^{- 3} u_{l} + 2.738 \times 10^{- 3} u_{h})} \end{array}

(108)

Moreover, for a fixed quantizer, when $\frac{σ_{i, 1}}{σ_{i + 1, 1}} = 0.75$ , we can prove in the same way that there exist a constant $C_{f_{3}}$ so that

{SNR}_{f} \geq 10 {log}_{10} \frac{u_{l}}{C_{1}^{2} C_{f_{3}} (8.305 \times 10^{- 3} u_{l} + 1.534 \times 10^{- 3} u_{h})} .

(109)

And when $\frac{σ_{i, 1}}{σ_{i + 1, 1}} = 1.25$ , we can prove in the same way that there exist a constant $C_{f_{4}}$ so that

{SNR}_{f} \geq 10 {log}_{10} \frac{u_{l}}{C_{1}^{2} C_{f_{4}} (4.39 \times 10^{- 3} u_{l} + 4.2775 \times 10^{- 3} u_{h})}

(110)

Let $C_{p} = max (C_{b,} C_{f_{3}}, C_{f_{4}})$ , and then we can conclude that

{SNR}_{a} \geq 10 {log}_{10} \frac{u_{l}}{C_{1}^{2} C_{p} (3.317 \times 10^{- 3} u_{l} + 2.738 \times 10^{- 3} u_{h})} .

When $\frac{σ_{i, 1}}{σ_{i + 1, 1}} = 0.75$ , we have

{SNR}_{f} \geq 10 {log}_{10} \frac{u_{l}}{C_{1}^{2} C_{p} (8.305 \times 10^{- 3} u_{l} + 1.534 \times 10^{- 3} u_{h})} .

When $\frac{σ_{i, 1}}{σ_{i + 1, 1}} = 1.25$ , we have

{SNR}_{f} \geq 10 {log}_{10} \frac{u_{l}}{C_{1}^{2} C_{p} (4.39 \times 10^{- 3} u_{l} + 4.2775 \times 10^{- 3} u_{h})} .

References

Donoho D: Compressed sensing. IEEE Trans Inf Theory 2006, 52(4):1289-1306.
Article MathSciNet MATH Google Scholar
Candès EJ: Compressive sampling. Proceedings of the International Congress of Mathematicians, Madrid, Spain; 2006:1433-1452.
MATH Google Scholar
Baraniuk RG: Compressive sensing. IEEE Signal Process Mag 2007, 24(4):118-121.
Article MathSciNet Google Scholar
Reeves G, Gastpar M: “Compressed” compressed sensing. IEEE International Symposium on Information Theory, Austin; 2010:1548-1552.
Google Scholar
Sreenivas TV, Kleijn WB: Compressive sensing for sparsely excited speech signals. IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei; 2009:4125-4128.
Google Scholar
Giacobello D, Christensen MG, Murthi MN, Jensen SH, Moonen M: Retrieving sparse patterns using a compressed sensing framework: applications to speech coding based on sparse linear prediction. IEEE Sigl Proc Letters 2010, 17(1):103-106.
Article Google Scholar
Candès EJ, Tao T: Decoding by linear programming. IEEE Trans Inf Theory 2005, 51(2):4203-4215. 10.1109/TIT.2005.858979
Article MathSciNet MATH Google Scholar
Baraniuk RG, Davenport MA, Devore R, Wakin MB: A simple proof of the restricted of isometry property for the random matrices. Constr Approx 2008, 28(3):253-263. 10.1007/s00365-007-9003-x
Article MathSciNet MATH Google Scholar
Goyal VK, Fletcher AK, Rangan S: Compressive sampling and lossy compression. IEEE Signal Process Mag 2008, 25(2):48-56.
Article Google Scholar
Laska JN, Boufounos PT, Davenport MA, Baraniuk RG: Democracy in action: quantization, saturation, and compressive Sensing. Appl Comput Harmon Anal 2011, 31(3):429-443. 10.1016/j.acha.2011.02.002
Article MathSciNet MATH Google Scholar
Dai W, Pham HV, Milenkovic O: Quantized compressive sensing. 2009. Arxiv preprint:http://arxiv.org./abs/0901.0749
Google Scholar
Laska JN, Boufounous P, Baraniuk RG: Finite range scalar quantization for compressive sensing. Proc. International Conference On Sampling Theory and Applications, Marseille; 2009:1433-1452.
Google Scholar
Boufounos PT, Baraniuk RG: 1-Bit compressive sensing. in Proc. 42nd annual Conference on Information Science and Systems, Princeton, NJ; 2008:16-21.
Google Scholar
Baig Y, Lai EM-K, Lewis JP: Quantization effects on compressed sensing video. 17th International Conference on Telecommunications, Doha; 2010:935-940.
Google Scholar
Boufounos P, Baraniuk R Data Compression Conference. In Quantization of sparse representation. Snowbird, UT; 2007:378.
Google Scholar
Candès EJ: The restricted isometry property and its implications for compressed sensing. Compte Rendus de I’Academie des Science, Paris; 2008:589-592.
MATH Google Scholar
Chen S, Donoho DL, Saunders MA: Atomic decomposition by basis pursuit. SIAM Rev 2001, 43(1):33-61.
Article MathSciNet MATH Google Scholar
Tropp J, Gilbert A: Signal recovery from random measurements via orthogonal matching pursuit. IEEE Trans Inf Theory 2007, 53(12):4655-4666.
Article MathSciNet MATH Google Scholar
Dai W, Milenkovic O: Subspace pursuit for compressive sensing signal reconstruction. IEEE Trans Inf Theory 2009, 55(5):2230-2249.
Article MathSciNet Google Scholar
Donoho DL, Tsaig Y, Strack JL: Sparse solution of underdetermined linear equation by stagewise orthogonal matching pursuit. IEEE Trans Inf Theory 2012, 58(2):1094-1121.
Article Google Scholar
Needell D, Vershynin R: Signal recovery from incomplete and inaccurate measurements via regularized orthogonal matching pursuitr. IEEE J Selected Topics in Siganl Processing 2010, 4(2):310-316.
Article Google Scholar
Do TT, Gan L, Nguyen N, Tran TD: Sparsity adaptive matching pursuit algorithm for practical compressed sensing. 42nd Asilomar Conference on Signals, Systems and Computer, Pacific Grove, USA; 2008:581-587.
Google Scholar
Atal BS, Schroeder MR: Adaptive predictive coding of speech signals. Bell Syst Techn J 1970, 49: 1973-1986.
Article Google Scholar
Dong JX, Zhou JJ, Chao YZ: The structure of symmetric r-cyclic matrices and their eigenvalues. J Centl Chin Normal Univ 1997, 31(2):129-132.
MATH Google Scholar
Li XB, Zhao RZ, Hu SH: Blocked polynomial deterministic matrix for compressed sensing. 6th International Conference on Wireless Communication, Networking and Mobiles, Chengdu; 2010:1-4.
Google Scholar
Gan L, DO T, Tran T: Fast compressive imaging using scrambled block hadamard ensemble, in Proc. European Signal Processing Conference, Switzerland; 2008:1281-1284.
Google Scholar
Chang HS, Weiss Y, Freeman WT: Informative sensing of natural images. IEEE International Conference on Image Processing, Cario, Egypt; 2009:3025-3028.
Google Scholar
Yap HL, Eftekhair A, Wakin MB, Rozell CJ: The restricted isometry property for block diagonal matrices. 45th Annual of the Conference on Information Science and Systems, Baltimore; 2011:1-6.
MATH Google Scholar
Park JY, Yap HL, Rozell CJ, Wakin MB: Concentration of measure for block diagonal matrices with application to compressive sensing. IEEE Trans Signal Process 2011, 59(12):5859-5875.
Article MathSciNet Google Scholar
Gray GA, ZEOLI GW: Quantization and saturation noise due to analog-to-digital conversion. IEEE Trans Aerosp Elect Syst 1971, 7(1):222-223.
Article Google Scholar
Dasgupta S, Gupta A: An elementary proof of the Johnson-Lindenstrauss lemma. Random Struc Algorithms 2003, 22(1):60-65. 10.1002/rsa.10073
Article MathSciNet MATH Google Scholar
Laska JN, Davenport MA, Baraniuk RG: Exact signal recovery from sparsely corrupted measurements through the pursuit of justice. 43rd Asilomar Confernce on Signals, Systems and Computers, Pacific Grove; 2009:1556-1560.
Google Scholar

Download references

Acknowledgement

This work is supported by the National Science Foundation of China (Grant No. 60971129 & 61201326 & 61271335), the National Research Program of China (973 Program) (Grant No. 2011CB302303), the Natural Science Fund for Higher Education of Jiangsu Province (Grant No. 12KJB510021) and the Scientific Innovation Research Program of College Graduate in Jiangsu Province (Grant No. CXLX11_0408).

Author information

Authors and Affiliations

College of Communication and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing, Jiangsu, 210003, China
Ji Yunyun
Key Lab of Broadband Wireless Communication and Sensor Network Technology, Nanjing University of Posts and Telecommunications, Nanjing, Jiangsu, 210003, China
Yang Zhen

Authors

Ji Yunyun
View author publications
You can also search for this author in PubMed Google Scholar
Yang Zhen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ji Yunyun.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Authors’ original file for figure 12

Authors’ original file for figure 13

Authors’ original file for figure 14

Authors’ original file for figure 15

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Yunyun, J., Zhen, Y. Study on adaptive compressed sensing & reconstruction of quantized speech signals. EURASIP J. Adv. Signal Process. 2012, 232 (2012). https://doi.org/10.1186/1687-6180-2012-232

Download citation

Received: 02 December 2011
Accepted: 10 October 2012
Published: 31 October 2012
DOI: https://doi.org/10.1186/1687-6180-2012-232

Study on adaptive compressed sensing & reconstruction of quantized speech signals

Abstract

1 Introduction

2 Compressed sensing background

3 Sparsity and sparsifying matrix of speech signals

3.1 Sparsifying matrix and sparsity of voiced speech

3.2 Sparsity of unvoiced speech signals

3.3 Simulation

4 Sensing matrix for speech signals

4.1 Two-block diagonal matrix

4.2 Simulation

5 Quantization effect on speech signals with compressed sensing

5.1 Quantization of speech signals in the framework of CS

5.2 Simulation

6 Conclusions

Appendix

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords