Scrambling-based speech encryption via compressed sensing

Zeng, Li; Zhang, Xiongwei; Chen, Liang; Fan, Zhangjun; Wang, Yonggang

doi:10.1186/1687-6180-2012-257

Research
Open access
Published: 28 December 2012

Scrambling-based speech encryption via compressed sensing

Li Zeng¹,
Xiongwei Zhang¹,
Liang Chen²,
Zhangjun Fan² &
…
Yonggang Wang²

EURASIP Journal on Advances in Signal Processing volume 2012, Article number: 257 (2012) Cite this article

5698 Accesses
35 Citations
1 Altmetric
Metrics details

Abstract

Conventional speech scramblers have three disadvantages, including heavy communication overhead, signal features underexploitation, and low attack resistance. In this study, we propose a scrambling-based speech encryption scheme via compressed sensing (CS). Distinguished from conventional scramblers, the above problems are solved in a unified framework by utilizing the advantages of CS. The presented encryption idea is general and easily applies to speech communication systems. Compared with the state-of-the-art methods, the proposed scheme provides lower residual intelligibility and greater cryptanalytic efforts. Meanwhile, it ensures desirable channel usage and notable resistibility to hostile attack. Extensive experimental results also confirm the effectiveness of the proposed scheme.

1. Introduction

Encryption, dating back to BC, is essential for information security in modern society [1]. Information espionages, including illegal surveillance and wiretapping, have emerged due to the wide applications of speech communication in national defense, economy and trade, scientific research, and social affairs. With security an ever more vital requisite of communications systems, speech encryption has attracted substantial acceptance as an effective means of enhancing protection in both military and civilian applications.

Two main categories of technologies have been developed for this purpose. The first one is content protection through encryption, e.g., speech scrambler [2–9]. Proper decryption of the data requires a key or the so-called scrambling matrix. The second one is digital watermarking, which aims at embedding messages into the multimedia data [10]. Intuitively, the time domain sample scrambling method is thus far the most attractive and desired, because it simply takes a segment of time domain sample values and directly scrambles them into a different segment of samples. In this article, we focus on the scrambling-based encryption.

Earlier speech scramblers disorder the original signal using specific sequence or matrix, such as pseudorandom sequence, Fibonacci transform [2], Hadamard matrix [3, 4], and so on. The main disadvantage shared by these methods is that the decryption key is invariable. Since the performance of computer hardware has incredibly been improved, these methods could easily be deciphered. To alleviate this problem, researchers proposed to employ new key schedules, such as stochastic matrix [5] and Latin square [6], to improve the strength of security [7, 8]. However, the improved algorithms also result in heavy transmission load due to their disability of compressing the original signal. Consequently, the speech compression methods are integrated into the process of encryption, e.g., G.729 mixed excitation linear prediction (MELP) and AMR [9]. Indeed, the combination of compression and scrambling leads to less costly encrypted data. But the parametric coding algorithms are of low robustness in the presence of noise or other hostile attacks. Besides, the performances of such algorithms depend heavily upon the encryption operator, and the character of speech itself is not well utilized. More recently, researches in nonlinear process have shown that the chaotic signal is very suitable for secure communications. However, chaotic system is sensitive to disturbance and requires strict self-synchronization, which limits its practical applications [11, 12].

According to Del Re et al. [13], the degree of security (deciphering difficulty) provided by a speech encryption system is related to (1) residual intelligibility (the amount of intelligibility left over in the encrypted signal) and (2) keyspace (the number of keys available for encryption). In general, the lower a scrambling system’s residual intelligibility and the bigger its keyspace, the higher its degree of security. After the propose of the stochastic approach [14], the keyspace of a scrambler is commonly measured by the number of encryption operators, namely the scrambling matrix [3, 4, 6, 13].

To summarize, a channel-saving and anti-attack speech scrambler is a major issue to be addressed. In the meantime, it should attain residual intelligibility as low as possible and provide keyspace as large as possible to increase the cryptogram immunity to cryptanalysis. Despite the improvements achieved by the aforementioned works, few investigations manage to simultaneously address these problems. In light of this consideration, we apply compressed sensing (CS) [15–17] to speech encryption, due to its promising capability in signal compression and its notable robustness to hostile attacks.

In this article, we tackle the issue on scrambling-based speech encryption via CS by exploiting the sparsity of speech over the Karhunen–Loeve (K–L) incoherent dictionary [18]. Distinguished from existing schemes, we scramble the dimensional-reduced measurements instead of the original speech. The algorithm proposed in this article is motivated by the following idea: if two independent signals x₁ and x₂ are aliased and scrambled in the same space by a stochastic matrix, the intelligibility of the original signal can be eliminated, and it is hard for the eavesdroppers to get any information barely from the mixture [19]. Observations show that the measurement vector of speech exhibits noise-like nature. However, the envelope of compressed data still retains considerable information of the original speech [20, 21], we therefore alias and scramble the two measurement vectors of separated speech instead of using the envelope as ciphered data directly.

To be specific, the presented scheme contains two stages: encryption and decryption. At the encryption stage, we compress and encrypt the speech. The original signal is separated into two independent parts and sparsely represented by the corresponding K–L incoherent dictionaries. Next, the sparse vectors of the two parts are measured by stochastic matrices. Afterwards, these low-dimensional measurements are mixed and scrambled by the scrambling matrix, which is constructed from the null space basis (NSB) of the aligned sensing matrix using singular value decomposition (SVD). At the decryption stage, the inverse operator is constructed to eliminate each aliased measurements part individually. At last, the separated speech parts are reconstructed using the orthogonal matching pursuit (OMP) [22], and assembled to recover the speech. Experimental results demonstrate that the encrypted signal of proposed scheme has low transmission cost and low residual intelligibility. It provides immense cryptogram immunity and exhibits notable attack resistance.

The rest of this article is outlined as follows. In Section 2, we introduce the K–L incoherent dictionary to sparsely represent speech signals. Section 3 explains the encryption idea that we seek to address, and expatiates on the detailed procedure of encryption and decryption. The residual intelligibility, encryption strength, and robustness performance of the proposed scheme, together with experimental results, are presented and discussed in Section 4. Conclusions are drawn in Section 5.

2. The sparse representation of speech

Sparse representation is a critical step of CS [16], since one can obtain the dimension-reduced signal on the basis of sparse vectors. In this section, we employ a practical sparsifying dictionary to sparsely represent the speech signal.

The K–L expansion [18] describes a stochastic process in the form of incoherent random principle components and the corresponding deterministic orthogonal basis. Thus, the main structure of the process can be captured by a few expansion terms. Assume a real second-order moment stochastic process {x(t), t ∈ [0, 1]}, its K–L expansion is

x (t) = \sum_{k = 1}^{\infty} a_{k} ϕ_{k} (t)

(1)

where a_k=∫₀¹x(t)ϕ_k(t)dt. The orthogonal bases {ϕ_k(t)} are eigenfunctions of the signal autocorrelation R_x(t u). They can be used to design the atoms of signal dictionary. {ϕ_k(t)} and the corresponding eigenvalues y_k satisfy the Fredholm integral equation [18, 23]

y_{k} ϕ_{k} (t) = \int_{0}^{1} R_{x} (t, u) ϕ_{k} (u) du

(2)

However, for a practical signal, Equation (2) is hard to solve due to the complexity of the signal autocorrelation. Since the autocorrelation of speech signal decreases rapidly within a low delay, we approximate it by the exponential function: R_x(t, u)=R_x(0)e^−μ|t−u|, where μ is the attenuation coefficient. Substituting R_x(t,u) into (2) yields

\begin{array}{l} y_{k} φ (t) = \int_{0}^{1} R_{x} (t, u) φ (u) du \\ = R_{x} (0) (e^{- μt} \int_{0}^{t} e^{μμ} ϕ (u) du + e^{μt} \int_{t}^{1} e^{- μμ} ϕ (u) du) \end{array}

(3)

By solving (3) with the boundary conditions and eliminating the zero particular solution ϕ₀(t)=0 (it cannot be used as an eigenfunction), we obtain the orthogonal basis

ϕ_{k} (t) = (\frac{kπ}{μ}) cos (kπt) + sin (kπt), k \in Z ∖ {0}

(4)

After adding ϕ₀(t)=1, the complete orthogonal K–L dictionary E is represented by

E = \{ϕ_{0} (t)\} ⋃ \{ϕ_{k} (t), k \in Z ∖ \{0\}\}

(5)

where k stands for the number of atoms in the dictionary. For digital signal processing, the bases of E are sampled in the range of 0≤t≤1 by uniform sampling. Let μ* denote the optimal value of parameter μ, it is estimated by solving the following optimization problem:

μ * = \underset{μ > 0}{arg min} {‖R_{x} (τ) - {\hat{R}}_{x} (μ)‖}_{2}^{2}

(6)

where $R_{x} (τ) = \frac{1}{(n - τ)} \sum_{i = 1}^{n - τ} x (i) x (i + τ)$ is the unbiased estimation of the autocorrelation of speech frame x∈Rⁿ with delay τ and ${\hat{R}}_{x} (τ) = R_{x} (0) e^{- μ |τ|}, τ = 0, 1, \dots, n - 1$ .

Thus, the optimal discrete atoms are e_k=[e_k(1), …, e_k(i), …, e_k(n)]^T, where

\begin{matrix} e_{k} (i) = \frac{kπ}{μ *} cos (\frac{kπ (i - 1)}{n - 1}) \\ + sin (\frac{kπ (i = 1)}{n - 1}) (i = 1, \dots, n) \end{matrix}

(7)

Then add with e₀=[1, … ,1]^T, we construct the complete discrete speech dictionary as

D = \{e_{0}\} ⋃ \{e_{k}, k \in Z ∖ \{0\}\}

(8)

In this case, discrete uniform sampling changes the orthogonality of bases {ϕ_k(t)}. Though they are not mathematically orthogonal, atoms in D are of low coherence and subsequently, D turns out to be an overcomplete incoherent dictionary.

Concisely, the presented dictionary is codetermined by the sinusoidal atoms (given in (7)) and the parameter μ*. The structure of the sinusoidal atoms is based on the K–L expansion. It is the general paradigm of the dictionary, thus can be shared by both the compression and the recovering part. On the other hand, μ* is affected by the character of each frame. It determines the detailed structure of the current dictionary. Hence, with μ*, the corresponding dictionary can be rebuilt to recover the original speech in the recovering part.

Three types of speech frame (unvoiced, voiced, and transition sound) and their corresponding sparse vectors over the K–L incoherent dictionaries are shown in Figure 1. Here, k is set equal to the length of a speech frame.

3. Proposed encryption scheme

This section details the specification of the proposed scheme. Section 3.1 illustrates the derivation of proposed encryption scheme. The course of scrambling matrix designing is addressed in Section 3.2. Section 3.3 describes the decryption and recovering process.

3.1. CS-based scrambling

According to Candès and Wakin [17], the implication of sparsity is now clear: when a signal has a sparse expansion, one can discard the small coefficients without much perceptual loss. Hence, some minor but non-zero entries of the sparse vectors can be discarded before the measuring to further reduce the compression rate. In addition, they prove that, for a K-sparse signal s∈Rⁿ and a fixed basis Φ∈R^m×n with atoms selected uniformly at random, the exact reconstruction of s from the measurements y=Φs∈R^m (m≪n) is of overwhelming probability, as long as the number of observations obeys

m \geq C \cdot K \cdot log n

(9)

for some real positive constant C. In this case, the original speech is compressed and the compression ratio is m:n. Here, we concentrate on the issue of speech encryption, and the quality of reconstructed speech will be given in Section 4.

The speech has proved to be a robust signal that can be perturbed in many different ways while remaining intelligible [24]. As depicted in Figure 2 (the compression ratio is set as 1:20), though the measurement vector exhibits some noise-like nature (Figure 2b), it is observed that the envelopes (red, dashed line) of the original speech and the compressed signal are of high similarity. This means the CS retains considerable information within the low-dimensional measurements.

Actually, neurons in the auditory brainstem sensitive to speech envelope have been found in mammalian physiological studies [25]. The envelope extracted using the Hilbert transform reveal that the envelope is most important for speech reception, namely the words are identified according to the envelope [20]. Research on the relationship between speech envelope and audio perceptual comprehending is still intensely ongoing [26]. More recently, Mehmet Cenk et. al. [21] have investigated the perceptual features for automatic emotion recognition with temporal envelopes.

Since the strong connection between residual intelligibility and speech envelope, our goal aims to come up with a new algorithm which is able to utilize the sparsity of speech signal as well as to decrease the residual intelligibility of the compressed data. In view of this consideration, the CS-based scrambling approach is employed for its straightforwardness.

Previous studies [5, 7, 8] have demonstrated the security of using stochastic matrix as the key. Coincidentally, the sensing matrix with respect to CS is also a stochastic matrix. It therefore can be used as the scrambling matrix to decrease the residual intelligibility of the dimensional-reduced signal. Nevertheless, we notice that the dimension of the compressed signal is not in accordance with that of the sensing matrix. In other words, y∈R^m cannot be scrambled directly by Φ∈R^m×n without dimensional variation. To solve this problem, one feasible way is to select a group of random atoms from Φ to form an m×m scrambling matrix. But the selection schedule will bring about additional communication load. As a consequence, based on the compressed speech sensing, we have designed a new paradigm of scrambling matrix that alias and scramble two volumes of compressed data $[\begin{array}{l} y_{1} \\ y_{2} \end{array}]$ together.

3.2. Design of the scrambler

As mentioned above, one can hardly get any information barely from the result if two independent signals x₁ and x₂ are aliased and scrambled in the same space. As shown in Figure 3, the original speech is thereby separated into two parts. For the sake of the independency of new speech segments, we set every four consecutive frames (800 samples at sampling rate 8 kHz) as a segmentation piece by considering two facts: (1) the quasi-periodic property of speech endures about 50 ms; (2) the auditory tolerance to delay is about 200 ms. Each cube in Figure 3 represents such a piece.

Next, the new signals are sparsely represented over the corresponding K–L incoherent dictionaries, and then measured by different stochastic matrices to get the compressed signals individually. Since using only one fixed matrix does not always hold the restricted isometry property [17] and will result in undesirable reconstructions, we randomly choose the matrices from the stochastic matrix dictionary B={Φ_j: j=1, 2, …, L}.

Such matrix dictionary is constructed in advance. During the construction process, each randomly generated stochastic matrix is tested with a group of different speech frames. If the correct reconstruction rate of this matrix is acceptable, it is chosen as a dictionary atom. In this study, we set the accepting threshold of correct reconstruction rate as 80%. As a matter of fact, almost all random matrices are CS matrices [15], thus the number of atoms L in the dictionary can be set according to practical requirements.

The compressed data y_i are pre-reconstructed until the final selected sensing matrices Φ_i, i=1, 2 ensure precise reconstruction. Similarly, let α_i, i=1,2 denote the normalized indices of these matrices in the dictionary and let D_i, i=1,2 stand for the sparsifying matrices of the two speech parts. Then, the encrypted signal y^D is obtained by aliasing and scrambling the two low-dimensional measurement vectors with the selected matrices Φ_i.

y^{D} = f (y_{1}, y_{2}; Φ_{1}, Φ_{2})

(10)

Subsequently, one heuristic approach is to design a scrambling matrix schedule that is of high security together with its inverse operator for decryption. Due to the independency of the two speech parts, their corresponding measurements y_i=Φ_iD_i^Tx_i, i=1, 2 are also incoherent. We can remove any one of them by its orthocomplement without damaging the other one. Without loss of generality, we take y₁ for the following illustrations.

Assume there exists a vector z from (Φ₁^T)^⊥ that is orthogonal to Φ₁^T, i.e., z^TΦ₁=0. Multiplying z with the encrypted data y^D yields

z^{T} y_{1} = (z^{T} Φ_{1}) D_{1}^{T} x_{1} = 0

(11)

thus the y₁ part comprised in y^D is eliminated. Then, we can reconstruct x₁, x₂ by reconstruction algorithms and assemble them to further obtain the recovered speech $\hat{x}$ .

Since the operation objects are matrices and vectors in practice, the orthocomplement designing problem turns out to be orthogonal vector designing. Following related linear space theories [27], z can be presented by a linear combination of the vectors in the non-trivial NSB of Φ₁, denoted as Null(Φ₁). The rank of the stochastic matrix Φ₁∈R^m×n(m<n) is m. However, the dimensions of Φ₁ and the null space of Φ₁^H have following relationship

dim [Null (Φ_{1}^{T})] = m - dim (Φ_{1}^{T}) = 0

(12)

According to (12), if we choose Φ_i to scramble y₁ and y₂ directly, the two measurement parts cannot be separated for decryption since Null(Φ₁) does not exist. To ensure the inverse operation, we construct a non-full row rank matrix $Φ_{1}^{D} = [\begin{array}{l} Φ_{1} \\ Φ_{1} \end{array}] \in R^{2 m \times n}$ . The dimension of Φ₁^D is m. By employing the conclusion drawn from [27], we have

dim ({Φ_{1}}^{D}) + dim [Null ({({Φ_{1}}^{D})}^{H})] = 2 m

(13)

Then,

\begin{matrix} dim [Null ({({Φ_{1}}^{D})}^{H})] = 2 m - dim ({Φ_{1}}^{D}) \\ = m \end{matrix}

(14)

This ensures the existence of null space of Φ₁^D and it can be constructed by SVD: for a matrix Φ₁^D ∈ R^2m×n with rank(Φ₁^D)=m, it can be decomposed as

{Φ_{1}}^{D} = U \sum V^{H}

(15)

Then the m left singular vectors {u_m+1, u_m+1, …, u_2m} that correspond to the non-zero singular values are orthonormal basis of the conjugate transpose matrix (Φ₁^D)^H, that is

Null ({({Φ_{1}}^{D})}^{H}) = Span \{u_{m + 1}, u_{m + 1}, \dots, u_{2 m}\}

(16)

where U∈R^2m×2m and V∈R^m×m are unitary matrices. Here $Σ = [\begin{array}{c} Σ_{1} 0 \\ 0 & 0 \end{array}],$ where Σ₁=diag(σ₁, σ₂, …, σ_m), and σ_i are the eigenvalues of (Φ₁^D)^HΦ₁^D.

With (16), we obtain the NSB of (Φ₁^D)^H: N∈R^2m×m, denoted as $N = [\begin{array}{l} N_{1} \\ N_{2} \end{array}]$ with property N₁=−N₂. N₁, N₂ are full rank matrices and therefore have inverse matrices, as proved in Appendix. In this case, the matrix Φ₁^D provides available NSB. In other words, if we use the NSB matrix to construct a scrambling matrix for the aligned measurements $[\begin{array}{l} y_{1} \\ y_{2} \end{array}]$ , the inverse operator for decryption is available.

For this consideration, the scrambling matrix is designed as $S = [\begin{array}{l} α_{1} I & N_{1} \\ N_{2}^{- 1} & α_{2} I \end{array}] \in R^{2 m \times 2 m}$ to alias and disorder the measurements y₁, y₂. Here, α_i are the corresponding normalized indices of Φ_i in the dictionary B, so that every frame of encrypted signal is aliased in different proportion to enhance the encryption complexity. The identity matrix I∈R^m×m is used to adapt the dimensions of S and $[\begin{array}{l} y_{1} \\ y_{2} \end{array}]$ . The final encrypted signal y^D is given in (17). Every compressed part y_i is scrambled and the scrambled data are aliased with each other. The whole process of encryption is shown in Figure 4.

y^{D} = [\begin{array}{l} α_{1} I N_{1} \\ N_{2}^{- 1} α_{2} I \end{array}] [\begin{array}{l} y_{1} \\ y_{2} \end{array}] = [\begin{array}{l} α_{1} y_{1} + N_{1} y_{2} \\ N_{2}^{- 1} y_{1} + α_{2} y_{2} \end{array}]

(17)

For communication, the encrypted data y^D, the dictionary parameters μ_i^∗, and the indices α_i of sensing matrix are transmitted to the decryption end. Then the decryption operator can be constructed by Φ_i. As such, the stochastic matrix dictionary B is also the key book shared by both the encryption and the decryption parts. This key book will never be exposed in the channel. In practice, it is irregularly updated and the number of its atoms L can adaptively be set to meet the requirements of the system.

3.3. Decryption

Figure 5 depicts the decryption procedure. We get Φ₁ with the index α₁ from the transmitted data, and construct the matrix S'=[N₂⁻¹ −α₂I₁] with Φ₁^D by SVD. The y₁ part is removed as follows.

\begin{array}{l} S' y^{D} = [N_{2}^{- 1} - α_{2} I_{1}] [\begin{array}{l} y_{1} + N_{1} y_{2} \\ α_{1} y_{1} + N_{1}^{- 1} y_{2} \end{array}] \\ = (N_{2}^{- 1} N_{1} - α_{2}^{2} I) y_{2} \\ = t \in R^{m \times 1} \end{array}

(18)

As mentioned above, N₁ and N₂ are full rank matrices. With the inverse matrix N₂⁻¹, we can get y₂ by multiplying the y₁—removed data t with the matrix (N₂⁻¹N₁ −α₂²I)⁻¹. Analogously, y₁ is decrypted as

y_{1} = {(α_{1} α_{2} I - N_{1} N_{2}^{- 1})}^{- 1} [α_{2} I - N_{1}] y^{D}

(19)

When the measurement vectors y_i and the dictionary parameters μ_i^∗ are derived, the two speech parts can be recovered using OMP algorithm [22]. Finally, by assembling the two reconstructed parts, we obtain the recovered speech $\hat{x}$ .

In theory, without the key book B, it is very hard for the eavesdroppers to decipher the encrypted signal even though they have cryptanalyzed the encryption mechanism. The strength of security is discussed in the following section.

4. Experimental results and discussions

In this section, the performances of the proposed encryption scheme are evaluated from three perspectives: (1) the residual intelligibility of the encrypted signal; (2) the strength of security; (3) resistance to hostile attacks. We test over 20,000 frames of speech coming from several speakers with unlike characteristics (gender, age, pitch, regional accent). These test signals are taken randomly from TIMIT database and are sampled at 8kHz with length 25ms, that is, 200 samples per frame. In Section 4.1, residual intelligibility test results and discussions are presented. Section 4.2 analyzes the strength of security, and a possible deciphering technique is considered. Section 4.3 verifies the robustness of the proposed scheme in two conditions: in the presence of noise and low-pass filtering (LPF).

4.1. Residual intelligibility

The amount of intelligibility left over in the encrypted signal is measured by the envelope relevance between the original speech and the processed signal, given as

ρ = \frac{< E_{0}, E_{p} >}{{({‖E_{0}‖}_{2} \cdot {‖E_{p}‖}_{2})}^{\frac{1}{2}}}

(20)

where E_o and E_p denote the envelopes of original speech and the processed signal, respectively. Naturally, we interpolate the vector of E_p to reach the same dimension as E_o due to the operation of inner production in (20). We test two kinds of processed signal: the compressed signal (CoS) and the encrypted signal (ES).

According to the experimental statistics, when the compression rate (m/n) is above 5%, the salient information of speech can be captured, and acceptable reconstruction quality is derived with the K–L incoherent dictionary. On the other hand, though reconstruction quality improves with the increasing of compression rate, it is of no significance for signal compression with a high compression rate. Therefore, the average residual intelligibilities are performed at compression rates ranged from 5 to 10%.

As seen in Table 1, despite of the noise-like nature, the low-dimensional measurements still retain considerable information of the original speech. The envelope relevance between the aliased, scrambled signal, and the original speech exhibits a dramatic decrease in terms of residual intelligibility. In the meantime, it is noticed that the compression rates and the residual intelligibility are not remarkably related, which means one can choose the compression ratio adaptively without increasing the residual intelligibility. In addition, as a subjective method for predicting the quality of narrow-band speech, the mean opinion score (MOS) recommended by ITU-T P.862 [28] is adopted to evaluate the perceptual quality of the recovered speech, and the results are also presented in Table 1 to illustrate the relationship between speech quality and compression rate.

Table 1 Residual intelligibilities of processed signals at different compression rates and the corresponding MOS scores of the recovered speech

Full size table

4.2. Strength of security

Following Shannon’s landmark article [14], the majority of literatures on key generation may roughly be categorized into four basic approaches: information theory approach, system theory approach, complexity theory approach, and stochastic approach. Considering the key schedule, our encryption scheme belongs to the stochastic approach, and its security is generally measured by the scale of the keyspace. Thus, the keyspace of proposed scheme is analyzed in two cases to evaluate its strength of security.

First, we consider the optimal case, in which the scrambling mechanism is thoroughly unknown to the unauthorized listener, including the key schedule, the structures of the scrambling matrix, and the sparsifying dictionary. Consequently, the essential approaches for decryption is given by

\begin{array}{l} y^{D} \\ S (unknown) \end{array}} \to \begin{array}{l} y_{i} \\ Φ_{i} (unknown) \end{array}} \to \begin{array}{l} s_{i} \\ D_{i} (unknown) \end{array}} \to {\hat{x}}_{i} \to \hat{x}

(21)

In this case, all of the above three unknown matrices can be regarded as the key since one cannot obtain the original information without any one of them. Meanwhile, the extreme low residual intelligibility and the noise-like nature of the encrypted signal have hampered the statistical analysis methods [29]. Therefore, the most feasible way is to search for the key, and its complexity is directly decided by the scale of keyspace. Given the dimensions of S, Φ_i, and D_i, their keyspace sizes are $O (10^{{(2 m)}^{2}})$ , O(10^mn), and O(10^mn), respectively. These are also the computational complexity for the searching process.

In the second case, we assume that an eavesdropper has a complete knowledge of the system, and has the necessary hardware to synchronize and isolate the frames. In other words, he knows that the scrambling matrix S can be constructed with Φ_i and the sparsifying matrix D_i can simply be rebuilt for its characteristic structure. Hence, the security of the system is assumed to reside entirely with the selection of a key Φ_i. For the eavesdropper, the only task is to find the key. Since the randomness of Φ_i, the keyspace size is O(10^mn). In practical situations, the speech frames is of length n=180–220 and if we choose the compression rate as 5%, the length of compressed signal is m=9–11. Therefore, the order of magnitude of the keyspace is about 10²⁰⁰⁰.

Table 2 compares the keyspace sizes and the compression ratios of the proposed scheme and some prior scramblers, which employ representative key schedules, including Hadamard matrix [4], Latin Square [6], dimensional-variable matrix [7], and chaos system [12]. As seen in Table 2, the proposed scheme provides larger keyspace and requires lower communication overhead.

Table 2 Comparison of keyspace size and compression ratio

Full size table

Now let us consider a possible deciphering technique by dictionary learning regardless of deciphering delay. A cryptanalyst trying to break the system may be in possession of large amounts of encrypted signal except the key book, since it is held by both the encryption and decryption parts and never transmitted through the channel. He knows the complete specification of the system (scrambling mechanism, structure of sparsifying dictionary); he would like to deduce the key without considering the real-time requirement.

In mathematical language, the j th encryption operation is represented by

y_{(j)}^{D} = S^{(j)} [\begin{array}{l} y_{1}^{(j)} \\ y_{2}^{(j)} \end{array}]

(22)

By wiretapping, the cryptanalyst has obtained enough encrypted signal y_(j)^D. He would like to learn ${\hat{S}}^{(j)}$ from Y^D, where Y^D=[y₍₁₎^D y₍₂₎^D ⋯ y_(j)^D]. Each $[\begin{array}{l} y_{1}^{(j)} \\ y_{2}^{(j)} \end{array}]$ is unknown. Unfortunately, this is an optimization problem with no constraint conditions and thus cannot be solved, let alone the scrambling matrix S is variable but not fixed. To say the least, even though he is able to find some fixed $\hat{S}$ , he still cannot rebuild the sensing matrix Φ. This can be verified through Equations (15) and (16).

In fact, there would not be enough data and delay tolerance for cryptanalysis. For instance, in secure communications of military information or intelligence of espionage activities, the key information is expected to be as briefly as possible to ensure short durations. Therefore, the dictionary learning may not be a feasible approach in real cases.

4.3. Robustness performance

Since readily decipherable unintelligibility signals may also be generated in large keyspace, other factors, including bandwidth expansion, delay times, channel resistance (to noise, distortion, etc.), cannot be ignored in assessing security. Two types of attack are performed with the encrypted signal: (1) in the presence of additive white Gaussian noise (AWGN); (2) LPF.

Representative speech scrambling schemes are chosen to compare with the CS scheme. To be specific, the time-domain scrambling (TDS) [5] is adopted to stand for non-compressional scramblers. In parallel to it, the approximate 13 line μ-law pause code modulation (PCM) and the MELP [9] are, respectively, chosen to represent waveform coding and parametric coding, with respect to compressional scramblers.

The MOS is chosen as the subjective criterion. In addition, average-subsection signal-to-noise ratio (SNRseg) [30] is adopted as the objective criterion to evaluate the quality of recovered speech, given by (23).

SNRseg = 101 g (\frac{1}{N_{frame}} \sum_{j = 1}^{N_{frame}} SNRse g_{j})

(23)

where $SNRse g_{j} = \sum_{i = 1}^{n} \frac{x_{j}^{2} (i)}{{[x_{j} (i) - {\hat{x}}_{j} (i)]}^{2}}$ and N_frame denotes the total number of frames. The results are calculated and averaged for a test set of approximately 100 sentences randomly selected from the TIMIT database.

4.3.1. Noise resistance

AWGN is added to the encrypted signal of each scheme. The performances of the proposed and comparative schemes are compared. The compression ratio of CS is set as 1:10.

As shown in Figure 6a, it is observed that CS scheme always outperforms the comparative schemes for all degrees of contamination. As the signal-to-noise ratio (SNR) becomes higher, the superiority of CS scheme becomes more obvious and leads to a more favorable comparison; the compressional schemes, including PCM and MELP, perform worse, and these are verified by the SNRseg decrements as well (Figure 6b).

The results are mainly due to the use of stochastic matrix: it has extreme low column coherence. The studies [31, 32] have shown that for a noiseless signal y=Φs, if the K-sparse vector s satisfies ${‖s‖}_{0} < \frac{1}{2} spark (Φ)$ , then the reconstruction $\hat{s}$ from the contaminated measurements (y+e) satisfies

{‖s - \hat{s}‖}_{2}^{2} \leq O (\frac{{‖e‖}_{2}^{2}}{1 - M (2 K - 1)})

(24)

where spark(Φ) stands for the minimum number of columns of Φ that are linearly dependent, and M is the “mutual coherence” of Φ[31], defined as

M = max_{1 \leq i, j \leq n, i \neq j} | \frac{Φ_{i}^{T} Φ_{j}}{{‖Φ_{i}‖}_{2} {‖Φ_{j}‖}_{2}}

(25)

In a word, one can stably reconstruct the sparse vector s with error proportional to the noise level, provided that (1) the columns of the sensing matrix Φ are weakly mutual correlated, and (2) the vector s is to some extent sparse. The reason why the reconstruction error is restricted is geometric in nature: summarily, the reconstruction $\hat{s}$ is restrained within a tiny tubular wedge that surrounds the original vector s, which ensures the stability of recovering (for further details please refer to [31], subsection 5.3). In particular, the method of convex relaxation can identify a sparse signal in AWGN [33], and more sophisticated stable recovery schemes and boundaries have been investigated [34].

Technically speaking, the waveform coding scheme has no noise-resistant precaution and thus is vulnerable to noise. In terms of the parametric coding scheme, the noise will bring in errors to the feature parameters such as pitch period and voiced/unvoiced judgment. Once such parameters are contaminated, undesirable reconstruct distortion happens. For reasons given above, the noise resistance of the proposed scheme is better than the counterparts.

4.3.2. LPF

In terms of LPF, the decrements of MOS and SNRseg between speeches reconstructed from the filtered and the original encrypted data are compared. Also, the compression ratio of CS is set as 1:10.

As seen in Figure 7, CS scheme slightly outperforms the comparative schemes when the cutoff frequency ranges from 2400 to 2600 Hz. It gradually performs better along with the increasing of cutoff frequency and exhibits obvious competitive advantages. The TDS scheme ranks in the second place and the PCM scheme shows the worst performance.

The encrypted signal obtained from CS has dramatically removed the speech characters by aliasing and scrambling, thus it is of the best resistance to LPF. On the contrary, the TDS scheme still retains some speech structures, which makes its performances inferior to the CS scheme. The TDS scheme outperforms the other two comparative ones due to its robustness to time domain perturbations [24]. As for PCM and MELP schemes, their encoding signals have no spectral structures. All parts of the signal share the similar importance and therefore are vulnerable to this type of attack. Any damages to the encrypted signal would bring about serious reconstruction errors and deteriorate the auditory quality. As a consequence, these two schemes have the worst robustness to LPF.

5. Conclusions

This article presented a scrambling-based speech encryption algorithm via CS. A high degree of security can be achieved due to low residual intelligibility and large keyspace size. The immense complexity associated with the task of finding the scrambling matrix ensures the effectiveness of encryption. It affords notable robustness to common hostile attacks, while requires lower communicational costs and introduces only a slight (about 200ms) processing delay. Experimental results are included which demonstrate the improved performance of the scheme compared with state-of-the-art speech scramblers. As a future work, it is planned to investigate more sophisticated speech sparse representation and reconstruction algorithms to further reduce the compression ratio and improve the auditory quality of the recovered speech.

Appendix

Proof of full rank property mentioned in Section 3.2 is given as follows.

As for full row rank matrix P∈R^m×n(m<n), rank(P)=m, denote its SVD as P=U Σ V^H, where Σ=[Σ_m O_m×(n−m)]∈R^m×n, Σ_m=diag(σ₁, σ₂,…,σ_m). Here U and V are m×m, n×n unitary matrices, respectively. σ_i>0 denote the square roots of non-zero eigenvalues of P^HP. With the SVD of P, we have

\begin{array}{l} P^{H} P = (V Σ^{H} U^{H}) (U Σ V^{H}) \\ = V [\begin{array}{l} Σ_{m}^{2} & O_{(n - m) \times (n - m)} \\ O_{(n - m) \times (n - m)} & O_{(n - m) \times (n - m)} \end{array}] V^{H} \end{array}

(26)

Then for $Q = [\begin{array}{l} P \\ P \end{array}] = U' Σ' V'^{H} \in R^{2 m \times n}$ and rank(Q)=m, Q^HQ is represented as

\begin{array}{l} Q^{H} Q = [P^{H} P^{H}] [\begin{array}{l} P \\ P \end{array}] = 2 P^{H} P \\ = 2 V' [\begin{array}{l} Σ_{m}^{2} & O_{(n - m) \times (n - m)} \\ O_{(n - m) \times (n - m)} & O_{(n - m) \times (n - m)} \end{array}] V'^{H} \end{array}

(27)

where U^′∈R^2m×2m, Σ^′∈R^2m×n, V^′∈R^n×n.Comparing (26) and (27), it is noticed that V^′=V, $Σ' = [\begin{array}{l} Σ_{m} & O_{m \times (n - m)} \\ O_{m \times n} & O_{m \times (n - m)} \end{array}]$ and U^′U^′^H=2I_2m, where I_2m denotes the 2m×2m identity matrix. In this case, if there is a proper U′, the SVD of Q can be obtained. It is verified that $U' = [\begin{array}{l} U & - U \\ U & U \end{array}]$ satisfies U^′U^′^H=2I_2m, then (27) is re-written as

\begin{array}{l} Q^{H} Q = 2 V' [\begin{array}{l} Σ_{m}^{2} & O_{(n - m) \times (n - m)} \\ O_{(n - m) \times (n - m)} & O_{(n - m) \times (n - m)} \end{array}] V'^{H} \\ = 2 V' Σ'^{H} Σ' V'^{H} = V' Σ'^{H} ({2 I}_{2 m}) Σ' V'^{H} \\ = {(U' Σ' V'^{H})}^{H} (U' Σ' V'^{H}) \end{array}

(28)

Thus, with the SVD of Q=U'Σ V'^H, the null space basis of Q^H can be constructed.

\begin{array}{l} N u l l (Q^{H}) = S p a n \{u'_{m + 1}, u'_{m + 2}, \dots, u'_{2 m}\} \\ = [\begin{array}{l} - U \\ U \end{array}] = [\begin{array}{l} N_{1} \\ N_{2} \end{array}] \in R^{m× 2 m} \end{array}

(29)

where u'_m+i denotes the column vector of U'.

As the randomness of P, dim [Null(Q^H)]=m, namely the NSB matrix $N = [\begin{array}{l} N_{1} \\ N_{2} \end{array}]$ is a full column rank matrix.

By elementary row operations, N can be converted to the form $[\begin{array}{l} N_{1} \\ O \end{array}]$ without rank changing. Therefore, rank (N₁)=m, which means N_i, i=1, 2 are full rank matrices and possess inverse matrices. This completes the proof.

Abbreviations

AMR:: Adaptive multi-rate
AWGN:: Additive white Gaussian noise
BC:: Before Christ
CoS:: Compressed signal
CS:: Compressed sensing
ES:: Encrypted signal
K–L:: Karhunen–Loeve
LPF:: Low-pass filtering
MELP:: Mixed excitation linear prediction
MOS:: Mean opinion score
NSB:: Null space basis
OMP:: Orthogonal matching pursuit
PCM:: Pause code modulation
SNR:: Signal-to-noise ratio
SNRseg:: Average-subsection signal-to-noise ratio
SVD:: Singular value decomposition
TDS:: Time-domain scrambling.

References

Clark JA: Nature-inspired cryptography: Past, Present and Future. In Congress on Evolutionary Computation. 3rd edition. Newport Beach, USA; 2003:1647-1654.
Google Scholar
Nan L, Yanhong S, Jiancheng Z: An audio scrambling method based on Fibonacci transformation. J. North China Univ. Technol. 2004, 16(3):8-11.
Google Scholar
Senk V, Delic VD, Milosevic VS: A new speech scrambling concept based on Hadamard matrices. IEEE Signal Process. Lett. 1997, 4(6):161-163.
Article Google Scholar
Pal SK: Fast, reliable & secure digital communication using Hardmard matrices. In Proceedings of the International Conference on Computing: Theory and Applications. 1st edition. Kolkata, India; 2007:526-532.
Google Scholar
Li H, Qin Z, Zhang XP, Shao LP: An n-dimensional space audio scrambling algorithm based on random matrix. J. Xi’an Jiaotong Univ. 2010, 44(4):13-17.
Google Scholar
Satti M, Kak S: Multilevel indexed quasi-group encryption for data and speech. IEEE Trans. Broadcast. 2009, 55(2):270-281.
Article Google Scholar
Li H, Qin Z: Audio scrambling algorithm based on variable dimension spaces. In International Conference on Industrial and Information Systems. 1st edition. West Bengal, India; 2009:316-319.
Google Scholar
Honggang W, Michael H, Hamid S, Peng DM, Wang W, Hsiao-Hwa C: Index-based selective audio encryption for wireless multimedia sensor networks. IEEE Trans Multimed 2010, 12(3):215-223.
Article Google Scholar
Antonio S, Juan Carlos M: Perception-based partial encryption of compressed speech. IEEE Trans. Speech Audio Process 2002, 10(8):637-643.
Article Google Scholar
Pierre M, O’Sullivan JA: Information-theoretic analysis of information hiding. IEEE Trans. Inf. Theory. 2003, 49(3):563-593.
Article MathSciNet MATH Google Scholar
Li-Lian H, Qi-tian Y: A chaos synchronization secure communication system based on output control. J. Electron. Inf. Technol. 2009, 31(10):2402-2405.
Google Scholar
Liangrui T, Lin Z, Xue Y: Chaos synchronization based on observer and its application in speech secure communication. In Proceedings of IC-NIDC. 2nd edition. Beijing, China; 2010:773-777.
Google Scholar
Del Re E, Fantacci R, Maffucci D: A new speech signal scrambling method for secure communications: theory, implementation, and security evaluation. IEEE J. Sel. Areas Commun. 1989, 7(4):474-480.
Article Google Scholar
Shannon CE: Communication theory of secret systems. Bell Syst. Tech. J. 1949, 28(4):656-715.
Article MathSciNet MATH Google Scholar
Donoho DL: Compressed sensing. IEEE Trans. Inf. Theory. 2006, 52(4):1289-1306.
Article MathSciNet MATH Google Scholar
Baraniuk RG: Lecture notes: compressive sensing. IEEE Signal Process. Mag. 2007, 24(4):118-121.
Article Google Scholar
Candès EJ, Wakin MB: An introduction to compressive sampling. IEEE Signal Process. Mag. 2008, 25(2):21-30.
Article Google Scholar
Tian-jing W, Bao-yu Z, Zhen Y: A speech signal sparse representation algorithm based on adaptive overcomplete dictionary. J. Electron. Inf. Technol. 2011, 33(10):2372-2377.
Article Google Scholar
Zhang-hua C, Yuan-sheng T: Secure communication based on network coding. J. Commun. 2010, 31(8A):188-194.
Google Scholar
Smith ZM, Bertrand D, Oxenham AJ: Chimaeric sounds reveal dichotomies in auditory perception. Nature. 2002, 416: 87-90.
Article Google Scholar
Mehmet Cenk S, Bilge G, Gunes Karabulut K: Perceptual audio features for emotional detection. EURASIP J. Audio Speech Music Process 2012, 16: 1-21.
Google Scholar
Tropp JA, Gilbert AC: Signal recovery from random measurements via orthogonal matching pursuit. IEEE Trans. Inf. Theory. 2007, 53(12):4655-4666.
Article MathSciNet MATH Google Scholar
Navarro-Moreno J, Ruiz-Molina JC: Nonlinear estimation using correlation information. IEEE Trans. Signal Process. 2006, 54(7):2822-2827.
Article Google Scholar
Saberi K, Perrot DR: Cognitive restoration of reversed speech. Nature 1999, 398: 760.
Article Google Scholar
Joris PX, Yin TC: Envelope coding in the lateral superior olive. I. Sensitivity to interaural time differences. J. Neurophysiol. 1995, 73: 1043-1062.
Google Scholar
Bendor D, Wang X: The neuronal representation of pitch in primate auditory cortex. Nature. 2005, 436: 1161-1165.
Article Google Scholar
Xian-da Z: Matrix Analysis and Applications. Tsinghua University Press, Beijing; 2004.
Google Scholar
Itu-T P: 862, Perceptual Evaluation of Speech Quality (PESQ), and Objective Method for End-toEnd Speech Quality Assessment of Narrowband Telephone Networks and Speech Codecs. ITU-T Recommendation, Geneva; 2001.
Google Scholar
Georgiev P, Theis F, Cichocki A: Sparse component analysis and blind source separation of underdetermined mixtures. IEEE Trans. Neural Netw. 2005, 16(4):992-996.
Article Google Scholar
Tingting X, Zhen Y, Xi S: Novel speech secure communication system based on information hiding and compressed sensing. The 4th International Conference on System and Networks Communications 4th edition. 2009, 201-206.
Google Scholar
Donoho DL, Michael E, Vladimir T: Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans. Inf. Theory. 2006, 52(1):6-18.
Article MathSciNet MATH Google Scholar
Babaie-Zadeh M, Jutten C: On the stable recovery of the sparsest overcomplete representations in presence of noise. IEEE Trans. Signal Process. 2010, 58(10):5396-5400.
Article MathSciNet Google Scholar
Tropp JA: Just relax: convex programming methods for identifying sparse signal in noise. IEEE Trans. Inf. Theory. 2006, 52(3):1030-1051.
Article MathSciNet MATH Google Scholar
Sun Q: Sparse approximation property and stable recovery of sparse signal from noisy measurements. IEEE Trans. Signal Process. 2011, 59(10):5086-5090.
Article MathSciNet Google Scholar

Download references

Acknowledgments

This study was supported by the National Natural Science Foundation, China (61072042), the Natural Science Foundation of Jiangsu Province, China (BK2012510), and the Pre-research Foundations of PLA University of Science and Technology (20110211). The authors appreciate Professor Shou-sheng Liu and Professor Zu-ping Qian for their useful discussions and valuable suggestions from the bottom of our hearts. The authors would like to thank the anonymous reviewers for their constructive comments and questions which greatly improved the article.

Author information

Authors and Affiliations

College of Command Information Systems, PLA University of Science and Technology, Nanjing, China
Li Zeng & Xiongwei Zhang
College of Communications Engineering, PLA University of Science and Technology, Nanjing, China
Liang Chen, Zhangjun Fan & Yonggang Wang

Authors

Li Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Xiongwei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Liang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Zhangjun Fan
View author publications
You can also search for this author in PubMed Google Scholar
Yonggang Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Li Zeng.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Authors’ original file for figure 12

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Zeng, L., Zhang, X., Chen, L. et al. Scrambling-based speech encryption via compressed sensing. EURASIP J. Adv. Signal Process. 2012, 257 (2012). https://doi.org/10.1186/1687-6180-2012-257

Download citation

Received: 15 December 2011
Accepted: 04 December 2012
Published: 28 December 2012
DOI: https://doi.org/10.1186/1687-6180-2012-257

Scrambling-based speech encryption via compressed sensing

Abstract

1. Introduction

2. The sparse representation of speech

3. Proposed encryption scheme

3.1. CS-based scrambling

3.2. Design of the scrambler

3.3. Decryption

4. Experimental results and discussions

4.1. Residual intelligibility

4.2. Strength of security

4.3. Robustness performance

4.3.1. Noise resistance

4.3.2. LPF

5. Conclusions

Appendix

Abbreviations

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords