Open Access

Scrambling-based speech encryption via compressed sensing

  • Li Zeng1Email author,
  • Xiongwei Zhang1,
  • Liang Chen2,
  • Zhangjun Fan2 and
  • Yonggang Wang2
EURASIP Journal on Advances in Signal Processing20122012:257

Received: 15 December 2011

Accepted: 4 December 2012

Published: 28 December 2012


Conventional speech scramblers have three disadvantages, including heavy communication overhead, signal features underexploitation, and low attack resistance. In this study, we propose a scrambling-based speech encryption scheme via compressed sensing (CS). Distinguished from conventional scramblers, the above problems are solved in a unified framework by utilizing the advantages of CS. The presented encryption idea is general and easily applies to speech communication systems. Compared with the state-of-the-art methods, the proposed scheme provides lower residual intelligibility and greater cryptanalytic efforts. Meanwhile, it ensures desirable channel usage and notable resistibility to hostile attack. Extensive experimental results also confirm the effectiveness of the proposed scheme.


Speech encryptionScramblingCompressed sensingResidual intelligibilityKeyspace

1. Introduction

Encryption, dating back to BC, is essential for information security in modern society [1]. Information espionages, including illegal surveillance and wiretapping, have emerged due to the wide applications of speech communication in national defense, economy and trade, scientific research, and social affairs. With security an ever more vital requisite of communications systems, speech encryption has attracted substantial acceptance as an effective means of enhancing protection in both military and civilian applications.

Two main categories of technologies have been developed for this purpose. The first one is content protection through encryption, e.g., speech scrambler [29]. Proper decryption of the data requires a key or the so-called scrambling matrix. The second one is digital watermarking, which aims at embedding messages into the multimedia data [10]. Intuitively, the time domain sample scrambling method is thus far the most attractive and desired, because it simply takes a segment of time domain sample values and directly scrambles them into a different segment of samples. In this article, we focus on the scrambling-based encryption.

Earlier speech scramblers disorder the original signal using specific sequence or matrix, such as pseudorandom sequence, Fibonacci transform [2], Hadamard matrix [3, 4], and so on. The main disadvantage shared by these methods is that the decryption key is invariable. Since the performance of computer hardware has incredibly been improved, these methods could easily be deciphered. To alleviate this problem, researchers proposed to employ new key schedules, such as stochastic matrix [5] and Latin square [6], to improve the strength of security [7, 8]. However, the improved algorithms also result in heavy transmission load due to their disability of compressing the original signal. Consequently, the speech compression methods are integrated into the process of encryption, e.g., G.729 mixed excitation linear prediction (MELP) and AMR [9]. Indeed, the combination of compression and scrambling leads to less costly encrypted data. But the parametric coding algorithms are of low robustness in the presence of noise or other hostile attacks. Besides, the performances of such algorithms depend heavily upon the encryption operator, and the character of speech itself is not well utilized. More recently, researches in nonlinear process have shown that the chaotic signal is very suitable for secure communications. However, chaotic system is sensitive to disturbance and requires strict self-synchronization, which limits its practical applications [11, 12].

According to Del Re et al. [13], the degree of security (deciphering difficulty) provided by a speech encryption system is related to (1) residual intelligibility (the amount of intelligibility left over in the encrypted signal) and (2) keyspace (the number of keys available for encryption). In general, the lower a scrambling system’s residual intelligibility and the bigger its keyspace, the higher its degree of security. After the propose of the stochastic approach [14], the keyspace of a scrambler is commonly measured by the number of encryption operators, namely the scrambling matrix [3, 4, 6, 13].

To summarize, a channel-saving and anti-attack speech scrambler is a major issue to be addressed. In the meantime, it should attain residual intelligibility as low as possible and provide keyspace as large as possible to increase the cryptogram immunity to cryptanalysis. Despite the improvements achieved by the aforementioned works, few investigations manage to simultaneously address these problems. In light of this consideration, we apply compressed sensing (CS) [1517] to speech encryption, due to its promising capability in signal compression and its notable robustness to hostile attacks.

In this article, we tackle the issue on scrambling-based speech encryption via CS by exploiting the sparsity of speech over the Karhunen–Loeve (K–L) incoherent dictionary [18]. Distinguished from existing schemes, we scramble the dimensional-reduced measurements instead of the original speech. The algorithm proposed in this article is motivated by the following idea: if two independent signals x1 and x2 are aliased and scrambled in the same space by a stochastic matrix, the intelligibility of the original signal can be eliminated, and it is hard for the eavesdroppers to get any information barely from the mixture [19]. Observations show that the measurement vector of speech exhibits noise-like nature. However, the envelope of compressed data still retains considerable information of the original speech [20, 21], we therefore alias and scramble the two measurement vectors of separated speech instead of using the envelope as ciphered data directly.

To be specific, the presented scheme contains two stages: encryption and decryption. At the encryption stage, we compress and encrypt the speech. The original signal is separated into two independent parts and sparsely represented by the corresponding K–L incoherent dictionaries. Next, the sparse vectors of the two parts are measured by stochastic matrices. Afterwards, these low-dimensional measurements are mixed and scrambled by the scrambling matrix, which is constructed from the null space basis (NSB) of the aligned sensing matrix using singular value decomposition (SVD). At the decryption stage, the inverse operator is constructed to eliminate each aliased measurements part individually. At last, the separated speech parts are reconstructed using the orthogonal matching pursuit (OMP) [22], and assembled to recover the speech. Experimental results demonstrate that the encrypted signal of proposed scheme has low transmission cost and low residual intelligibility. It provides immense cryptogram immunity and exhibits notable attack resistance.

The rest of this article is outlined as follows. In Section 2, we introduce the K–L incoherent dictionary to sparsely represent speech signals. Section 3 explains the encryption idea that we seek to address, and expatiates on the detailed procedure of encryption and decryption. The residual intelligibility, encryption strength, and robustness performance of the proposed scheme, together with experimental results, are presented and discussed in Section 4. Conclusions are drawn in Section 5.

2. The sparse representation of speech

Sparse representation is a critical step of CS [16], since one can obtain the dimension-reduced signal on the basis of sparse vectors. In this section, we employ a practical sparsifying dictionary to sparsely represent the speech signal.

The K–L expansion [18] describes a stochastic process in the form of incoherent random principle components and the corresponding deterministic orthogonal basis. Thus, the main structure of the process can be captured by a few expansion terms. Assume a real second-order moment stochastic process {x(t), t [0, 1]}, its K–L expansion is
x t = k = 1 a k ϕ k t
where a k =∫01x(t)ϕ k (t)dt. The orthogonal bases {ϕ k (t)} are eigenfunctions of the signal autocorrelation R x (t u). They can be used to design the atoms of signal dictionary. {ϕ k (t)} and the corresponding eigenvalues y k satisfy the Fredholm integral equation [18, 23]
y k ϕ k t = 0 1 R x t , u ϕ k u du
However, for a practical signal, Equation (2) is hard to solve due to the complexity of the signal autocorrelation. Since the autocorrelation of speech signal decreases rapidly within a low delay, we approximate it by the exponential function: R x (t, u)=R x (0)eμ|tu|, where μ is the attenuation coefficient. Substituting R x (t,u) into (2) yields
y k φ t = 0 1 R x t , u φ u du = R x 0 e μt 0 t e μμ ϕ u du + e μt t 1 e μμ ϕ u du
By solving (3) with the boundary conditions and eliminating the zero particular solution ϕ0(t)=0 (it cannot be used as an eigenfunction), we obtain the orthogonal basis
ϕ k t = μ cos kπt + sin kπt , k Z { 0 }
After adding ϕ0(t)=1, the complete orthogonal K–L dictionary E is represented by
E = ϕ 0 t ϕ k t , k Z 0
where k stands for the number of atoms in the dictionary. For digital signal processing, the bases of E are sampled in the range of 0≤t≤1 by uniform sampling. Let μ* denote the optimal value of parameter μ, it is estimated by solving the following optimization problem:
μ * = arg min μ > 0 R x τ R ^ x μ 2 2

where R x τ = 1 n τ i = 1 n τ x i x i + τ is the unbiased estimation of the autocorrelation of speech frame xR n with delay τ and R ^ x τ = R x 0 e μ τ , τ = 0 , 1 , , n 1 .

Thus, the optimal discrete atoms are e k =[e k (1), …, e k (i), …, e k (n)] T , where
e k i = μ cos i 1 n 1 + sin i = 1 n 1 i = 1 , , n
Then add with e0=[1, … ,1] T , we construct the complete discrete speech dictionary as
D = e 0 e k , k Z 0

In this case, discrete uniform sampling changes the orthogonality of bases {ϕ k (t)}. Though they are not mathematically orthogonal, atoms in D are of low coherence and subsequently, D turns out to be an overcomplete incoherent dictionary.

Concisely, the presented dictionary is codetermined by the sinusoidal atoms (given in (7)) and the parameter μ*. The structure of the sinusoidal atoms is based on the K–L expansion. It is the general paradigm of the dictionary, thus can be shared by both the compression and the recovering part. On the other hand, μ* is affected by the character of each frame. It determines the detailed structure of the current dictionary. Hence, with μ*, the corresponding dictionary can be rebuilt to recover the original speech in the recovering part.

Three types of speech frame (unvoiced, voiced, and transition sound) and their corresponding sparse vectors over the K–L incoherent dictionaries are shown in Figure 1. Here, k is set equal to the length of a speech frame.
Figure 1

Sparsity of three types of speech over K L incoherent dictionary. The horizontal axes stand for the amplitudes of signals, the vertical axes stand for the indices of vectors.

3. Proposed encryption scheme

This section details the specification of the proposed scheme. Section 3.1 illustrates the derivation of proposed encryption scheme. The course of scrambling matrix designing is addressed in Section 3.2. Section 3.3 describes the decryption and recovering process.

3.1. CS-based scrambling

According to Candès and Wakin [17], the implication of sparsity is now clear: when a signal has a sparse expansion, one can discard the small coefficients without much perceptual loss. Hence, some minor but non-zero entries of the sparse vectors can be discarded before the measuring to further reduce the compression rate. In addition, they prove that, for a K-sparse signal sR n and a fixed basis ΦRm×n with atoms selected uniformly at random, the exact reconstruction of s from the measurements y=ΦsR m (mn) is of overwhelming probability, as long as the number of observations obeys
m C K log n

for some real positive constant C. In this case, the original speech is compressed and the compression ratio is m:n. Here, we concentrate on the issue of speech encryption, and the quality of reconstructed speech will be given in Section 4.

The speech has proved to be a robust signal that can be perturbed in many different ways while remaining intelligible [24]. As depicted in Figure 2 (the compression ratio is set as 1:20), though the measurement vector exhibits some noise-like nature (Figure 2b), it is observed that the envelopes (red, dashed line) of the original speech and the compressed signal are of high similarity. This means the CS retains considerable information within the low-dimensional measurements.
Figure 2

Comparison of the envelopes (red, dashed line) of original and compressed signal.

Actually, neurons in the auditory brainstem sensitive to speech envelope have been found in mammalian physiological studies [25]. The envelope extracted using the Hilbert transform reveal that the envelope is most important for speech reception, namely the words are identified according to the envelope [20]. Research on the relationship between speech envelope and audio perceptual comprehending is still intensely ongoing [26]. More recently, Mehmet Cenk et. al. [21] have investigated the perceptual features for automatic emotion recognition with temporal envelopes.

Since the strong connection between residual intelligibility and speech envelope, our goal aims to come up with a new algorithm which is able to utilize the sparsity of speech signal as well as to decrease the residual intelligibility of the compressed data. In view of this consideration, the CS-based scrambling approach is employed for its straightforwardness.

Previous studies [5, 7, 8] have demonstrated the security of using stochastic matrix as the key. Coincidentally, the sensing matrix with respect to CS is also a stochastic matrix. It therefore can be used as the scrambling matrix to decrease the residual intelligibility of the dimensional-reduced signal. Nevertheless, we notice that the dimension of the compressed signal is not in accordance with that of the sensing matrix. In other words, yR m cannot be scrambled directly by ΦRm×n without dimensional variation. To solve this problem, one feasible way is to select a group of random atoms from Φ to form an m×m scrambling matrix. But the selection schedule will bring about additional communication load. As a consequence, based on the compressed speech sensing, we have designed a new paradigm of scrambling matrix that alias and scramble two volumes of compressed data y 1 y 2 together.

3.2. Design of the scrambler

As mentioned above, one can hardly get any information barely from the result if two independent signals x1 and x2 are aliased and scrambled in the same space. As shown in Figure 3, the original speech is thereby separated into two parts. For the sake of the independency of new speech segments, we set every four consecutive frames (800 samples at sampling rate 8 kHz) as a segmentation piece by considering two facts: (1) the quasi-periodic property of speech endures about 50 ms; (2) the auditory tolerance to delay is about 200 ms. Each cube in Figure 3 represents such a piece.
Figure 3

Sketch map of speech separation.

Next, the new signals are sparsely represented over the corresponding K–L incoherent dictionaries, and then measured by different stochastic matrices to get the compressed signals individually. Since using only one fixed matrix does not always hold the restricted isometry property [17] and will result in undesirable reconstructions, we randomly choose the matrices from the stochastic matrix dictionary B={Φ j : j=1, 2, …, L}.

Such matrix dictionary is constructed in advance. During the construction process, each randomly generated stochastic matrix is tested with a group of different speech frames. If the correct reconstruction rate of this matrix is acceptable, it is chosen as a dictionary atom. In this study, we set the accepting threshold of correct reconstruction rate as 80%. As a matter of fact, almost all random matrices are CS matrices [15], thus the number of atoms L in the dictionary can be set according to practical requirements.

The compressed data y i are pre-reconstructed until the final selected sensing matrices Φ i , i=1, 2 ensure precise reconstruction. Similarly, let α i , i=1,2 denote the normalized indices of these matrices in the dictionary and let D i , i=1,2 stand for the sparsifying matrices of the two speech parts. Then, the encrypted signal y D is obtained by aliasing and scrambling the two low-dimensional measurement vectors with the selected matrices Φ i .
y D = f y 1 , y 2 ; Φ 1 , Φ 2

Subsequently, one heuristic approach is to design a scrambling matrix schedule that is of high security together with its inverse operator for decryption. Due to the independency of the two speech parts, their corresponding measurements y i =Φ i D i T x i , i=1, 2 are also incoherent. We can remove any one of them by its orthocomplement without damaging the other one. Without loss of generality, we take y1 for the following illustrations.

Assume there exists a vector z from (Φ1 T ) that is orthogonal to Φ1 T , i.e., z T Φ1=0. Multiplying z with the encrypted data y D yields
z T y 1 = z T Φ 1 D 1 T x 1 = 0

thus the y1 part comprised in y D is eliminated. Then, we can reconstruct x1, x2 by reconstruction algorithms and assemble them to further obtain the recovered speech x ^ .

Since the operation objects are matrices and vectors in practice, the orthocomplement designing problem turns out to be orthogonal vector designing. Following related linear space theories [27], z can be presented by a linear combination of the vectors in the non-trivial NSB of Φ1, denoted as Null(Φ1). The rank of the stochastic matrix Φ1Rm×n(m<n) is m. However, the dimensions of Φ1 and the null space of Φ1 H have following relationship
dim Null Φ 1 T = m dim Φ 1 T = 0
According to (12), if we choose Φ i to scramble y1 and y2 directly, the two measurement parts cannot be separated for decryption since Null(Φ1) does not exist. To ensure the inverse operation, we construct a non-full row rank matrix Φ 1 D = Φ 1 Φ 1 R 2 m × n . The dimension of Φ1 D is m. By employing the conclusion drawn from [27], we have
dim Φ 1 D + dim Null Φ 1 D H = 2 m
dim Null Φ 1 D H = 2 m dim Φ 1 D = m
This ensures the existence of null space of Φ1 D and it can be constructed by SVD: for a matrix Φ1 D R2m×n with rank(Φ1 D )=m, it can be decomposed as
Φ 1 D = U V H
Then the m left singular vectors {um+1, um+1, …, u2m} that correspond to the non-zero singular values are orthonormal basis of the conjugate transpose matrix (Φ1 D ) H , that is
Null Φ 1 D H = Span u m + 1 , u m + 1 , , u 2 m

where UR2m×2m and VRm×m are unitary matrices. Here Σ = Σ 1 0 0 0 , where Σ1=diag(σ1, σ2, …, σ m ), and σ i are the eigenvalues of (Φ1 D ) H Φ1 D .

With (16), we obtain the NSB of (Φ1 D ) H : NR2m×m, denoted as N = N 1 N 2 with property N1=−N2. N1, N2 are full rank matrices and therefore have inverse matrices, as proved in Appendix. In this case, the matrix Φ1 D provides available NSB. In other words, if we use the NSB matrix to construct a scrambling matrix for the aligned measurements y 1 y 2 , the inverse operator for decryption is available.

For this consideration, the scrambling matrix is designed as S = α 1 I N 1 N 2 1 α 2 I R 2 m × 2 m to alias and disorder the measurements y1, y2. Here, α i are the corresponding normalized indices of Φ i in the dictionary B, so that every frame of encrypted signal is aliased in different proportion to enhance the encryption complexity. The identity matrix IRm×m is used to adapt the dimensions of S and y 1 y 2 . The final encrypted signal y D is given in (17). Every compressed part y i is scrambled and the scrambled data are aliased with each other. The whole process of encryption is shown in Figure 4.
y D = α 1 I N 1 N 2 1 α 2 I y 1 y 2 = α 1 y 1 + N 1 y 2 N 2 1 y 1 + α 2 y 2
Figure 4

The overall framework of encryption.

For communication, the encrypted data y D , the dictionary parameters μ i , and the indices α i of sensing matrix are transmitted to the decryption end. Then the decryption operator can be constructed by Φ i . As such, the stochastic matrix dictionary B is also the key book shared by both the encryption and the decryption parts. This key book will never be exposed in the channel. In practice, it is irregularly updated and the number of its atoms L can adaptively be set to meet the requirements of the system.

3.3. Decryption

Figure 5 depicts the decryption procedure. We get Φ1 with the index α1 from the transmitted data, and construct the matrix S'=[N2−1α2I1] with Φ1 D by SVD. The y1 part is removed as follows.
S ' y D = N 2 1 α 2 I 1 y 1 + N 1 y 2 α 1 y 1 + N 1 1 y 2 = N 2 1 N 1 α 2 2 I y 2 = t R m × 1
Figure 5

The overall framework of decryption.

As mentioned above, N1 and N2 are full rank matrices. With the inverse matrix N2−1, we can get y2 by multiplying the y1—removed data t with the matrix (N2−1N1α22I)−1. Analogously, y1 is decrypted as
y 1 = α 1 α 2 I N 1 N 2 1 1 α 2 I N 1 y D

When the measurement vectors y i and the dictionary parameters μ i are derived, the two speech parts can be recovered using OMP algorithm [22]. Finally, by assembling the two reconstructed parts, we obtain the recovered speech x ^ .

In theory, without the key book B, it is very hard for the eavesdroppers to decipher the encrypted signal even though they have cryptanalyzed the encryption mechanism. The strength of security is discussed in the following section.

4. Experimental results and discussions

In this section, the performances of the proposed encryption scheme are evaluated from three perspectives: (1) the residual intelligibility of the encrypted signal; (2) the strength of security; (3) resistance to hostile attacks. We test over 20,000 frames of speech coming from several speakers with unlike characteristics (gender, age, pitch, regional accent). These test signals are taken randomly from TIMIT database and are sampled at 8kHz with length 25ms, that is, 200 samples per frame. In Section 4.1, residual intelligibility test results and discussions are presented. Section 4.2 analyzes the strength of security, and a possible deciphering technique is considered. Section 4.3 verifies the robustness of the proposed scheme in two conditions: in the presence of noise and low-pass filtering (LPF).

4.1. Residual intelligibility

The amount of intelligibility left over in the encrypted signal is measured by the envelope relevance between the original speech and the processed signal, given as
ρ = < E 0 , E p > E 0 2 E p 2 1 2

where E o and E p denote the envelopes of original speech and the processed signal, respectively. Naturally, we interpolate the vector of E p to reach the same dimension as E o due to the operation of inner production in (20). We test two kinds of processed signal: the compressed signal (CoS) and the encrypted signal (ES).

According to the experimental statistics, when the compression rate (m/n) is above 5%, the salient information of speech can be captured, and acceptable reconstruction quality is derived with the K–L incoherent dictionary. On the other hand, though reconstruction quality improves with the increasing of compression rate, it is of no significance for signal compression with a high compression rate. Therefore, the average residual intelligibilities are performed at compression rates ranged from 5 to 10%.

As seen in Table 1, despite of the noise-like nature, the low-dimensional measurements still retain considerable information of the original speech. The envelope relevance between the aliased, scrambled signal, and the original speech exhibits a dramatic decrease in terms of residual intelligibility. In the meantime, it is noticed that the compression rates and the residual intelligibility are not remarkably related, which means one can choose the compression ratio adaptively without increasing the residual intelligibility. In addition, as a subjective method for predicting the quality of narrow-band speech, the mean opinion score (MOS) recommended by ITU-T P.862 [28] is adopted to evaluate the perceptual quality of the recovered speech, and the results are also presented in Table 1 to illustrate the relationship between speech quality and compression rate.
Table 1

Residual intelligibilities of processed signals at different compression rates and the corresponding MOS scores of the recovered speech

Compression rate (%)





























4.2. Strength of security

Following Shannon’s landmark article [14], the majority of literatures on key generation may roughly be categorized into four basic approaches: information theory approach, system theory approach, complexity theory approach, and stochastic approach. Considering the key schedule, our encryption scheme belongs to the stochastic approach, and its security is generally measured by the scale of the keyspace. Thus, the keyspace of proposed scheme is analyzed in two cases to evaluate its strength of security.

First, we consider the optimal case, in which the scrambling mechanism is thoroughly unknown to the unauthorized listener, including the key schedule, the structures of the scrambling matrix, and the sparsifying dictionary. Consequently, the essential approaches for decryption is given by
y D S unknown } y i Φ i unknown } s i D i unknown } x ^ i x ^

In this case, all of the above three unknown matrices can be regarded as the key since one cannot obtain the original information without any one of them. Meanwhile, the extreme low residual intelligibility and the noise-like nature of the encrypted signal have hampered the statistical analysis methods [29]. Therefore, the most feasible way is to search for the key, and its complexity is directly decided by the scale of keyspace. Given the dimensions of S, Φ i , and D i , their keyspace sizes are O 10 2 m 2 , O(10 mn ), and O(10 mn ), respectively. These are also the computational complexity for the searching process.

In the second case, we assume that an eavesdropper has a complete knowledge of the system, and has the necessary hardware to synchronize and isolate the frames. In other words, he knows that the scrambling matrix S can be constructed with Φ i and the sparsifying matrix D i can simply be rebuilt for its characteristic structure. Hence, the security of the system is assumed to reside entirely with the selection of a key Φ i . For the eavesdropper, the only task is to find the key. Since the randomness of Φ i , the keyspace size is O(10 mn ). In practical situations, the speech frames is of length n=180–220 and if we choose the compression rate as 5%, the length of compressed signal is m=9–11. Therefore, the order of magnitude of the keyspace is about 102000.

Table 2 compares the keyspace sizes and the compression ratios of the proposed scheme and some prior scramblers, which employ representative key schedules, including Hadamard matrix [4], Latin Square [6], dimensional-variable matrix [7], and chaos system [12]. As seen in Table 2, the proposed scheme provides larger keyspace and requires lower communication overhead.
Table 2

Comparison of keyspace size and compression ratio



Hadamard [[4]]

Latin Square [[6]]

ASVDS [[7]]

Chaos [[12]]


O(10n 2)

O(10 n (n!)2




Compression ratio






Now let us consider a possible deciphering technique by dictionary learning regardless of deciphering delay. A cryptanalyst trying to break the system may be in possession of large amounts of encrypted signal except the key book, since it is held by both the encryption and decryption parts and never transmitted through the channel. He knows the complete specification of the system (scrambling mechanism, structure of sparsifying dictionary); he would like to deduce the key without considering the real-time requirement.

In mathematical language, the j th encryption operation is represented by
y j D = S j y 1 j y 2 j

By wiretapping, the cryptanalyst has obtained enough encrypted signal y(j) D . He would like to learn S ^ j from Y D , where Y D =[y(1) D y(2) D y(j) D ]. Each y 1 j y 2 j is unknown. Unfortunately, this is an optimization problem with no constraint conditions and thus cannot be solved, let alone the scrambling matrix S is variable but not fixed. To say the least, even though he is able to find some fixed S ^ , he still cannot rebuild the sensing matrix Φ. This can be verified through Equations (15) and (16).

In fact, there would not be enough data and delay tolerance for cryptanalysis. For instance, in secure communications of military information or intelligence of espionage activities, the key information is expected to be as briefly as possible to ensure short durations. Therefore, the dictionary learning may not be a feasible approach in real cases.

4.3. Robustness performance

Since readily decipherable unintelligibility signals may also be generated in large keyspace, other factors, including bandwidth expansion, delay times, channel resistance (to noise, distortion, etc.), cannot be ignored in assessing security. Two types of attack are performed with the encrypted signal: (1) in the presence of additive white Gaussian noise (AWGN); (2) LPF.

Representative speech scrambling schemes are chosen to compare with the CS scheme. To be specific, the time-domain scrambling (TDS) [5] is adopted to stand for non-compressional scramblers. In parallel to it, the approximate 13 line μ-law pause code modulation (PCM) and the MELP [9] are, respectively, chosen to represent waveform coding and parametric coding, with respect to compressional scramblers.

The MOS is chosen as the subjective criterion. In addition, average-subsection signal-to-noise ratio (SNRseg) [30] is adopted as the objective criterion to evaluate the quality of recovered speech, given by (23).
SNRseg = 101 g 1 N frame j = 1 N frame SNRse g j

where SNRse g j = i = 1 n x j 2 i x j i x ^ j i 2 and N frame denotes the total number of frames. The results are calculated and averaged for a test set of approximately 100 sentences randomly selected from the TIMIT database.

4.3.1. Noise resistance

AWGN is added to the encrypted signal of each scheme. The performances of the proposed and comparative schemes are compared. The compression ratio of CS is set as 1:10.

As shown in Figure 6a, it is observed that CS scheme always outperforms the comparative schemes for all degrees of contamination. As the signal-to-noise ratio (SNR) becomes higher, the superiority of CS scheme becomes more obvious and leads to a more favorable comparison; the compressional schemes, including PCM and MELP, perform worse, and these are verified by the SNRseg decrements as well (Figure 6b).
Figure 6

Comparison of scramblers in the presence of AWGN. (a) MOS decrements. (b) SNRseg decrements.

The results are mainly due to the use of stochastic matrix: it has extreme low column coherence. The studies [31, 32] have shown that for a noiseless signal y=Φs, if the K-sparse vector s satisfies s 0 < 1 2 spark Φ , then the reconstruction s ^ from the contaminated measurements (y+e) satisfies
s s ^ 2 2 O e 2 2 1 M 2 K 1
where spark(Φ) stands for the minimum number of columns of Φ that are linearly dependent, and M is the “mutual coherence” of Φ[31], defined as
M = max 1 i , j n , i j | Φ i T Φ j Φ i 2 Φ j 2

In a word, one can stably reconstruct the sparse vector s with error proportional to the noise level, provided that (1) the columns of the sensing matrix Φ are weakly mutual correlated, and (2) the vector s is to some extent sparse. The reason why the reconstruction error is restricted is geometric in nature: summarily, the reconstruction s ^ is restrained within a tiny tubular wedge that surrounds the original vector s, which ensures the stability of recovering (for further details please refer to [31], subsection 5.3). In particular, the method of convex relaxation can identify a sparse signal in AWGN [33], and more sophisticated stable recovery schemes and boundaries have been investigated [34].

Technically speaking, the waveform coding scheme has no noise-resistant precaution and thus is vulnerable to noise. In terms of the parametric coding scheme, the noise will bring in errors to the feature parameters such as pitch period and voiced/unvoiced judgment. Once such parameters are contaminated, undesirable reconstruct distortion happens. For reasons given above, the noise resistance of the proposed scheme is better than the counterparts.

4.3.2. LPF

In terms of LPF, the decrements of MOS and SNRseg between speeches reconstructed from the filtered and the original encrypted data are compared. Also, the compression ratio of CS is set as 1:10.

As seen in Figure 7, CS scheme slightly outperforms the comparative schemes when the cutoff frequency ranges from 2400 to 2600 Hz. It gradually performs better along with the increasing of cutoff frequency and exhibits obvious competitive advantages. The TDS scheme ranks in the second place and the PCM scheme shows the worst performance.
Figure 7

Comparison of scramblers for LPF performances. (a) MOS decrements. (b) SNRseg decrements.

The encrypted signal obtained from CS has dramatically removed the speech characters by aliasing and scrambling, thus it is of the best resistance to LPF. On the contrary, the TDS scheme still retains some speech structures, which makes its performances inferior to the CS scheme. The TDS scheme outperforms the other two comparative ones due to its robustness to time domain perturbations [24]. As for PCM and MELP schemes, their encoding signals have no spectral structures. All parts of the signal share the similar importance and therefore are vulnerable to this type of attack. Any damages to the encrypted signal would bring about serious reconstruction errors and deteriorate the auditory quality. As a consequence, these two schemes have the worst robustness to LPF.

5. Conclusions

This article presented a scrambling-based speech encryption algorithm via CS. A high degree of security can be achieved due to low residual intelligibility and large keyspace size. The immense complexity associated with the task of finding the scrambling matrix ensures the effectiveness of encryption. It affords notable robustness to common hostile attacks, while requires lower communicational costs and introduces only a slight (about 200ms) processing delay. Experimental results are included which demonstrate the improved performance of the scheme compared with state-of-the-art speech scramblers. As a future work, it is planned to investigate more sophisticated speech sparse representation and reconstruction algorithms to further reduce the compression ratio and improve the auditory quality of the recovered speech.


Proof of full rank property mentioned in Section 3.2 is given as follows.

As for full row rank matrix PRm×n(m<n), rank(P)=m, denote its SVD as P=U Σ V H , where Σ=[Σ m Om×(nm)]Rm×n, Σ m =diag(σ1, σ2,…,σ m ). Here U and V are m×m, n×n unitary matrices, respectively. σ i >0 denote the square roots of non-zero eigenvalues of P H P. With the SVD of P, we have
P H P = V Σ H U H U Σ V H = V Σ m 2 O n m × n m O n m × n m O n m × n m V H
Then for Q = P P = U ' Σ ' V ' H R 2 m × n and rank(Q)=m, Q H Q is represented as
Q H Q = P H P H P P = 2 P H P = 2 V ' Σ m 2 O n m × n m O n m × n m O n m × n m V ' H
where U R2m×2m, Σ R2m×n, V Rn×n.Comparing (26) and (27), it is noticed that V =V, Σ ' = Σ m O m × n m O m × n O m × n m and UU H =2I2m, where I2m denotes the 2m×2m identity matrix. In this case, if there is a proper U′, the SVD of Q can be obtained. It is verified that U ' = U U U U satisfies UU H =2I2m, then (27) is re-written as
Q H Q = 2 V ' Σ m 2 O n m × n m O n m × n m O n m × n m V ' H = 2 V ' Σ ' H Σ ' V ' H = V ' Σ ' H 2 I 2 m Σ ' V ' H = U ' Σ ' V ' H H U ' Σ ' V ' H
Thus, with the SVD of Q=U'Σ V' H , the null space basis of Q H can be constructed.
N u l l Q H = S p a n u ' m + 1 , u ' m + 2 , , u ' 2 m = U U = N 1 N 2 R 2 m

where u'm+i denotes the column vector of U'.

As the randomness of P, dim [Null(Q H )]=m, namely the NSB matrix N = N 1 N 2 is a full column rank matrix.

By elementary row operations, N can be converted to the form N 1 O without rank changing. Therefore, rank (N1)=m, which means N i , i=1, 2 are full rank matrices and possess inverse matrices. This completes the proof.



Adaptive multi-rate


Additive white Gaussian noise


Before Christ


Compressed signal


Compressed sensing


Encrypted signal




Low-pass filtering


Mixed excitation linear prediction


Mean opinion score


Null space basis


Orthogonal matching pursuit


Pause code modulation


Signal-to-noise ratio


Average-subsection signal-to-noise ratio


Singular value decomposition


Time-domain scrambling.



This study was supported by the National Natural Science Foundation, China (61072042), the Natural Science Foundation of Jiangsu Province, China (BK2012510), and the Pre-research Foundations of PLA University of Science and Technology (20110211). The authors appreciate Professor Shou-sheng Liu and Professor Zu-ping Qian for their useful discussions and valuable suggestions from the bottom of our hearts. The authors would like to thank the anonymous reviewers for their constructive comments and questions which greatly improved the article.

Authors’ Affiliations

College of Command Information Systems, PLA University of Science and Technology, Nanjing, China
College of Communications Engineering, PLA University of Science and Technology, Nanjing, China


  1. Clark JA: Nature-inspired cryptography: Past, Present and Future. In Congress on Evolutionary Computation. 3rd edition. Newport Beach, USA; 2003:1647-1654.Google Scholar
  2. Nan L, Yanhong S, Jiancheng Z: An audio scrambling method based on Fibonacci transformation. J. North China Univ. Technol. 2004, 16(3):8-11.Google Scholar
  3. Senk V, Delic VD, Milosevic VS: A new speech scrambling concept based on Hadamard matrices. IEEE Signal Process. Lett. 1997, 4(6):161-163.View ArticleGoogle Scholar
  4. Pal SK: Fast, reliable & secure digital communication using Hardmard matrices. In Proceedings of the International Conference on Computing: Theory and Applications. 1st edition. Kolkata, India; 2007:526-532.Google Scholar
  5. Li H, Qin Z, Zhang XP, Shao LP: An n-dimensional space audio scrambling algorithm based on random matrix. J. Xi’an Jiaotong Univ. 2010, 44(4):13-17.Google Scholar
  6. Satti M, Kak S: Multilevel indexed quasi-group encryption for data and speech. IEEE Trans. Broadcast. 2009, 55(2):270-281.View ArticleGoogle Scholar
  7. Li H, Qin Z: Audio scrambling algorithm based on variable dimension spaces. In International Conference on Industrial and Information Systems. 1st edition. West Bengal, India; 2009:316-319.Google Scholar
  8. Honggang W, Michael H, Hamid S, Peng DM, Wang W, Hsiao-Hwa C: Index-based selective audio encryption for wireless multimedia sensor networks. IEEE Trans Multimed 2010, 12(3):215-223.View ArticleGoogle Scholar
  9. Antonio S, Juan Carlos M: Perception-based partial encryption of compressed speech. IEEE Trans. Speech Audio Process 2002, 10(8):637-643.View ArticleGoogle Scholar
  10. Pierre M, O’Sullivan JA: Information-theoretic analysis of information hiding. IEEE Trans. Inf. Theory. 2003, 49(3):563-593.View ArticleMathSciNetMATHGoogle Scholar
  11. Li-Lian H, Qi-tian Y: A chaos synchronization secure communication system based on output control. J. Electron. Inf. Technol. 2009, 31(10):2402-2405.Google Scholar
  12. Liangrui T, Lin Z, Xue Y: Chaos synchronization based on observer and its application in speech secure communication. In Proceedings of IC-NIDC. 2nd edition. Beijing, China; 2010:773-777.Google Scholar
  13. Del Re E, Fantacci R, Maffucci D: A new speech signal scrambling method for secure communications: theory, implementation, and security evaluation. IEEE J. Sel. Areas Commun. 1989, 7(4):474-480.View ArticleGoogle Scholar
  14. Shannon CE: Communication theory of secret systems. Bell Syst. Tech. J. 1949, 28(4):656-715.MathSciNetView ArticleMATHGoogle Scholar
  15. Donoho DL: Compressed sensing. IEEE Trans. Inf. Theory. 2006, 52(4):1289-1306.MathSciNetView ArticleMATHGoogle Scholar
  16. Baraniuk RG: Lecture notes: compressive sensing. IEEE Signal Process. Mag. 2007, 24(4):118-121.View ArticleGoogle Scholar
  17. Candès EJ, Wakin MB: An introduction to compressive sampling. IEEE Signal Process. Mag. 2008, 25(2):21-30.View ArticleGoogle Scholar
  18. Tian-jing W, Bao-yu Z, Zhen Y: A speech signal sparse representation algorithm based on adaptive overcomplete dictionary. J. Electron. Inf. Technol. 2011, 33(10):2372-2377.View ArticleGoogle Scholar
  19. Zhang-hua C, Yuan-sheng T: Secure communication based on network coding. J. Commun. 2010, 31(8A):188-194.Google Scholar
  20. Smith ZM, Bertrand D, Oxenham AJ: Chimaeric sounds reveal dichotomies in auditory perception. Nature. 2002, 416: 87-90.View ArticleGoogle Scholar
  21. Mehmet Cenk S, Bilge G, Gunes Karabulut K: Perceptual audio features for emotional detection. EURASIP J. Audio Speech Music Process 2012, 16: 1-21.Google Scholar
  22. Tropp JA, Gilbert AC: Signal recovery from random measurements via orthogonal matching pursuit. IEEE Trans. Inf. Theory. 2007, 53(12):4655-4666.MathSciNetView ArticleMATHGoogle Scholar
  23. Navarro-Moreno J, Ruiz-Molina JC: Nonlinear estimation using correlation information. IEEE Trans. Signal Process. 2006, 54(7):2822-2827.View ArticleGoogle Scholar
  24. Saberi K, Perrot DR: Cognitive restoration of reversed speech. Nature 1999, 398: 760.View ArticleGoogle Scholar
  25. Joris PX, Yin TC: Envelope coding in the lateral superior olive. I. Sensitivity to interaural time differences. J. Neurophysiol. 1995, 73: 1043-1062.Google Scholar
  26. Bendor D, Wang X: The neuronal representation of pitch in primate auditory cortex. Nature. 2005, 436: 1161-1165.View ArticleGoogle Scholar
  27. Xian-da Z: Matrix Analysis and Applications. Tsinghua University Press, Beijing; 2004.Google Scholar
  28. Itu-T P: 862, Perceptual Evaluation of Speech Quality (PESQ), and Objective Method for End-toEnd Speech Quality Assessment of Narrowband Telephone Networks and Speech Codecs. ITU-T Recommendation, Geneva; 2001.Google Scholar
  29. Georgiev P, Theis F, Cichocki A: Sparse component analysis and blind source separation of underdetermined mixtures. IEEE Trans. Neural Netw. 2005, 16(4):992-996.View ArticleGoogle Scholar
  30. Tingting X, Zhen Y, Xi S: Novel speech secure communication system based on information hiding and compressed sensing. The 4th International Conference on System and Networks Communications 4th edition. 2009, 201-206.Google Scholar
  31. Donoho DL, Michael E, Vladimir T: Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans. Inf. Theory. 2006, 52(1):6-18.View ArticleMathSciNetMATHGoogle Scholar
  32. Babaie-Zadeh M, Jutten C: On the stable recovery of the sparsest overcomplete representations in presence of noise. IEEE Trans. Signal Process. 2010, 58(10):5396-5400.MathSciNetView ArticleGoogle Scholar
  33. Tropp JA: Just relax: convex programming methods for identifying sparse signal in noise. IEEE Trans. Inf. Theory. 2006, 52(3):1030-1051.MathSciNetView ArticleMATHGoogle Scholar
  34. Sun Q: Sparse approximation property and stable recovery of sparse signal from noisy measurements. IEEE Trans. Signal Process. 2011, 59(10):5086-5090.MathSciNetView ArticleGoogle Scholar


© Zeng et al.; licensee Springer. 2012

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.