Scrambling-based speech encryption via compressed sensing

Conventional speech scramblers have three disadvantages, including heavy communication overhead, signal features underexploitation, and low attack resistance. In this study, we propose a scrambling-based speech encryption scheme via compressed sensing (CS). Distinguished from conventional scramblers, the above problems are solved in a unified framework by utilizing the advantages of CS. The presented encryption idea is general and easily applies to speech communication systems. Compared with the state-of-the-art methods, the proposed scheme provides lower residual intelligibility and greater cryptanalytic efforts. Meanwhile, it ensures desirable channel usage and notable resistibility to hostile attack. Extensive experimental results also confirm the effectiveness of the proposed scheme.


Introduction
Encryption, dating back to BC, is essential for information security in modern society [1]. Information espionages, including illegal surveillance and wiretapping, have emerged due to the wide applications of speech communication in national defense, economy and trade, scientific research, and social affairs. With security an ever more vital requisite of communications systems, speech encryption has attracted substantial acceptance as an effective means of enhancing protection in both military and civilian applications.
Two main categories of technologies have been developed for this purpose. The first one is content protection through encryption, e.g., speech scrambler [2][3][4][5][6][7][8][9]. Proper decryption of the data requires a key or the so-called scrambling matrix. The second one is digital watermarking, which aims at embedding messages into the multimedia data [10]. Intuitively, the time domain sample scrambling method is thus far the most attractive and desired, because it simply takes a segment of time domain sample values and directly scrambles them into a different segment of samples. In this article, we focus on the scrambling-based encryption.
Earlier speech scramblers disorder the original signal using specific sequence or matrix, such as pseudorandom sequence, Fibonacci transform [2], Hadamard matrix [3,4], and so on. The main disadvantage shared by these methods is that the decryption key is invariable. Since the performance of computer hardware has incredibly been improved, these methods could easily be deciphered. To alleviate this problem, researchers proposed to employ new key schedules, such as stochastic matrix [5] and Latin square [6], to improve the strength of security [7,8]. However, the improved algorithms also result in heavy transmission load due to their disability of compressing the original signal. Consequently, the speech compression methods are integrated into the process of encryption, e.g., G.729 mixed excitation linear prediction (MELP) and AMR [9]. Indeed, the combination of compression and scrambling leads to less costly encrypted data. But the parametric coding algorithms are of low robustness in the presence of noise or other hostile attacks. Besides, the performances of such algorithms depend heavily upon the encryption operator, and the character of speech itself is not well utilized. More recently, researches in nonlinear process have shown that the chaotic signal is very suitable for secure communications. However, chaotic system is sensitive to disturbance and requires strict self-synchronization, which limits its practical applications [11,12].
According to Del Re et al. [13], the degree of security (deciphering difficulty) provided by a speech encryption system is related to (1) residual intelligibility (the amount of intelligibility left over in the encrypted signal) and (2) keyspace (the number of keys available for encryption). In general, the lower a scrambling system's residual intelligibility and the bigger its keyspace, the higher its degree of security. After the propose of the stochastic approach [14], the keyspace of a scrambler is commonly measured by the number of encryption operators, namely the scrambling matrix [3,4,6,13].
To summarize, a channel-saving and anti-attack speech scrambler is a major issue to be addressed. In the meantime, it should attain residual intelligibility as low as possible and provide keyspace as large as possible to increase the cryptogram immunity to cryptanalysis. Despite the improvements achieved by the aforementioned works, few investigations manage to simultaneously address these problems. In light of this consideration, we apply compressed sensing (CS) [15][16][17] to speech encryption, due to its promising capability in signal compression and its notable robustness to hostile attacks.
In this article, we tackle the issue on scrambling-based speech encryption via CS by exploiting the sparsity of speech over the Karhunen-Loeve (K-L) incoherent dictionary [18]. Distinguished from existing schemes, we scramble the dimensional-reduced measurements instead of the original speech. The algorithm proposed in this article is motivated by the following idea: if two independent signals x 1 and x 2 are aliased and scrambled in the same space by a stochastic matrix, the intelligibility of the original signal can be eliminated, and it is hard for the eavesdroppers to get any information barely from the mixture [19]. Observations show that the measurement vector of speech exhibits noise-like nature. However, the envelope of compressed data still retains considerable information of the original speech [20,21], we therefore alias and scramble the two measurement vectors of separated speech instead of using the envelope as ciphered data directly.
To be specific, the presented scheme contains two stages: encryption and decryption. At the encryption stage, we compress and encrypt the speech. The original signal is separated into two independent parts and sparsely represented by the corresponding K-L incoherent dictionaries. Next, the sparse vectors of the two parts are measured by stochastic matrices. Afterwards, these lowdimensional measurements are mixed and scrambled by the scrambling matrix, which is constructed from the null space basis (NSB) of the aligned sensing matrix using singular value decomposition (SVD). At the decryption stage, the inverse operator is constructed to eliminate each aliased measurements part individually. At last, the separated speech parts are reconstructed using the orthogonal matching pursuit (OMP) [22], and assembled to recover the speech. Experimental results demonstrate that the encrypted signal of proposed scheme has low transmission cost and low residual intelligibility. It provides immense cryptogram immunity and exhibits notable attack resistance.
The rest of this article is outlined as follows. In Section 2, we introduce the K-L incoherent dictionary to sparsely represent speech signals. Section 3 explains the encryption idea that we seek to address, and expatiates on the detailed procedure of encryption and decryption. The residual intelligibility, encryption strength, and robustness performance of the proposed scheme, together with experimental results, are presented and discussed in Section 4. Conclusions are drawn in Section 5.

The sparse representation of speech
Sparse representation is a critical step of CS [16], since one can obtain the dimension-reduced signal on the basis of sparse vectors. In this section, we employ a practical sparsifying dictionary to sparsely represent the speech signal.
The K-L expansion [18] describes a stochastic process in the form of incoherent random principle components and the corresponding deterministic orthogonal basis. Thus, the main structure of the process can be captured by a few expansion terms. Assume a real second-order moment stochastic process {x(t), t ∈ [0, 1]}, its K-L expansion is where a k = R 0 1 x(t)φ k (t)dt. The orthogonal bases {φ k (t)} are eigenfunctions of the signal autocorrelation R x (t,u). They can be used to design the atoms of signal dictionary. {φ k (t)} and the corresponding eigenvalues y k satisfy the Fredholm integral equation [18,23] However, for a practical signal, Equation (2) is hard to solve due to the complexity of the signal autocorrelation. Since the autocorrelation of speech signal decreases rapidly within a low delay, we approximate it by the exponential function: R x (t, u)=R x (0)e −μ|t−u| , where μ is the attenuation coefficient. Substituting R x (t,u) into (2) yields By solving (3) with the boundary conditions and eliminating the zero particular solution φ 0 (t)=0 (it cannot be used as an eigenfunction), we obtain the orthogonal basis After adding φ 0 (t)=1, the complete orthogonal K-L dictionary E is represented by where k stands for the number of atoms in the dictionary. For digital signal processing, the bases of E are sampled in the range of 0≤t≤1 by uniform sampling. Let μ* denote the optimal value of parameter μ, it is estimated by solving the following optimization problem: where R x τ ð Þ ¼ 1 Þ is the unbiased estimation of the autocorrelation of speech frame x∈R n with delay τ andR x τ ð Þ ¼ R x 0 ð Þe Àμ τ j j ; τ ¼ 0; 1; ⋯; n À 1. Thus, the optimal discrete atoms are e k =[e k (1), . . ., e k (i), . . ., e k (n)] T , where Then add with e 0 =[1, . . . ,1] T , we construct the complete discrete speech dictionary as In this case, discrete uniform sampling changes the orthogonality of bases {φ k (t)}. Though they are not mathematically orthogonal, atoms in D are of low coherence and subsequently, D turns out to be an overcomplete incoherent dictionary.
Concisely, the presented dictionary is codetermined by the sinusoidal atoms (given in (7)) and the parameter μ*. The structure of the sinusoidal atoms is based on the K-L expansion. It is the general paradigm of the dictionary, thus can be shared by both the compression and the recovering part. On the other hand, μ* is affected by the character of each frame. It determines the detailed structure of the current dictionary. Hence, with μ*, the corresponding dictionary can be rebuilt to recover the original speech in the recovering part.
Three types of speech frame (unvoiced, voiced, and transition sound) and their corresponding sparse vectors over the K-L incoherent dictionaries are shown in Figure 1. Here, k is set equal to the length of a speech frame.

Proposed encryption scheme
This section details the specification of the proposed scheme. Section 3.1 illustrates the derivation of proposed encryption scheme. The course of scrambling matrix designing is addressed in Section 3.2. Section 3.3 describes the decryption and recovering process.

CS-based scrambling
According to Candès and Wakin [17], the implication of sparsity is now clear: when a signal has a sparse expansion, one can discard the small coefficients without much perceptual loss. Hence, some minor but non-zero entries of the sparse vectors can be discarded before the measuring to further reduce the compression rate. In addition, they prove that, for a K-sparse signal s∈R n and a fixed basis Φ∈R m×n with atoms selected uniformly at random, the exact reconstruction of s from the measurements y=Φs∈R m (m≪n) is of overwhelming probability, as long as the number of observations obeys for some real positive constant C. In this case, the original speech is compressed and the compression ratio is m:n. Here, we concentrate on the issue of speech encryption, and the quality of reconstructed speech will be given in Section 4. The speech has proved to be a robust signal that can be perturbed in many different ways while remaining intelligible [24]. As depicted in Figure 2 (the compression ratio is set as 1:20), though the measurement vector exhibits some noise-like nature (Figure 2b), it is observed that the envelopes (red, dashed line) of the original speech and the compressed signal are of high similarity. This means the CS retains considerable information within the lowdimensional measurements.
Actually, neurons in the auditory brainstem sensitive to speech envelope have been found in mammalian physiological studies [25]. The envelope extracted using the Hilbert transform reveal that the envelope is most important for speech reception, namely the words are identified according to the envelope [20]. Research on the relationship between speech envelope and audio perceptual comprehending is still intensely ongoing [26]. More recently, Mehmet Cenk et. al. [21] have investigated the perceptual features for automatic emotion recognition with temporal envelopes.
Since the strong connection between residual intelligibility and speech envelope, our goal aims to come up with a new algorithm which is able to utilize the sparsity of speech signal as well as to decrease the residual intelligibility of the compressed data. In view of this consideration, the CS-based scrambling approach is employed for its straightforwardness.
Previous studies [5,7,8] have demonstrated the security of using stochastic matrix as the key. Coincidentally, the sensing matrix with respect to CS is also a stochastic matrix. It therefore can be used as the scrambling matrix to decrease the residual intelligibility of the dimensionalreduced signal. Nevertheless, we notice that the dimension of the compressed signal is not in accordance with that of the sensing matrix. In other words, y∈R m cannot be scrambled directly by Φ∈R m×n without dimensional variation. To solve this problem, one feasible way is to select a group of random atoms from Φ to form an m×m scrambling matrix. But the selection schedule will bring about additional communication load. As a consequence, based on the compressed speech sensing, we have designed a new paradigm of scrambling matrix that alias and scramble two volumes of compressed data y 1 y 2 ! together.

Design of the scrambler
As mentioned above, one can hardly get any information barely from the result if two independent signals x 1 and x 2 are aliased and scrambled in the same space. As shown in Figure 3, the original speech is thereby separated into two parts. For the sake of the independency of new speech segments, we set every four consecutive frames (800 samples at sampling rate 8 kHz) as a segmentation piece by considering two facts: (1) the quasiperiodic property of speech endures about 50 ms; (2) the auditory tolerance to delay is about 200 ms. Each cube in Figure 3 represents such a piece. Next, the new signals are sparsely represented over the corresponding K-L incoherent dictionaries, and then measured by different stochastic matrices to get the compressed signals individually. Since using only one fixed matrix does not always hold the restricted isometry property [17] and will result in undesirable reconstructions, we randomly choose the matrices from the stochastic matrix dictionary B={Φ j : j=1, 2, . . ., L}.
Such matrix dictionary is constructed in advance. During the construction process, each randomly generated stochastic matrix is tested with a group of different speech frames. If the correct reconstruction rate of this matrix is acceptable, it is chosen as a dictionary atom. In this study, we set the accepting threshold of correct reconstruction rate as 80%. As a matter of fact, almost all random matrices are CS matrices [15], thus the number of atoms L in the dictionary can be set according to practical requirements.
The compressed data y i are pre-reconstructed until the final selected sensing matrices Φ i , i=1, 2 ensure precise reconstruction. Similarly, let α i , i=1,2 denote the normalized indices of these matrices in the dictionary and let D i , i=1,2 stand for the sparsifying matrices of the two speech parts. Then, the encrypted signal y D is obtained by aliasing and scrambling the two low-dimensional measurement vectors with the selected matrices Φ i .
Subsequently, one heuristic approach is to design a scrambling matrix schedule that is of high security together with its inverse operator for decryption. Due to the independency of the two speech parts, their corresponding measurements We can remove any one of them by its orthocomplement without damaging the other one. Without loss of generality, we take y 1 for the following illustrations.
Assume there exists a vector z from (Φ 1 T ) ⊥ that is orthogonal to Φ 1 T , i.e., z T Φ 1 =0. Multiplying z with the encrypted data y D yields thus the y 1 part comprised in y D is eliminated. Then, we can reconstruct x 1 , x 2 by reconstruction algorithms and assemble them to further obtain the recovered speechx.
Since the operation objects are matrices and vectors in practice, the orthocomplement designing problem turns out to be orthogonal vector designing. Following related linear space theories [27], z can be presented by a linear combination of the vectors in the non-trivial NSB of Φ 1 , denoted as Null(Φ 1 ). The rank of the stochastic matrix Φ 1 ∈R m×n (m<n) is m. However, the dimensions of Φ 1 and the null space of Φ 1 H have following relationship According to (12), if we choose Φ i to scramble y 1 and y 2 directly, the two measurement parts cannot be separated for decryption since Null(Φ 1 ) does not exist. To ensure the inverse operation, we construct a non-full row rank By employing the conclusion drawn from [27], we have Then, This ensures the existence of null space of Φ 1 D and it can be constructed by SVD: for a matrix Φ 1 D ∈ R 2m×n with rank(Φ 1 D )=m, it can be decomposed as Then the m left singular vectors {u m+1 , u m+1 , . . ., u 2m } that correspond to the non-zero singular values are orthonormal basis of the conjugate transpose matrix where U∈R 2m×2m and V∈R m×m are unitary matrices.
and σ i are the eigenvalues of (Φ 1 With (16), we obtain the NSB of (Φ 1 are full rank matrices and therefore have inverse matrices, as proved in Appendix. In this case, the matrix Φ 1 D provides available NSB. In other words, if we use the NSB matrix to construct a scrambling matrix for the aligned measurements y 1 y 2 ! , the inverse operator for decryption is available. For this consideration, the scrambling matrix is ! ∈R 2mÂ2m to alias and disorder the measurements y 1 , y 2 . Here, α i are the corresponding normalized indices of Φ i in the dictionary B, so that every frame of encrypted signal is aliased in different proportion to enhance the encryption complexity. The identity matrix I∈R m×m is used to adapt the dimensions of S and y 1 y 2 ! . The final encrypted signal y D is given in (17). Every compressed part y i is scrambled and the scrambled data are aliased with each other. The whole process of encryption is shown in Figure 4.
For communication, the encrypted data y D , the dictionary parameters μ i * , and the indices α i of sensing matrix are transmitted to the decryption end. Then the decryption operator can be constructed by Φ i . As such, the stochastic matrix dictionary B is also the key book shared by both the encryption and the decryption parts.
This key book will never be exposed in the channel. In practice, it is irregularly updated and the number of its atoms L can adaptively be set to meet the requirements of the system. Figure 5 depicts the decryption procedure. We get Φ 1 with the index α 1 from the transmitted data, and construct the matrix S'=[N 2 −1 −α 2 I 1 ] with Φ 1 D by SVD. The y 1 part is removed as follows.

Decryption
As mentioned above, N 1 and N 2 are full rank matrices. With the inverse matrix N 2 −1 , we can get y 2 by multiplying the y 1 -removed data t with the matrix (N 2 Analogously, y 1 is decrypted as When the measurement vectors y i and the dictionary parameters μ i * are derived, the two speech parts can be recovered using OMP algorithm [22]. Finally, by assembling the two reconstructed parts, we obtain the recovered speechx.
In theory, without the key book B, it is very hard for the eavesdroppers to decipher the encrypted signal even though they have cryptanalyzed the encryption mechanism. The strength of security is discussed in the following section.

Experimental results and discussions
In this section, the performances of the proposed encryption scheme are evaluated from three perspectives: (1) the residual intelligibility of the encrypted signal; (2) the strength of security; (3) resistance to hostile attacks. We test over 20,000 frames of speech coming from several speakers with unlike characteristics (gender, age, pitch, regional accent). These test signals are taken randomly from TIMIT database and are sampled at 8kHz with length 25ms, that is, 200 samples per frame. In Section 4.1, residual intelligibility test results and discussions are presented. Section 4.2 analyzes the strength of security, and a possible deciphering technique is considered. Section 4.3 verifies the robustness of the proposed scheme in two conditions: in the presence of noise and low-pass filtering (LPF).

Residual intelligibility
The amount of intelligibility left over in the encrypted signal is measured by the envelope relevance between the original speech and the processed signal, given as where E o and E p denote the envelopes of original speech and the processed signal, respectively. Naturally, we interpolate the vector of E p to reach the same dimension as E o due to the operation of inner production in (20). We test two kinds of processed signal: the compressed signal (CoS) and the encrypted signal (ES). According to the experimental statistics, when the compression rate (m/n) is above 5%, the salient information of speech can be captured, and acceptable reconstruction quality is derived with the K-L incoherent dictionary. On the other hand, though reconstruction quality improves with the increasing of compression rate, it is of no significance for signal compression with a high compression rate. Therefore, the average residual intelligibilities are performed at compression rates ranged from 5 to 10%.
As seen in Table 1, despite of the noise-like nature, the low-dimensional measurements still retain considerable information of the original speech. The envelope relevance between the aliased, scrambled signal, and the original speech exhibits a dramatic decrease in terms of residual intelligibility. In the meantime, it is noticed that the compression rates and the residual intelligibility are not remarkably related, which means one can choose the compression ratio adaptively without increasing the residual intelligibility. In addition, as a subjective method for predicting the quality of narrow-band speech, the mean opinion score (MOS) recommended by ITU-T P.862 [28] is adopted to evaluate the perceptual quality of the recovered speech, and the results are also presented in Table 1 to illustrate the relationship between speech quality and compression rate.

Strength of security
Following Shannon's landmark article [14], the majority of literatures on key generation may roughly be categorized into four basic approaches: information theory approach, system theory approach, complexity theory approach, and stochastic approach. Considering the key schedule, our encryption scheme belongs to the stochastic approach, and its security is generally measured by the scale of the keyspace. Thus, the keyspace of proposed scheme is analyzed in two cases to evaluate its strength of security. , y 1 1 , y D y i Figure 5 The overall framework of decryption. First, we consider the optimal case, in which the scrambling mechanism is thoroughly unknown to the unauthorized listener, including the key schedule, the structures of the scrambling matrix, and the sparsifying dictionary. Consequently, the essential approaches for decryption is given by In this case, all of the above three unknown matrices can be regarded as the key since one cannot obtain the original information without any one of them. Meanwhile, the extreme low residual intelligibility and the noise-like nature of the encrypted signal have hampered the statistical analysis methods [29]. Therefore, the most feasible way is to search for the key, and its complexity is directly decided by the scale of keyspace. Given the dimensions of S, Φ i , and D i , their keyspace sizes are , O(10 mn ), and O(10 mn ), respectively. These are also the computational complexity for the searching process.
In the second case, we assume that an eavesdropper has a complete knowledge of the system, and has the necessary hardware to synchronize and isolate the frames. In other words, he knows that the scrambling matrix S can be constructed with Φ i and the sparsifying matrix D i can simply be rebuilt for its characteristic structure. Hence, the security of the system is assumed to reside entirely with the selection of a key Φ i . For the eavesdropper, the only task is to find the key. Since the randomness of Φ i , the keyspace size is O(10 mn ). In practical situations, the speech frames is of length n=180-220 and if we choose the compression rate as 5%, the length of compressed signal is m=9-11. Therefore, the order of magnitude of the keyspace is about 10 2000 . Table 2 compares the keyspace sizes and the compression ratios of the proposed scheme and some prior scramblers, which employ representative key schedules, including Hadamard matrix [4], Latin Square [6], dimensionalvariable matrix [7], and chaos system [12]. As seen in Table 2, the proposed scheme provides larger keyspace and requires lower communication overhead. Now let us consider a possible deciphering technique by dictionary learning regardless of deciphering delay. A cryptanalyst trying to break the system may be in possession of large amounts of encrypted signal except the key book, since it is held by both the encryption and decryption parts and never transmitted through the channel. He knows the complete specification of the system (scrambling mechanism, structure of sparsifying dictionary); he would like to deduce the key without considering the real-time requirement.
In mathematical language, the jth encryption operation is represented by By wiretapping, the cryptanalyst has obtained enough encrypted signal y (j) D . He would like to learnŜ fortunately, this is an optimization problem with no constraint conditions and thus cannot be solved, let alone the scrambling matrix S is variable but not fixed. To say the least, even though he is able to find some fixedŜ, he still cannot rebuild the sensing matrix Φ. This can be verified through Equations (15) and (16). In fact, there would not be enough data and delay tolerance for cryptanalysis. For instance, in secure communications of military information or intelligence of espionage activities, the key information is expected to be as briefly as possible to ensure short durations. Therefore, the dictionary learning may not be a feasible approach in real cases.

Robustness performance
Since readily decipherable unintelligibility signals may also be generated in large keyspace, other factors, including bandwidth expansion, delay times, channel resistance (to noise, distortion, etc.), cannot be ignored in assessing security. Two types of attack are performed with the encrypted signal: (1) in the presence of additive white Gaussian noise (AWGN); (2) LPF.
Representative speech scrambling schemes are chosen to compare with the CS scheme. To be specific, the time-domain scrambling (TDS) [5] is adopted to stand for non-compressional scramblers. In parallel to it, the approximate 13 line μ-law pause code modulation (PCM) and the MELP [9] are, respectively, chosen to represent waveform coding and parametric coding, with respect to compressional scramblers. The MOS is chosen as the subjective criterion. In addition, average-subsection signal-to-noise ratio (SNRseg) [30] is adopted as the objective criterion to evaluate the quality of recovered speech, given by (23). where x j i ð Þ Àx j i ð Þ Â Ã 2 and N frame denotes the total number of frames. The results are calculated and averaged for a test set of approximately 100 sentences randomly selected from the TIMIT database.

Noise resistance
AWGN is added to the encrypted signal of each scheme. The performances of the proposed and comparative schemes are compared. The compression ratio of CS is set as 1:10. As shown in Figure 6a, it is observed that CS scheme always outperforms the comparative schemes for all degrees of contamination. As the signal-to-noise ratio (SNR) becomes higher, the superiority of CS scheme becomes more obvious and leads to a more favorable comparison; the compressional schemes, including PCM and MELP, perform worse, and these are verified by the SNRseg decrements as well (Figure 6b).
The results are mainly due to the use of stochastic matrix: it has extreme low column coherence. The studies [31,32] have shown that for a noiseless signal y=Φs, if the K-sparse vector s satisfies s k k 0 < 1 2 spark Φ ð Þ, then the reconstructionŝ from the contaminated measurements (y+e) satisfies where spark(Φ) stands for the minimum number of columns of Φ that are linearly dependent, and M is the "mutual coherence" of Φ [31], defined as In a word, one can stably reconstruct the sparse vector s with error proportional to the noise level, provided that (1) the columns of the sensing matrix Φ are weakly mutual correlated, and (2) the vector s is to some extent sparse. The reason why the reconstruction error is restricted is geometric in nature: summarily, the reconstructionŝ is restrained within a tiny tubular wedge that surrounds the original vector s, which ensures the stability of recovering (for further details please refer to [31], subsection 5.3). In particular, the method of convex relaxation can identify a sparse signal in AWGN [33], and more sophisticated stable recovery schemes and boundaries have been investigated [34].
Technically speaking, the waveform coding scheme has no noise-resistant precaution and thus is vulnerable to noise. In terms of the parametric coding scheme, the noise will bring in errors to the feature parameters such as pitch period and voiced/unvoiced judgment. Once such parameters are contaminated, undesirable reconstruct distortion happens. For reasons given above, the noise resistance of the proposed scheme is better than the counterparts.

LPF
In terms of LPF, the decrements of MOS and SNRseg between speeches reconstructed from the filtered and the original encrypted data are compared. Also, the compression ratio of CS is set as 1:10.
As seen in Figure 7, CS scheme slightly outperforms the comparative schemes when the cutoff frequency ranges from 2400 to 2600 Hz. It gradually performs better along with the increasing of cutoff frequency and exhibits obvious competitive advantages. The TDS scheme ranks in the second place and the PCM scheme shows the worst performance.
The encrypted signal obtained from CS has dramatically removed the speech characters by aliasing and scrambling, thus it is of the best resistance to LPF. On the contrary, the TDS scheme still retains some speech structures, which makes its performances inferior to the CS scheme. The TDS scheme outperforms the other two comparative ones due to its robustness to time domain perturbations [24]. As for PCM and MELP schemes, their encoding signals have no spectral structures. All parts of the signal share the similar importance and therefore are vulnerable to this type of attack. Any damages to the encrypted signal would bring about serious reconstruction errors and deteriorate the auditory quality. As a consequence, these two schemes have the worst robustness to LPF.

Conclusions
This article presented a scrambling-based speech encryption algorithm via CS. A high degree of security can be achieved due to low residual intelligibility and large keyspace size. The immense complexity associated with the task of finding the scrambling matrix ensures the effectiveness of encryption. It affords notable robustness to common hostile attacks, while requires lower communicational costs and introduces only a slight (about 200ms) processing delay. Experimental results are included which demonstrate the improved performance of the scheme compared with state-of-the-art speech scramblers. As a future work, it is planned to investigate more sophisticated speech sparse representation and reconstruction algorithms to further reduce the compression ratio and improve the auditory quality of the recovered speech.

Appendix
Proof of full rank property mentioned in Section 3.2 is given as follows.
As for full row rank matrix P∈R m×n (m<n), rank(P)=m, denote its SVD as P=UΣV H , where Σ=[Σ m O m×(n−m) ]∈ R m×n , Σ m =diag(σ 1 , σ 2 ,. . .,σ m ). Here U and V are m×m, n×n unitary matrices, respectively. σ i >0 denote the square roots of non-zero eigenvalues of P H P. With the SVD of P, we have Then for Q ¼ P P ! ¼ U 0 Σ 0 V 0H ∈R 2mÂn and rank(Q)=m, Q H Q is represented as where U 0 ∈R 2m×2m , Σ 0 ∈R 2m×n , V 0 ∈R n×n .
Comparing (26) and (27) (27) is rewritten as Thus, with the SVD of Q=U'ΣV' H , the null space basis of Q H can be constructed.
where u' m+i denotes the column vector of U'.
As the randomness of P, dim [Null(Q H )]=m, namely the NSB matrix N ¼ is a full column rank matrix.
By elementary row operations, N can be converted to the form N 1 O ! without rank changing. Therefore, rank (N 1 )=m, which means N i , i=1, 2 are full rank matrices and possess inverse matrices. This completes the proof.