 Research
 Open access
 Published:
Scramblingbased speech encryption via compressed sensing
EURASIP Journal on Advances in Signal Processing volume 2012, Article number: 257 (2012)
Abstract
Conventional speech scramblers have three disadvantages, including heavy communication overhead, signal features underexploitation, and low attack resistance. In this study, we propose a scramblingbased speech encryption scheme via compressed sensing (CS). Distinguished from conventional scramblers, the above problems are solved in a unified framework by utilizing the advantages of CS. The presented encryption idea is general and easily applies to speech communication systems. Compared with the stateoftheart methods, the proposed scheme provides lower residual intelligibility and greater cryptanalytic efforts. Meanwhile, it ensures desirable channel usage and notable resistibility to hostile attack. Extensive experimental results also confirm the effectiveness of the proposed scheme.
1. Introduction
Encryption, dating back to BC, is essential for information security in modern society [1]. Information espionages, including illegal surveillance and wiretapping, have emerged due to the wide applications of speech communication in national defense, economy and trade, scientific research, and social affairs. With security an ever more vital requisite of communications systems, speech encryption has attracted substantial acceptance as an effective means of enhancing protection in both military and civilian applications.
Two main categories of technologies have been developed for this purpose. The first one is content protection through encryption, e.g., speech scrambler [2–9]. Proper decryption of the data requires a key or the socalled scrambling matrix. The second one is digital watermarking, which aims at embedding messages into the multimedia data [10]. Intuitively, the time domain sample scrambling method is thus far the most attractive and desired, because it simply takes a segment of time domain sample values and directly scrambles them into a different segment of samples. In this article, we focus on the scramblingbased encryption.
Earlier speech scramblers disorder the original signal using specific sequence or matrix, such as pseudorandom sequence, Fibonacci transform [2], Hadamard matrix [3, 4], and so on. The main disadvantage shared by these methods is that the decryption key is invariable. Since the performance of computer hardware has incredibly been improved, these methods could easily be deciphered. To alleviate this problem, researchers proposed to employ new key schedules, such as stochastic matrix [5] and Latin square [6], to improve the strength of security [7, 8]. However, the improved algorithms also result in heavy transmission load due to their disability of compressing the original signal. Consequently, the speech compression methods are integrated into the process of encryption, e.g., G.729 mixed excitation linear prediction (MELP) and AMR [9]. Indeed, the combination of compression and scrambling leads to less costly encrypted data. But the parametric coding algorithms are of low robustness in the presence of noise or other hostile attacks. Besides, the performances of such algorithms depend heavily upon the encryption operator, and the character of speech itself is not well utilized. More recently, researches in nonlinear process have shown that the chaotic signal is very suitable for secure communications. However, chaotic system is sensitive to disturbance and requires strict selfsynchronization, which limits its practical applications [11, 12].
According to Del Re et al. [13], the degree of security (deciphering difficulty) provided by a speech encryption system is related to (1) residual intelligibility (the amount of intelligibility left over in the encrypted signal) and (2) keyspace (the number of keys available for encryption). In general, the lower a scrambling system’s residual intelligibility and the bigger its keyspace, the higher its degree of security. After the propose of the stochastic approach [14], the keyspace of a scrambler is commonly measured by the number of encryption operators, namely the scrambling matrix [3, 4, 6, 13].
To summarize, a channelsaving and antiattack speech scrambler is a major issue to be addressed. In the meantime, it should attain residual intelligibility as low as possible and provide keyspace as large as possible to increase the cryptogram immunity to cryptanalysis. Despite the improvements achieved by the aforementioned works, few investigations manage to simultaneously address these problems. In light of this consideration, we apply compressed sensing (CS) [15–17] to speech encryption, due to its promising capability in signal compression and its notable robustness to hostile attacks.
In this article, we tackle the issue on scramblingbased speech encryption via CS by exploiting the sparsity of speech over the Karhunen–Loeve (K–L) incoherent dictionary [18]. Distinguished from existing schemes, we scramble the dimensionalreduced measurements instead of the original speech. The algorithm proposed in this article is motivated by the following idea: if two independent signals x_{1} and x_{2} are aliased and scrambled in the same space by a stochastic matrix, the intelligibility of the original signal can be eliminated, and it is hard for the eavesdroppers to get any information barely from the mixture [19]. Observations show that the measurement vector of speech exhibits noiselike nature. However, the envelope of compressed data still retains considerable information of the original speech [20, 21], we therefore alias and scramble the two measurement vectors of separated speech instead of using the envelope as ciphered data directly.
To be specific, the presented scheme contains two stages: encryption and decryption. At the encryption stage, we compress and encrypt the speech. The original signal is separated into two independent parts and sparsely represented by the corresponding K–L incoherent dictionaries. Next, the sparse vectors of the two parts are measured by stochastic matrices. Afterwards, these lowdimensional measurements are mixed and scrambled by the scrambling matrix, which is constructed from the null space basis (NSB) of the aligned sensing matrix using singular value decomposition (SVD). At the decryption stage, the inverse operator is constructed to eliminate each aliased measurements part individually. At last, the separated speech parts are reconstructed using the orthogonal matching pursuit (OMP) [22], and assembled to recover the speech. Experimental results demonstrate that the encrypted signal of proposed scheme has low transmission cost and low residual intelligibility. It provides immense cryptogram immunity and exhibits notable attack resistance.
The rest of this article is outlined as follows. In Section 2, we introduce the K–L incoherent dictionary to sparsely represent speech signals. Section 3 explains the encryption idea that we seek to address, and expatiates on the detailed procedure of encryption and decryption. The residual intelligibility, encryption strength, and robustness performance of the proposed scheme, together with experimental results, are presented and discussed in Section 4. Conclusions are drawn in Section 5.
2. The sparse representation of speech
Sparse representation is a critical step of CS [16], since one can obtain the dimensionreduced signal on the basis of sparse vectors. In this section, we employ a practical sparsifying dictionary to sparsely represent the speech signal.
The K–L expansion [18] describes a stochastic process in the form of incoherent random principle components and the corresponding deterministic orthogonal basis. Thus, the main structure of the process can be captured by a few expansion terms. Assume a real secondorder moment stochastic process {x(t), t ∈ [0, 1]}, its K–L expansion is
where a_{ k }=∫_{0}^{1}x(t)ϕ_{ k }(t)dt. The orthogonal bases {ϕ_{ k }(t)} are eigenfunctions of the signal autocorrelation R_{ x }(t u). They can be used to design the atoms of signal dictionary. {ϕ_{ k }(t)} and the corresponding eigenvalues y_{ k } satisfy the Fredholm integral equation [18, 23]
However, for a practical signal, Equation (2) is hard to solve due to the complexity of the signal autocorrelation. Since the autocorrelation of speech signal decreases rapidly within a low delay, we approximate it by the exponential function: R_{ x }(t, u)=R_{ x }(0)e^{−μt−u}, where μ is the attenuation coefficient. Substituting R_{ x }(t,u) into (2) yields
By solving (3) with the boundary conditions and eliminating the zero particular solution ϕ_{0}(t)=0 (it cannot be used as an eigenfunction), we obtain the orthogonal basis
After adding ϕ_{0}(t)=1, the complete orthogonal K–L dictionary E is represented by
where k stands for the number of atoms in the dictionary. For digital signal processing, the bases of E are sampled in the range of 0≤t≤1 by uniform sampling. Let μ* denote the optimal value of parameter μ, it is estimated by solving the following optimization problem:
where {R}_{x}\left(\tau \right)=\frac{1}{\left(n\tau \right)}{\displaystyle {\sum}_{i=1}^{n\tau}x\left(i\right)x\left(i+\tau \right)} is the unbiased estimation of the autocorrelation of speech frame x∈R^{n} with delay τ and {\widehat{R}}_{x}\left(\tau \right)={R}_{x}\left(0\right){e}^{\mu \left\tau \right},\tau =0,1,\cdots ,n1.
Thus, the optimal discrete atoms are e_{ k }=[e_{ k }(1), …, e_{ k }(i), …, e_{ k }(n)]^{T}, where
Then add with e_{0}=[1, … ,1]^{T}, we construct the complete discrete speech dictionary as
In this case, discrete uniform sampling changes the orthogonality of bases {ϕ_{ k }(t)}. Though they are not mathematically orthogonal, atoms in D are of low coherence and subsequently, D turns out to be an overcomplete incoherent dictionary.
Concisely, the presented dictionary is codetermined by the sinusoidal atoms (given in (7)) and the parameter μ*. The structure of the sinusoidal atoms is based on the K–L expansion. It is the general paradigm of the dictionary, thus can be shared by both the compression and the recovering part. On the other hand, μ* is affected by the character of each frame. It determines the detailed structure of the current dictionary. Hence, with μ*, the corresponding dictionary can be rebuilt to recover the original speech in the recovering part.
Three types of speech frame (unvoiced, voiced, and transition sound) and their corresponding sparse vectors over the K–L incoherent dictionaries are shown in Figure 1. Here, k is set equal to the length of a speech frame.
3. Proposed encryption scheme
This section details the specification of the proposed scheme. Section 3.1 illustrates the derivation of proposed encryption scheme. The course of scrambling matrix designing is addressed in Section 3.2. Section 3.3 describes the decryption and recovering process.
3.1. CSbased scrambling
According to Candès and Wakin [17], the implication of sparsity is now clear: when a signal has a sparse expansion, one can discard the small coefficients without much perceptual loss. Hence, some minor but nonzero entries of the sparse vectors can be discarded before the measuring to further reduce the compression rate. In addition, they prove that, for a Ksparse signal s∈R^{n} and a fixed basis Φ∈R^{m×n} with atoms selected uniformly at random, the exact reconstruction of s from the measurements y=Φs∈R^{m} (m≪n) is of overwhelming probability, as long as the number of observations obeys
for some real positive constant C. In this case, the original speech is compressed and the compression ratio is m:n. Here, we concentrate on the issue of speech encryption, and the quality of reconstructed speech will be given in Section 4.
The speech has proved to be a robust signal that can be perturbed in many different ways while remaining intelligible [24]. As depicted in Figure 2 (the compression ratio is set as 1:20), though the measurement vector exhibits some noiselike nature (Figure 2b), it is observed that the envelopes (red, dashed line) of the original speech and the compressed signal are of high similarity. This means the CS retains considerable information within the lowdimensional measurements.
Actually, neurons in the auditory brainstem sensitive to speech envelope have been found in mammalian physiological studies [25]. The envelope extracted using the Hilbert transform reveal that the envelope is most important for speech reception, namely the words are identified according to the envelope [20]. Research on the relationship between speech envelope and audio perceptual comprehending is still intensely ongoing [26]. More recently, Mehmet Cenk et. al. [21] have investigated the perceptual features for automatic emotion recognition with temporal envelopes.
Since the strong connection between residual intelligibility and speech envelope, our goal aims to come up with a new algorithm which is able to utilize the sparsity of speech signal as well as to decrease the residual intelligibility of the compressed data. In view of this consideration, the CSbased scrambling approach is employed for its straightforwardness.
Previous studies [5, 7, 8] have demonstrated the security of using stochastic matrix as the key. Coincidentally, the sensing matrix with respect to CS is also a stochastic matrix. It therefore can be used as the scrambling matrix to decrease the residual intelligibility of the dimensionalreduced signal. Nevertheless, we notice that the dimension of the compressed signal is not in accordance with that of the sensing matrix. In other words, y∈R^{m} cannot be scrambled directly by Φ∈R^{m×n} without dimensional variation. To solve this problem, one feasible way is to select a group of random atoms from Φ to form an m×m scrambling matrix. But the selection schedule will bring about additional communication load. As a consequence, based on the compressed speech sensing, we have designed a new paradigm of scrambling matrix that alias and scramble two volumes of compressed data \left[\begin{array}{l}{\mathit{y}}_{1}\\ {\mathit{y}}_{2}\end{array}\right] together.
3.2. Design of the scrambler
As mentioned above, one can hardly get any information barely from the result if two independent signals x_{1} and x_{2} are aliased and scrambled in the same space. As shown in Figure 3, the original speech is thereby separated into two parts. For the sake of the independency of new speech segments, we set every four consecutive frames (800 samples at sampling rate 8 kHz) as a segmentation piece by considering two facts: (1) the quasiperiodic property of speech endures about 50 ms; (2) the auditory tolerance to delay is about 200 ms. Each cube in Figure 3 represents such a piece.
Next, the new signals are sparsely represented over the corresponding K–L incoherent dictionaries, and then measured by different stochastic matrices to get the compressed signals individually. Since using only one fixed matrix does not always hold the restricted isometry property [17] and will result in undesirable reconstructions, we randomly choose the matrices from the stochastic matrix dictionary B={Φ_{ j }: j=1, 2, …, L}.
Such matrix dictionary is constructed in advance. During the construction process, each randomly generated stochastic matrix is tested with a group of different speech frames. If the correct reconstruction rate of this matrix is acceptable, it is chosen as a dictionary atom. In this study, we set the accepting threshold of correct reconstruction rate as 80%. As a matter of fact, almost all random matrices are CS matrices [15], thus the number of atoms L in the dictionary can be set according to practical requirements.
The compressed data y_{ i } are prereconstructed until the final selected sensing matrices Φ_{ i }, i=1, 2 ensure precise reconstruction. Similarly, let α_{ i }, i=1,2 denote the normalized indices of these matrices in the dictionary and let D_{ i }, i=1,2 stand for the sparsifying matrices of the two speech parts. Then, the encrypted signal y^{D} is obtained by aliasing and scrambling the two lowdimensional measurement vectors with the selected matrices Φ_{ i }.
Subsequently, one heuristic approach is to design a scrambling matrix schedule that is of high security together with its inverse operator for decryption. Due to the independency of the two speech parts, their corresponding measurements y_{ i }=Φ_{ i }D_{ i }^{T}x_{ i }, i=1, 2 are also incoherent. We can remove any one of them by its orthocomplement without damaging the other one. Without loss of generality, we take y_{1} for the following illustrations.
Assume there exists a vector z from (Φ_{1}^{T})^{⊥} that is orthogonal to Φ_{1}^{T}, i.e., z^{T}Φ_{1}=0. Multiplying z with the encrypted data y^{D} yields
thus the y_{1} part comprised in y^{D} is eliminated. Then, we can reconstruct x_{1}, x_{2} by reconstruction algorithms and assemble them to further obtain the recovered speech \widehat{\mathit{x}}.
Since the operation objects are matrices and vectors in practice, the orthocomplement designing problem turns out to be orthogonal vector designing. Following related linear space theories [27], z can be presented by a linear combination of the vectors in the nontrivial NSB of Φ_{1}, denoted as Null(Φ_{1}). The rank of the stochastic matrix Φ_{1}∈R^{m×n}(m<n) is m. However, the dimensions of Φ_{1} and the null space of Φ_{1}^{H} have following relationship
According to (12), if we choose Φ_{ i } to scramble y_{1} and y_{2} directly, the two measurement parts cannot be separated for decryption since Null(Φ_{1}) does not exist. To ensure the inverse operation, we construct a nonfull row rank matrix {\mathit{\Phi}}_{1}^{D}=\left[\begin{array}{l}{\mathit{\Phi}}_{1}\\ {\mathit{\Phi}}_{1}\end{array}\right]\in {\mathit{R}}^{2\mathit{m}\mathit{\times}\mathit{n}}. The dimension of Φ_{1}^{D} is m. By employing the conclusion drawn from [27], we have
Then,
This ensures the existence of null space of Φ_{1}^{D} and it can be constructed by SVD: for a matrix Φ_{1}^{D} ∈ R^{2m×n} with rank(Φ_{1}^{D})=m, it can be decomposed as
Then the m left singular vectors {u_{m+1}, u_{m+1}, …, u_{2m}} that correspond to the nonzero singular values are orthonormal basis of the conjugate transpose matrix (Φ_{1}^{D})^{H}, that is
where U∈R^{2m×2m} and V∈R^{m×m} are unitary matrices. Here \mathbf{\Sigma}=\left[\begin{array}{c}{\mathbf{\Sigma}}_{1}0\\ 0\hfill & 0\end{array}\right]\phantom{\rule{.3em}{0ex}}, where Σ_{1}=diag(σ_{1}, σ_{2}, …, σ_{ m }), and σ_{ i } are the eigenvalues of (Φ_{1}^{D})^{H}Φ_{1}^{D}.
With (16), we obtain the NSB of (Φ_{1}^{D})^{H}: N∈R^{2m×m}, denoted as \mathit{N}=\left[\begin{array}{l}{\mathit{N}}_{1}\\ {\mathit{N}}_{2}\end{array}\right] with property N_{1}=−N_{2}. N_{1}, N_{2} are full rank matrices and therefore have inverse matrices, as proved in Appendix. In this case, the matrix Φ_{1}^{D} provides available NSB. In other words, if we use the NSB matrix to construct a scrambling matrix for the aligned measurements \left[\begin{array}{l}{\mathit{y}}_{1}\\ {\mathit{y}}_{2}\end{array}\right], the inverse operator for decryption is available.
For this consideration, the scrambling matrix is designed as \mathit{S}=\left[\begin{array}{ll}{\alpha}_{1}\mathit{I}& {\mathit{N}}_{1}\\ {\mathit{N}}_{2}^{1}\hfill & {\alpha}_{2}\mathit{I}\end{array}\right]\in {R}^{2\mathit{m}\mathit{\times}2m} to alias and disorder the measurements y_{1}, y_{2}. Here, α_{ i } are the corresponding normalized indices of Φ_{ i } in the dictionary B, so that every frame of encrypted signal is aliased in different proportion to enhance the encryption complexity. The identity matrix I∈R^{m×m} is used to adapt the dimensions of S and \left[\begin{array}{l}{\mathit{y}}_{1}\\ {\mathit{y}}_{2}\end{array}\right]. The final encrypted signal y^{D} is given in (17). Every compressed part y_{ i } is scrambled and the scrambled data are aliased with each other. The whole process of encryption is shown in Figure 4.
For communication, the encrypted data y^{D}, the dictionary parameters μ_{ i }^{∗}, and the indices α_{ i } of sensing matrix are transmitted to the decryption end. Then the decryption operator can be constructed by Φ_{ i }. As such, the stochastic matrix dictionary B is also the key book shared by both the encryption and the decryption parts. This key book will never be exposed in the channel. In practice, it is irregularly updated and the number of its atoms L can adaptively be set to meet the requirements of the system.
3.3. Decryption
Figure 5 depicts the decryption procedure. We get Φ_{1} with the index α_{1} from the transmitted data, and construct the matrix S'=[N_{2}^{−1} −α_{2}I_{1}] with Φ_{1}^{D} by SVD. The y_{1} part is removed as follows.
As mentioned above, N_{1} and N_{2} are full rank matrices. With the inverse matrix N_{2}^{−1}, we can get y_{2} by multiplying the y_{1}—removed data t with the matrix (N_{2}^{−1}N_{1} −α_{2}^{2}I)^{−1}. Analogously, y_{1} is decrypted as
When the measurement vectors y_{ i } and the dictionary parameters μ_{ i }^{∗} are derived, the two speech parts can be recovered using OMP algorithm [22]. Finally, by assembling the two reconstructed parts, we obtain the recovered speech \widehat{\mathit{x}}.
In theory, without the key book B, it is very hard for the eavesdroppers to decipher the encrypted signal even though they have cryptanalyzed the encryption mechanism. The strength of security is discussed in the following section.
4. Experimental results and discussions
In this section, the performances of the proposed encryption scheme are evaluated from three perspectives: (1) the residual intelligibility of the encrypted signal; (2) the strength of security; (3) resistance to hostile attacks. We test over 20,000 frames of speech coming from several speakers with unlike characteristics (gender, age, pitch, regional accent). These test signals are taken randomly from TIMIT database and are sampled at 8kHz with length 25ms, that is, 200 samples per frame. In Section 4.1, residual intelligibility test results and discussions are presented. Section 4.2 analyzes the strength of security, and a possible deciphering technique is considered. Section 4.3 verifies the robustness of the proposed scheme in two conditions: in the presence of noise and lowpass filtering (LPF).
4.1. Residual intelligibility
The amount of intelligibility left over in the encrypted signal is measured by the envelope relevance between the original speech and the processed signal, given as
where E_{ o } and E_{ p } denote the envelopes of original speech and the processed signal, respectively. Naturally, we interpolate the vector of E_{ p } to reach the same dimension as E_{ o } due to the operation of inner production in (20). We test two kinds of processed signal: the compressed signal (CoS) and the encrypted signal (ES).
According to the experimental statistics, when the compression rate (m/n) is above 5%, the salient information of speech can be captured, and acceptable reconstruction quality is derived with the K–L incoherent dictionary. On the other hand, though reconstruction quality improves with the increasing of compression rate, it is of no significance for signal compression with a high compression rate. Therefore, the average residual intelligibilities are performed at compression rates ranged from 5 to 10%.
As seen in Table 1, despite of the noiselike nature, the lowdimensional measurements still retain considerable information of the original speech. The envelope relevance between the aliased, scrambled signal, and the original speech exhibits a dramatic decrease in terms of residual intelligibility. In the meantime, it is noticed that the compression rates and the residual intelligibility are not remarkably related, which means one can choose the compression ratio adaptively without increasing the residual intelligibility. In addition, as a subjective method for predicting the quality of narrowband speech, the mean opinion score (MOS) recommended by ITUT P.862 [28] is adopted to evaluate the perceptual quality of the recovered speech, and the results are also presented in Table 1 to illustrate the relationship between speech quality and compression rate.
4.2. Strength of security
Following Shannon’s landmark article [14], the majority of literatures on key generation may roughly be categorized into four basic approaches: information theory approach, system theory approach, complexity theory approach, and stochastic approach. Considering the key schedule, our encryption scheme belongs to the stochastic approach, and its security is generally measured by the scale of the keyspace. Thus, the keyspace of proposed scheme is analyzed in two cases to evaluate its strength of security.
First, we consider the optimal case, in which the scrambling mechanism is thoroughly unknown to the unauthorized listener, including the key schedule, the structures of the scrambling matrix, and the sparsifying dictionary. Consequently, the essential approaches for decryption is given by
In this case, all of the above three unknown matrices can be regarded as the key since one cannot obtain the original information without any one of them. Meanwhile, the extreme low residual intelligibility and the noiselike nature of the encrypted signal have hampered the statistical analysis methods [29]. Therefore, the most feasible way is to search for the key, and its complexity is directly decided by the scale of keyspace. Given the dimensions of S, Φ_{ i }, and D_{ i }, their keyspace sizes are O\left({10}^{{\left(2m\right)}^{2}}\right), O(10^{mn}), and O(10^{mn}), respectively. These are also the computational complexity for the searching process.
In the second case, we assume that an eavesdropper has a complete knowledge of the system, and has the necessary hardware to synchronize and isolate the frames. In other words, he knows that the scrambling matrix S can be constructed with Φ_{ i } and the sparsifying matrix D_{ i } can simply be rebuilt for its characteristic structure. Hence, the security of the system is assumed to reside entirely with the selection of a key Φ_{ i }. For the eavesdropper, the only task is to find the key. Since the randomness of Φ_{ i }, the keyspace size is O(10^{mn}). In practical situations, the speech frames is of length n=180–220 and if we choose the compression rate as 5%, the length of compressed signal is m=9–11. Therefore, the order of magnitude of the keyspace is about 10^{2000}.
Table 2 compares the keyspace sizes and the compression ratios of the proposed scheme and some prior scramblers, which employ representative key schedules, including Hadamard matrix [4], Latin Square [6], dimensionalvariable matrix [7], and chaos system [12]. As seen in Table 2, the proposed scheme provides larger keyspace and requires lower communication overhead.
Now let us consider a possible deciphering technique by dictionary learning regardless of deciphering delay. A cryptanalyst trying to break the system may be in possession of large amounts of encrypted signal except the key book, since it is held by both the encryption and decryption parts and never transmitted through the channel. He knows the complete specification of the system (scrambling mechanism, structure of sparsifying dictionary); he would like to deduce the key without considering the realtime requirement.
In mathematical language, the j th encryption operation is represented by
By wiretapping, the cryptanalyst has obtained enough encrypted signal y_{(j)}^{D}. He would like to learn {\widehat{\mathit{S}}}^{\left(j\right)} from Y^{D}, where Y^{D}=[y_{(1)}^{D} y_{(2)}^{D} ⋯ y_{(j)}^{D}]. Each \left[\begin{array}{l}{\mathit{y}}_{1}^{\left(j\right)}\\ {\mathit{y}}_{2}^{\left(j\right)}\end{array}\right] is unknown. Unfortunately, this is an optimization problem with no constraint conditions and thus cannot be solved, let alone the scrambling matrix S is variable but not fixed. To say the least, even though he is able to find some fixed \widehat{\mathit{S}}, he still cannot rebuild the sensing matrix Φ. This can be verified through Equations (15) and (16).
In fact, there would not be enough data and delay tolerance for cryptanalysis. For instance, in secure communications of military information or intelligence of espionage activities, the key information is expected to be as briefly as possible to ensure short durations. Therefore, the dictionary learning may not be a feasible approach in real cases.
4.3. Robustness performance
Since readily decipherable unintelligibility signals may also be generated in large keyspace, other factors, including bandwidth expansion, delay times, channel resistance (to noise, distortion, etc.), cannot be ignored in assessing security. Two types of attack are performed with the encrypted signal: (1) in the presence of additive white Gaussian noise (AWGN); (2) LPF.
Representative speech scrambling schemes are chosen to compare with the CS scheme. To be specific, the timedomain scrambling (TDS) [5] is adopted to stand for noncompressional scramblers. In parallel to it, the approximate 13 line μlaw pause code modulation (PCM) and the MELP [9] are, respectively, chosen to represent waveform coding and parametric coding, with respect to compressional scramblers.
The MOS is chosen as the subjective criterion. In addition, averagesubsection signaltonoise ratio (SNRseg) [30] is adopted as the objective criterion to evaluate the quality of recovered speech, given by (23).
where \mathit{SNRse}{g}_{j}={\displaystyle \sum _{i=1}^{n}\frac{{x}_{j}^{2}\left(i\right)}{{\left[{x}_{j}\left(i\right){\widehat{x}}_{j}\left(i\right)\right]}^{2}}} and N_{ frame } denotes the total number of frames. The results are calculated and averaged for a test set of approximately 100 sentences randomly selected from the TIMIT database.
4.3.1. Noise resistance
AWGN is added to the encrypted signal of each scheme. The performances of the proposed and comparative schemes are compared. The compression ratio of CS is set as 1:10.
As shown in Figure 6a, it is observed that CS scheme always outperforms the comparative schemes for all degrees of contamination. As the signaltonoise ratio (SNR) becomes higher, the superiority of CS scheme becomes more obvious and leads to a more favorable comparison; the compressional schemes, including PCM and MELP, perform worse, and these are verified by the SNRseg decrements as well (Figure 6b).
The results are mainly due to the use of stochastic matrix: it has extreme low column coherence. The studies [31, 32] have shown that for a noiseless signal y=Φs, if the Ksparse vector s satisfies {\Vert \mathit{s}\Vert}_{0}<\frac{1}{2}\mathit{spark}\left(\mathit{\Phi}\right), then the reconstruction \widehat{\mathit{s}} from the contaminated measurements (y+e) satisfies
where spark(Φ) stands for the minimum number of columns of Φ that are linearly dependent, and M is the “mutual coherence” of Φ[31], defined as
In a word, one can stably reconstruct the sparse vector s with error proportional to the noise level, provided that (1) the columns of the sensing matrix Φ are weakly mutual correlated, and (2) the vector s is to some extent sparse. The reason why the reconstruction error is restricted is geometric in nature: summarily, the reconstruction \widehat{\mathit{s}} is restrained within a tiny tubular wedge that surrounds the original vector s, which ensures the stability of recovering (for further details please refer to [31], subsection 5.3). In particular, the method of convex relaxation can identify a sparse signal in AWGN [33], and more sophisticated stable recovery schemes and boundaries have been investigated [34].
Technically speaking, the waveform coding scheme has no noiseresistant precaution and thus is vulnerable to noise. In terms of the parametric coding scheme, the noise will bring in errors to the feature parameters such as pitch period and voiced/unvoiced judgment. Once such parameters are contaminated, undesirable reconstruct distortion happens. For reasons given above, the noise resistance of the proposed scheme is better than the counterparts.
4.3.2. LPF
In terms of LPF, the decrements of MOS and SNRseg between speeches reconstructed from the filtered and the original encrypted data are compared. Also, the compression ratio of CS is set as 1:10.
As seen in Figure 7, CS scheme slightly outperforms the comparative schemes when the cutoff frequency ranges from 2400 to 2600 Hz. It gradually performs better along with the increasing of cutoff frequency and exhibits obvious competitive advantages. The TDS scheme ranks in the second place and the PCM scheme shows the worst performance.
The encrypted signal obtained from CS has dramatically removed the speech characters by aliasing and scrambling, thus it is of the best resistance to LPF. On the contrary, the TDS scheme still retains some speech structures, which makes its performances inferior to the CS scheme. The TDS scheme outperforms the other two comparative ones due to its robustness to time domain perturbations [24]. As for PCM and MELP schemes, their encoding signals have no spectral structures. All parts of the signal share the similar importance and therefore are vulnerable to this type of attack. Any damages to the encrypted signal would bring about serious reconstruction errors and deteriorate the auditory quality. As a consequence, these two schemes have the worst robustness to LPF.
5. Conclusions
This article presented a scramblingbased speech encryption algorithm via CS. A high degree of security can be achieved due to low residual intelligibility and large keyspace size. The immense complexity associated with the task of finding the scrambling matrix ensures the effectiveness of encryption. It affords notable robustness to common hostile attacks, while requires lower communicational costs and introduces only a slight (about 200ms) processing delay. Experimental results are included which demonstrate the improved performance of the scheme compared with stateoftheart speech scramblers. As a future work, it is planned to investigate more sophisticated speech sparse representation and reconstruction algorithms to further reduce the compression ratio and improve the auditory quality of the recovered speech.
Appendix
Proof of full rank property mentioned in Section 3.2 is given as follows.
As for full row rank matrix P∈R^{m×n}(m<n), rank(P)=m, denote its SVD as P=U Σ V^{H}, where Σ=[Σ_{ m } O_{m×(n−m)}]∈R^{m×n}, Σ_{ m }=diag(σ_{1}, σ_{2},…,σ_{ m }). Here U and V are m×m, n×n unitary matrices, respectively. σ_{ i }>0 denote the square roots of nonzero eigenvalues of P^{H}P. With the SVD of P, we have
Then for \mathit{Q}=\left[\begin{array}{l}\mathit{P}\\ \mathit{P}\end{array}\right]=\mathit{U}\text{'}\mathbf{\Sigma}\text{'}\mathit{V}{\text{'}}^{H}\in {\mathit{R}}^{2\mathit{m}\times \mathit{n}} and rank(Q)=m, Q^{H}Q is represented as
where U^{′}∈R^{2m×2m}, Σ^{′}∈R^{2m×n}, V^{′}∈R^{n×n}.Comparing (26) and (27), it is noticed that V^{′}=V, \mathbf{\Sigma}\text{'}=\left[\begin{array}{ll}{\mathbf{\Sigma}}_{m}& \hfill {\mathit{O}}_{\mathit{m}\times \left(nm\right)}\\ {\mathit{O}}_{\mathit{m}\times n}& {\mathit{O}}_{\mathit{m}\times \left(nm\right)}\end{array}\right] and U^{′}U^{′}^{H}=2I_{2m}, where I_{2m} denotes the 2m×2m identity matrix. In this case, if there is a proper U′, the SVD of Q can be obtained. It is verified that \mathit{U}\text{'}=\left[\begin{array}{ll}\mathit{U}& \hfill \mathit{U}\\ \mathit{U}& \phantom{\rule{.5em}{0ex}}\mathit{U}\end{array}\right] satisfies U^{′}U^{′}^{H}=2I_{2m}, then (27) is rewritten as
Thus, with the SVD of Q=U'Σ V'^{H}, the null space basis of Q^{H} can be constructed.
where u'_{m+i} denotes the column vector of U'.
As the randomness of P, dim [Null(Q^{H})]=m, namely the NSB matrix \mathit{N}=\left[\begin{array}{l}{\mathit{N}}_{1}\\ {\mathit{N}}_{2}\end{array}\right] is a full column rank matrix.
By elementary row operations, N can be converted to the form \left[\begin{array}{l}{\mathit{N}}_{1}\\ \mathit{O}\end{array}\right] without rank changing. Therefore, rank (N_{1})=m, which means N_{ i }, i=1, 2 are full rank matrices and possess inverse matrices. This completes the proof.
Abbreviations
 AMR:

Adaptive multirate
 AWGN:

Additive white Gaussian noise
 BC:

Before Christ
 CoS:

Compressed signal
 CS:

Compressed sensing
 ES:

Encrypted signal
 K–L:

Karhunen–Loeve
 LPF:

Lowpass filtering
 MELP:

Mixed excitation linear prediction
 MOS:

Mean opinion score
 NSB:

Null space basis
 OMP:

Orthogonal matching pursuit
 PCM:

Pause code modulation
 SNR:

Signaltonoise ratio
 SNRseg:

Averagesubsection signaltonoise ratio
 SVD:

Singular value decomposition
 TDS:

Timedomain scrambling.
References
Clark JA: Natureinspired cryptography: Past, Present and Future. In Congress on Evolutionary Computation. 3rd edition. Newport Beach, USA; 2003:16471654.
Nan L, Yanhong S, Jiancheng Z: An audio scrambling method based on Fibonacci transformation. J. North China Univ. Technol. 2004, 16(3):811.
Senk V, Delic VD, Milosevic VS: A new speech scrambling concept based on Hadamard matrices. IEEE Signal Process. Lett. 1997, 4(6):161163.
Pal SK: Fast, reliable & secure digital communication using Hardmard matrices. In Proceedings of the International Conference on Computing: Theory and Applications. 1st edition. Kolkata, India; 2007:526532.
Li H, Qin Z, Zhang XP, Shao LP: An ndimensional space audio scrambling algorithm based on random matrix. J. Xi’an Jiaotong Univ. 2010, 44(4):1317.
Satti M, Kak S: Multilevel indexed quasigroup encryption for data and speech. IEEE Trans. Broadcast. 2009, 55(2):270281.
Li H, Qin Z: Audio scrambling algorithm based on variable dimension spaces. In International Conference on Industrial and Information Systems. 1st edition. West Bengal, India; 2009:316319.
Honggang W, Michael H, Hamid S, Peng DM, Wang W, HsiaoHwa C: Indexbased selective audio encryption for wireless multimedia sensor networks. IEEE Trans Multimed 2010, 12(3):215223.
Antonio S, Juan Carlos M: Perceptionbased partial encryption of compressed speech. IEEE Trans. Speech Audio Process 2002, 10(8):637643.
Pierre M, O’Sullivan JA: Informationtheoretic analysis of information hiding. IEEE Trans. Inf. Theory. 2003, 49(3):563593.
LiLian H, Qitian Y: A chaos synchronization secure communication system based on output control. J. Electron. Inf. Technol. 2009, 31(10):24022405.
Liangrui T, Lin Z, Xue Y: Chaos synchronization based on observer and its application in speech secure communication. In Proceedings of ICNIDC. 2nd edition. Beijing, China; 2010:773777.
Del Re E, Fantacci R, Maffucci D: A new speech signal scrambling method for secure communications: theory, implementation, and security evaluation. IEEE J. Sel. Areas Commun. 1989, 7(4):474480.
Shannon CE: Communication theory of secret systems. Bell Syst. Tech. J. 1949, 28(4):656715.
Donoho DL: Compressed sensing. IEEE Trans. Inf. Theory. 2006, 52(4):12891306.
Baraniuk RG: Lecture notes: compressive sensing. IEEE Signal Process. Mag. 2007, 24(4):118121.
Candès EJ, Wakin MB: An introduction to compressive sampling. IEEE Signal Process. Mag. 2008, 25(2):2130.
Tianjing W, Baoyu Z, Zhen Y: A speech signal sparse representation algorithm based on adaptive overcomplete dictionary. J. Electron. Inf. Technol. 2011, 33(10):23722377.
Zhanghua C, Yuansheng T: Secure communication based on network coding. J. Commun. 2010, 31(8A):188194.
Smith ZM, Bertrand D, Oxenham AJ: Chimaeric sounds reveal dichotomies in auditory perception. Nature. 2002, 416: 8790.
Mehmet Cenk S, Bilge G, Gunes Karabulut K: Perceptual audio features for emotional detection. EURASIP J. Audio Speech Music Process 2012, 16: 121.
Tropp JA, Gilbert AC: Signal recovery from random measurements via orthogonal matching pursuit. IEEE Trans. Inf. Theory. 2007, 53(12):46554666.
NavarroMoreno J, RuizMolina JC: Nonlinear estimation using correlation information. IEEE Trans. Signal Process. 2006, 54(7):28222827.
Saberi K, Perrot DR: Cognitive restoration of reversed speech. Nature 1999, 398: 760.
Joris PX, Yin TC: Envelope coding in the lateral superior olive. I. Sensitivity to interaural time differences. J. Neurophysiol. 1995, 73: 10431062.
Bendor D, Wang X: The neuronal representation of pitch in primate auditory cortex. Nature. 2005, 436: 11611165.
Xianda Z: Matrix Analysis and Applications. Tsinghua University Press, Beijing; 2004.
ItuT P: 862, Perceptual Evaluation of Speech Quality (PESQ), and Objective Method for EndtoEnd Speech Quality Assessment of Narrowband Telephone Networks and Speech Codecs. ITUT Recommendation, Geneva; 2001.
Georgiev P, Theis F, Cichocki A: Sparse component analysis and blind source separation of underdetermined mixtures. IEEE Trans. Neural Netw. 2005, 16(4):992996.
Tingting X, Zhen Y, Xi S: Novel speech secure communication system based on information hiding and compressed sensing. The 4th International Conference on System and Networks Communications 4th edition. 2009, 201206.
Donoho DL, Michael E, Vladimir T: Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans. Inf. Theory. 2006, 52(1):618.
BabaieZadeh M, Jutten C: On the stable recovery of the sparsest overcomplete representations in presence of noise. IEEE Trans. Signal Process. 2010, 58(10):53965400.
Tropp JA: Just relax: convex programming methods for identifying sparse signal in noise. IEEE Trans. Inf. Theory. 2006, 52(3):10301051.
Sun Q: Sparse approximation property and stable recovery of sparse signal from noisy measurements. IEEE Trans. Signal Process. 2011, 59(10):50865090.
Acknowledgments
This study was supported by the National Natural Science Foundation, China (61072042), the Natural Science Foundation of Jiangsu Province, China (BK2012510), and the Preresearch Foundations of PLA University of Science and Technology (20110211). The authors appreciate Professor Shousheng Liu and Professor Zuping Qian for their useful discussions and valuable suggestions from the bottom of our hearts. The authors would like to thank the anonymous reviewers for their constructive comments and questions which greatly improved the article.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Zeng, L., Zhang, X., Chen, L. et al. Scramblingbased speech encryption via compressed sensing. EURASIP J. Adv. Signal Process. 2012, 257 (2012). https://doi.org/10.1186/168761802012257
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/168761802012257