Robust Speech Watermarking Procedure in the Time-Frequency Domain

– An approach to speech watermarking based on the time-frequency signal analysis is proposed. As a time-frequency representation suitable for speech analysis, the S-method is used. The time-frequency characteristics of watermark are modeled by using speech components in the selected region. The modeling procedure is based on the concept of time-varying ﬁltering. A detector form that includes cross-terms in the Wigner distribution is proposed. Theoretical considerations are illustrated by the examples. Eﬃciency of the proposed procedure has been tested for several signals and under various attacks.


I. INTRODUCTION
Digital watermarking has been developed as an effective solution for multimedia data protection. Watermarking usually assumes embedding of secret signal that should be robust and imperceptible within the host data. Also, reliable watermark detection must be provided. A number of proposed watermarking techniques refer to the speech and audio signals [1]. Some of them are based on spreadspectrum method [2]- [4], while the others are related to the time-scale method [5], [6], or fragile content features combined with robust watermarking [7].
The existing watermarking techniques are mainly based on either the time or frequency domain. However, in both cases, the timefrequency characteristics of watermark do not correspond to the time-frequency characteristics of speech signal. It may cause watermark audibility, because the watermark will be present in the time-frequency regions where speech components do not exist. In this paper, a time-frequency based approach for speech watermarking is proposed. The watermark in the time-frequency domain is modeled to follow specific speech components in the selected time-frequency regions. Additionally, in order to provide its imperceptibility, the energy of watermark is adjusted to the energy of speech components. In image watermarking, an approach based on the two-dimensional space/spatial frequency distribution has already been proposed in [8]. However, it is not appropriate in the case of speech signals.
Among all time-frequency representations, the spectrogram is the simplest one. However, it has a low time-frequency resolution. On the other hand, the Wigner distribution, as one of the commonly used, produces a large amount of cross-terms in the case of multicomponent signals. Thus, the S-method, as a cross-terms free time-frequency representation, can be used for speech analysis. The watermark is created by modeling time-frequency characteristics of a pseudo-random sequence according to the certain time-frequency speech components. The main problem in these applications is the inversion of the time-frequency distributions. A procedure based on the timevarying filtering has been proposed in [9]. The Wigner distribution has been used to create time-varying filter that identifies the support of a monocomponent chirp signal. However, it cannot be used in the case of multicomponent speech signals. Also, some interesting approaches to signal's components extraction from the time-frequency plane have been proposed in [10], [11].
In this work, the time-varying filtering, based on the cross-terms free time-frequency representation, is adapted for speech signals and watermarking purpose. Namely, this con-cept is used to identify the support of certain speech components in the time-frequency domain and to model the watermark according to these components. The basic idea of this approach has been introduced in [12]. The time-varying filtering is also used to overcome the problem of inverse mapping from the time-frequency domain. Additionally, a reliable procedure for blind watermark detection is provided by modifying the correlation detector in the time-frequency domain. It is based on the Wigner distribution, because the presence of cross-terms improves detection results [13]. Therefore, the main advantage of the proposed method is in providing efficient watermark detection with low probabilities of error for a set of strong attacks. Payload provided by this procedure is suitable for various applications [1].
The paper is organized as follows. Timefrequency representations and the concept of time-varying filtering are presented in Section II. A proposal for watermark embedding and detection are given in Section III. The evaluation of the proposed procedure is performed by the various examples and tests in Section IV. Concluding remarks are given in Section V.

II. T B
Time-frequency representations of speech signal and the concept of time-varying filtering will be considered in this Section.

A. Time-frequency representation of speech signals
Time-frequency representations have been used for speech signal analysis. The Wigner distribution, as one of the commonly used time-frequency representations, in its pseudo form is defined as: where f represents a signal (* denotes the conjugated function), w is the window function, N is the window length, while n and k are discrete time and frequency variables, respectively. However, if we represent a multicomponent signal (such as speech) as a sum of M components f i (n) that is, f (n) = M i=1 f i (n), its Wigner distribution produces a large amount of cross-terms: where W D i f (n, k) are the auto-terms, while W D ij f (n, k), for i = j, represent the crossterms. In order to preserve auto-terms concentration as in the Wigner distribution, and to reduce the presence of cross-terms, the Smethod (SM) has been introduced [14]: where P(l) is a finite frequency domain window with length 2L+1, while STFT is the short-time Fourier Transform defined as: ST F T (n, k) = N/2 m=−N/2 w(m)f(n + m)e −j2πmk/N , with window function w(m). Thus, the SM of the multicomponent signal, whose components do not overlap in the timefrequency plane, represents the cross-terms free Wigner distribution of the individual signal components. By taking the rectangular window P(l), the discrete form of SM can be written as: Note that the terms in summation improve the quality of spectrogram (square module of the short-time Fourier Transform) toward the quality of the Wigner distribution.
The window P(l ) should be wide enough to enable the complete summation over the auto-terms. At the same time, to remove the cross-terms, it should be narrower than the distance between the auto-terms. The convergence within P(l ) is very fast, so that high auto-terms concentration is obtained with only a few summation terms. Thus, in many applications L<5 can be used [14]. Unlike the Wigner distribution, the oversampling in time domain is not necessary since the aliasing components will be removed in the same way as the cross-terms. More details about the S-method can be found in [14], [15].
Comparing to other quadratic time-frequency distributions, the S-method provides a significant saving in computation time. The number of complex multiplications for the S-method is N (3+L)/2, while the number of complex additions is N (6+L)/2 [14] (N is the number of samples within the window w(m)). In the case of Wigner distribution, these numbers are significantly larger: N (4+log 2 N )/2 for complex multiplications and N log 2 2N for complex additions. It is important to note that the Smethod allows simple and efficient hardware realization that has already been done [16], [17].

B. Time-varying filtering
Time-varying filtering is used in order to obtain watermark with specific time-frequency properties, as well as to provide the inverse transform from the time-frequency domain. In the sequel, the general concept of the timevarying filtering is presented.
For a given signal x, the pseudo form of timevarying filtering, suitable for numerical realizations, has been defined as [18]: where w is a lag window, τ is a lag coordinate, while h represents impulse response of the time-varying filter. Time-varying transfer function, that is, support function, has been defined as Weyl symbol mapping of the impulse response into the time-frequency domain [18]: where t and ω are time and frequency variables respectively. Thus, by using the support function (6), the filter output can be obtained as [18]: The discrete form of the above relation can be written as: where STFT x is the STFT of an input signal x, while N is the length of window w(m). According to (8), by using the STFT of a pseudo-random sequence and a suitable support function, the watermark with specific time-frequency characteristics will be obtained [12]. The support function will be defined in the form of time-frequency mask that corresponds to certain speech components.

III. W P U T -F R
A method for time-frequency based speech watermarking is proposed in this Section. The watermark is embedded in the components of a voiced speech part. It is modeled to follow the time-frequency characteristics of significant speech formants. Furthermore, the procedure for watermark detection in the timefrequency domain is proposed.

A. Watermark sequence generation
In order to select the speech components for watermarking, the region D in the time-frequency plane that is, D =  Fig. 1.). The time instances t 1 and t 2 correspond to the start and the end of voiced speech part.The voice activity detector, that is, word end-points detector [19]- [21], is used to select the voiced part of speech signal. The strongest formants are selected within the frequency interval ω ∈ (ω 1 , ω 2 ).
The time-frequency characteristics of the watermark within the region D can be modeled by using the support function defined as: Thus, the support function L M will be used to create a watermark with specific timefrequency characteristics. In order to use the strongest formants components, the energy floor ξ is introduced. Thus, the function L M can be modified as: where SM x (t,ω) represents the SM of speech signal. Since the energy floor ξ is used to avoid watermarking of weak components, an appropriate expression for ξ is given by: ) is a maximal value of signal's S-method in the region D, while λ is a parameter with values between 0 and 1. The higher λ means that stronger components are taken. It is assumed that the significant components within the region are approximately of the same strength. It means that only a few closest formants should be considered within the region D. Therefore, if different time-frequency regions are used for watermarking, each energy floor should be adapted to the strength of maximal component within the considered region. It is important to note that generally, the value ξ is not necessary for the detection procedure, as it will be explained latter.
The pseudo-random sequence p is an input of the time-varying filter. According to (8), the watermark is obtained as: where STFT p (n,k) is the discrete STFT of the sequence p. Since the watermark is modeled by using the function L M , it will be present only within the specified region where the strong signal components exist.
Finally, the watermark embedding is done according to:

B. Watermark detection
The watermark detection is performed in the time-frequency domain by using the correlation detector. The time instances t 1 andt 2 are determined by using voice activity detector. It is not necessary that the detector contains the information about the frequency range (ω 1 ,ω 2 ) of the region D. Namely, the correlation can be performed along the entire frequency range of signal, but it is only effective within (ω 1 ,ω 2 ) (region D), where watermark components exist. By the way, the information about the range (ω 1 ,ω 2 ) can be extracted from the watermark time-frequency representation. The detector responses must satisfy: (13) where STFTx w (t,ω), STFTw key (t,ω) represent the short-time Fourier Transform of watermarked signal and the short-time Fourier Transform of watermark, respectively, while T is a threshold. The detector response for any wrong trial (sequence created in the same manner as watermark) should not be greater than the threshold value.
The support function L M and the energy floor ξ are not required in the detection procedure. The function L M can be extracted from the watermark and used to model other sequences that will act as wrong trials, or simply it does not have to be used. Namely, detection can be performed even by using STFT of nonmodeled pseudo-random sequence p (used to create watermark). The watermark is included in the sequence p and correlation will take effect only on the time-frequency positions of watermark. The remaining parts of the sequence p have the same influence on detection as in the case of wrong trials.
A significant improvement of watermark detection is obtained if the cross-terms in the time-frequency plane are included. Namely, for the calculation of SM in the detection stage, a large window length L can be chosen. For the window length greater than the distance between the auto-terms, cross-terms appear: where L min is the minimal distance between the auto-terms. Thus, by increasing L in (4), the SM approaches the Wigner distribution (for L=N /2 Wigner distribution is obtained). An interesting approach to signal detection, based on the Wigner distribution, is proposed in [13], where the presence of cross-terms increases the number of components used in detection. Namely, apart from the auto-terms, the watermark is included in the cross-terms, as well. Therefore, by using the time-frequency domain with the cross terms included, watermark detection can be significantly relaxed and improved, since the watermark is spread over a large number of components within the considered region. If the cross terms are considered, the correlation detector in the time-frequency domain can be written as: where the first summation includes autoterms, while the second one includes crossterms.
Since the cross-terms contribute in watermark detection they should be included in other existing detectors structures. For example, the locally optimal detector based on the generalized Gaussian distribution of the watermarked coefficients, in the presence of crossterms in the time-frequency domain, can be written as: (16) The performance of the proposed detector is tested by using the following measure of detection quality [22], [23]: where D and σ 2 represent the mean value and the standard deviation of the detector responses, respectively, while indexes w r and w w indicate the right and wrong keys (trials). The watermarking procedure has been done for different right keys (watermarks). For each of the right keys, a certain number of wrong trials are generated in the same manner as right keys.
The probability of error P err is calculated by using: where the indexes w r and w w have the same meaning as in the previous relation, T is a threshold, while equal priors p Dw w = p Dw r = 1/2 are assumed. By considering normal distribution forP Dww and P Dwr , and σ 2 wr = σ 2 ww , the minimization of P err leads to the following relation: By increasing the value of R the probability of error decreases.

IV. E
Efficiency of the proposed procedure is demonstrated on several examples, where signals with various maximal frequencies and signal to noise ratios (SNR) are used. The successful detection in the time-frequency domain is performed in the case without attack, as well as with a set of strong attacks.
Example1: The speech signal with f max = 4 kHz is considered. This maximal frequency is used to provide an appropriate illustration of the proposed method. The STFT was calculated by using rectangular window with 256 samples for time-varying filtering. Zero padding up to 1024 samples was carried out, and the parameter L=5 is used in the SM calculation. The region D (Fig 2.a) is selected to cover the first three low frequency formants of voiced speech part. The corresponding support function L M (Fig 2.b) is created by using the value ξ with parameter λ=0.7.
Selection of the voiced speech part is done by using the word end-points detector based on the combined Teager energy and Energyentropy features [20], [21] (a non-overlapping speech frames of length 8 ms are used). The original and watermarked signals are given in Fig 3.a.
The obtained SNR is higher than 20 dB, which fulfills the constraint of watermark imperceptibility [24]. The watermark imperceptibility has also been proven by using the ABX listening test, where A, B and X are original, watermarked, and original or watermarked signal, respectively. The listener listens to A and B. Then listener listens to X and decides whether X is A or B. Since A, B and X are few seconds long, the entire signals are listened to, not only isolated segments. Three female and seven male listeners with normal hearing participated in the listening test. The test was performed few times, and from the obtained statistics it was concluded that the listeners can not positively distinguish between watermarked and original signal.
In order to illustrate the efficiency of the proposed detector form, an isolated watermarked speech part is considered. However, it is not limited to this particular speech part but, depending on the required data payload, various voiced speech parts can be used to embed and detect watermark. Detection is performed by using 100 trials with wrong keys. The responses of the standard correlation detector for STFT coefficients are given in Fig  3.b, while the responses of the detector defined by (15) are shown in Fig 3.c and Fig 3.d (for window length L=10 and L=32, respectively). The detector response for right key is normalized to the value 1, while the responses for wrong keys are proportionally presented.
Observe that for the same right key and the same set of wrong trials, the improvement of detection results is achieved by increasing parameter L (Fig. 3.). Thus, it is obvious that the detector performance increases with the number of cross-terms. In the following experiments L=32 has been used to provide reliable detection. Further increasing of L does not improve results significantly. Note that a window width N +1 (for L=N /2), like in the Wigner distribution, can cause the presence of crossterms that do not contain watermark, since they could result from two non-watermarked auto-terms. These cross-terms are not desir- able in watermark detection procedure. Additionally, we have performed experiments with few other speech signals. For each signal, the low frequency formants are used, and the watermark has been embedded with approximately the same SNR (around 24 dB). The detection is performed by using (15) with L=32. We present the results for three of them in Fig. 4. Note that the obtained results are very similar to the ones in Fig. 3.d. Thus, the detection performance is insensitive to different signals tested under same conditions. Example2: In the previous example, the low frequency formants have been considered.
However, different frequency regions can be used. Thus, the procedure is also tested for watermark modeled according to the middle frequency formants. The detection results are given in Fig. 5.a (f max =4 kHz and L=32). The ratio between detector responses for right key and wrong trials is lower than in the previous example, with low frequency formants, but still satisfactory. The obtained SNR is 28 dB. In addition, the middle frequency formants of a signal with f max =11.025 kHz have been considered. The results of watermark detection are given in Fig. 5.b (L=32, and SNR=32 dB). Extended frequency range enables more In order to evaluate the efficiency of the proposed procedure by using the measure of detection quality defined by (17), we repeated the procedure for 50 trials (for 50 right keys -watermarks). They are modeled corresponding to the low frequency formants. For each of the right keys, a number of 60 wrong keys (trials) are generated in the same manner as right keys. The average SNR is around 27 dB. The watermark imperceptibility has been proven by using ABX listening test as in the first example. Again, the watermarked signal is perceptibly similar to the original one. The detection is performed by using correlation detector that includes cross-terms in the timefrequency domain (L=32). The responses of the proposed detector for right and wrong keys are shown in Fig. 6. The threshold is set as: , where D w r and D w w represent the mean values of the detector responses for right keys (watermarks) and wrong trials, respectively. The calculated measure of detection quality is R=7.5, that means the proba- In the sequel, the procedure is tested on various attacks, such as mp3 compression for different bit rates, time scaling, pitch scaling, echo, amplitudes normalization, and so forth. The results of detection in terms of quality measure R and corresponding probabilities of detection error P err are given in the Table I. The most of attacks are realized by using CoolEditPro v2.0, while the rest of the processing is done in Matlab 7.
Note that a plenty of considered attacks are strong and they introduce a significant signal distortion. For example, in the existing audio watermarking procedures, usually applied time scaling is up to 4%, wow and flutter up to 0.5% or 0.7%, echo 50 ms or 100 ms [4], [25]. We have applied stronger attacks to show that, even in this case, the proposed method provides high robustness with very low probabilities of detection error (see Table I). Note that these results were obtained with a higher watermark bit rate (more details will be provided in the next subsection). The time-scale modification (TSM) is one of the challenging Amplitude -Normalize (100%) 6.95 10 −7 Wow (delay 10%) and bright flutter 6.72 10 −6 Pitch scaling ±5% 5.6 10 −5 Additive Gaussian noise (SNR=-35dB) 6.9 10 −7 attacks in audio watermarking that has specially been considered in the recent literature [24]. Very few algorithms can resist these desynchronization attacks [24]. Here, we have applied TSM -time stretch up to ±15% by using software tool CoolEditPro v2.0. However, the low probability of detection error is still maintained. Only in the case of pitch scaling the obtained probability of error was lower (Table I), but still satisfying. Apart from the very low probabilities of detection error, an additional advantage of the proposed detection is in providing more flexibility related to de-synchronization between frequencies of the watermark sequence embedded in the signal and watermark sequence used for detection. The correlation effects are enhanced since the detection is performed within the whole time-frequency region covered with a large number of cross-terms apart from the auto-terms.
In the sequel, the achieved payload and some related applications are given.

A. Data payload
In this example we have used a single voiced part to embed a pseudo-random sequence that represents one bit of information. The approximate length of watermark, obtained as mod-eled pseudo random sequences, is 1000 samples (125 ms for a signal sampled at 8000 Hz). Data payload varies between 4 bps and 8 bps, depending on the duration of voiced speech regions. In the case of speech signal sampled at 44100 Hz, the achievable data payload is 22 bps. In this way we have provided a required compromise between data payload and robustness. Thus, the proposed algorithm can be efficiently used for copyright and ownership protection, copy and access control [1].
Note that the data payload can be increased by using shorter sequences. If we consider the watermark sequence with 500 samples (that correspond to 62.5 ms of signal sampled at 8000 Hz) the data payload is increased twice (up to 16 bps). However, the probability of detection error increases to 10 −4 . On the other hand, the probability of detection error can decrease even bellow 10 −8 by considering lower watermark bit rates.

V. C
An efficient approach to watermarking of speech signals in the time-frequency domain is presented. It is based on the cross-terms free S-method and the time-varying filtering used for watermark modeling. The watermark imperceptibility is provided by adjusting the location and the strength of watermark to the selected speech components within the timefrequency region. Also, the efficient watermark detection based on the use of cross-terms in time-frequency domain is provided. The number of cross-terms employed in the detection procedure is controlled by the window length used in the calculation of S-method. The experimental results demonstrate that the procedure assures convenient and reliable watermark detection providing low probability of error. The successful watermark detection has been demonstrated in the case of various attacks.