- Research Article
- Open Access

# Time-Frequency-Based Speech Regions Characterization and Eigenvalue Decomposition Applied to Speech Watermarking

- Irena Orović
^{1}Email author and - Srdjan Stanković
^{1}

**2010**:572748

https://doi.org/10.1155/2010/572748

© I. Orović and S. Stanković. 2010

**Received:**13 February 2010**Accepted:**30 July 2010**Published:**15 August 2010

## Abstract

The eigenvalues decomposition based on the S-method is employed to extract the specific time-frequency characteristics of speech signals. This approach is used to create a flexible speech watermark, shaped according to the time-frequency characteristics of the host signal. Also, the Hermite projection method is applied for characterization of speech regions. Namely, time-frequency regions that contain voiced components are selected for watermarking. The watermark detection is performed in the time-frequency domain as well. The theory is tested on several examples.

## Keywords

- Speech Signal
- Wigner Distribution
- Eigenvalue Decomposition
- Hermite Function
- Speech Region

## 1. Introduction

Digital watermarking has been developed to provide efficient solutions for ownership protection, copyright protection, and authentication of digital multimedia data by embedding a secret signal called the watermark into the cover media. Depending on the applications, two watermarking scenarios are available: robust and fragile. The robust watermarking assumes that the watermark should be resistant to various signal processing techniques called attacks. At the same time, the watermark should be imperceptible. In order to meet these requirements, a number of watermarking techniques have been proposed, many of which are related to speech and audio signals [1–11]. One of the earliest and simplest techniques is based on the LSB coding [1–4]. The watermark embedding is done by altering the individual audio samples represented by 16 bits per sample. The human auditory system is sensitive to the noise introduced by LSB replacement, which limits the number of LSBs that can be imperceptibly modified. The main disadvantage of these methods is their low robustness [1]. In a number of watermarking algorithms, the spread-spectrum technique has been employed [5–7]. The spread spectrum sequence can be embedded in the time domain, FFT coefficients, cepstral coefficients, and so forth. The embedding is performed in a way to provide robustness to common attacks (noise, compression, etc.). Furthermore, several algorithms use the phase of audio signal for watermarking, such are the phase coding and phase modulation approaches [8, 9], assuring good imperceptibility. Namely, imperceptible phase modifications are exploited by the controlled phase alternation of the host signal. However, the fact that they are nonblind watermarking methods (the presence of the original signal is required for watermark detection) limits the number of their applications.

Most of existing watermarking techniques are based on either the time domain or the frequency domain. In both cases, the changes in the signal may decrease the subjective quality, since the time-frequency characteristics of the watermark do not correspond to the time-frequency characteristics of the host signal. This may cause watermark audibility because it will be present in the time-frequency regions where speech components do not exist. In order to adjust the location and the strength of the watermark to the time-varying spectral content of the host signal, a time-frequency domain-based approach is proposed in this paper. The watermark, shaped in accordance with the formants in the time-frequency domain, will be more imperceptible and more robust at the same time.

The time-frequency distributions have been used to characterize the time-varying spectral content of nonstationary signals [12–16]. As the most commonly used, the Wigner distribution can provide an ideal representation for linear frequency-modulated monocomponent signals [12, 15]. For multicomponents signals, the S-method, that is, a cross-terms-free Wigner distribution, can be used [16]. The S-method can be also used to separate the signal components. Note that the signal components separation could be of interest in many applications. In particular, in watermarking it allows creating the watermark that is shaped by using an arbitrary combination of the signal components. The eigenvalues-based S-method decomposition is applied to separate the signal components [17, 18].

In order to provide suitable compromise between imperceptibility and robustness, the watermark should be shaped according to the time-frequency components of speech signal, as proposed in [19, 20]. Therein, the speech components selection is performed by using the time-frequency support function with a certain energy threshold. However, the threshold is chosen empirically and it does not provide sufficient flexibility. Namely, it includes all components with the energy between the maximum and the threshold level.

Therefore, in this paper, the eigenvalue decomposition method is employed to create a time-frequency mask as an arbitrary combination of speech components (formants). Only the components from voiced time-frequency regions are considered [19]. The Hermite projection method-based procedure for regions characterization is applied[21, 22]. The speech regions are reconstructed within the time-frequency plane by using a certain number of Hermite expansion coefficients. The mean square error between the original and reconstructed region is used to characterize dynamics of regions. It allows distinguishing between voiced, unvoiced, and noisy regions. Finally, the watermark embedding and detection are performed in the time-frequency domain. The robustness of the proposed procedure is proved under various common attacks.

The considered watermarking approach can be useful in numerous applications assuming speech signals. These applications include, but are not limited to, the intellectual property rights, such as proof of ownership, speaker verification systems, VoIP, and mobile applications such as cell-phone tracking. Recently, an interesting application of speech watermarking has appeared in air traffic control [11]. The air traffic control relies on voice communication between the aircraft pilot and air traffic control operators. Thus, the embedded digital information can be used for aircraft identification.

The paper is organized as follows. A theoretical background on the time-frequency analysis is given in Section 2. Section 3 describes the speech regions characterization procedure. In Section 4, the formants selection based on the eigenvalues decomposition is proposed. The time-frequency-based watermarking procedure is presented in Section 5. The performance of the proposed procedure is tested on examples in Section 6. Concluding remarks are given in Section 7.

## 2. Theoretical Background—Time-Frequency Analysis

where *x*(*t*) is a signal while *w*(*t*) is a window function.

*w*(

*t*) (window shape and window width). Namely, if the signal phase is not linear, it cannot simultaneously provide a good time and frequency resolution. Various quadratic distributions have been introduced to improve the spectrogram resolution. Among them, the most commonly used, [1, 14, 15], is the Wigner distribution, defined as follows:

where *n* and *k* are discrete time and frequency samples. If the minimal distance between autoterms is greater than the window width (
), the cross-terms will be completely removed. Also, if the autoterms width is equal to
, the S-method produces the same autoterms concentration as the Wigner distribution. Moreover, since the convergence within *P*(*l*) is fast, in many practical applications a good concentration can be achieved by setting
.

*R*

_{ f }, the support function can be defined as follows:

Although it was initially introduced for signal denoising, the concept of nonstationary filtering can be used to retrieve the signal with specific characteristics from the time-frequency domain.

Therefore, the time-frequency analysis can provide complete information about the time-varying spectral components, even when their number is significant as in the case of speech signals. Namely, these components appear in the time-frequency plane as recognizable time-varying structures that could be used to characterize different speech regions (voiced, unvoiced, noisy, etc.), as proposed in the sequel. Furthermore, the extraction of individual speech components from the time-frequency domain could be useful in many applications assuming speech signals. This is generally a highly demanding task due to the number of speech components. As an effective solution, a method based on the eigenvalues decomposition and the speech signal time-frequency representation is presented in Section 4.

## 3. Speech Regions Characterization by Using the Fast Hermite Projection Method of Time-Frequency Representation

### 3.1. Fast Hermite Projection Method

The fast Hermite projection method has been introduced for image expansion into a Fourier series by using an orthonormal system of Hermite functions [21, 22]. Namely, the Hermite functions provide better computational localization in both the spatial and the transform domain, in comparison with the trigonometric functions. The Hermite projection method has been mainly used in image processing applications, such as image filtering, and texture analysis. Here, we provide a brief overview of the method.

*f*(

*x*,

*y*) can be defined as follows:

where are the two-dimensional Hermite functions while are the Hermite coefficients.

In our case, the two-dimensional function *f*(*x*,*y*) is a time-frequency representation of a speech region, which will be represented by a certain number of Hermite coefficients *c* _{
ij
}. Note that the number of coefficients *c* _{
ij
} depends on the number of the employed Hermite functions. The more functions is used, the less error is introduced in the reconstructed version *F*(*x*,*y*).

*N*Hermite functions can be defined as follows:

Accordingly, the functions correspond to the rows of the time-frequency representation.

where are zeros of Hermite polynomials while are associated weights.

### 3.2. Speech Regions Characterization by Using the Concept of Hermite Projection Method

According to (8) or its simplified form (9), the time-frequency representation of a speech region as a two-dimensional function can be expanded into a certain number of Hermite functions. Thus, we may assume that
and
, where *D* denotes the original time-frequency region and *D* ^{
r
} is the region reconstructed from the Hermite expansion coefficients. The difference between *D* and *D* ^{
r
} will depend on the number of Hermite functions used for the expansion, as well as on the complexity of the considered region.

The S-method is used for time-frequency representation of speech signals. By observing time-frequency characteristics, a significant difference between noise, pauses, and speech can be noted. Moreover, the voiced and unvoiced speech parts are significantly different. The voiced parts are characterized by higher energy and complex structure.

where *D* _{
i
}(*t*,*ω*) and
denote the original and the reconstructed
th region from
while d_{1} and d_{2} are dimensions of the regions. Thus, the region
, containing either noise or unvoiced sounds, will produce a significantly lower MSE than the region
with complex voiced structures. The dimensions
and
are the same for all regions. They are chosen experimentally such that the region includes most of the sound components.

^{-3}while the regions containing complex formant structures have a large value of MSE (generally, it is significantly above 10

^{3}). The MSEs for the unvoiced regions are between the two cases.

MSEs for some of the tested speech regions.

No. | Region description | MSE |
---|---|---|

1 | Noise | 3*10 |

2 | Noise | 3*10 |

3 | Noise | 1*10 |

4 | Noise | 1*10 |

5 | Noise | 4*10 |

6 | Noise | 6*10 |

7 | Noise | 5*10 |

8 | Voiced | 9971 |

9 | Voiced | 2265 |

10 | Voiced | 5917 |

11 | Voiced | 16587 |

12 | Voiced | 5245 |

13 | Unvoiced | 55 |

14 | Voiced | 4466 |

15 | Voiced | 3242 |

16 | Unvoiced | 606 |

17 | Voiced | 19016 |

18 | Voiced | 23733 |

19 | Voiced | 7398 |

20 | Unvoiced | 0.018 |

21 | Unvoiced | 1.25 |

22 | Unvoiced | 0.007 |

23 | Unvoiced | 0.049 |

24 | Unvoiced | 4.38 |

Therefore, based on the numerous experiments, the voiced regions with emphatic formants are determined by . These regions have a rich formants structure and they will be appropriate for watermarking. A set of arbitrary selected formants could be used to shape the watermark. It will provide a flexibility to create the watermark with very specific time-frequency characteristics. The combination of time-frequency components could be an additional secret key to increase robustness and security of this procedure.

## 4. Eigenvalue Decomposition Based on the Time-Frequency Distribution

*m*is a discrete lag coordinate. Consequently, the inverse of the Wigner distribution can be written as follows:

*R*

_{ SM }. Furthermore, ( is the energy of the

*i*th component), and for , that is,

where denotes the Kronecker symbol.

As it will be explained in the sequel, the autocorrelation matrix is calculated according to (21) for each time-frequency region (obtained by using the S-method). Then, the eigenvalue decomposition is applied to according to (23), resulting in eigenvalues and eigenvectors. Each of these components is characterized by a certain location in the time-frequency plane.

Once separated, they could be further combined in various ways to provide an arbitrary time-frequency map used as a support function in watermark modelling.

### 4.1. Selection of Speech Formants Suitable for Watermarking

After the regions have been selected, the formants that will be used for watermark modeling need to be determined. This can be realized by considering the formants whose energy is above a certain floor value, as it is done in [19]. Namely, the energy floor was defined as a portion of the maximum energy value of the S-method within the selected region. Therein, it has been assumed that the significant components have approximately the same energy. However, this may not always be the case as the number of selected components could vary between different regions. Consequently, it may lead to a variable amount of watermark within different regions. Thus, in order to overcome these difficulties, the eigenvalue decomposition method is employed for speech formants selection.

*K*. Each of these components can be reconstructed as . Thus, a signal that contains components of the original speech is obtained as:

The S-method of the signal will be denoted as . Note that it represents a time-frequency map that is used for watermark modelling.

## 5. Time-Frequency-Based Speech Watermarking Procedure

### 5.1. Watermark Modelling and Embedding

- (1)
- (2)
- (3)
where

*λ*could be set to zero or, for a sharpen mask, to a small positive value, - (4)finally,the watermark is obtained at the output of the time-varying filter as follows [19]:

where is the STFT of the host signal within the selected region.

### 5.2. Watermark Detection

where and are the S-method of the watermarked signal and watermark, respectively.

holds for any wrong trial.

*L*(in the S-method), the cross-terms appear and they are included in detection, as well [19]. Namely, the cross-terms also contain the watermark, and hence they contribute to the watermark detection. The detection performance is tested by using the following measure of detection quality [24, 25]:

## 6. Examples

Example 1.

- (1)
Formants whose energy is above a threshold

*ξ*are selected for watermarking. The threshold is determined as a portion of the S-method's maximum value ( is the maximum energy value of the S-method within the observed region), [19]. Thus, the threshold is adapted to the maximum energy within the region. - (2)
The eigenvalues-based decomposition is used to create an arbitrary composed time-frequency map.

In the first case, the number of selected formants depends on the threshold value. An illustration of formants selected by using two different thresholds
and
is given in Figure 5(a). Note that a higher threshold
(calculated for
) selects only the strongest low-frequency formants (Figure 5(a) left). On the other hand, a lower threshold
(for
) yields more components (Figure 5(a) right). However, it is difficult to control their number. Also, the amount of signal energy is varying through different time-frequency regions. Thus, an optimal threshold should be determined for each region. This is a demanding task and it could cause difficulties in practical applications. Namely, if the threshold selects too many components, the watermark may produce perceptual changes. Otherwise, if there are not enough components, it could be difficult to detect the watermark. An illustration of two different regions, obtained by using the threshold *ξ* with
, is given in Figure 5(b). Although the threshold is calculated for both regions in the same way
, the number of selected components is significantly different. The components in the first region (Figure 5(b)left) are approximately at the same energy level. Thus, a significant number of them will be selected with this threshold. However, in the second region (Figure 5(b) right), the energy varies for different components and the given threshold selects just a few strongest components.

Example 2.

The speech signal with maximal frequency 4 kHz is considered. A voiced time-frequency region is used for watermark modelling and embedding. The procedure is implemented in Matlab 7. The STFT is calculated using the rectangular window with 1024 samples, and then, it is used to obtain the signal S-method. Since the speech components are very close to each other in the time-frequency domain, the S-method is calculated with the parameter
to avoid the presence of cross-terms. After calculating the inverse transform (the IFFT routine is applied to the S-method), the eigenvalues and eigenvectors are obtained by using the Matlab built-in function (eigs). Twenty eigenvectors are selected, weighted by the corresponding eigenvalues, and merged into a signal with desired components. Furthermore, the S-method is calculated for the obtained signal providing the support function *L* _{
H
} for watermark shaping. Here, the Hanning window with 512 samples is used for the STFT calculation while in the S-method
. The watermark is created as a pseudorandom sequence, whose length is determined by the length of the voiced speech region (approximately 1300 samples). The STFT of the watermark is also calculated by using the Hanning window with 512 samples. It is then multiplied by the function *L* _{
H
} to shape its time-frequency characteristics. For each of the right keys (watermarks), a set of 50 wrong trials is created following the same modelling procedure as for the right keys. The correlation detector based on the S-method coefficients is applied with
.

The proposed approach preserves favourable properties of the time-frequency-based watermarking procedure [19], which outperforms some existing techniques. An illustration of normalized detector responses for right keys (red line) and wrong trials (blue line) is shown in Figure 7. Furthermore, the robustness is tested against several types of attacks, all being commonly used in existing procedures [5, 8, 10]. Namely, in the existing algorithms, the usual amount of attacks is time scaling up to 4%, wow up to 0.5% or 0.7%, echo 50 ms or 100 ms [5], and so forth, providing the probability of error of order 10^{-6}. We have applied the same types of attacks, but with higher strength, showing that the proposed approach provides robustness even in this case. The proposed procedure is tested on: mp3 compression with constant bit rate (128 Kbps), mp3 compression with variable bit rate (40–50 Kbps), delay (180 ms), Echo (200 ms), pitch scaling (5%), wow (delay 20%), flutter, and amplitude normalization. The measures of detection quality and corresponding probabilities of error are calculated according to (32). The results are given in Table 2. Note that the proposed method provides very low probabilities of error, mostly of order 10^{-7}, even in the presence of stronger attacks. Also, the robustness to pitch scaling has been improved when compared to the results reported in [19].

As expected, the detection results are similar as in [19] where the threshold is well adapted to the energy within the considered speech region. However, in the previous example, it is shown that the optimal threshold selection for one region does not have to be optimal for the other ones. Thus, it can include only a few formants (Figure 5(b) right). Consequently, the detection performance decreases, due to the smaller number of components available for correlation in the time-frequency domain. The procedure performance can vary significantly for different regions, since it is not easy to adjust thresholds separately for each of them. In this example, a single threshold is used. The detection results obtained for the region where the threshold is not optimal are shown in Figure 8. The measures of detection quality have decreased, as shown in Table 3. From this point of view, the flexibility of components selection provided by the proposed approach assures more reliable results.

The measures of detection quality for the proposed approach under various attacks.

The measures of detection quality.

Attack | R |
---|---|

No attack | 4.3 |

Mp3 constant | 4.1 |

Mp3 variable | 3.9 |

Delay | 4 |

Echo | 4 |

Pitch scaling | 3.9 |

Wow | 1.8 |

Bright flutter | 3.8 |

Amplitude normalization | 4.1 |

## 7. Conclusion

The paper proposes an improved formants selection method for speech watermarking purposes. Namely, the eigenvalues decomposition based on the S-method is used to select different formants within the time-frequency regions of speech signal. Unlike the threshold-based selection, the proposed method allows for an arbitrary choice of components number and their positions in the time-frequency plane. This method results in better performance when compared to the method based on a single threshold. An additional improvement is achieved by adapting the Hermite projection method for characterization of speech regions. This has led to an efficient selection of voiced regions with formants suitable for watermarking. Finally, the watermarking procedure based on the proposed approach provides greater flexibility in implementation and it is characterised by reliable detection results.

## Declarations

### Acknowledgment

This work is supported by the Ministry of Education and Science of Montenegro.

## Authors’ Affiliations

## References

- Pal SK, Saxena PK, Mutto SK: The future of audio steganography.
*Proceedings of Pacific Rim Workshop on Digital Steganography, 2002*Google Scholar - Cvejic N, Seppänen T: Increasing the capacity of LSB based audio steganography.
*Proceedings of the 5th IEEE International Workshop on Multimedia Signal Processing, December 2002, St. Thomas, Virgin Islands, USA*336-338.Google Scholar - Shieh C-S, Huang H-C, Wang F-H, Pan J-S: Genetic watermarking based on transform-domain techniques.
*Pattern Recognition*2004, 37(3):555-565. 10.1016/j.patcog.2003.07.003View ArticleGoogle Scholar - Wang F-H, Jain LC, Pan J-S: VQ-based watermarking scheme with genetic codebook partition.
*Journal of Network and Computer Applications*2007, 30(1):4-23. 10.1016/j.jnca.2005.08.002View ArticleGoogle Scholar - Kirovski D, Malvar HS: Spread-spectrum watermarking of audio signals.
*IEEE Transactions on Signal Processing*2003, 51(4):1020-1033. 10.1109/TSP.2003.809384MathSciNetView ArticleGoogle Scholar - Malik H, Ansari R, Khokhar A: Robust audio watermarking using frequency-selective spread spectrum.
*IET Information Security*2008, 2(4):129-150. 10.1049/iet-ifs:20070145View ArticleGoogle Scholar - Cvejic N, Keskinarkaus A, Seppanen T: Audio watermarking using m-sequences and temporal masking.
*Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, October 2001, New York, NY, USA*227-230.Google Scholar - Cvejic N:
*Algorithms for audio watermarking and steganography, Academic dissertation*. University of Oulu, Oulu, Finland; 2004.Google Scholar - Kuo S-S, Johnston JD, Turin W, Quackenbush SR: Covert audio watermarking using perceptually tuned signal independent multiband phase modulation.
*Proceedings of IEEE International Conference on Acoustic, Speech and Signal Processing, May 2002, Orlando, Fla, USA*1753-1756.Google Scholar - Xiang S, Huang J: Histogram-based audio watermarking against time-scale modification and cropping attacks.
*IEEE Transactions on Multimedia*2007, 9(7):1357-1372.View ArticleGoogle Scholar - Hofbauer K, Hering H, Kubin G: Speech watermarking for the VHF radio channel.
*Proceedings of EUROCONTROL Innovative Research Workshop (INO '05), December 2005, Brétigny-sur-Orge, France*215-220.Google Scholar - Cohen L: Time-frequency distributions—a review.
*Proceedings of the IEEE*1989, 77(7):941-981. 10.1109/5.30749View ArticleGoogle Scholar - Loughlin PJ: Scanning the special issue on time-frequency analysis.
*Proceedings of the IEEE*1996, 84(9):1195. 10.1109/JPROC.1996.535239View ArticleGoogle Scholar - Boashash B:
*Time-Frequency Analysis and Processing*. Elsevier, Amsterdam, The Netherlands; 2003.MATHGoogle Scholar - Hlawatsch F, Boudreaux-Bartels GF: Linear and quadratic time-frequency signal representations.
*IEEE Signal Processing Magazine*1992, 9(2):21-67. 10.1109/79.127284View ArticleGoogle Scholar - Stankovic L: Method for time-frequency analysis.
*IEEE Transactions on Signal Processing*1994, 42(1):225-229. 10.1109/78.258146View ArticleGoogle Scholar - Stanković L, Thayaparan T, Daković M: Signal decomposition by using the S-method with application to the analysis of HF radar signals in sea-clutter.
*IEEE Transactions on Signal Processing*2006, 54(11):4332-4342.View ArticleGoogle Scholar - Thayaparan T, Stanković L, Daković M: Decomposition of time-varying multicomponent signals using time-frequency based method.
*Proceedings of Canadian Conference on Electrical and Computer Engineering (CCECE '06), May 2006, Ottawa, Canada*60-63.Google Scholar - Stanković S, Orović I, Žarić N: Robust speech watermarking procedure in the time-frequency domain.
*EURASIP Journal on Advances in Signal Processing*2008, 2008:-9.Google Scholar - Stanković S, Orović I, Žarić N, Ioana C: An approach to digital watermarking of speech signals in the time-frequency domain.
*Proceedings of the 48th International Symposium focused on Multimedia Signal Processing and Communications (ELMAR '06), June 2006, Zadar, Croatia*127-130.Google Scholar - Kortchagine D, Krylov A: Image database retrieval by fast Hermite projection method.
*Proceedings of the 15th International Conference on Computer Graphics and Applications (GraphiCon '05), June 2005, Novosibirsk Akademgorodok, Russia*308-311.Google Scholar - Kortchagine D, Krylov A: Projection filtering in image processing.
*Proceedings of the 10th International Conference on Computer Graphics and Applications (GraphiCon '00), August-September 2000, Moscow, Russia*42-45.Google Scholar - Stanković S: About time-variant filtering of speech signals with time-frequency distributions for hands-free telephone systems.
*Signal Processing*2000, 80(9):1777-1785. 10.1016/S0165-1684(00)00087-6View ArticleMATHGoogle Scholar - Heeger D:
*Signal Detection Theory*. Department of Psychiatry, Stanford University, Stanford, Calif, USA; 1997.Google Scholar - Wickens TD:
*Elementary Signal Detection Theory*. Oxford University Press, Oxford, UK; 2002.Google Scholar - Adelsbach A, Katzenbeisser S, Sadeghi A-R: Watermark detection with zero-knowledge disclosure.
*Multimedia Systems*2003, 9(3):266-278. 10.1007/s00530-003-0098-zView ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.