Single channel speech separation in modulation frequency domain based on a novel pitch range estimation method
© Mahmoodzadeh et al; licensee Springer. 2012
Received: 7 May 2011
Accepted: 17 March 2012
Published: 17 March 2012
Computational Auditory Scene Analysis (CASA) has been the focus in recent literature for speech separation from monaural mixtures. The performance of current CASA systems on voiced speech separation strictly depends on the robustness of the algorithm used for pitch frequency estimation. We propose a new system that estimates pitch (frequency) range of a target utterance and separates voiced portions of target speech. The algorithm, first, estimates the pitch range of target speech in each frame of data in the modulation frequency domain, and then, uses the estimated pitch range for segregating the target speech. The method of pitch range estimation is based on an onset and offset algorithm. Speech separation is performed by filtering the mixture signal with a mask extracted from the modulation spectrogram. A systematic evaluation shows that the proposed system extracts the majority of target speech signal with minimal interference and outperforms previous systems in both pitch extraction and voiced speech separation.
Speech separation, as a solution to the cocktail party problem, is a well-known challenge with important applications. To touch the point, consider the telecommunication systems or the Automatic Speech Recognition systems that lose performance in the presence of interfering sounds [1, 2]. An effective system that segregates speech from interference in monaural (single-microphone) situations can be rewarding in such problems. Many methods have been proposed for monaural speech enhancement; for example, see [3–7]. These methods usually assume certain statistical properties for interference and tend to lack the capacity of dealing with a variety of interferences. While the monaural speech separation works awkwardly, the human auditory system performs proficiently. The perceptual process is considered as Auditory Scene Analysis (ASA) . Psychoacoustic research in ASA has inspired considerable work in developing Computational Auditory Scene Analysis (CASA) systems for speech separation (see [6, 7] for a comprehensive review).
According to Bregman , ASA procedure can be separated into two theoretical stages: segmentation and grouping. At the first stage, speech is transformed into a higher-dimensional space (such as a time-frequency two-dimensional representation) and then, similar time-frequency (T-F) units are segmented in order to compose different regions . In the second stage, these regions are combined into different streams based on the relevant acoustic information. The major computational goal of CASA is to separate the target speech signal from the interference for different purposes, via generating a binary or a soft T-F mask, see, e.g., [8–10].
Natural speech includes both voiced and unvoiced portions. Voiced portions of speech are described by periodicity (or harmonicity), which has been used as an important feature in many CASA systems for segregating voiced speech (see, e.g. [13, 14]). Despite considerable advances in voiced speech separation, the performance of current CASA systems is still limited by pitch frequency (F0) estimation errors and residual noise. Various methods have been proposed for robust pitch frequency estimation, see e.g., [15, 16]; however, robust pitch frequency estimation in low signal-to-noise ratio (SNR) situations still poses a significant challenge.
While mixed speech may have a great deal of overlap in the time domain, modulation frequency analysis provides an additional dimension that can present a greater degree of separation among sources. In other words, the original T-F representation obtained from transformations like Short-Time Fourier Transform (STFT) can be augmented to a third dimension that represents modulation frequency. In , by assuming that the pitch frequency range is known and this range is constant in each filter channel, the modulation spectral analysis is used as a tool for producing the mask for speech separation a higher-dimensional spaces.
Based on the above observations, we propose a new system for single channel separation of voiced speech based on the modulation filtering. The idea is that, first, the target pitch (frequency) range is estimated in the modulation frequency domain, and then, this range is used for producing the proper mask for speech separation. Because of the following reasons provided in , modulation analysis and filtering are applied for the target speech separation problem. First, there is a general belief stating that the human ASA system processes the sounds in the modulation frequency domain. Second, the energy from two co-channel talkers is largely non-overlapping in the modulation frequency domain. The method of modulation analysis and filtering has extensively been studied by many researchers in the field of single channel speech separation; Reference  provides a general discussion on this subject.
At first, the proposed system performs a multipitch range estimation of target and interference speech based on the segmentation of modulation spectrogram domain. The segmentation is done using an onset and offset algorithm similar to that proposed by Hu and Wang . In the proposed method, the noisy signal is divided into 200 ms time frames and then, the proposed speech separation algorithm is applied to each individual frame. Pitch range estimation method works in three stages: the first stage computes the modulation spectrogram; the second stage decomposes the modulation spectrogram into segments using an onset and offset algorithm. In this stage, at first, the peaks and valleys of derivative smoothed intensity of modulation spectrogram are detected and marked as onset and offset candidates. Any onset bigger than a certain threshold is accepted for which the smallest offset between two onsets is selected. Then, onset and offset fronts are produced by connecting the common onsets and offsets. Finally, the segments are formed by matching the onset and offset fronts. The third stage determines the range of pitch frequency by selecting and grouping the desired segments.
The separation part of the proposed system aims at obtaining a soft mask in the modulation spectrogram domain. By extending the soft mask suggested in , a soft mask is proposed whose value depends on the estimated pitch range in each filter channel. To determine the soft mask in each filter channel, first, we find and mutually compare the modulation spectrogram energy of target and interference in their pitch ranges estimated from the previous stage. Then, we transform the soft mask to the time domain and filter the mixture signal in order to obtain the separated target signal. Thus, a strategy is suggested which estimates the target pitch range, and subsequently, segregates the target signal from the interference. Finally, the separated target signal is obtained from arranging the separated signal from each frame, in a time order sequence.
This article is organized as follows. Section 2 describes the modulation frequency analysis. In Section 3, first, a brief description of the present system is given and then the details of each stage are presented. In Section 4, a quantitative measure is proposed for evaluating the performance of speech separation and it is used for systematic evaluation of pitch range estimation and speech separation. This article concludes with a discussion in Section 5.
2. Modulation frequency analysis
where I is the DFT length and i is the modulation frequency index. The modulation transform consists of a filter-bank that uses the DSTFT followed by a subband envelope detector and, then, a frequency analyzer of the subband envelopes (the DFT) .
3. System description
To determine the mentioned pitch ranges, our proposed method uses an onset and offset detection algorithm  to find the distribution of modulation spectrogram energy in the modulation frequency domain, which is an important feature for determining the pitch range. When modulation spectrogram energy is found, the modulation spectrogram is segmented, as described in Section 3.2.2. Then, the resulting segments are grouped in order to estimate the pitch range of each speaker. A detailed description of stages is as follows.
3.1. T-F decomposition and modulation transform
At the T-F stage, the STFT (as a uniform filter-bank) is used for decomposing a broadband signal into narrowband subband signals. The output of the T-F stage enters into the modulation transform stage in order to calculate the modulation spectrogram.
3.2. Pitch range estimation in modulation frequency domain
The pitch frequencies of target and interference speakers are both time-varying. Occasionally, pitch frequencies of the target and interference speakers are too close to each other, in which this fact causes undesired errors in multipitch tracking algorithms and decreases the accuracy of speech separation methods. The algorithm of this article estimates the pitch range of target and interference speakers of noisy speech in the modulation frequency domain. Estimating the pitch range in small time-intervals (for example 200 ms) decreases the error in the pitch range estimation method.
In the pitch range estimation approach, at first, the intensity of the modulation spectrogram is smoothed over the modulation frequency, using a low-pass filter. Then, the partial derivative of the smoothed intensity over the modulation frequency is computed. By marking the peaks and valleys of the resulting signal, the onset and offset candidates are detected and the onset and offset fronts are formed. By matching the onset and offset fronts, the modulation spectrogram of speech signal is segmented. The detailed description of the stages for the pitch range estimation is as follows.
where g s (i) is a low-pass FIR filter with a small number of coefficients with pass-band [0, s] in Hz. Here, "*" denotes the convolution operator (over the modulation frequency). The parameter s determines the degree of smoothing: the smaller s, the smoother would be.
3.2.2. Onset/offset detection and matching
In every filter channel k, to determine the offset corresponding to each onset candidate, let fon[k, l] represent the modulation frequency of the l th onset candidate in the filter channel k. The corresponding offset, denoted by foff[k, l], is located between fon[k, l] and fon[k, l+ 1]. If there are multiple offset candidates in this interval, the one with the largest intensity decrease (i.e., the smallest ) is chosen.
After finding the onsets and offsets, those with close modulation frequencies are connected to the onset and offset fronts, because the frequency components of onsets and offsets with close modulation frequencies probably correspond to the same source. Onset and offset fronts are vertical contours across acoustic frequency in the modulation spectrogram domain. The proposed system connects an onset candidate from a filter channel to an onset candidate in the above adjacent filter channel, provided that their distance in the modulation frequency is less than a certain threshold relative to the latter filter channel. In each filter channel, this threshold is defined as the mean of the distances in the modulation frequency direction between two-by-two adjacent onsets. This definition for the threshold is provided from experiments and is validated as a good choice in the data. The same applies to the offset candidates. Notice that a threshold with a too small value may prevent onsets or offsets from the same event to joint; while a threshold with a too large value may cause some onsets from different events to connect together .
The next step is to form segments by matching individual onset and offset fronts. Consider (fon[k, l k ], fon[k, lk+1],..., fon[k+r-1, lk+r-1]) as an onset front with r consecutive filter channels, in which l k denotes the number of the selected onset as an onset front member, in the filter channel k; and consider (foff[k, l k ], foff[k+1, lk+1],..., foff[k+r-1, lk+r-1]) as the corresponding offset modulation frequencies. For each offset modulation frequency, first, we find all those offset fronts that cross this offset; then, the offset front with the most crosses (with the offset modulation frequencies) is chosen as the matching offset front. Now, the entire filter channels from k to k +r -1 occupied by the matching offset front (and their corresponding offset modulation frequencies on this matching offset front) are labeled as "matched." If all the channels from k to k+r-1 are labeled as matched, the matching procedure finishes; otherwise, the matched channels should be put aside and the procedure should be repeated for the remaining unmatched channels.
3.2.3. Segment selection and decision-making
By detecting the onsets and offsets and forming the onset and offset fronts, the modulation spectrogram domain of speech signal is segmented. Since the speaker's pitch range is [60, 350] Hz (for men, women, and children), only the segments with modulation frequencies in this range are accepted. Now, we describe the grouping procedure for the segments.
3.3. Speech separation
In , a mask is presented for speech separation in the modulation spectrogram domain, assuming that the pitch ranges of the target and interference are known and that these ranges are the same in each subband. Our system extends this idea by allowing the value of the mask in each filter channel to depend on the estimated pitch range of that filter channel.
Consider a given signal x(n) that is the sum of a target signal x ts (n) and an interference signal x is (n), sampled at f s Hz, i.e., x(n) = x ts (n)+x is (n). A proper mask should be estimated for segregating the target signal from the interference signal. In each filter channel k, the pitch ranges of the target and interfering speakers (obtained from the previous stage) are denoted by and , respectively. Also, is defined as the set of modulation frequency indices of PF k , i.e., a pitch range in the filter channel k.
Finally, the separated target signal in the time domain is obtained by taking the inverse STFT of .
As mentioned earlier, our system estimates the pitch range and uses this range for the speech separation. In this section, we evaluate the proposed system in the processes of pitch range estimation and speech separation.
4.1. Pitch range estimation
First, the proposed system is evaluated in the pitch range estimation process with utterances chosen from the Lee's database  and a corpus of 100 mixtures of speech and interference , commonly used for CASA research, see, e.g., [13, 25, 26]. The corpus contains utterances from both male and female speakers. These utterances are mixed with a set of intrusions at different SNR levels. These intrusions are N0: 1 kHz pure tone; N1: white noise; N2: noise bursts; N3: cocktail party noise; N4: rock music; N5: siren, N6: trill telephone; N7: female speech; N8: male speech; and N9: female speech. These intrusions have a considerable variety; for example, N3 is noise-like, while N5 contains strong harmonic sounds. They form a realistic corpus for evaluating the capacity of a CASA system when it deals with various types of interference.
The signal X(k, i) is the modulation spectrogram of an input signal that is digitized at a 16-kHz sampling rate. The parameters of the proposed system are set to M = 16 and K = 128. w(n) is a Hanning window with length L = 64 (refer to Section 2). The STFT filter-bank has 128 filter channels, for which the center frequency of the k th filter channel is ω k = 2πk/K, k = 0,..., K-1.
A reliable evaluation of the proposed system requires a reference range of the true pitch. However, such a reference is probably impossible to obtain from a noisy speech. We find the reference pitch range by framing the clean speech signal and calculating the pitch frequency in each frame.
The performance of the proposed method is compared with that of the Least Square Harmonic (LSH) technique , Robust Algorithm for Pitch Tracking (RAPT) , and the Maximum A Posterior (MAP) estimator . RAPT and MAP are two standard pitch estimation algorithms. The LSH algorithm, derived in  for harmonic decomposition of a time-varying signal, estimates the harmonic amplitudes and phases, by solving a set of linear equations that minimizes the mean square error. The RAPT algorithm estimates the pitch frequency, by searching for local maxima in the autocorrelation function of the windowed speech signal and then, using a dynamic programming technique (see  for more details). The MAP approach  considers a harmonic model for the voiced speech so that each windowed signal is expressed with a generalized linear model whose basic functions depend on the fundamental frequency and number of harmonic partials.
Figure 10 also provides a comparison between the results of the pitch estimation using the mentioned four methods, in which the proposed system performs consistently better than the three standard methods, at all SNR levels. Although the performance of the LSH model (as the best performing one among the mentioned standard algorithms) is good at SNR levels above 10 dB, it drops quickly as SNR decreases, which shows that the proposed system is more robust to interference compared with the LSH model.
As mentioned in , MAP performs slightly better in low SNR's rather than high SNR's. In addition, RAPT fails to estimate the desired pitch period in low SNR's, because it mistakenly chooses sub-harmonic and harmonic partials instead of the true pitch period. The current scheme performs almost consistently in both high and low SNR's.
4.2. Voiced speech separation
A corpus of 100 mixtures composed of 10 target utterances mixed with 10 intrusions is recruited for assessing the performance of the system on voiced speech separation; these data are described in Section 4. 1. For comparison, the Hu and Wang system  and the spectral subtraction method  are employed. Performance of the voiced speech separation is evaluated using two measures commonly used for this propose :
The percentage of energy loss, PEL, which measures the amount of the target speech excluded from the segregated speech.
The percentage of residual noise, PNR, which measures the amount of the intrusion included in the segregated speech.
where is the estimated signal and s(n) is the target signal before being mixed with the intrusion.
As shown in Figure 11, the proposed system segregates 78.9% of the voiced target energy at -5 dB SNR and 99% at 15 dB SNR. At the same time, at -5 dB, 15.9% of the segregated energy belongs to intrusion. This number drops to 0.7% at 15 dB SNR. Figure 11c shows the SNR of the segregated target. This system obtains an average 7.5 dB gain in SNR when the mixture SNR is -5 dB. This gain increases to 14.3 dB, when the mixture SNR is 15 dB. As shown in the figure, the segregated target loses more target energy (Figure 11a), but contains less interference as well (Figure 11b).
Figure 11 also shows the performance of the system proposed by Hu and Wang for voiced speech separation , which is a representative of CASA systems. As shown in the figure, the Hu and Wang's system yields a lower percentage of noise residues (Figure 11b), but has a much higher percentage of target energy loss (Figure 11a, c). Nevertheless, it should be noted that our system significantly improved the PEL (in Figure 11a, see, e.g., by around 11 and 10% improvement at 0 and 15 dB, respectively), which leads to much less signal distortion. The price paid for this is a slightly increase in PNR, as depicted in Figure 11b (e.g., by around 6 and 0.5% increase at 0 and 15 dB, respectively).
To help the reader recognize the real difference in the performance, a file is prepared including sample audio mixture signals (target speech signal + interference signal) and the results of the separation using the Spectral Subtraction, Hu and Wang, and the proposed systems. The file is available at http://ee.yazduni.ac.ir/sprl/ASP-AM-SampleWaves.ppt.
5. Discussions and conclusions
One of the major challenges in speech enhancement is the separation of a target speech from an interference signal of the same type. The accuracy of the CASA methods in single channel speech separation depends on the correctness of the pitch frequency estimation of two simultaneous speakers because the proper mask in the T-F domain for the speech separation is produced in association with the estimated pitch frequency.
In this article, a single channel speech separation system is proposed that estimates the pitch range of one or two speakers and segregates the target speech from the interference. The pitch range is estimated using the onset and offset algorithm considering the distribution of speaker energy in the modulation spectrogram domain. When the target and interference speakers are either male or female, the methods for pitch frequency estimation encounter large errors because of close pitch frequency values. Therefore, CASA methods that employ the pitch frequency as their main feature for speech separation face difficulties. In contrast, a main novelty of the present algorithm is the estimation of pitch range based on short time-frames of the mixture signal. The constructed mask for speech separation depends on the pitch range estimated independently in each subband. As shown by the evaluation results, major portions of the voiced target speech are separated from the interfering speech using this mask. In addition, the proposed system can separate the unvoiced portions that are quasi-periodic because of the proximity of voiced portions.
The proposed algorithm is robust to interference and produces good estimates of both pitch range and voiced speech, even in the presence of strong interference. Systematic evaluation shows that the proposed algorithm performs significantly better than the mentioned CASA and speech enhancement systems.
Silent gaps and other interference-masked intervals are usually included in natural speech utterances. In practice, the utterance across such time-intervals should be grouped. This is a sequential grouping problem [5, 6] whose segments or masks can be obtained using the speech recognition in a top-down manner (also, limited to non-speech interference)  or the speaker recognition trained by speaker models . However, the proposed algorithm does not encounter this problem of sequential grouping because it operates in the modulation spectrogram domain.
In terms of computational complexity, the main cost of the proposed algorithm arises from determining segments in modulation spectrogram for pitch range estimation. The estimation of the mask and convolution for speech separation consumes a small fraction of the overall cost. Both tasks (pitch range estimation and speech separation) are implemented in the frequency domain, so the computational complexity is O(N logN), where N is the number of samples in the input signal. These operations should separately be performed for each subband. On the other hand, since feature extraction takes place independently in different subbands, substantial speedup can be achieved through parallel computing.
For future work, the proposed algorithm can be improved by iterative estimation of pitch range and speech separation. The algorithm can include a specific method to jump-start the iterative process, which gives an initial estimate of both pitch range and mask with reasonable quality. In general, the performance of the algorithm depends on the initial estimate of pitch range; better initial estimates would lead to better performance. Even with a poor estimate of pitch range, which is unavoidable in very low SNR conditions, the proposed algorithm improves the initial estimate during the iterative process.
- Lippmann RP: Speech recognition by machines and humans. Speech Commun 1997, 22: 1-16. 10.1016/S0167-6393(97)00021-6View ArticleGoogle Scholar
- Sroka JJ, Braida LD: Human and machine consonant recognition. Speech Commun 2005, 45: 410-423.View ArticleGoogle Scholar
- de Cheveigne A: Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Edited by: Wang DL, Brown GJ. Wiley & IEEE, Hoboken, NJ; 2006:45-79.Google Scholar
- Dubnov S, Tabrikian J, Arnon-Targan M: Speech source separation in convolutive environments using space-time-frequency analysis. EURASIP J Appl Signal Process 2006, 2006: 38412, 11.View ArticleGoogle Scholar
- Bregman AS: Auditory Scene Analysis. MIT, Cambridge, MA; 1990.Google Scholar
- Wang DL, Brown GJ (Eds): Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Wiley & IEEE, Hoboken, NJ; 2006.Google Scholar
- Buchler M, Allegro S, Launer S, Dillier N: Sound classification in hearing aids inspired by auditory scene analysis. EURASIP J Appl Signal Process 2005, 18: 2991-3002.View ArticleGoogle Scholar
- Hu G, Wang D: A Tandem algorithm for pitch estimation and voiced speech segregation. IEEE Trans Audio Speech Lang Process 2007, 18(8):2067-2079.Google Scholar
- Shao Y, Srinivasan S, Jin Z, Wang D: A computational auditory scene analysis system for speech segregation and robust speech recognition. Comput Speech Lang 2010, 24: 77-93. 10.1016/j.csl.2008.03.004View ArticleGoogle Scholar
- Radfar MH, Dansereau RM, Sayadiyan A: A maximum likelihood estimation of vocal-tract-related filter characteristics for single channel speech separation. EURASIP J Audio Speech Music Process 2007, 2007: 84186, 15.View ArticleGoogle Scholar
- Barker J, Cooke M, Ellis D: Decoding speech in the presence of other sources. Speech Commun 2005, 45: 5-25. 10.1016/j.specom.2004.05.002View ArticleGoogle Scholar
- Shao Y, Wang DL: Model-based sequential organization in cochannel speech. IEEE Trans Acoust Speech Signal Process 2005, 14: 289-298.Google Scholar
- Brown GJ, Cooke M: Computational auditory scene analysis. Comput Speech Lang 1994, 8: 297-336. 10.1006/csla.1994.1016View ArticleGoogle Scholar
- Hu G, Wang DL: Monaural speech separation based on pitch tracking and amplitude modulation. IEEE Trans Neural Net 2004, 15: 1135-1150. 10.1109/TNN.2004.832812View ArticleGoogle Scholar
- Wu M, Wang DL, Brown GJ: A multipitch tracking algorithm for noisy speech. IEEE Trans Speech Audio Process 2003, 11: 229-241. 10.1109/TSA.2003.811539View ArticleGoogle Scholar
- Le Roux J, Kameoka H, Ono N, de Cheveigne A, Sagayama S: Single and multiple F0 contour estimation through parametric spectrogram modeling of speech in noisy environments. IEEE Trans Audio Speech Lang Process 2007, 15: 1135-1145.View ArticleGoogle Scholar
- Schimmel SM, Atlas LE, Nie K: Feasibility of single channel speaker separation based on modulation frequency analysis. Proc IEEE International Conference on Acoustics, Speech and Signal Processing, Hawaii, USA 2007, 4: 605-608.Google Scholar
- Schimmel SM Dissertation, University of Washington; 2007.Google Scholar
- Atlas L, Shamma SA: Joint acoustic and modulation frequency. EURASIP J Appl Signal Process 2003, 2003(7):668-675. 10.1155/S1110865703305013View ArticleGoogle Scholar
- Hu G, Wang DL: Auditory segmentation based on onset and offset analysis. IEEE Trans Audio Speech Lang Process 2007, 15(2):396-405.View ArticleGoogle Scholar
- Drullman R, Festen JM, Plomp R: Effect of temporal envelope smearing on speech reception. J Acoust Soc Am 1994, 95: 1053-1064. 10.1121/1.408467View ArticleGoogle Scholar
- Schimmel SM, Atlas LE: Coherent envelope detection for modulation filtering of speech. Proc IEEE International Conference on Acoustics, Speech and Signal Processing, Pennsylvania, USA 2005, 221-224.Google Scholar
- Lee TW: Blind source separation: audio examples. 1998. . Accessed 4 May 2011 http://www.snl.salk.edu/~tewon/Blind/blind_audio.htmlGoogle Scholar
- Cooke MP: Modeling Auditory Processing and Organization. Cambridge University Press, Cambridge; 1993.Google Scholar
- Drake LA Dissertation, University of Northwestern; 2001.Google Scholar
- Wang DL, Brown GJ: Separation of speech from interfering sounds based on oscillatory correlation. IEEE Trans Neural Netw 1999, 10: 684-697. 10.1109/72.761727View ArticleGoogle Scholar
- Li Q, Atlas L: Time-variant least-squares harmonic modeling. Proc IEEE International Conference on Acoustics, Speech and Signal Processing, Hong Kong 2003, 2: 41-44.Google Scholar
- Talkin D: A robust algorithm for pitch tracking (RAPT). In Speech Coding and Synthesis. Edited by: Klein WB, Paliwal KK. Elsevier, NewYork, NY; 1995:495-518.Google Scholar
- Tabrikian J, Dubnov S, Dickalov Y: Maximum a posterior probability pitch tracking in noisy environments using harmonic model. IEEE Trans Speech Audio Process 2004, 12: 76-87. 10.1109/TSA.2003.819950View ArticleGoogle Scholar
- Huang X, Acero A, Hon HW: Spoken Language Processing: A Guide to Theory, Algorithms, and System Development. Prentice Hall PTR, Upper Saddle River, NJ; 2001.Google Scholar
- Shao Y Dissertation, University of Ohio State; 2007.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.