 Research
 Open access
 Published:
Reliable likelihood ratios for statistical modelbased voice activity detector with low falsealarm rate
EURASIP Journal on Advances in Signal Processing volume 2011, Article number: 31 (2011)
Abstract
The role of the statistical modelbased voice activity detector (SMVAD) is to detect speech regions from input signals using the statistical models of noise and noisy speech. The decision rule of SMVAD is based on the likelihood ratio test (LRT). The LRTbased decision rule may cause detection errors because of statistical properties of noise and speech signals. In this article, we first analyze the reasons why the detection errors occur and then propose two modified decision rules using reliable likelihood ratios (LRs). We also propose an effective weighting scheme considering spectral characteristics of noise and speech signals. In the experiments proposed in this study, with almost no additional computations, the proposed methods show significant performance improvement in various noise conditions. Experimental results also show that the proposed weighting scheme provides additional performance improvement over the two proposed SMVADs.
1. Introduction
The purpose of a voice activity detector (VAD) is to discriminate between speech and nonspeech regions from the input signals in various noisy conditions. VAD techniques have widely been used in many speech applicable fields, such as speech recognition, speaker recognition, speech coding, and speech enhancement as a preprocessor because they can help us to improve the performance of those recognition systems and enhance the channel efficiency of the speech coding system. In general, most of the conventional VAD systems assume that the statistical property of noise is stationary over longer period than that of speech, which makes it possible to estimate noise statistics in spite of the occasional presence of speech [1]. By comparing estimated noise and speech statistics, we can detect speech regions from the unknown input signals.
As the demands for more accurate VADs in noisy conditions increase, a lot of efforts have been made to enhance the performance of VAD [2–14]. One successful approach is the statistical modelbased VAD (SMVAD) proposed by Sohn et al. [2]. It utilizes the complex Gaussian probability density function (PDF). More recently, various efforts have been made to optimize SMVAD by modifying the decision rule originally derived from the likelihood ratio test (LRT). To decrease detection errors at speech offset regions, Sohn et al. [3] proposed an effective hangover scheme based on the hidden Markov model (HMM), and Cho and Kondoz [4] proposed smoothed likelihood ratios (SLRs) in the decision rule. Other approaches have involved various statistical models for noise and noisy speech [5], and discriminative weight training (DWT) scheme [6]. The DWT scheme is a good approach in the name of optimizing frequency weights, but it does not yet consider temporal variations of input signal statistics because the optimized weights can be calculated only once through the whole training data. Also, the DWT scheme is not very practical since the weights need to be optimized differently in various noise conditions according to noise types and signaltonoise ratio levels. Another technique is the movingaveraged decision rule over a certain number of neighboring frames applied for performance improvement of SMVAD [7]. However, to our knowledge, it seems that there has been no study about the reliability of likelihood ratios (LR).
In this article, we analyze the problem of the LRTbased decision rule and various properties of noise spectra in terms of the signal power and the related SNRs. Based on our analysis, we propose modified decision rules by selecting reliable LRs and a weighting scheme which can well take into account the difference between noise and noisy speech. The main advantage of these methods is that each proposed method shows significant performance improvement with almost no additional computational cost.
This article is organized as follows. In Section 2, we introduce the modeling concept of noise and noisy speech used to constitute the decision rule of SMVAD. In addition, we demonstrate estimation techniques for related parameters such as a priori and a posteriori SNRs, noise variance, and speech absence probability (SAP). In Section 3, we analyze the LRTbased decision rule of SMVAD and explain overlooked phenomena produced by noise or noisy speech. In Section 4, we propose new decision rules by selecting reliable LRs and a weighting scheme applied to every LR to reduce detection errors. In Section 5, we show test environments and demonstrate the significantly improved performance of proposed methods, compared with the conventional SMVAD methods in various noise conditions.
2. Statistical modelbased voice activity detector
In this section, we briefly review the overall process of SMVAD using the complex Gaussian PDF to detect speech regions in the adverse noise environment. Basically, the PDF used for SMVAD assumes that there is no correlation between the real and imaginary parts of spectral components.
2.1. Noise and noisy speech modeling
The SMVAD is based on two hypotheses H_{0} and H_{1} which assume the only two cases, noise or noisy speech, respectively,
H_{0}  Speech absence: Y(n) = N(n)
H_{1}  Speech presence: Y(n) = S(n) + N(n)
where Y(n) = [Y_{0}(n), Y_{1}(n),...,Y_{M1}(n)], N(n) = [N_{0}(n), N_{1}(n),...,N_{M1}(n)], and S(n) = [S_{0}(n), S_{1}(n),...,S_{M1}(n)] represent Mdimensional discrete Fourier transform (DFT) coefficient vectors of the input signal, noise, and clean speech at the n th frame, respectively. In the SMVAD, the following assumptions are given:

1.
Noise is additive and its statistics is uncorrelated with speech.

2.
All DFT coefficients are independent of each other.

3.
The likelihood of Y_{ k } (n) conditioned on each hypothesis can be modeled by the zeromean complex Gaussian PDF.
Under these assumptions, the PDFs of Y(n) conditioned on each hypotheses are given by
where k is the frequency bin index, and λ_{N, k}and λ_{S, k}denote the variances of noise and speech, respectively.
2.2. Decision rule based on LRT
The decision rule of the SMVAD can be derived from log likelihood ratios (LLRs) at every frequency bin which is given by
where ξ_{ k } (n) is λ_{S, k}/λ _{ N, k } representing the a priori signaltonoise ratio (SNR) and γ_{ k } (n) is Y_{ k } (n)^{2}/λ _{ N, k } denoting the a posteriori SNR. λ _{ S, k } and λ _{ N, k } should be estimated and the wellknown method for estimating the a priori SNR is the decisiondirected (DD) method [15] which is given as
where α is the weighting term, e.g., 0.98, is an estimate for the shorttime power spectrum of clean speech derived from the minimum mean square error shorttime spectral amplitude (MMSESTSA) estimator [15], , and is the estimated noise variance which is given by [16] as
where 0 < ζ _{ N } < 1 is the smoothing parameter, E[·] the expectation operator, and N_{ k } (n)^{2} the noise power spectrum. In Equation 5, the expectation term is also given by
Where
In Equation 6, p(H_{0}Y_{ k } (n)) is the SAP at the k th frequency bin and derived from the Bayes' rule such that [8]
with p(H_{0}) representing the a priori probability of speech absence which is set to 0.2 in our case.
With the estimated parameters, the decision rule of SMVAD is given by
where is the LLR utilizing and and η is the decision threshold.
3. Analysis of LRTbased decision rule
In general, the LRTbased decision rule of SMVAD overlooks two undesirable problems. The first is that LRs cannot always show high values even if the input signal contains speech. Because of the basic assumption that noise is uncorrelated with speech, the complex Gaussian models of the input signal must satisfy the following condition:
With condition (11), the peak of p(Y(n)H_{1}) is always lower than or equal to that of p(Y(n)H_{0}). Thus, p(Y(n)H_{1}) cannot always larger than p(Y(n)H_{0}) even in the case of speech presence. Therefore, an increased variance does not guarantee an increased LLR values. Figure 1 shows an example of this case with the three complex Gaussian PDFs having different variances. In Figure 1, the dotted line indicates the PDF only with noise variance, and the dashed and the solid lines represent the PDFs for which the a priori SNRs are 10 and 5 dB with the given noise variance, respectively.
Figure 2 shows two LLR curves related to Figure 1 with respect to the spectral amplitude of input signals where the dashed line indicates the LLR with the lower a priori SNR and the solid line represents the LLR with the higher a priori SNR, respectively. In Figure 2, the dashed circles show the difference between two LLRs for a low and highpowered spectra at the given frequency bin. In the case of the left circle, there is a small difference between the solid and the dashed lines, and the solid line may be rather lower than the dashed line, even though the given a priori SNRs show a substantial difference. On the other hand, the right circle shows a large difference between the two LLR lines.
By inspecting the two cases, it is observed that the spectral power of input signals plays an important role in making the decision rule have a better discriminative property, because the conventional decision rule of the SMVAD was the average of all LLRs. In other words, the accuracy of the decision rule may be degraded by the LLRs derived from lowpowered input spectrum.
Figure 3 shows an example of the undesirable cases of LLRs in a speech frame caused by both high a priori SNR and lowspectral power. In Figure 3a, xaxis represents the frequency bin index, the solid line indicates the spectral power of noisy speech, and the dotted line represents the estimated noise variance. As shown in Figure 3b, even though the a priori SNR is estimated to be high in this speech frame, it causes low LLR values when the input signal power is lower than the estimated noise variance. If the LLRs shown in Figure 3c are employed for decisions in SMVAD, this speech frame could not be detected as a speech frame. In Figure 3c, it is also observed that most of LLRs are close to 0 and high LLRs are located on the most highpowered frequency region. From the investigation of the first problem, it is concluded that the decision rule of SMVAD uses LLR at every frequency bin but not all of them contribute to a correct decision in case of the components in the lowpowered frequency region.
The second problem of SMVAD also occurs on the lowpowered frequency region of the noise. As mentioned in Section 1, the basic assumption on SMVAD is that noise is stationary over a long period of time, but, in practice, most of real noise powers tend to change slightly framebyframe. To accommodate this phenomenon, the estimated noise variance, , needs to be reconsidered. Since the fixed smoothing parameter ζ_{ N } is generally chosen to be very close to 1, estimated noise variance changes very smoothly. By this effect, estimated noise variance can keep the a priori SNR very low along the noiseonly frames. As a result, the LLR with estimated SNRs can be simplified by
Because of the smoothing operation, the a priori SNR can be kept low or change very smoothly over the nonspeech region. However, the a posteriori SNR and is greatly influenced by the unstable property of noise when noise is not stationary. Actually, the variations in the noise spectrum are not so serious in terms of the actual absolute value of noise and this phenomenon is good for estimating the reliable noise variance. However, since the a posteriori SNR is the ratio of the varying input spectrum Y_{ k } (n)^{2} to the almost fixed noise variance , the a posteriori SNRs in the lowpowered frequency regions are more difficult to be settled on fixed low values. Figure 4 shows an example of the transition of a posteriori SNR with two different noise variances where yaxis means the ratio, (Noise variance + Varying Range)/Noise variance, which corresponds to the a posteriori SNR.
As shown in Figure 4, when the noise variance is very small, the transition of the a posteriori SNR is rapidly increased over the same varying range against the transition of the high noise variance. In the noiseonly frames, the average of LLRs has to be close to 0. However, the a posteriori SNRs in lowpowered frequency region can be high and possibly make certain LLR values as high as the levels in the speech frame. Using these LLRs in the decision rule, some of these noise frames can be detected as speech frames. Figure 5, where xaxis represents frequency bin, shows an actual case of the average car noise spectrum and the variance of a posteriori SNR according to the noise variance estimated on each frame. As already mentioned, in most highpowered frequency regions, the variances of a posteriori SNR are very close to 0, which means that there are low possibilities of high LLR values. In a lowpowered region, on the contrary, higher variances may be shown. In case of (12), this effect brings about high a posteriori SNR which causes high LLRs.
In summary, if the input signal includes the speech signal, the LLRs in the low powered region could not be reliable because there is no way to judge whether the LLRs are caused by the speech signal or varying noise spectral components. Therefore, LLRs in the highpowered region are more important to let decision rule have an enhanced discriminative property. Figure 6 shows an actual case that SMVAD can save a speech frame when the decision rule of SMVAD only uses the LLRs in the highpowered region. In Figure 6, if we average all LLRs for decision, we could never detect the speech frame plotted in Figure 6a. On the contrary, based on our analysis, if we select or properly weight the LLRs for the decision rule, we can detect the speech frame because all of LLRs in lowpowered region in Figure 6b can be excluded from the decision or reduced by the proper weights which can attenuate the effects of unreliable LLRs.
4. Modified decision rules
4.1. Selection of reliable LRs
By considering two undesirable phenomena analyzed in Section 3, it is discovered that LLRs in the highpowered frequency region are more reliable than those in the lowpowered region. However, the concept for the highpowered region is still ambiguous for the timevarying input frames. In case of additive noise, since the average noise spectrum is almost fixed, the highpowered region may also be fixed for every noise frame, but in case of speech frames, the highpowered region can be moved by speech signals. Therefore, we need to find the highpowered region independently from the neighboring frames and only consider the total power of the current frame. Here, three assumptions for the selection of reliable LLRs are used:

1.
The property of noise is mainly dependent on highpowered but less varying frequency components for which LLRs can be kept low.

2.
Most of noise spectral power is concentrated on the highpowered region, irrespective of the existence of speech component.

3.
When speech component exists in the current frame, the LLRs in the highpowered frequency region obtained because of the speech component may show high value.
Therefore, we propose two modified decision rules by selecting the frequency bins with reliable LLRs on the basis of the spectral power. At first, we reorder the input signal vector in terms of the spectral power such as Y^{(R)}(n) = [Y^{(1)}(n), Y^{(2)}(n),...,Y^{(M)}(n)] where Y^{(r)}(n)^{2} ≥ Y^{(s)}(n)^{2} for r > s and we also define LLR vector, where each element is related to its corresponding Y^{(r)}(n). With this vector, the first modified decision rule is defined as
where N_{ H } denotes the number of LR selected by the spectral power of frequency bins. By this decision rule, we can only consider the LLRs related to highpower frequency bins and N_{ H } is determined, empirically.
The second method is to compare the binpower with the average power in each frame. Based on this idea, the second modified decision rule is given by
where if Y^{(r)}(n)^{2} ≥ Y_{avg}(n), and f[Λ^{(r)}(n), Y_{avg}(n)] = 0 otherwise, and N_{ A } is the number of spectral components greater than or equal to the average power of each frame. In this method, we assumed that the spectral power in the highpowered region of noise is always greater than the frame average power.
4.2. Weighting scheme considering reliability of LRs
With the analysis of LRTbased decision rule, we also propose a weighting scheme to reflect the reliability of each LLR. As mentioned in Section 3, since the LLRs in the lowpowered region of noise are not reliable because of the variation of the a posteriori SNR, it is desirable to consider more importantly the spectral powers of noisy speech which are much higher than the noise variance. In addition, as the noise variance becomes closer to the highest value of the noise variances at the current frame, the LLRs derived from the a posteriori SNRs would be reliable. Thus, the weights applied to each LLR are defined by
where is equal to the highest variance of the variance vector which is composed of s at all k th bins. In (16), each w_{ k } (n) can reduce the effects of the unstable a posteriori SNRs and cause LLRs in the highpowered region to remain on their own values or to increase. Thus, the new decision rule with this weight is given by
5. Experiments
In the experiments, test data were composed of 60s long speech data from the IEEE sentence listed in Table 1 and noise data from the AURORA database. The speech data were spoken by three male and three female speakers and sampled at 8 kHz. We used 20 ms frame size and 10 ms frame shift size. VAD decision was made every frame. The test material was all handlabeled and consisted of 67% of speech and 33% of silence frames. For these experiments, we also used three types of noises, such as car, babble, and street noises at 5, 10, and 15 dB SNRs, respectively.
To compare with the proposed methods, we used four conventional methods as baseline systems. The first method is the typical SMVAD proposed by Sohn et al. [2] as described in Section 2, and the second method is SMVAD with the HMM hangover scheme which is specified in [3]. The third conventional method is the DWT scheme proposed in [6]. For training and testing of the DWT method, we used same parameters specified in [6]. The experiment of DWT scheme was performed with six sets of 10s long data used for testing and the remaining data used for training by the roundrobin test. The fourth method is SMVAD using a multiple observation LRT (MOLRT) proposed in [7]. For the fourth method, we used one frame, which was experimentally chosen, before and after the current frame which is going to be determined as the speech or the nonspeech.
The proposed methods are all evaluated by receiver operating characteristic (ROC) curves which show discriminative properties of VAD between noiseonly and noisy speech frames in terms of the speech detection rate (SDR) and falsealarm rate (FAR) such that
where N_{CS}, N_{TS}, N_{FS}, and N_{TN} denote the number of correctly detected speech frames, total speech frames, falsely detected speech frames in silence regions, and total silence frames, respectively. In these experiments, we set N_{ H } = 10 for in (13).
In every ROC curve, HMM, DWT, MO, HP, AP, and weight in the parenthesis denote the results from the HMM hangover scheme, the DWT scheme, the MOLRT scheme, the first proposed decision rule in (13), the second proposed decision rule in (14), and the decision rule with the proposed weighting scheme, respectively. For practical comparison of performances, we focus on SDRs of the methods when FAR is low. We consider that the VAD can show a reasonable discriminating property when FAR < 0.2. In the ROC curves, the red lines represent the results of the proposed methods and the blue lines indicate the results of the conventional methods.
In the car noise environment of Figure 7, all of the proposed methods show better performance than the conventional methods do. In addition, the SDRs of SMVAD(HP) and SMVAD(Weight) are at least 0.1 higher than those of SMVAD when FAR < 0.05. Especially, SMVAD(Weight) keeps the highest SDR at the extremely low FAR.
In babble noise environment of Figure 8, every proposed method also shows better performance than the conventional methods do as in the car noise environment, but the difference is that the performance improvement of SMVAD(HP) and (AP) are not noticeable. However, we can also observe that the performance improvement of SMVAD(Weight) is kept constant as the SNR becomes higher and can be still considered to be significant.
In street noise environment of Figure 9, SMVAD(Weight) is effective on improving the performance of SMVAD and shows significantly higher SDR at FAR = 0.05 with 5 dB SNR than SMVAD(HMM). In case of 10 dB SNR, the performance of SMVAD(HMM) is almost the same as SMVAD(HP) or SMVAD(AP), but it is still not better than that of SMVAD(Weight).
From the investigation of the experimental results, it is observed that SMVAD(Weight) shows the highest and the most consistent performance improvement in all noise conditions. In addition, SMVAD(DWT) did not show better performance than SMVAD(Weight) does although the complexity of SMVAD(Weight) is almost the same as that of SMVAD. By these results, we can conclude that the variation of input signal statistics has a large influence on the accuracy of VAD, and if we are not sure to know about specific noise type, it would be better to use SMVAD(Weight) for stable performance improvement under unexpected noise environments.
6. Conclusion
In this article, we introduced SMVAD, and analyzed the averaged LRTbased decision rule and its undesirable phenomena which can possibly happen in various noise environments. To reduce the undesirable phenomena, we proposed two types of modified decision rules based on the selection of reliable LRs and a weighting scheme applied to LLRs used in the decision rule. Compared with the conventional methods, it was proved that the proposed methods are much more robust in various noise environments without any training procedure and additional computational complexity. Among the proposed methods, SMVAD(Weight) showed the most reliable performance improvement under various conditions.
For further studies, the properties of speech and noise, which can be applied to the weights for LLRs, are needed to be analyzed in more details.
Abbreviations
 DWT:

discriminative weight training
 FAR:

falsealarm rate
 HMM:

hidden Markov model
 LRs:

likelihood ratios
 LRT:

likelihood ratio test
 MMSESTSA:

minimum mean square error shorttime spectral amplitude
 MOLRT:

multiple observation LRT
 PDF:

probability density function
 ROC:

receiver operating characteristic
 SAP:

speech absence probability
 SDR:

speech detection rate
 SLRs:

smoothed likelihood ratios
 SMVAD:

statistical modelbased voice activity detector
 SNR:

signaltonoise ratio
 VAD:

voice activity detector.
References
Srinivasan K, Gersho A: Voice activity detection for cellular networks. Proc IEEE Speech Coding Workshop 1993, 8586.
Sohn J, Sung W: A voice activity detector employing soft decision based noise spectrum adaptation. Proc Int Conf Acoustics, Speech, and Signal Processing 1998, 365368.
Sohn J, Kim NS, Sung W: A statistical modelbased voice activity detection. IEEE Signal Process Lett 1999,6(1):13. 10.1109/97.736233
Cho YD, Kondoz A: Analysis and improvement of a statistical modelbased voice activity detector. IEEE Signal Process Lett 2001,8(10):276278. 10.1109/97.957270
Chang JH, Kim NS, Mitra SK: Voice activity detection based on multiple statistical models. IEEE Trans Signal Process 2006,54(6):19651976.
Kang SI, Jo QH, Chang JH: Discriminative weight training for a statistical modelbased voice activity detection. IEEE Signal Process Lett 2008, 15: 170173.
Ramírez J, Segura JC, Benítez C, García L, Rubio A: Statistical voice activity detection using a multiple observation likelihood ratio test. IEEE Signal Process Lett 2005,12(10):689692.
Kim YG, Suh YJ, Kim HR: Selection of reliable likelihood ratios for statistical modelbased voice activity detection. Proc of Asia Pacific Signal and Information Processing Association Annual Summit and Conf. 2009, CDROM 2009.
Ramírez J, Segura JC, Benítez C, Torre Á, Rubio AJ: A new KullbackLeibler VAD for speech recognition in noise. IEEE Signal Process Lett 2004,11(2):266269. 10.1109/LSP.2003.821762
Garner PN, Fukada T, Komori Y: A differential spectral voice activity detector. Proc Int Conf Acoustics, Speech, and Signal Processing 2004, 597600.
Davis A, Nordholm S, Togneri R: Statistical voice activity detection using lowvariance spectrum estimation and an adaptive threshold. IEEE Trans Audio Speech Lang Process 2006,14(2):412424.
Shin JW, Kwon HJ, Jin SH, Kim NS: Voice activity detection based on conditional MAP criterion. IEEE Signal Process Lett 2008, 15: 257260.
Ramírez J, Górriz JM, Segura JC, Puntonet CG, Rubio AJ: Speech/nonspeech discrimination based on contextual information integrated bispectrum LRT. IEEE Signal Process Lett 2006,13(8):497500.
Chang JH, Shin JW, Kim NS: Voice activity detector employing generalised Gaussian distribution. Electron Lett 2004,40(24):15611563. 10.1049/el:20047090
Ephraim Y, Malah D: Speech enhancement using a minimum meansquare error shorttime spectral amplitude estimator. IEEE Trans Acoust Speech Signal Process 1984,ASSP32(6):11091121.
Kim NS, Chang JH: Spectral enhancement based on global soft decision. IEEE Signal Process Lett 2000,7(5):108110. 10.1109/97.841154
Acknowledgements
This study was supported by the NRF grant funded by the Korean government (MEST) (No. 20110017967).
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Kim, Y., Suh, Y. & Kim, H. Reliable likelihood ratios for statistical modelbased voice activity detector with low falsealarm rate. EURASIP J. Adv. Signal Process. 2011, 31 (2011). https://doi.org/10.1186/16876180201131
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/16876180201131