- Research Article
- Open access
- Published:
Musical-Noise Analysis in Methods of Integrating Microphone Array and Spectral Subtraction Based on Higher-Order Statistics
EURASIP Journal on Advances in Signal Processing volume 2010, Article number: 431347 (2010)
Abstract
We conduct an objective analysis on musical noise generated by two methods of integrating microphone array signal processing and spectral subtraction. To obtain better noise reduction, methods of integrating microphone array signal processing and nonlinear signal processing have been researched. However, nonlinear signal processing often generates musical noise. Since such musical noise causes discomfort to users, it is desirable that musical noise is mitigated. Moreover, it has been recently reported that higher-order statistics are strongly related to the amount of musical noise generated. This implies that it is possible to optimize the integration method from the viewpoint of not only noise reduction performance but also the amount of musical noise generated. Thus, we analyze the simplest methods of integration, that is, the delay-and-sum beamformer and spectral subtraction, and fully clarify the features of musical noise generated by each method. As a result, it is clarified that a specific structure of integration is preferable from the viewpoint of the amount of generated musical noise. The validity of the analysis is shown via a computer simulation and a subjective evaluation.
1. Introduction
There have recently been various studies on microphone array signal processing [1]; in particular, the delay-and-sum (DS) [2–4] array and the adaptive beamformer [5–7] are the most conventionally used microphone arrays for speech enhancement. Moreover, many methods of integrating microphone array signal processing and nonlinear signal processing such as spectral subtraction (SS) [8] have been studied with the aim of achieving better noise reduction [9–15]. It has been well demonstrated that such integration methods can achieve higher noise reduction performance than that obtained using conventional adaptive microphone arrays [13] such as the Griffith-Jim array [6]. However, a serious problem exists in such methods: artificial distortion (so-called musical noise [16]) due to nonlinear signal processing. Since the artificial distortion causes discomfort to users, it is desirable that musical noise is controlled through signal processing. However, in almost all nonlinear noise reduction methods, the strength parameter to mitigate musical noise in nonlinear signal processing is determined heuristically. Although there have been some studies on reducing musical noise [16] and on nonlinear signal processing with less musical noise [17], evaluations have mainly depended on subjective tests by humans, and no objective evaluations have been performed to the best of our knowledge.
In our recent study, it was reported that the amount of generated musical noise is strongly related to the difference between higher-order statistics (HOS) before and after nonlinear signal processing [18]. This fact makes it possible to analyze the amount of musical noise arising through nonlinear signal processing. Therefore, on the basis of HOS, we can establish a mathematical metric for the amount of musical noise generated in an objective manner. One of the authors has analyzed single-channel nonlinear signal processing based on the objective metric and clarified the features of the amount of musical noise generated [18, 19]. In addition, this objective metric suggests the possibility that methods of integrating microphone array signal processing and nonlinear signal processing can be optimized from the viewpoint of not only noise reduction performance but also the sound quality according to human hearing. As a first step toward achieving this goal, in this study we analyze the simplest case of the integration of microphone array signal processing and nonlinear signal processing by considering the integration of DS and SS. As a result of the analysis, we clarify the musical-noise generation features of two types of methods on integration of microphone array signal processing and SS.
Figure 1 shows a typical architecture used for the integration of microphone array signal processing and SS, where SS is performed after beamforming. Thus, we call this type of architecture BF+SS. Such a structure has been adopted in many integration methods [11, 15]. On the other hand, the integration architecture illustrated in Figure 2 is an alternative architecture used when SS is performed before beamforming. Such a structure is less commonly used, but some integration methods use this structure [12, 14]. In this architecture, channelwise SS is performed before beamforming, and we call this type of architecture chSS+BF.
We have already tried to analyze such methods of integrating DS and SS from the viewpoint of musical-noise generation on the basis of HOS [20]. However, in the analysis, we did not consider the effect of flooring in SS and the noise reduction performance. On the other hand, in this study we perform an exact analysis considering the effect of flooring in SS and the noise reduction performance. We analyze these two architectures on the basis of HOS and obtain the following results.
-
(i)
The amount of musical noise generated strongly depends on not only the oversubtraction parameter of SS but also the statistical characteristics of the input signal.
-
(ii)
Except for the specific condition that the input signal is Gaussian, the noise reduction performances of the two methods are not equivalent even if we set the same SS parameters.
-
(iii)
Under equivalent noise reduction performance conditions, chSS+BF generates less musical noise than BF+SS for almost all practical cases.
The most important contribution of this paper is that these findings are mathematically proved. In particular, the amount of musical noise generated and the noise reduction performance resulting from the integration of microphone array signal processing and SS are analytically formulated on the basis of HOS. Although there have been many studies on optimization methods based on HOS [21], this is the first time they have been used for musical-noise assessment. The validity of the analysis based on HOS is demonstrated via a computer simulation and a subjective evaluation by humans.
The rest of the paper is organized as follows. In Section 2, the two methods of integrating microphone array signal processing and SS are described in detail. In Section 3, the metric based on HOS used for the amount of musical noise generated is described. Next, the musical-noise analysis of SS, microphone array signal processing, and their integration methods are discussed in Section 4. In Section 5, the noise reduction performances of the two integration methods are discussed, and both methods are compared under equivalent noise reduction performance conditions in Section 6. Moreover, the result of a computer simulation and experimental results are given in Section 7. Following a discussion of the results of the experiments, we give our conclusions in Section 8.
2. Methods of Integrating Microphone Array Signal Processing and SS
In this section, the formulations of the two methods of integrating microphone array signal processing and SS are described. First, BF+SS, which is a typical method of integration, is formulated. Next, an alternative method of integration, chSS+BF, is introduced.
2.1. Sound-Mixing Model
In this study, a uniform linear microphone array is assumed, where the coordinates of the elements are denoted by (see Figure 3) and is the number of microphones. We consider one target speech signal and an additive interference signal. Multiple mixed signals are observed at each microphone element, and the short-time analysis of the observed signals is conducted by a frame-by-frame discrete Fourier transform (DFT). The observed signals are given by
where is the observed signal vector, is the transfer function vector, is the target speech signal, and is the noise signal vector.
2.2. SS after Beamforming
In BF+SS, the single-channel target-speech-enhanced signal is first obtained by beamforming, for example, by DS. Next, single-channel noise estimation is performed by a beamforming technique, for example, null beamformer [22] or adaptive beamforming [1]. Finally, we extract the resultant target-speech-enhanced signal via SS. The full details of signal processing are given below.
To enhance the target speech, DS is applied to the observed signal. This can be represented by
where is the coefficient vector of the DS array and is the specific fixed look direction known in advance. Also, is the sampling frequency, is the DFT size, and is the sound velocity. Finally, we obtain the target-speech-enhanced spectral amplitude based on SS. This procedure can be expressed as
where this procedure is a type of extended SS [23]. Here, is the target-speech-enhanced signal, is the oversubtraction parameter, is the flooring parameter, and is the estimated noise signal, which can generally be obtained by a beamforming techniques such as fixed or adaptive beamforming. denotes the expectation operator with respect to the time-frame index. For example, can be expressed as [13]
where is the filter coefficient vector of the null beamformer [22] that steers the null directivity to the speech direction , and is the gain adjustment term, which is determined in a speech break period. Since the null beamformer can remove the speech signal by steering the null directivity to the speech direction, we can estimate the noise signal. Moreover, a method exists in which independent component analysis (ICA) is utilized as a noise estimator instead of the null beamformer [15].
2.3. Channelwise SS before Beamforming
In chSS+BF, we first perform SS independently in each input channel and then we derive a multichannel target-speech-enhanced signal by channelwise SS. This can be expressed as
where is the target-speech-enhanced signal obtained by SS at a specific channel and is the estimated noise signal in the th channel. For instance, the multichannel noise can be estimated by single-input multiple-output ICA (SIMO-ICA) [24] or a combination of ICA and the projection back method [25]. These techniques can provide the multichannel estimated noise signal, unlike traditional ICA. SIMO-ICA can separate mixed signals not into monaural source signals but into SIMO-model signals at the microphone. Here SIMO denotes the specific transmission system in which the input signal is a single source signal and the outputs are its transmitted signals observed at multiple microphones. Thus, the output signals of SIMO-ICA maintain the rich spatial qualities of the sound sources [24] Also the projection back method provides SIMO-model-separated signals using the inverse of an optimized ICA filter [25].
Finally, we extract the target-speech-enhanced signal by applying DS to = . This procedure can be expressed by
where is the final output of chSS+BF.
Such a chSS+BF structure performs DS after (multichannel) SS. Since DS is basically signal processing in which the summation of the multichannel signal is taken, it can be considered that interchannel smoothing is applied to the multichannel spectral-subtracted signal. On the other hand, the resultant output signal of BF+SS remains as it is after SS. That is to say, it is expected that the output signal of chSS+BF is more natural (contains less musical noise) than that of BF+SS. In the following sections, we reveal that chSS+BF can output a signal with less musical noise than BF+SS in almost all cases on the basis of HOS.
3. Kurtosis-Based Musical-Noise Generation Metric
3.1. Introduction
It has been reported by the authors that the amount of musical noise generated is strongly related to the difference between the kurtosis of a signal before and after signal processing. Thus, in this paper, we analyze the amount of musical noise generated through BF+SS and chSS+BF on the basis of the change in the measured kurtosis. Hereinafter, we give details of the kurtosis-based musical-noise metric.
3.2. Relation between Musical-Noise Generation and Kurtosis
In our previous works [18–20], we defined musical noise as the audible isolated spectral components generated through signal processing. Figure 4(b) shows an example of a spectrogram of musical noise in which many isolated components can be observed. We speculate that the amount of musical noise is strongly related to the number of such isolated components and their level of isolation.
Hence, we introduce kurtosis to quantify the isolated spectral components, and we focus on the changes in kurtosis. Since isolated spectral components are dominant, they are heard as tonal sounds, which results in our perception of musical noise. Therefore, it is expected that obtaining the number of tonal components will enable us to quantify the amount of musical noise. However, such a measurement is extremely complicated; so instead we introduce a simple statistical estimate, that is, kurtosis.
This strategy allows us to obtain the characteristics of tonal components. The adopted kurtosis can be used to evaluate the width of the probability density function (p.d.f.) and the weight of its tails; that is, kurtosis can be used to evaluate the percentage of tonal components among the total components. A larger value indicates a signal with a heavy tail in its p.d.f., meaning that it has a large number of tonal components. Also, kurtosis has the advantageous property that it can be easily calculated in a concise algebraic form.
3.3. Kurtosis
Kurtosis is one of the most commonly used HOS for the assessment of non-Gaussianity. Kurtosis is defined as
where is a random variable, is the kurtosis of , and is the th-order moment of . Here is defined as
where denotes the p.d.f. of . Note that this is not a central moment but a raw moment. Thus, (7) is not kurtosis according to the mathematically strict definition, but a modified version; however, we refer to (7) as kurtosis in this paper.
3.4. Kurtosis Ratio
Although we can measure the number of tonal components by kurtosis, it is worth mentioning that kurtosis itself is not sufficient to measure musical noise. This is because that the kurtosis of some unprocessed signals such as speech signals is also high, but we do not perceive speech as musical noise. Since we aim to count only the musical-noise components, we should not consider genuine tonal components. To achieve this aim, we focus on the fact that musical noise is generated only in artificial signal processing. Hence, we should consider the change in kurtosis during signal processing. Consequently, we introduce the following kurtosis ratio [18] to measure the kurtosis change:
where is the kurtosis of the processed signal and is the kurtosis of the input signal. A larger kurtosis ratio 1) indicates a marked increase in kurtosis as a result of processing, implying that a larger amount of musical noise is generated. On the other hand, a smaller kurtosis ratio (1) implies that less musical noise is generated. It has been confirmed that this kurtosis ratio closely matches the amount of musical noise in a subjective evaluation based on human hearing [18].
4. Kurtosis-Based Musical-Noise Analysis for Microphone Array Signal Processing and SS
4.1. Analysis Flow
In the following sections, we carry out an analysis on musical-noise generation in BF+SS and chSS+BF based on kurtosis. The analysis is composed of the following three parts.
-
(i)
First, an analysis on musical-noise generation in BF+SS and chSS+BF based on kurtosis that does not take noise reduction performance into account is performed in this section.
-
(ii)
The noise reduction performance is analyzed in Section 5, and we reveal that the noise reduction performances of BF+SS and chSS+BF are not equivalent. Moreover, a flooring parameter designed to align the noise reduction performances of BF+SS and chSS+BF is also derived to ensure the fair comparison of BF+SS and chSS+BF.
-
(iii)
The kurtosis-based comparison between BF+SS and chSS+BF under the same noise reduction performance conditions is carried out in Section 6.
In the analysis in this section, we first clarify how kurtosis is affected by SS. Next, the same analysis is applied to DS. Finally, we analyze how kurtosis is increased by BF+SS and chSS+BF. Note that our analysis contains no limiting assumptions on the statistical characteristics of noise; thus, all noises including Gaussian and super-Gaussian noise can be considered.
4.2. Signal Model Used for Analysis
Musical-noise components generated from the noise-only period are dominant in spectrograms (see Figure 4); hence, we mainly focus our attention on musical-noise components originating from input noise signals.
Moreover, to evaluate the resultant kurtosis of SS, we introduce a gamma distribution to model the noise in the power domain [26–28]. The p.d.f. of the gamma distribution for random variable is defined as
where , and . Here, denotes the shape parameter, is the scale parameter, and is the gamma function. The gamma distribution with corresponds to the chi-square distribution with two degrees of freedom. Moreover, it is well known that the mean of for a gamma distribution is , where is the expectation operator. Furthermore, the kurtosis of a gamma distribution, , can be expressed as [18]
Moreover, let us consider the power-domain noise signal, , in the frequency domain, which is defined as
where is the real part of the complex-valued signal and is its imaginary part, which are independent and identically distributed (i.i.d.) with each other, and the superscript expresses complex conjugation. Thus, the power-domain signal is the sum of two squares of random variables with the same distribution.
Hereinafter, let and be the signals after DFT analysis of signal in a specific microphone , , and we suppose that the statistical properties of equal to and . Moreover, we assume the following; is i.i.d. in each channel, the p.d.f. of is symmetrical, and its mean is zero. These assumptions mean that the odd-order cumulants and moments are zero except for the first order.
Although if is a Gaussian signal, note that the kurtosis of a Gaussian signal in the power spectral domain is 6. This is because a Gaussian signal in the time domain obeys the chi-square distribution with two degrees of freedom in the power spectral domain; for such a chi-square distribution, .
4.3. Resultant Kurtosis after SS
In this section, we analyze the kurtosis after SS. In traditional SS, the long-term-averaged power spectrum of a noise signal is utilized as the estimated noise power spectrum. Then, the estimated noise power spectrum multiplied by the oversubtraction parameter is subtracted from the observed power spectrum. When a gamma distribution is used to model the noise signal, its mean is . Thus, the amount of subtraction is . The subtraction of the estimated noise power spectrum in each frequency band can be considered as a shift of the p.d.f. to the zero-power direction (see Figure 5). As a result, negative-power components with nonzero probability arise. To avoid this, such negative components are replaced by observations that are multiplied by a small positive value (the so-called flooring technique). This means that the region corresponding to the probability of the negative components, which forms a section cut from the original gamma distribution, is compressed by the effect of the flooring. Finally, the floored components are superimposed on the laterally shifted p.d.f. (see Figure 5). Thus, the resultant p.d.f. after SS, , can be written as
where is the random variable of the p.d.f. after SS. The derivation of is described in Appendix A.
From (13), the kurtosis after SS can be expressed as
where
Here, is the upper incomplete gamma function defined as
and is the lower incomplete gamma function defined as
The detailed derivation of (14) is given in Appendix B. Although Uemura et al. have given an approximated form (lower bound) of the kurtosis after SS in [18], (14) involves no approximation throughout its derivation. Furthermore, (14) takes into account the effect of the flooring technique unlike [18].
Figure 6(a) depicts the theoretical kurtosis ratio after SS, , for various values of oversubtraction parameter and flooring parameter . In the figure, the kurtosis of the input signal is fixed to 6.0, which corresponds to a Gaussian signal. From this figure, it is confirmed that thekurtosis ratio is basically proportional to the oversubtraction parameter . However, kurtosis does not monotonically increase when the flooring parameter is nonzero. For instance, the kurtosis ratio is smaller than the peak value when and . This phenomenon can be explained as follows. For a large oversubtraction parameter, almost all the spectral components become negative due to the larger lateral shift of the p.d.f. by SS. Since flooring is applied to avoid such negative components, almost all the components are reconstructed by flooring. Therefore, the statistical characteristics of the signal never change except for its amplitude if . Generally, kurtosis does not depend on the change in amplitude; consequently, it can be considered that kurtosis does not markedly increase when a larger oversubtraction parameter and a larger flooring parameter are set.
The relation between the theoretical kurtosis ratio and the kurtosis of the original input signal is shown in Figure 6(b). In the figure, is fixed to 0.0. It is revealed that the kurtosis ratio after SS rapidly decreases as the input kurtosis increases, even with the same oversubtraction parameter . Therefore, the kurtosis ratio after SS, which is related to the amount of musical noise, strongly depends on the statistical characteristics of the input signal. That is to say, SS generates a larger amount of musical noise for a Gaussian input signal than for a super-Gaussian input signal. This fact has been reported in [18].
4.4. Resultant Kurtosis after DS
In this section, we analyze the kurtosis after DS, and we reveal that DS can reduce the kurtosis of input signals. Since we assume that the statistical properties of or are the same as that of , the effect of DS on the change in kurtosis can be derived from the cumulants and moments of .
For cumulants, when and are independent random variables it is well known that the following relation holds:
where denotes the th-order cumulant. The cumulants of the random variable , , are defined by a cumulant-generating function, which is the logarithm of the moment-generating function. The cumulant-generating function is defined as
where is an auxiliary variable and is the moment-generating function. Thus, the th-order cumulant is represented by
where is the th-order derivative of .
Now we consider the DS beamformer, which is steered to and whose array weights are . Using (18), the resultant th-order cumulant after DS, , can be expressed by
where is the th-order cumulant of . Therefore, using (21) and the well-known mathematical relation between cumulants and moments, the power-spectral-domain kurtosis after DS, can be expressed by
The detailed derivation of (22) is described in Appendix C.
Regarding the power-spectral components obtaining from a gamma distribution, we illustrate the relation between input kurtosis and output kurtosis after DS in Figure 7. In the figure, solid lines indicate simulation results and broken lines show theoretical relations given by (22). The simulation results are derived as follows. First, multichannel signals with various values of kurtosis are generated artificially from a gamma distribution. Next, DS is applied to the generated signals. Finally, kurtosis after DS is estimated from the signal resulting from DS. From this figure, it is confirmed that the theoretical plots closely fit the simulation results. The relation between input/output kurtosis behaves as follows: (i) The output kurtosis is very close to a linear function of the input kurtosis, and (ii) the output kurtosis is almost inversely proportional to the number of microphones. These behaviors result in the following simplified (but useful) approximation with an explicit function form:
where is the input kurtosis. The approximated plots also match the simulation results in Figure 7.
When input signals involve interchannel correlation, the relation between input kurtosis and output kurtosis after DS approaches that for only one microphone. If all input signals are identical signals, that is, the signals are completely correlated, the output after DS also becomes the same as the input signal. In such a case, the effect of DS on the change in kurtosis corresponds to that for only one microphone. However, the interchannel correlation is not equal to one within all frequency subbands for a diffuse noise field that is a typically considered noise field. It is well known that the intensity of the interchannel correlation is strong in lower-frequency subbands and weak in higher-frequency subbands for the diffuse noise field [1]. Therefore, in lower-frequency subbands, it can be expected that DS does not significantly reduce the kurtosis of the signal.
As it is well known that the interchannel correlation for a diffuse noise field between two measurement locations can be expressed by the sinc function [1], we can state how array signal processing is affected by the interchannel correlation. However, we cannot know exactly how cumulants are changed by the interchannel correlation because (18) only holds when signals are mutually independent. Therefore, we cannot formulate how kurtosis is changed via DS for signals with interchannel correlation. For this reason, we experimentally investigate the effect of interchannel correlation in the following.
Figures 8 and 9 show preliminary simulation results of DS. In this simulation, SS is first applied to a multichannel Gaussian signal with interchannel correlation in the diffuse noise field. Next, DS is applied to the signal after SS. In the preliminary simulation, the interelement distance between microphones is 2.15 cm. From the results shown in Figures 8(a) and 9, we can confirm that the effect of DS on kurtosis is weak in lower-frequency subbands, although it should be noted that the effect does not completely disappear. Also, the theoretical kurtosis curve is in good agreement with the actual results in higher-frequency subbands (see Figures 8(b) and 9). This is because the interchannel correlation is weak in higher-frequency subbands. Consequently, for a diffuse noise field, DS can reduce the kurtosis of the input signal even if interchannel correlation exists.
If input noise signals contain no interchannel correlation, the distance between microphones does not affect the results. That is to say, the kurtosis change via DS can be well fit to (23). Otherwise, in lower-frequency subbands, it is expected that the mitigation effect of kurtosis by DS degrades with decreasing distance between microphones. This is because the interchannel correlation in lower-frequency subbands increases with decreasing distance between microphones. In higher-frequency subbands, the effect of the distance between microphones is thought to be small.
4.5. Resultant Kurtosis: BF+SS versus chSS+BF
In the previous subsections, we discussed the resultant kurtosis after SS and DS. In this subsection, we analyze the resultant kurtosis for two types of composite systems, that is, BF+SS and chSS+BF, and compare their effect on musical-noise generation. As described in Section 3, it is expected that a smaller increase in kurtosis leads to a smaller amount of musical noise generated.
In BF+SS, DS is first applied to a multichannel input signal. At this point, the resultant kurtosis in the power spectral domain, , can be represented by (23). Using (11), we can derive a shape parameter for the gamma distribution corresponding to , , as
The derivation of (24) is shown in Appendix D. Consequently, using (14) and (24), the resultant kurtosis after BF+SS, , can be written as
In chSS+BF, SS is first applied to each input channel. Thus, the output kurtosis after channelwise SS, , is given by
Finally, DS is performed and the resultant kurtosis after chSS+BF, , can be written as
where we use (23).
We should compare and here. However, one problem still remains: comparison under equivalent noise reduction performance; the noise reduction performances of BF+SS and chSS+BF are not equivalent as described in the next section. Moreover, the design of a flooring parameter so that the noise reduction performances of both methods become equivalent will be discussed in the next section. Therefore, and will be compared in Section 6 under equivalent noise reduction performance conditions.
5. Noise Reduction Performance Analysis
In the previous section, we did not discuss the noise reduction performances of BF+SS and chSS+BF. In this section, a mathematical analysis of the noise reduction performances of BF+SS and chSS+BF is given. As a result of this analysis, it is revealed that the noise reduction performances of BF+SS and chSS+BF are not equivalent even if the same parameters are set in the SS part. We then derive a flooring-parameter design strategy for aligning the noise reduction performances of BF+SS and chSS+BF.
5.1. Noise Reduction Performance of SS
We utilize the following index to measure the noise reduction performance (NRP):
where is the power-domain (noise) signal of the input and is the power-domain (noise) signal of the output after processing.
First, we derive the average power of the input signal. We assume that the input signal in the power domain can be modeled by a gamma distribution. Then, the average power of the input signal is given as
Here, let , then . Thus,
This corresponds to the mean of a random variable with a gamma distribution.
Next, the average power of the signal after SS is calculated. Here, let obey the p.d.f. of the signal after SS, , defined by (13); then the average power of the signal after SS can be expressed as
We now consider the first term of the right-hand side in (31). We let , then . As a result,
Also, we deal with the second term of the right-hand side in (31). We let then , resulting in
Using (30), (32), and (33), the noise reduction performance of SS, , can be expressed by
Figure 10(a) shows the theoretical value of for various values of oversubtraction parameter and flooring parameter , where the kurtosis of the input signal is fixed to 6.0, corresponding to a Gaussian signal. From this figure, it is confirmed that is proportional to . However, hits a peak when is nonzero even for a large value of . The relation between the theoretical value of and the kurtosis of the input signal is illustrated in Figure 10(b). In this figure, is fixed to 0.0. It is revealed that decreases as the input kurtosis increases. This is because the mean of a high-kurtosis signal tends to be small. Since the shape parameter of a high-kurtosis signal becomes small, the mean corresponding to the amount of subtraction also becomes small. As a result, is decreased as the input kurtosis increases. That is to say, the strongly depends on the statistical characteristics of the input signal as well as the values of the oversubtraction and flooring parameters.
5.2. Noise Reduction Performance of DS
It is well known that the noise reduction performance of DS () is proportional to the number of microphones. In particular, for spatially uncorrelated multichannel signals, is given as [1]
5.3. Resultant Noise Reduction Performance: BF+SS versus chSS+BF
In the previous subsections, the noise reduction performances of SS and DS were discussed. In this subsection, we derive the resultant noise reduction performances of the composite systems of SS and DS, that is, BF+SS and chSS+BF.
The noise reduction performance of BF+SS is analyzed as follows. In BF+SS, DS is first applied to a multichannel input signal. If this input signal is spatially uncorrelated, its noise reduction performance can be represented by . After DS, SS is applied to the signal. Note that DS affects the kurtosis of the input signal. As described in Section 4.4, the resultant kurtosis after DS can be approximated as . Thus, SS is applied to the kurtosis-modified signal. Consequently, using (24), (34), and (35), the noise reduction performance of BF+SS, , is given as
where is defined by (24).
In chSS+BF, SS is first applied to a multichannel input signal; then DS is applied to the resulting signal. Thus, using (34) and (35), the noise reduction performance of chSS+BF, , can be represented by
Figure 11 depicts the values of and . From this result, we can see that the noise reduction performances of both methods are equivalent when the input signal is Gaussian. However, if the input signal is super-Gaussian, exceeds . This is due to the fact that DS is first applied to the input signal in BF+SS; thus, DS reduces the kurtosis of the signal. Since for a low-kurtosis signal is greater than that for a high-kurtosis signal (see Figure 10(b)), the noise reduction performance of BF+SS is superior to that of chSS+BF.
This discussion implies that and are not equivalent under some conditions. Thus the kurtosis-based analysis described in Section 4 is biased and requires some adjustment. In the following subsection, we will discuss how to align the noise reduction performances of BF+SS and chSS+BF.
5.4. Flooring-Parameter Design in BF+SS for Equivalent Noise Reduction Performance
In this section, we describe the flooring-parameter design in BF+SS so that and become equivalent.
Using (36) and (37), the flooring parameter that makes equal to , is
where
The detailed derivation of (38) is given in Appendix E. By replacing in (3) with this new flooring parameter , we can align and to ensure a fair comparison.
6. Output Kurtosis Comparison under Equivalent NRP Condition
In this section, using the new flooring parameter for BF+SS, , we compare the output kurtosis of BF+SS and chSS+BF.
Setting to (25), the output kurtosis of BF+SS is modified to
Here, we adopt the following index to compare the resultant kurtosis after BF+SS and chSS+BF:
where expresses the resultant kurtosis ratio between BF+SS and chSS+BF. Note that a positive indicates that chSS+BF reduces the kurtosis more than BF+SS, implying that less musical noise is generated in chSS+BF. The behavior of is depicted in Figures 12 and 13. Figure 12 illustrates theoretical values of for various values of input kurtosis. In this figure, is fixed to 2.0 and the flooring parameter in chSS+BF is set to , , , and . The flooring parameter for BF+SS is automatically determined by (38). From this figure, we can confirm that chSS+BF reduces the kurtosis more than BF+SS for almost all input signals with various values of input kurtosis. Theoretical values of for various oversubtraction parameters are depicted in Figure 13. Figure 13(a) shows that the output kurtosis after chSS+BF is always less than that after BF+SS for a Gaussian signal, even if is nonzero. On the other hand, Figure 13(b) implies that the output kurtosis after BF+SS becomes less than that after chSS+BF for some parameter settings. However, this only occurs for a large oversubtraction parameter, for example, , which is not often applied in practical use. Therefore, it can be considered that chSS+BF reduces the kurtosis and musical noise more than BF+SS in almost all cases.
7. Experiments and Results
7.1. Computer Simulations
First, we compare BF+SS and chSS+BF in terms of kurtosis ratio and noise reduction performance. We use 16-kHz-sampled signals as test data, in which the target speech is the original speech convoluted with impulse responses recorded in a room with 200 millisecond reverberation (see Figure 14), and to which an artificially generated spatially uncorrelated white Gaussian or super-Gaussian signal is added. We use six speakers (six sentences) as sources of the original clean speech. The number of microphone elements in the simulation is varied from 2 to 16, and their interelement distance is 2.15 cm each. The oversubtraction parameter is set to 2.0 and the flooring parameter for BF+SS, , is set to 0.0, 0.2, 0.4, or 0.8. Note that the flooring parameter in chSS+BF is set to 0.0. In the simulation, we assume that the long-term-averaged power spectrum of noise is estimated perfectly in advance.
Here, we utilize the kurtosis ratio defined in Section 3.4 to measure the difference in kurtosis, which is related to the amount of musical noise generated. The kurtosis ratio is given by
where is the power spectra of the residual noise signal after processing, and is the power spectra of the original noise signal before processing. This kurtosis ratio indicates the extent to which kurtosis is increased with processing. Thus, a smaller kurtosis ratio is desirable. Moreover, the noise reduction performance is measured using (28).
Figures 15–17 show the simulation results for a Gaussian input signal. From Figure 15(a), we can see that the kurtosis ratio of chSS+BF decreases almost monotonically with increasing number of microphones. On the other hand, the kurtosis ratio of BF+SS does not exhibit such a tendency regardless of the flooring parameter. Also, the kurtosis ratio of chSS+BF is lower than that of BF+SS for all cases except for . Moreover, we can confirm from Figure 15(b) that the values of noise reduction performance for BF+SS with flooring parameter and chSS+BF are almost the same. When the flooring parameter for BF+SS is nonzero, the kurtosis ratio of BF+SS becomes smaller but the noise reduction performance degrades. On the other hand, for Gaussian signals, chSS+BF can reduce the kurtosis ratio, that is, reduce the amount of musical noise generated, without degrading the noise reduction performance. Indeed BF+SS with reduces the kurtosis ratio more than chSS+BF, but the noise reduction performance of BF+SS is extremely degraded. Furthermore, we can confirm from Figures 16 and 17 that the theoretical kurtosis ratio and noise reduction performance closely fit the experimental results. These findings also support the validity of the analysis in Sections 4, 5, and 6.
Figures 18–20 illustrate the simulation results for a super-Gaussian input signal. It is confirmed from Figure 18(a) that the kurtosis ratio of chSS+BF also decreases monotonically with increasing number of microphones. Unlike the case of the Gaussian input signal, the kurtosis ratio of BF+SS with also decreases with increasing number of microphones. However, for a lower value of the flooring parameter, the kurtosis ratio of BF+SS is not degraded. Moreover, the kurtosis ratio of chSS+BF is lower than that of BF+SS for almost all cases. For the super-Gaussian input signal, in contrast to the case of the Gaussian input signal, the noise reduction performance of BF+SS with is greater than that of chSS+BF (see Figure 18(b)). That is to say, the noise reduction performance of BF+SS is superior to that of chSS+BF for the same flooring parameter. This result is consistent with the analysis in Section 5. The noise reduction performance of BF+SS with is comparable to that of chSS+BF. However, the kurtosis ratio of chSS+BF is still lower than that of BF+SS with . This result also coincides with the analysis in Section 6. On the other hand, the kurtosis ratio of BF+SS with is almost the same as that of chSS+BF. However, the noise reduction performance of BF+SS with is lower than that of chSS+BF. Thus, it can be confirmed that chSS+BF reduces the kurtosis ratio more than BF+SS for a super-Gaussian signal under the same noise reduction performance. Furthermore, the theoretical kurtosis ratio and noise reduction performance closely fit the experimental results in Figures 19 and 20.
We also compare speech distortion originating from chSS+BF and BF+SS on the basis of cepstral distortion (CD) [29] for the four-microphone case. The comparison is made under the condition that the noise reduction performances of both methods are almost the same. For the Gaussian input signal, the same parameters and are utilized for BF+SS and chSS+BF. On the other hand, and are utilized for BF+SS and and are utilized for chSS+BF for the super-Gaussian input signal. Table 1 shows the result of the comparison, from which we can see that the amount of speech distortion originating from BF+SS and chSS+BF is almost the same for the Gaussian input signal. For the super-Gaussian input signal, the speech distortion originating from BF+SS is less than that from chSS+BF. This is owing to the difference in the flooring parameter for each method.
In conclusion, all of these results are strong evidence for the validity of the analysis in Sections 4, 5, and 6. These results suggest the following.
-
(i)
Although BF+SS can reduce the amount of musical noise by employing a larger flooring parameter, it leads to a deterioration of the noise reduction performance.
-
(ii)
In contrast, chSS+BF can reduce the kurtosis ratio, which corresponds to the amount of musical noise generated, without degradation of the noise reduction performance.
-
(iii)
Under the same level of noise reduction performance, the amount of musical noise generated via chSS+BF is less than that generated via BF+SS.
-
(iv)
Thus, the chSS+BF structure is preferable from the viewpoint of musical-noise generation.
-
(v)
However, the noise reduction performance of BF+SS is superior to that of chSS+BF for a super-Gaussian signal when the same parameters are set in the SS part for both methods.
-
(vi)
These results imply a trade-off between the amount of musical noise generated and the noise reduction performance. Thus, we should use an appropriate structure depending on the application.
These results should be applicable under different SNR conditions because our analysis is independent of the noise level. In the case of more reverberation, the observed signal tends to become Gaussian because many reverberant components are mixed. Therefore, the behavior of both methods under more reverberant conditions should be similar to that in the case of a Gaussian signal.
7.2. Subjective Evaluation
Next, we conduct a subjective evaluation to confirm that chSS+BF can mitigate musical noise. In the evaluation, we presented two signals processed by BF+SS and by chSS+BF to seven male examinees in random order, who were asked to select which signal they considered to contain less musical noise (the so-called AB method). Moreover, we instructed examinees to evaluate only the musical noise and not to consider the amplitude of the remaining noise. Here, the flooring parameter in BF+SS was automatically determined so that the output SNR of BF+SS and chSS+BF was equivalent. We used the preference score as the index of the evaluation, which is the frequency of the selected signal.
In the experiment, three types of noise, (a) artificial spatially uncorrelated white Gaussian noise, (b) recorded railway-station noise emitted from 36 loudspeakers, and (c) recorded human speech emitted from 36 loudspeakers, were used. Note that noises (b) and (c) were recorded in the actual room shown in Figure 14 and therefore include interchannel correlation because they were recordings of actual noise signals.
Each test sample is a 16-kHz-sampled signal, and the target speech is the original speech convoluted with impulse responses recorded in a room with 200 millisecond reverberation (see Figure 14) and to which the above-mentioned recorded noise signal is added. Ten pairs of signals per type of noise, that is, a total of 30 pairs of processed signals, were presented to each examinee.
Figure 21 shows the subjective evaluation results, which confirm that the output of chSS+BF is preferred to that of BF+SS, even for actual acoustic noises including non-Gaussianity and interchannel correlation properties.
8. Conclusion
In this paper, we analyze two methods of integrating microphone array signal processing and SS, that is, BF+SS and chSS+BF, on the basis of HOS. As a result of the analysis, it is revealed that the amount of musical noise generated via SS strongly depends on the statistical characteristics of the input signal. Moreover, it is also clarified that the noise reduction performances of BF+SS and chSS+BF are different except in the case of a Gaussian input signal. As a result of our analysis under equivalent noise reduction performance conditions, it is shown that chSS+BF reduces musical noise more than BF+SS in almost all practical cases. The results of a computer simulation also support the validity of our analysis. Moreover, by carrying out a subjective evaluation, it is confirmed that the output of chSS+BF is considered to contain less musical noise than that of BF+SS. These analytic and experimental results imply the considerable potential of optimization based on HOS to reduce musical noise.
As a future work, it remains necessary to carry out signal analysis based on more general distributions. For instance, analysis using a generalized gamma distribution [26, 27] can lead to more general results. Moreover, an exact formulation of how kurtosis is changed through DS under a coherent condition is still an open problem. Furthermore, the robustness of BF+SS and chSS+BF against low-SNR or more reverberant conditions is not discussed in this paper. In the future, the discussion should involve not only noise reduction performance and musical-noise generation but also such robustness.
Appendices
A. Derivation of (13)
When we assume that the input signal of the power domain can be modeled by a gamma distribution, the amount of subtraction is . The subtraction of the estimated noise power spectrum in each frequency subband can be considered as a lateral shift of the p.d.f. to the zero-power direction (see Figure 5). As a result of this subtraction, the random variable is replaced with and the gamma distribution becomes
Since the domain of the original gamma distribution is , the domain of the resultant p.d.f. is . Thus, negative-power components with nonzero probability arise, which can be represented by
where is part of . To remove the negative-power components, the signals corresponding to are replaced by observations multiplied by a small positive value . The observations corresponding to (A.2), , are given by
Since a small positive flooring parameter is applied to (A.3), the scale parameter becomes and the range is changed from to . Then, (A.3) is modified to
where is the probability of the floored components. This is superimposed on the p.d.f. given by (A.1) within the range . By considering the positive range of (A.1) and , the resultant p.d.f. of SS can be formulated as
where the variable is replaced with for convenience.
B. Derivation of (14)
To derive the kurtosis after SS, the 2nd- and 4th-order moments of are required. For , the 2nd-order moment is given by
We now expand the first term of the right-hand side of (B.1). Here, let ; then and . Consequently,
Next we consider the second term of the right-hand side of (B.1). Here, let ; then . Thus,
As a result, the 2nd-order moment after SS, , is a composite of (B.2) and (B.3) and is given as
In the same manner, the 4th-order moment after SS, , can be represented by
Consequently, using (B.4) and (B.5), the kurtosis after SS is given as
where
C. Derivation of (22)
As described in (12), the power-domain signal is the sum of two squares of random variables with the same distribution. Using (18), the power-domain cumulants can be written as
where is the th square-domain moment. Here, the p.d.f. of such a square-domain signal is not symmetrical and its mean is not zero. Thus, we utilize the following relations between the moments and cumulants around the origin:
where is the th-order raw moment and is the th-order cumulant. Moreover, the square-domain moments can be expressed by
Using (C.1)–(C.3), the power-domain moments can be expressed in terms of the 4th- and 8th-order moments in the time domain. Therefore, to obtain the kurtosis after DS in the power domain, the moments and cumulants after DS up to the 8th order are needed.
The 3rd-, 5th-, and 7th-order cumulants are zero because we assume that the p.d.f. of is symmetrical and that its mean is zero. If these conditions are satisfied, the following relations between moments and cumulants hold:
Using (21) and (C.4), the time-domain moments after DS are expressed as
where is the th-order raw moment after DS in the time domain.
Using (C.2), (C.3), and (C.5), the square-domain cumulants can be written as
where is the th-order cumulant in the square domain.
Moreover, using (C.1), (C.2), and (C.6), the 2nd- and 4th-order power-domain moments can be written as
As a result, the power-domain kurtosis after DS, , is given as
D. Derivation of (24)
According to (11), the shape parameter corresponding to the kurtosis after DS, , is given by the solution of the quadratic equation:
This can be expanded as
Using the quadratic formula,
whose denominator is larger than zero because . Here, since , we must select the appropriate numerator of (D.3). First, suppose that
This inequality clearly holds when because and . Thus,
When , the following relation also holds:
Since (D.6) is true when , (D.4) holds. In summary, (D.4) always holds for and . Thus,
Overall,
On the other hand, let
This inequality is not satisfied when because and . Now (D.9) can be modified as
then the following relation also holds for :
This is not true for . Thus, (D.9) is not appropriate for . Therefore, corresponding to is given by
E. Derivation of (38)
For , which corresponds to a Gaussian or super-Gaussian input signal, it is revealed that the noise reduction performance of BF+SS is superior to that of chSS+BF from the numerical simulation in Section 5.3. Thus, the following relation holds:
This inequality corresponds to
Then, the new flooring parameter in BF+SS, which makes the noise reduction performance of BF+SS equal to that of chSS+BF, satisfies because
Moreover, the following relation for also holds:
This can be rewritten as
and consequently
where is defined by (39) and is given by (40). Using (E.3) and (E.4), the right-hand side of (E.5) is clearly greater than or equal to zero. Moreover, since , , , and , the right-hand side of (E.6) is also greater than or equal to zero. Therefore,
References
Brandstein M, Ward D (Eds): Microphone Arrays: Signal Processing Techniques and Applications. Springer, Berlin, Germany; 2001.
Flanagan JL, Johnston JD, Zahn R, Elko GW: Computer-steered microphone arrays for sound transduction in large rooms. Journal of the Acoustical Society of America 1985, 78(5):1508-1518. 10.1121/1.392786
Omologo M, Matassoni M, Svaizer P, Giuliani D: Microphone array based speech recognition with different talker-array positions. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97), September 1997, Munich, Germany 227-230.
Silverman HF, Patterson WR: Visualizing the performance of large-aperture microphone arrays. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP '99), 1999 962-972.
Frost O: An algorithm for linearly constrained adaptive array processing. Proceedings of the IEEE 1972, 60: 926-935.
Griffiths LJ, Jim CW: An alternative approach to linearly constrained adaptive beamforming. IEEE Transactions on Antennas and Propagation 1982, 30(1):27-34. 10.1109/TAP.1982.1142739
Kaneda Y, Ohga J: Adaptive microphone-array system for noise reduction. IEEE Transactions on Acoustics, Speech, and Signal Processing 1986, 34(6):1391-1400. 10.1109/TASSP.1986.1164975
Boll S: Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech and Signal Processing 1979, 27(2):113-120. 10.1109/TASSP.1979.1163209
Meyer J, Simmer K: Multi-channel speech enhancement in a car environment using Wiener filtering and spectral subtraction. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97), 1997 1167-1170.
Fischer S, Kammeyer KD: Broadband beamforming with adaptive post filtering for speech acquisition in noisy environment. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97), 1997 359-362.
Mukai R, Araki S, Sawada H, Makino S: Removal of residual cross-talk components in blind source separation using time-delayed spectral subtraction. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP '02), May 2002, Orlando, Fla, USA 1789-1792.
Cho J, Krishnamurthy A: Speech enhancement using microphone array in moving vehicle environment. Proceedings of the IEEE Intelligent Vehicles Symposium, April 2003, Graz, Austria 366-371.
Ohashi Y, Nishikawa T, Saruwatari H, Lee A, Shikano K: Noise robust speech recognition based on spatial subtraction array. Proceedings of the International Workshop on Nonlinear Signal and Image Processing, 2005 324-327.
Even J, Saruwatari H, Shikano K: New architecture combining blind signal extraction and modified spectral subtraction for suppression of background noise. Proceedings of the International Workshop on Acoustic Echo and Noise Control (IWAENC '08), 2008, Seattle, Wash, USA
Takahashi Y, Takatani T, Osako K, Saruwatari H, Shikano K: Blind spatial subtraction array for speech enhancement in noisy environment. IEEE Transactions on Audio, Speech and Language Processing 2009, 17(4):650-664.
Jebara SB: A perceptual approach to reduce musical noise phenomenon with Wiener denoising technique. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP '06), 2006 3: 49-52.
Ephraim Y, Malah D: Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator. IEEE Transactions on Acoustics, Speech, and Signal Processing 1984, 32(6):1109-1121. 10.1109/TASSP.1984.1164453
Uemura Y, Takahashi Y, Saruwatari H, Shikano K, Kondo K: Automatic optimization scheme of spectral subtraction based on musical noise assessment via higher-order statistics. Proceedings of the International Workshop on Acoustic Echo and Noise Control (IWAENC '08), 2008, Seattle, Wash, USA
Uemura Y, Takahashi Y, Saruwatari H, Shikano K, Kondo K: Musical noise generation analysis for noise reduction methods based on spectral subtraction and MMSE STSA estimatio. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP '09), 2009 4433-4436.
Takahashi Y, Uemura Y, Saruwatari H, Shikano K, Kondo K: Musical noise analysis based on higher order statistics for microphone array and nonlinear signal processing. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP '09), 2009 229-232.
Comon P: Independent component analysis, a new concept? Signal Processing 1994, 36: 287-314. 10.1016/0165-1684(94)90029-9
Saruwatari H, Kurita S, Takeda K, Itakura F, Nishikawa T, Shikano K: Blind source separation combining independent component analysis and beamforming. EURASIP Journal on Applied Signal Processing 2003, 2003(11):1135-1146. 10.1155/S1110865703305104
Mizumachi M, Akagi M: Noise reduction by paired-microphone using spectral subtraction. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP '98), 1998 2: 1001-1004.
Takatani T, Nishikawa T, Saruwatari H, Shikano K: High-fidelity blind separation of acoustic signals using SIMO-model-based independent component analysis. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences 2004, E87-A(8):2063-2072.
Ikeda S, Murata N: A method of ICA in the frequency domain. Proceedings of the International Workshop on Independent Component Analysis and Blind Signal Separation, 1999 365-371.
Stacy EW: A generalization of the gamma distribution. The Annals of Mathematical Statistics 1962, 1187-1192.
Kokkinakis K, Nandi AK: Generalized gamma density-based score functions for fast and flexible ICA. Signal Processing 2007, 87(5):1156-1162. 10.1016/j.sigpro.2006.09.012
Shin JW, Chang J-H, Kim NS: Statistical modeling of speech signals based on generalized gamma distribution. IEEE Signal Processing Letters 2005, 12(3):258-261.
Rabiner L, Juang B: Fundamentals of Speech Recognition. Prentice-Hall PTR; 1993.
Acknowledgment
This work was partly supported by MIC Strategic Information and Communications R&D Promotion Programme in Japan.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Takahashi, Y., Saruwatari, H., Shikano, K. et al. Musical-Noise Analysis in Methods of Integrating Microphone Array and Spectral Subtraction Based on Higher-Order Statistics. EURASIP J. Adv. Signal Process. 2010, 431347 (2010). https://doi.org/10.1155/2010/431347
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1155/2010/431347