Rate-constrained source separation for speech enhancement in wireless-communicated binaural hearing aids
- David Ayllón^{1}Email author,
- Roberto Gil-Pita^{1} and
- Manuel Rosa-Zurera^{1}
https://doi.org/10.1186/1687-6180-2013-187
© Ayllón et al.; licensee Springer. 2013
Received: 29 June 2013
Accepted: 5 December 2013
Published: 20 December 2013
Abstract
A recent trend in hearing aids is the connection of the left and right devices to collaborate between them. Binaural systems can provide natural binaural hearing and support the improvement of speech intelligibility in noise, but they require data transmission between both devices, which increases the power consumption. This paper presents a novel sound source separation algorithm for binaural speech enhancement based on supervised machine learning and time-frequency masking. The system is designed considering the power restrictions in hearing aids, constraining both the computational cost of the algorithm and the transmission bit rate. The transmission schema is optimized using a tailored evolutionary algorithm that assigns a different number of bits to each frequency band. The proposed algorithm requires less than 10% of the available computational resources for signal processing and obtains good separation performance using bit rates lower than 64 kbps.
Keywords
1 Introduction
Most people suffering from impaired hearing and wearing hearings aids show a lack of intelligibility when they are in a noisy environment. Modern devices include some mechanisms to increment the hearing comfort of the user, including advanced features such as acoustic feedback cancellation [1, 2], automatic environment classification [3, 4], or speech enhancement [5, 6]. One of the most challenging problems found in the design of hearing aids is the reduction of undesired noise and interference signals to increase speech intelligibility without introducing audible distortions in the target speech. The implementation of signal processing algorithms in hearing aids presents additional challenges: the reduced battery life, which limits the computational capability of the device, the requirement of real-time processing, which limits the processing delay to few milliseconds and reduces the number of frequency bands used for the analysis, and the small size of the device, which limits the number of assembled microphones.
A common approach to remove undesired sound sources is to provide the device with directivity, assuming that the undesired sources and the target source are spatially separated. Directional microphones have been amply included in hearing aids for over 25 years and have proved to significantly increase speech intelligibility in various noisy environments [7]. However, they are usually not applicable to small ear canal devices for reasons of size, the higher internal noise they have compared to omnidirectional microphones, and their fixed directivity pattern which does not allow adapting the directivity to changing acoustic environments [8]. In the last years, microphone arrays composed of omnidirectional microphones have drawn the attention of hearing aid designers [9, 10]. The use of multiple channels allows implementing speech enhancement algorithms based on spatial filtering (beamforming) and source separation. Both fixed and adaptive beamforming techniques have been successfully implemented in modern hearing aids [11–14], due to their reduced complexity in comparison to traditional multichannel source separation algorithms based on ICA or clustering. Originated in the computational auditory scene analysis (CASA) [15], the time-frequency masking approach for source separation is a potential solution for speech enhancement in hearing aids [16], as long as the estimation of the time-frequency mask involves low computational complexity. The ideal binary mask (IBM) is defined in [17] as the one that takes values of zero or one by comparing the local signal-to-noise ratio (SNR) in each time-frequency bin against a threshold, which is typically chosen to be 0 dB. Several studies [18–20] have demonstrated that the application of the IBM to separate speech in noisy conditions entails an improvement in speech intelligibility. Unfortunately, the computation of the IBM needs to have access to the target speech source and noise signals, information that is not available in practice. Hence, the IBM should be estimated somehow from the corrupted signal, obtaining a binary mask that is just an approximation of the IBM.
Many hearing-impaired people have bilateral hearing loss and they are forced to wear two devices. When hearing aids are worn at both ears, these devices usually operate independently. However, there is a new trend of binaural hearing aids that connects both devices in order to exchange information between them. Binaural hearing provides considerable benefits over using a single ear, due to the ability to preserve spatial cues, which are necessary to localize and separate sounds. Unfortunately, the communication between both hearing devices should be implemented with a wireless link, due to aesthetic reasons, which unavoidably increases the power consumption and, consequently, reduces the battery life. This fact opens a new problem: how to reduce the bit rate transmitted between both devices without decreasing the performance of the speech enhancement algorithm.
In the recent years, several works have proposed microphone array-based binaural spatial filtering techniques, using both fixed beamformers [21, 22] and adaptive beamformers [23, 24]. The work in [25] analyzes the robustness of binaural fixed and adaptive beamformers by means of objective perceptual quality measures. In [26], three different strategies are proposed, two of them are based on the estimation of the short-time spectral amplitude of the original signal, and the third one is based on spatial filtering. The aforementioned proposed solutions have demonstrated their ability to reduce noise and to improve speech quality. However, they assume that the original signals received at the right and left devices are available at both sides, which involves a high bandwidth communication. In practice, the signals are not completely transmitted, and the transmission rate (and the power consumption) depends on the amount of exchanged information. This problem is approached in [27], which evaluates the array gain provided by collaborating hearing aids as a function of the communication rate, using an information theoretic approach. In [28], the authors evaluate the decrement of noise reduction achieved by a binaural multichannel Wiener filter when reducing the bandwidth of the transmission link. The work in [29] proposes two approaches to reduce data transmission: the first approach is to transmit only an estimation of the undesired signal at a determined bit rate and the second approach is to transmit the complete received signal at the determined bit rate. Unfortunately, the performance of the algorithms in [27, 29] is notably reduced when the transmission rate decreases (e.g. lower than 16 kbps). An additional problem associated to the use of beamforming techniques for wireless-communicated binaural hearing is the following. The output of the beamformer is obtained by combining a weighted version of the input channels from both devices. If one or several speech signals have been quantized and transmitted to the other device, the beamforming output is directly affected by quantization noise.
The goal of this paper is the design of an energy-efficient speech enhancement algorithm with low computational cost for wireless-communicated binaural hearing aids. The binaural speech enhancement problem is approached from a different perspective, using time-frequency masking rather than spatial filtering. In this context, there are two problems to solve. First, a low-cost speech enhancement algorithm that uses binaural information is designed. The proposed algorithm estimates the IBM, which has been proven to correlate with intelligibility [18–20], using a generalized version of the least-squares linear discriminant analysis (LS-LDA) [30]. The classifier uses a set of features extracted from the short-time Fourier transform (STFT) of the signals received at both ears, assuming that all information has been exchanged between both devices. The second problem to solve is the reduction of the amount of information exchanged between both devices minimizing the effects on the performance obtained by the speech enhancement algorithm. The signal of one of the devices is quantized before being transmitted to the other device, which calculates the binary mask. The quantization of each frequency band can be performed with a different number of bits. An optimization algorithm based on evolutionary computation is proposed to distribute a limited number of bits among the different frequency bands, allowing to assign a value of 0 bits, which avoid transmitting unnecessary information. In the proposed schema, the transmitted signals are only used to estimate the mask, and quantization noise does not directly affect the quality of the output speech signal, although it may affect the mask estimation.
Our previous work in [31] addressed the same problem described in this paper, designing a low-cost speech separation system based on the computation of the time-frequency binary mask that maximizes the W-disjoint orthogonality (WDO) factor and increases the energy efficiency of the wireless-communicated binaural hearing aids. There are 3 main differences of that work with the one described in this paper. First, the goal of the current design is to obtain a system that estimates the IBM rather than maximizes the WDO. Second, unlike the previous work that only considered the time and level differences between both ears as input features, this work proposes and studies a different combination of features, with the novelty that they are calculated not only from the current time-frequency point but also from the neighbor time-frequency points. And third, the algorithm proposed in this paper optimizes the weights of the classifier and the bit distribution at the same time, and for all frequencies at once.
2 Time-frequency masking source separation
Speech signals are sparse in the time-frequency domain, that is, most of the sample values of a signal are zero or close to zero in this domain. This property is very useful for speech source separation due to the fact that the probability of two or more sources being simultaneously active is low in a sparse representation. Two signals are considered to be W-disjoint orthogonal (WDO) if their STFT representations do not overlap [32]. If this property is strictly met, the original signals can be perfectly demixed by identifying the active time-frequency regions of each source, which leads to a time-frequency binary mask. Usually, speech signals only show an approximate WDO behavior in the sense that the probability of two sources having high energy in the same time-frequency point is low [33]. This fact allows separating sources by time-frequency masking with a good performance.
2.1 The ideal binary mask (IBM)
where the time-frequency bins are associated to the source that has more energy than its interfering sources. It has been demonstrated in [18–20] that the application of the IBM to separate speech from noise entails an improvement in the intelligibility of the target speech signal. Unfortunately, the clean and noise signals are not available in practice, and the IBM must be estimated from the mixtures, which decreases the performance. The study in [19] evaluates the impact of the IBM estimation errors in the intelligibility of the separated signals.
2.2 The W-disjoint orthogonality (WDO) quality factor
where M(k, m) is the time-frequency mask computed for the separation of the target source S(k, m). If the sources were strictly WDO, the IBM mask defined in (1) would preserve all the energy of the desired signal, obtaining the maximum value PSR = 1.
It is clear that WDO sources perfectly separated with the IBM mask defined in (1) have a value of WDO = 1, which is the maximum value. However, this value is only achievable by perfect WDO sources and it obviously decreases (i.e. W D O ≤ 1) with approximately WDO sources, due to the fact that a small part of the source signals overlap, which implies that the mask is not able neither to preserve all the energy of the desired signal nor to reject all the energy of the interfering signals. Therefore, the WDO factor is a good indicator of the quality of the separation achieved by a time-frequency mask for approximately WDO sources.
3 Proposed binaural speech enhancement system
3.1 System overview
We use the logarithmic transformation of the squared amplitude because it provides more meaningful information from the human hearing point of view. The phase of the STFT is represented by ϕ _{ L }(k, m) and ϕ _{ R }(k, m) for the left and right hearing aids, respectively.
The speech enhancement system is based on the estimation of the IBM defined in (1) from the two binaural mixtures. The IBM is not necessarily the same for the left and the right devices. However, in order to preserve the binaural cues, we assume that the same mask is applied in the right and the left devices. The IBM is calculated here using the energy of the signals of both devices. The mask is calculated only in one of the devices and transmitted to the other one, thus reducing the computational load in one of the devices. In the schema shown in Figure 1, the right device transmits the amplitude and phase of the STFT of its received signal, A _{ R }(k, m) and ϕ _{ R }(k, m), to the left device, which calculates the binary mask M(k, m) and transmits it to the right device. Once both devices have the mask, they apply it to the STFT of their received signals and compute the inverse STFT (ISTFT) to obtain a clean version of the original target source, which is directly played in the loudspeaker of the hearing device. The number of bits transmitted can be reduced by transmitting a quantized low-bit version of A _{ R }(k, m) and ϕ _{ R }(k, m), instead of their values themselves. The transmitted quantized version of A _{ R }(k, m) and ϕ _{ R }(k, m) are labeled as ${A}_{R}^{{B}_{\mathit{\text{Ak}}}}(k,m)$ and ${\varphi}_{R}^{{B}_{\mathit{\text{Pk}}}}(k,m)$, where B _{ Ak } and B _{ Pk } are the number of bits used to quantize the k th frequency band of the amplitude and phase, respectively. The quantized values from the right device and those directly computed by the left device, A _{ L }(k, m) and ϕ _{ L }(k, m), are used by the left device to calculate the binary mask M(k, m). Due to the fact that the binary mask, which is transmitted from the left to the right device, only contains values of 0 and 1, it is coded with only 1 bit, hence K bits are transmitted for each frame. It is worth to mention here that the use of a soft mask may improve the performance of the IBM, but the transmission of continuous values would imply an increment of the transmission rate. The key point of the system proposed in this paper is that the values A _{ R }(k, m) and ϕ _{ R }(k, m) of each frequency band are quantized with a different number of bits B _{ Ak } and B _{ Pk }, limiting the total number of bits transmitted for each frame. The assignation of the number of bits to the different frequency bands is carried out by optimizing the performance of the speech enhancement system, avoiding to transmit unnecessary information.
The proposed transmission schema only makes sense when the latency of the system allows a delay higher than the transmission time plus the processing time. The system can also be implemented symmetrically, for instance, transmitting the information of half of the frequency bands from the left to the right device and the other half from the right to the left device. In this case, each device calculates half of the mask and transmits it to the other device. For the sake of simplicity, the schema in Figure 1 is adopted in this paper, considering that the proposed algorithms are also valid for the symmetric schema. Additionally, it is worth clarifying that the data transmission is not continuous: first, the amplitude and phase information is transmitted from the right to the left device, and after the processing time, the mask is transmitted from the left to the right device. This fact allows transmitting at the maximum bit rate available in the device (around 300 kbps in commercial devices) but only during a part of the processing time of each frame.
Finally, it is worth mentioning that all the design methods described in this paper are carried out offline on a computer. Only when the design has been completed, the optimum solution is then implemented on the digital hearing aid.
3.2 Estimation of the IBM with a least squares generalized discriminant analysis (LS-GDA)
The computational cost associated to the estimation of the IBM must be relatively low, according to the low computational power available in hearing aids. In this work, we propose the use of a low-cost classifier to decide whether a time-frequency point belongs to speech or noise, thus generating a time-frequency binary mask. The classifier uses a set of features extracted from the STFT of the left and right mixtures (A _{ L }(k, m), ϕ _{ L }(k, m), ${A}_{R}^{{B}_{\mathit{\text{Ak}}}}(k,m)$, and ${\varphi}_{R}^{{B}_{\mathit{\text{Pk}}}}(k,m)$) and it is trained with the IBM as target output.
where y _{ p } is the output of the LDA for the p th pattern and y _{0} is a threshold value. The output values of the classifier range from 0 to 1, so the threshold value is set to y _{0} = 0.5.
where ${f}_{1},\dots ,{f}_{{N}_{T}}$ are N _{ T } transformations performed over the original input features contained in P. The weight vector is then defined as $\mathbf{v}={\left[{v}_{0},{v}_{1},\dots ,{v}_{{N}_{T}\xb7L}\right]}^{T}$, and it can also be obtained using expression (12). Henceforth, this is denominated generalized discriminant analysis (GDA), and it is the classification schema used in this paper, which has been labeled as LS-GDA.
The implementation of the proposed classifier is relatively simple, its computational cost being directly related to the number of features included in Q. Considering that the selected data is consecutively stored in memory, and the processor performs the multiply-accumulate (MAC) operation in a single instruction, the number of instructions necessary to process each frequency band by the LS-GDA is approximately L+1, where L is the number of input features (we drop here the constant number of instructions necessary to generate the mask, which is a simple comparison). Hence, limiting the computational cost of the classifier is equivalent to limiting the number of features used for classification. The selection procedure to determine the best set of features to solve the classification problem at hand is included in section 4.
3.3 Evolutionary algorithm to reduce the transmission bit rate
The low-cost classifier proposed in the previous section provides an estimation of the IBM minimizing the MSE. The classifier uses a set of features calculated from the signals received at both ears, which implies that all the information is transmitted from the right device to the left one. Unfortunately, this is not an energy-efficient system. The second step in the design of the binaural speech enhancement system proposed in this paper is the reduction of the transmission bit rate, which implies a reduction in the power consumption, while minimizing the effect that quantization has in the enhanced speech. In this work, we propose to optimize the transmission rate assigning a different number of bits B _{ Ak } and B _{ Pk } to quantize the values A _{ R }(k, m) and ϕ _{ R }(k, m) of each frequency band. The number of bits may also differ between both values of the same frequency. This transmission schema allows assigning more bits to the frequencies and values providing more information to the classifier.
In order to optimize the bit distribution, a tailored evolutionary algorithm is proposed, considering that the number of bits associated to the transmission of the data of each time frame (i.e., the bit rate) is constrained. The algorithm searches the best assignation of bits among frequency bands in order to minimize the MSE obtained by the LS-GDA classifier (the MSE is then the fitness function). The matrix Q is created including the selected set of features calculated with the values ${A}_{R}^{{B}_{\mathit{\text{Ak}}}}(k,m)$ and ${\varphi}_{R}^{{B}_{\mathit{\text{Pk}}}}(k,m)$ quantized with different number of bits B _{ Ak } and B _{ Pk }, considering all integer values of bits from 0 to 8. The values B _{ Ak } = 0 and B _{ Pk } = 0 mean that no information from this value in the k-th frequency band is transmitted. Hence, the rows of Q contain the features quantized with different number of bits. The values ${A}_{R}^{{B}_{\mathit{\text{Ak}}}}(k,m)$ and ${\varphi}_{R}^{{B}_{\mathit{\text{Pk}}}}(k,m)$ received by the left device are simulated by quantizing uniformly the values using ${2}^{{B}_{\mathit{\text{Ak}}}}$ and ${2}^{{B}_{\mathit{\text{Pk}}}}$ quantization steps, respectively. The dynamic range has been limited to 90 dB for the amplitude values (A _{ L } and A _{ R } are logarithmic values) and 2π for the phase values.
- 1.
The matrix Q is created containing the selected set of features calculated using the values ${A}_{R}^{{B}_{\mathit{\text{Ak}}}}(k,m)$ and ${\varphi}_{R}^{{B}_{\mathit{\text{Pk}}}}(k,m)$, quantized with different number of bits, from 0 to 8.
- 2.
An initial population of 100 candidate solutions is generated. Each solution contains 2 · K values between 0 and 8 bits, which corresponds with a different number of bits for A _{ R }(k, m) and ϕ _{ R }(k, m) for each frequency band.
- 3.
The candidates of the population are validated to fulfill the constraint of the total number of bits. If a candidate solution exceeds by N _{ D } maximum number of bits allowed, the number of bits of a number of N _{ D } random positions of the candidate solution are decreased by one. In case that the number of bits of an element falls below 0, it is set to 0. The procedure iterates until the candidate solution fulfills the requirement.
- 4.The fitness function (MSE) of the classifier is then evaluated for each candidate solution and frequency band, following the next steps:
- (a)
To extract the quantized version of the features from Q, according to the current candidate solution.
- (b)
The weight values v are calculated for each frequency band, using expression (12).
- (c)
The MSE of each solution and frequency band is calculated according to expression (11).
- (d)
The MSE associated to a candidate solution is the average of the MSE obtained in all frequency bands.
- (a)
- 5.
A selection process is applied, using the MSE of each solution as ranking. It consists in selecting the best 10% of the solutions of the population, removing the remaining solutions.
- 6.
The remaining 90% solutions of the new generation are then generated by uniform crossover of the best candidates.
- 7.
Mutations are applied in the 1% of the new population, excluding the best obtained solution which is preserved. Mutations consist of increasing or decreasing by one the number of bits of random positions of the mutated candidate solution.
- 8.
The process is repeated from steps 3 to 7 until 100 generations are evaluated. Since the best solution of each iteration is not modified, the best solution obtained in the last iteration is considered the best solution.
The values of the parameters of the evolutionary algorithm (population size, crossover rate, mutation scheme, and number of generations) have been found to obtain a quite good tradeoff between design time and performance for the experiments carried out in this paper.
4 Experimental work and results
4.1 Database generation
The suitable database design plays a vital role in any kind of problem based on supervised machine learning. In order to validate the algorithms proposed in this work, a database of binaural speech and noise mixtures has been generated to design and test the classifier. In the case of speech, the TIMIT database described in [37] has been used. It contains a total of 626 speech male/female recordings sampled at 16 kHz with a duration of 4 s. Another 626 noise sources have been selected from an extensive database which contains both stationary and non-stationary noise. Stationary noise refers to monotonous noisy environments, for instance, the aircraft cabin noise. Non-stationary noise to other non-monotonous noises, for example, children shouting in a kindergarten. We have taken into account a variety of noise sources, including those from the following diverse environments: aircraft, bus, cafe, car, kindergarten, living room, nature, school, shops, sports, traffic, train, train station, etc. All the speech and noise signals have been initially normalized with power level of 0 dB.
Finally, is it worth clarifying that the number of data samples P contained in the matrix Q is given by M × Nmixtures, where M is the number of time frames of each mixture and Nmixtures is the number of mixtures of the database.
4.2 Selection of the input feature space
Proposed combination of features
SET | NFtSet | Features |
---|---|---|
SET1 | 3 | A _{ L }, (A _{ L }-A _{ R })^{2}, (ϕ _{ R }-ϕ _{ L })^{2} |
SET2 | 3 | A _{ L }, abs(A _{ L }-A _{ R }), abs(ϕ _{ R }-ϕ _{ L }) |
SET3 | 4 | A _{ L }, ${A}_{L}^{2}$, (A _{ L }-A _{ R })^{2}, (ϕ _{ R }-ϕ _{ L })^{2} |
SET4 | 2 | (A _{ L }-A _{ R })^{2}, (ϕ _{ R }-ϕ _{ L })^{2} |
SET5 | 7 | A _{ L }, A _{ R }, ${A}_{L}^{2}$, ${A}_{R}^{2}$, A _{ L }·A _{ R }, abs(A _{ L }-A _{ R }), abs(ϕ _{ R }-ϕ _{ L }) |
SET6 | 6 | A _{ L }, ${A}_{L}^{2}$, abs(A _{ L }-A _{ R }), (A _{ L }-A _{ R })^{2}, abs(ϕ _{ R }-ϕ _{ L }), (ϕ _{ R }-ϕ _{ L })^{2} |
Note than in the case of a symmetric implementation of the proposed system, an extra number of Nfreqs neighbor channels should be transmitted.
The experiments carried out in this section have two objectives: first, the selection of the best set of features among the six proposed (Table 1) and second, the selection of the optimum time-frequency footprint, finding the best values for Nfreqs and Nframes. The two problems are solved separately in two different experiments described below.
4.2.1 Selection of the best set of features
- 1.
Create the matrix Q calculating the features corresponding to the evaluated set and time-frequency footprint, using the data from the design set.
- 2.
Calculate the weights of the LS-GDA classifier using Equation (12).
- 3.
Create the matrix Q calculating the features corresponding to the evaluated set and time-frequency footprint, using now the data from the test set.
- 4.
Generate the binary mask for each mixture of the test database, using the weights calculated in point 2, according to (9).
- 5.
Compute the WDO value for all the mixtures of the test database using the binary mask and the power of the original signals.
- 6.
Repeat steps 1 to 5 for each set of features, time-frequency footprint, and SNR.
The conclusion of this analysis is that the combination of features labeled as SET2 is the best solution among the evaluated. From here onwards, all the experiments will be carried out with this set of features.
4.2.2 Selection of the best time-frequency footprint
- 1.
Create the matrix Q with the features of SET2 and the time-frequency footprint evaluated, using the data from the design set.
- 2.
Calculate the weights of the LS-GDA classifier using Equation (12).
- 3.
Create the matrix Q with the features of SET2 and the time-frequency footprint evaluated, using now the data from the test set.
- 4.
Generate the binary mask for each mixture of the test database, using the weights calculated in point 2, according to (9).
- 5.
Compute the WDO value for all the mixtures of the test database using the binary mask and the power of the original signals.
- 6.
Repeat steps 1 to 5 for each value of Nfreqs and Nframes and each SNR.
From the analysis of the results obtained with this experiment, we propose that a time-frequency footprint with Nfreqs = 3 and Nframes = 2 represent a good tradeoff between speech separation and computational cost. The proposed solution obtains an WDO value of 0.79 for mixtures at 0 dB, using only 27 features to classify each time-frequency point. Finally, it is worth mentioning that a square-shaped time-frequency footprint have been also considered. However, it does not outperform the results of the T-shaped footprint due to the notably higher number of required features.
4.2.3 Evaluation of the computational cost associated to the proposed solution
In the special case of a processor with a clock speed of 5.12 MHz (5 MIPS), and working with a sampling rate of 16 kHz, analysis window of 128 samples and 65 frequency bands (i.e., our case), the number of instructions available to process each frequency band of a frame is 308. These instructions are shared between the different signal processing algorithms included in the device: the multi-band compression-expansion algorithm, feedback cancellation, automatic acoustic environment classification, and speech enhancement. Hence, the proposed speech enhancement algorithm should only use a part of the total number of available instructions.
The solution selected after the study carried out in this section uses 27 features. The number of instructions necessary to process each frequency band by the LS-GDA is approximately L + 1. Therefore, the number of instructions associated to the proposed solution represents less than 9% of the available number of instructions. This result supports the feasibility of implementing the proposed speech enhancement algorithm in real hearing aids.
4.3 Optimizing the transmission rate
The proposed evolutionary algorithm to optimize the bit distribution has been executed different times varying the transmitted bit rate from 0 to 512 kbps. In the case that the bit rate is 512 kbps, all the quantized data is transmitted with the maximum number of bits, B _{ Ak } = 8 and B _{ Pk } = 8 (i.e., 16 bits per frequency band, K = 64, and 500 frames per second); hence, the optimization is not required. In order to compare the effectiveness of the proposed algorithm, we have also evaluated the performance obtained by an uniform distribution of bits, assigning a constant number of bits to the amplitude and phase values of each frequency band. The values assigned in this case are 1, 2, 4, and 8, which corresponds with transmission rates of 64, 128, 256, and 512 kbps, respectively.
In the case of transmitting the quantized values with the maximum number of bits (512 kbps), the WDO values obtained by the proposed algorithm practically match the WDO values in case of non-quantization (i.e., using A _{ R }(k, m) and ϕ _{ R }(k, m)). The performance is nearly unaffected when the transmission rate is decreased up to 128 kbps, but the decrease begins to be noticeable for lower bit rates. Nevertheless, in the case of SNR = 0 dB (worst case), the performance is only reduced by 4% in the case of transmitting 64 kbps, 12% in the case of transmitting 16 kbps, 17% in the case of transmitting 8 kbps, and 25% in the case of transmitting 2 kbps, which are acceptable transmission rates for hearing aids. Additionally, the figure also shows the case in which no information is transmitted from the right to the left device (0 kbps). In such a case, the features are calculated only using the information available in the left ear (i.e., monaural system), and the performance clearly drops to WDO values around 0.5 for SNR = 0 dB, which supports the use of binaural separation. Moreover, it is noticeable that the results obtained by the optimized distribution outperforms the results obtained by the uniform distribution, the difference increasing when the number of bits decreases. Nevertheless, the use of a uniform distribution does not allow reducing the transmission rate below 64 kbps.
5 Conclusions
This paper presents a novel energy-efficient sound separation algorithm with very low computational cost for speech enhancement in wireless-communicated binaural hearing aids. The source separation algorithm is based on supervised machine learning and time-frequency masking, and the design of the system has been carried out considering the power and computational limitations of state-of-the-art hearing aids. First, the computational cost of the algorithm has been constrained, obtaining good separation performance in terms of WDO even for low SNRs when using less than the 10% of the available computational resources for signal processing. The combination of features selected represents a tradeoff between separation performance and computational cost. The improvement associated to the introduction of the information of neighbor time-frequency points in the decision whether a time-frequency point belongs to speech or noise has been proven. Second, the transmission bit rate associated to the information exchange between both devices has been also constrained, optimizing the distribution of number of bits among the different frequency bands with an evolutionary algorithm. The performance of the algorithm in terms of WDO is only reduced by 4% in the case of transmitting 64 kbps, 17% in the case of transmitting 8 kbps, and 25% in the case of transmitting only 2 kbps, which are feasible bit rates for hearing aids. The optimization algorithm allows distributing the bits efficiently. Finally, the advantages of binaural source separation in comparison to the monaural case have been amply demonstrated.
The proposed algorithm has been tested in a scenario where the desired speech source is contaminated with two directional noises, in low SNR conditions. In order to generalize the results for a typical hearing aid application, the proposed algorithm should also be tested with diffuse background noise and reverberations. Additionally, other metrics related to speech quality or intelligibility should be used to evaluate the performance of the algorithm. Finally, it is worth noting that the tradeoff between transmission bit rate and separation performance can be further studied in an information theoretic framework.
Declarations
Acknowledgements
This work has been funded by the Spanish ministry of economy and competitiveness, under project TEC2012-38142-C04-02 and the scholarship AP2009-3932.
Authors’ Affiliations
References
- Spriet A, Rombouts G, Moonen M, Wouters J: Adaptive feedback cancellation in hearing aids. J. Franklin Inst 2006, 343(6):545-573. 10.1016/j.jfranklin.2006.08.002View ArticleMATHGoogle Scholar
- Freed DJ: Adaptive feedback cancellation in hearing aids with clipping in the feedback path. J. Acoust. Soc. Am 2008, 123: 1618. 10.1121/1.2836767View ArticleGoogle Scholar
- Nordqvist P, Leijon A: An efficient robust sound classification algorithm for hearing aids. J. Acoust. Soc. Am 2004, 115: 3033. 10.1121/1.1710877View ArticleGoogle Scholar
- Alexandre E, Cuadra L, Álvarez L, Rosa-Zurera M, López-Ferreras F: Two-layer automatic sound classification system for conversation enhancement in hearing aids. Integr Comput. Aided Eng 2008, 15: 85-94.Google Scholar
- Peterson PM, Zurek P: Multimicrophone adaptive beamforming fo reduction in hearing aids. J. Rehabil. Res. Dev 1987., 24(4):Google Scholar
- Hamacher V, Chalupper J, Eggers J, Fischer E, Kornagel U, Puder H, Rass U: Signal processing in high-end hearing aids: state of the art, challenges, and future trends. EURASIP J. Appl. Signal Process 2005, 2005: 2915-2929. 10.1155/ASP.2005.2915View ArticleMATHGoogle Scholar
- Hawkins DB, Yacullo WS: Signal-to-noise ratio advantage of binaural hearing aids and directional microphones under different levels of reverberation. J. Speech Hear. Disord 1984, 49(3):278.View ArticleGoogle Scholar
- Chung K: Challenges and recent developments in hearing aids. Part I. Speech understanding in noise, microphone technologies and noise reduction algorithms. Trends Amplif 2004, 8(3):83-124. 10.1177/108471380400800302View ArticleGoogle Scholar
- Kates JM, Weiss MR: A comparison of hearing-aid array-processing techniques. J. Acoust. Soc. Am 1996, 99: 3138. 10.1121/1.414798View ArticleGoogle Scholar
- Saunders GH, Kates JM: Speech intelligibility enhancement using hearing-aid array processing. J. Acoust. Soc. Am 1997, 102: 1827. 10.1121/1.420107View ArticleGoogle Scholar
- Stadler R, Rabinowitz W: On the potential of fixed arrays for hearing aids. J. Acoust. Soc. Am 1993, 94: 1332. 10.1121/1.408161View ArticleGoogle Scholar
- Hoffman M, Trine T, Buckley K, Van Tasell D: Robust adaptive microphone array processing for hearing aids: realistic speech enhancement. J. Acoust. Soc. Am 1994, 96: 759. 10.1121/1.410313View ArticleGoogle Scholar
- Greenberg JE: Modified LMS algorithms for speech processing with an adaptive noise canceller. IEEE Trans. Speech Audio Process 1998, 6(4):338-351. 10.1109/89.701363View ArticleGoogle Scholar
- Spriet A, Moonen M, Wouters J: Robustness analysis of multichannel Wiener filtering and generalized sidelobe cancellation for multimicrophone noise reduction in hearing aid applications. IEEE Trans. Speech Audio Process 2005, 13(4):487-503.View ArticleGoogle Scholar
- Wang D, Brown GJ: Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Hoboken: Wiley Interscience; 2006.View ArticleGoogle Scholar
- Wang D: Time-frequency masking for speech separation and its potential for hearing aid design. Trends Amplif 2008, 12(4):332-353. 10.1177/1084713808326455View ArticleGoogle Scholar
- Hu G, Wang D: Speech segregation based on pitch tracking and amplitude modulation. In IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, 21–24 October, 2001. Piscataway: IEEE,; 2001:79-82.Google Scholar
- Srinivasan S, Roman N, Wang D: Binary and ratio time-frequency masks for robust speech recognition. Speech Commun 2006, 48(11):1486-1501. 10.1016/j.specom.2006.09.003View ArticleGoogle Scholar
- Li N, Loizou PC: Factors influencing intelligibility of ideal binary-masked speech: Implications for noise reduction. J. Acoust. Soc. Am 2008, 123: 1673. 10.1121/1.2832617View ArticleGoogle Scholar
- Brungart DS, Chang PS, Simpson BD, Wang D: Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation. J. Acoust. Soc. Am 2006, 120: 4007. 10.1121/1.2363929View ArticleGoogle Scholar
- Desloge JG, Rabinowitz WM, Zurek PM: Microphone-array hearing aids with binaural output. I. Fixed-processing systems. IEEE Trans. Speech Audio Process 1997, 5(6):529-542. 10.1109/89.641298View ArticleGoogle Scholar
- Lotter T, Vary P: Dual-channel speech enhancement by superdirective beamforming. EURASIP J. Appl. Signal Process 2006, 2006: 175-175.View ArticleMATHGoogle Scholar
- Welker DP, Greenberg JE, Desloge JG, Zurek PM: Microphone-array hearing aids with binaural output. II. A two-microphone adaptive system. IEEE Trans. Speech Audio Process 1997, 5(6):543-551. 10.1109/89.641299View ArticleGoogle Scholar
- Roman N, Srinivasan S, Wang D: Binaural segregation in multisource reverberant environments. J. Acoust. Soc. Am 2006, 120: 4040. 10.1121/1.2355480View ArticleGoogle Scholar
- Rohdenburg T, Hohmann V, Kollmeier B: Robustness analysis of binaural hearing aid beamformer algorithms by means of objective perceptual quality measures. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, 21–24 October, 2007. Piscataway: IEEE,; 2007:315-318.View ArticleGoogle Scholar
- Wittkop T, Hohmann V: Strategy-selective noise reduction for binaural digital hearing aids. Speech Commun. 2003, 39: 111-138. 10.1016/S0167-6393(02)00062-6View ArticleMATHGoogle Scholar
- Roy O, Vetterli M: Rate-constrained beamforming for collaborating hearing aids. In IEEE International Symposium on Information Theory Seattle, Washington, 6–12 July, 2006. Piscataway: IEEE,; 2006:2809-2813.View ArticleGoogle Scholar
- Doclo S, Moonen M, Van den Bogaert T, Wouters J: Reduced-bandwidth and distributed MWF-based noise reduction algorithms for binaural hearing aids. IEEE Trans. Audio, Speech, Language Process 2009, 17: 38-51.View ArticleGoogle Scholar
- Srinivasan S, Den Brinker AC: Rate-constrained beamforming in binaural hearing aids. EURASIP J. Adv. Signal Process 2009, 2009: 8.View ArticleMATHGoogle Scholar
- Ye J: Least squares linear discriminant analysis. In Proceedings of the 24th International Conference on Machine learning Oregon State University, Corvalis, OR, 20–24 June, 2007. New York: ACM,; 2007:1087-1093.Google Scholar
- Gil-Pita R, Cuadra L, Alexandre E, Ayllón D, Alvarez L, Rosa-Zurera M: Enhancing the energy efficiency of wireless-communicated binaural hearing aids for speech separation driven by soft-computing algorithms. Appl. Soft Comput 2012, 12(7):1939-1949. 10.1016/j.asoc.2011.03.022View ArticleGoogle Scholar
- Jourjine A, Rickard S, Yilmaz O: Blind separation of disjoint orthogonal signals: Demixing N sources from 2 mixtures. In IEEE International Conference on Acoustics, Speech, and Signal Processing, Istanbul, 5–9 June, 2000. Piscataway: IEEE,; 2000:2985-2988.Google Scholar
- Rickard S, Yilmaz O: On the approximate W-disjoint orthogonality of speech. In IEEE International Conference on Acoustics, Speech, and Signal Processing, Renaissance Orlando Resort, Orlando, FL, 13–17 May, 2002. Piscataway: IEEE,; 2002:I-529.View ArticleGoogle Scholar
- Li Y, Wang D: On the optimality of ideal binary time–frequency masks. Speech Commun 2009, 51(3):230-239. 10.1016/j.specom.2008.09.001View ArticleGoogle Scholar
- Yilmaz O, Rickard S: Blind separation of speech mixtures via time-frequency masking. IEEE Trans. Signal Process 2004, 52(7):1830-1847. 10.1109/TSP.2004.828896MathSciNetView ArticleGoogle Scholar
- Fisher RA: The use of multiple measurements in taxonomic problems. Ann. Eugen 1936, 7(2):179-188. 10.1111/j.1469-1809.1936.tb02137.xView ArticleGoogle Scholar
- Fisher WM, Doddington GR, Goudie-Marshall KM: The DARPA speech recognition research database: specifications and status. DARPA Workshop on Speech Recognition 1986, 93-99.Google Scholar
- Algazi VR, Duda RO, Thompson DM, Avendano C: The CIPIC HRTF database. In IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics New Paltz, New York, 21–24 October, 2001. Piscataway: IEEE,; 2001:99-102.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.