EURASIP Journal on Applied Signal Processing 2005:18, 2938–2953 c ○ 2005 Hindawi Publishing Corporation An Auditory-Masking-Threshold-Based Noise Suppression Algorithm GMMSE-AMT[ERB] for Listeners with Sensorineural Hearing Loss

This study describes a new noise suppression scheme for hearing aid applications based on the auditory masking threshold (AMT) in conjunction with a modified generalized minimum mean square error estimator (GMMSE) for individual subjects with hearing loss. The representation of cochlear frequency resolution is achieved in terms of auditory filter equivalent rectangular bandwidths (ERBs). Estimation of AMT and spreading functions for masking are implemented in two ways: with normal auditory thresholds and normal auditory filter bandwidths (GMMSE-AMT[ERB]-NH) and with elevated thresholds and broader auditory filters characteristic of cochlear hearing loss (GMMSE-AMT[ERB]-HI). Evaluation is performed using speech corpora with objective quality measures (segmental SNR, Itakura-Saito), along with formal listener evaluations of speech quality rating and intelligibility. While no measurable changes in intelligibility occurred, evaluations showed quality improvement with both algorithm implementations. However, the customized formulation based on individual hearing losses was similar in performance to the formulation based on the normal auditory system.


INTRODUCTION
Individuals with sensorineural hearing loss have more difficulty understanding speech compared to those with normal hearing. This effect is compounded in diverse environments that may contain time varying cues/signals or multiple competing speakers. This increased difficulty in understanding speech in noise is due to (a) reduced audibility of speech sounds in listeners with elevated auditory thresholds, and (b) suprathreshold processing deficits characteristic of sensorineural hearing loss. Hearing aids incorporate different strategies to compensate for reduced audibility and for suprathreshold processing deficits. These strategies include frequency-dependent amplification, compression, and directional microphones. Hearing aids based on digital signal processing may also include algorithms for feedback cancellation and active noise reduction. Spectral subtraction is one possible noise reduction algorithm for hearing aid applications because of its simplicity and low computational requirements. In general, noise reduction circuits employing spectral subtraction use mathematical criteria based on the estimated speech-to-noise ratio. One of the primary objectives in speech enhancement is to achieve a balance between pure noise suppression and the musical noise-like artifacts that may be introduced by the processing techniques. Most noise suppression methods are based on a signal-plus-noise model, and mathematical criteria (such as signal-to-noise ratio) are used to evaluate their performance. In an effort to achieve a better balance between audible musical artifacts and noise suppression, a number of previous studies in speech enhancement have considered incorporating aspects of the human auditory system including masking [1,2,3,4,5,6]. In an earlier study, Tsoukalas et al. [1] used a spectral subtraction technique based on aspects of the auditory process. Their method considers an enhancement approach that uses the auditory masking threshold (AMT) [7] in conjunction with a version of spectral subtraction. The AMT in their implementation was calculated in four steps: (1) obtain energies in speech critical band (CB) frequency analysis, (2) convolve a spreading function [8] with the CB spectrum to obtain a masking spread threshold, (3) compute an offset term for masking spread thresholds that takes into account signal tonality, and (4) normalize/compare and account for absolute auditory thresholds. This speech enhancement method is referred to as the TMK algorithm in the present study.
Based on the work in [1], Arehart et al. [9], implemented a version of the TMK algorithm and evaluated its effectiveness in improving speech-perception in noise for both normal-hearing and hearing-impaired listeners. This implementation is referred to as the auditory masking threshold-noise suppression (AMT-NS) scheme in the present study. The AMT-NS algorithm yielded better quality ratings and better intelligibility scores in both normalhearing and hearing-impaired listeners in some but not all of the test conditions. Their implementation of the TMK scheme employed speech and noise sampled at 8 kHz, while the original TMK [1] used 16 kHz samples of speech and noise. Also, the level of intelligibility improvement reported in [1] was significantly higher than those demonstrated in [9] when using an 8 kHz sample rate version of the enhancement method.
The TMK and the AMT-NS algorithms are based on masking properties of the normal auditory system, with its theoretical underpinnings based on MPEG-4 audio coding [7]. Alternate processing strategies that specifically consider hearing aid applications and the effects of sensorineural hearing loss may optimize the AMT-NS approach to speech enhancement for hearing-impaired listeners. The present study describes a new noise suppression scheme. Referred to here as GMMSE-AMT [ERB], this new scheme includes two primary modification of previous formulations.
The first change is that the new algorithm includes a modification of the suppression structure. Specifically, it is implemented using the modified generalized minimum mean square error (GMMSE) estimators which provide improvement over traditional spectral subtraction estimators [10,11]. The suppression structure has also been modified so that tonality is not included. Preliminary evaluations in our laboratory indicated that listeners preferred algorithm formulations with tonality disabled. Furthermore, inclusion of tonality would introduce additional complexity to the algorithm formulation, which would impact the ability for realtime implementation in digital hearing aid applications. Finally, the assumptions of the tonality offset, originally formulated for use in MPEG-4 audio coding applications, are primarily related to the harmonic structure of music or audio. While there is some justification in using tonality offset with voiced signals due to the harmonic structure present in formant regions, some assumptions regarding tonality may not be appropriate for hearing aid applications. Therefore, we do not include a tonality offset in the formulation presented here.
The second primary modification is that the new algorithm establishes a framework for customization of the AMT estimation to individual subjects with hearing loss. To accommodate this framework, the algorithm requires estimation of normal frequency resolution as well as the degraded frequency resolution characteristic of cochlear hearing loss. Therefore, the frequency resolution of the cochlea is represented in the algorithm with an auditory filter bank using equivalent rectangular bandwidths (ERBs) [8]. While related to the critical band scale, the ERB scale is used in the algorithm formulation because present-day experimental studies estimating degraded frequency resolution in listeners with sensorineural hearing loss have used the ERB scale and not the critical band scale (e.g., [12,13,14]). The estimation of the AMT and of the spreading functions for masking are implemented in two ways: with normal auditory thresholds and normal auditory filter bandwidths (GMMSE-AMT[ERB-NH]) and with the elevated thresholds and broader auditory filters characteristic of cochlear hearing loss (GMMSE-AMT[ERB-HI]). Section 2 of this paper presents details of the algorithm derivation including the modified structure and framework for customization of the AMT based on individual listener profiles. Section 3 presents evaluation of both GMMSE-AMT[ERB-NH] and GMMSE-AMT[ERB-HI] implementations. GMMSE-AMT[ERB-NH] is evaluated over several speech corpora, using detailed objective quality tests based on segmental SNR and the Itakura-Saito objective quality measures. Formal listener evaluations with normal and hearing impaired subjects of speech quality rating and intelligibility are also used to test performance for both the NH and HI formulations.

GMMSE-AMT[ERB] ALGORITHM FORMULATION
The flowchart of the proposed algorithm is presented in Figure 1. The algorithm can be partitioned into three phases that include: (1) enrollment (GMMSE spectral estimation),  versions are implemented and customized for individual hearing-impaired listeners by including frequencydependent amplification approximating the linear gain prescribed by the NAL-R hearing aid fitting procedure [15]. GMMSE-AMT[ERB-HI] is further customized for each individual hearing-impaired listener by considering individual hearing losses in the AMT estimation (i.e., broader auditory filters and elevated thresholds).

Enrollment: GMMSE spectral estimation
The first processing step is to obtain an estimate of the clean speech power spectrum through a modified generalized minimum mean square estimation algorithm that is needed to calculate the AMT. The original speech signal x(n) is assumed to be degraded by an additive uncorrelated noise source d(n), resulting in the noisy speech signal, Under this assumed model, one can obtain a generalized family of MMSE speech spectral estimators as [10,11] where X p is the power spectrum of the clean speech, and Y p is the power spectrum of the noisy speech (both of which are real quantities). This MMSE estimator attempts to strike a balance between the a priori information and the noisy data information (in this case the a posteriori SNR γ − 1). One of the main advantages of the MMSE amplitude estimator is that it results in colorless residual noise in the enhanced speech [16]. We note that substitution of α = 0.5 into (1) gives the traditional Ephraim-Malah [17] amplitude estimator, and α = 1 gives the MMSE power spectral estimator. For MMSE, if the real and imaginary parts of the Fourier coefficients of the clean speech and noise power spectra are modeled as independent zero mean Gaussian random variables with variances σ 2 x (ω, i)/2 and σ 2 d (ω, i)/2, respectively, and α = 0.5, the MMSE estimate of X(ω, i) is given by [17], 1 where Γ[·] is the Gamma function, and Φ(a, b; z) is the confluent hypergeometric series (see (4)) defined in [18], and is dependent on the a priori SNR and a posteriori SNR, with where ξ(ω, i) and γ(ω, i) are defined as where ξ(ω, i) is the a priori SNR and γ(ω, i) − 1 is the a posteriori SNR as a function of frequency ω and frame index i. The definitions in (6) suggest a general representation of the terms ξ(ω, i) and γ(ω, i), where ξ(ω, i) is the SNR using the clean speech X, and γ(ω, i) being the ratio of the noisy speech spectrum of Y (ω, i) to the background noise spectrum assuming that the noise is statistically white. While γ(ω, i) can be obtained from an accurate estimate of the background noise, a decision-directed approach is used to estimate ξ(ω, i). The estimate for ξ(ω, i) is given by [17] where β is chosen to be between 0 and 1, and P[x] = x for x ≥ 0, and P[x] = 0 for x < 0. It can be shown that a small value of α (e.g., lim α→0 ) is suitable for noise suppression that improves the segmental SNR [11]. A larger value of α (e.g., lim α→1 ) reduces the amount of musical processing artifacts and speech distortion (note that this balance is illustrated in Enrollment phase in Figure 1). This suggests a benefit from a method that dynamically changes the value of α, rather than restricting the processing to a single value. Using a speech/pause detection algorithm, one can dynamically change the value of α. In the noisy signal, if a pause is encountered, the value of α is dynamically adjusted (i.e., α → 0), and in regions where speech is present, the value α is set to 1.
The voice activity detector (VAD) algorithm [19] used to dynamically adjust α is described below. Let P dk be the power spectrum of the distortion/noise for the kth ERB frequency subband, and P xk be the estimated power spectrum of the clean speech signal for the kth ERB frequency subband. The values of P dk and P xk are obtained from the following relations: where µ = 0.7, κ = 0.998, and η = 0.45. These values are used for our implementation with an analysis (FFT) frame size of 128 samples, with a skip rate of 64 samples (i.e., overlap of 50% between adjacent analysis windows) using an 8 kHz sample rate. These values were determined to be reasonable for the noise types considered through a pilot experiment, and kept fixed for all processing in the present study. The speech pause detector algorithm is applied as follows: where NX[n] = P dk [n]/ P xk [n]. The term NX rel k [n] is the relative ratio of the noise energy to the signal-plus-noise energy for each subband [19]. The values of NX min k [n] and NX max k [n] represent the minimum and maximum ratios, and are calculated looking back across the previous 400 milliseconds portion of the speech signal. The value of the power spectrum of the distortion in subband k, P dk , is modified if NX k [n] is less than a predetermined threshold. We then apply a nonlinear gain term, based on the value of α from the GMMSE algorithm, the a priori SNR and the a posteriori SNR, to the noisy power spectrum to obtain the estimate of the clean power spectrum.

AMT threshold estimation
Having presented the GMMSE enhancement scheme and voice activity detector, we now shift to the auditory masking threshold estimation scheme. It is important to note that the use of an AMT is not by itself a speech enhancement process, since it essentially allows the enhancement method to balance noise suppression versus potential processing artifacts. The use of the AMT is of particular interest for hearingimpaired individuals since, in theory, one would expect that the AMT would be shifted for such individuals and allow for a different level of either background noise or processing artifacts in the processed signal. The steps for calculating the AMT (as shown in Figure 1) in the present algorithm are as follows: The auditory filters are represented using their equivalent rectangular bandwidth [12]. For normal-hearing (NH) individuals, the hearing thresholds across all frequencies are assumed to be 0 dB HL. The hearing thresholds in quiet for hearing-impaired (HI) individuals are obtained from audiometric testing. The ERB values for a normal-hearing individual over the whole frequency range are described by the following equation [12]: where ERB is in Hz, and F is the center frequency in kHz. For the hearing-impaired individual, the ERB is equal to 24.7(4.37F + 1) * B, where B (B > 1) is the frequency broadening term which is described below. The total threshold for HI listeners is a combination of threshold loss due to outer and inner hair cell damage. The broadening of the auditory filters due to hearing loss can be described by [13,14] B = (10) 0.01757(HLohc−22)·([1−( fc−1) 2 ]/3.09) (11) up to a frequency of 1 kHz, and for higher frequencies, where f c is the center frequency in kHz, and HL ohc is the amount of hearing loss due to outer hair cell damage. Eighty percent of the total threshold loss is assumed to be due to loss of outer hair cell function, with the auditory filter bandwidth at 2000 Hz corresponding to filters that are approximately 2.7 times the bandwidth of normal auditory filters (Moore and Glasberg [14]). The constant 0.01757 is chosen so that B has a value of 3.8 when HL ohc = 55 dB, which the model assumes is the maximum value of broadening due to outer hair cell loss below 2000 Hz. For NH individuals, the value of B is set to 1. Thus, the total number of estimated ERB filters in the frequency partition will be smaller for impaired ears. Once the filter shapes are defined, the signal power in each critical subband is calculated as X ERB . The excitation pattern is derived from the output of the auditory filters as a function of their center frequency. Specifically, the excitation pattern is calculated by summing up the power of each signal component with the filter weighting function that is given by the ROEX(p) model, which is described in [8], as where W is the filter shape. We note that the signal power for calculating the excitation pattern must be recalculated to match the audiometric testing results. The correction thresholds for this recalculation are obtained from the TDH-39 headphones for both the normal and impaired ear. The normalized distance of the signal component from the center frequency f c of the filter involved is described as The parameter p in (13) describes both the bandwidth and slope of the skirts of the auditory filter and can be used to derived p l and p u , which, respectively, describe the sharpness of the lower and upper sides of the ERB-based bandpass filters. The lower frequency skirt p l of the auditory filter becomes less sharp with increasing level. Here, p l varies with broadening and level as where p l (51) is the value of the skirt p for an equivalent noise level of 51 dB/ERB, and p l (51,1k) is the value of p l (x) at 1 kHz for a noise level of 51 dB/ERB. X ERB is the signal power in each critical subband which can also be stated as the equivalent input power in dB/ERB. The upper frequency skirt, p u , of the auditory filter does not vary largely with level and can be described as  Figure 2 compares the excitation pattern based on Schroeder's spreading function and the masking in the ROEX (rounded exponential) model [12]. The excitation pattern does not vary with the level for the critical bands (CB) in the Schroeder model [20]. The excitation pattern for the impaired ear is consistent with broader filter shapes characteristic of sensorineural hearing loss. The excitation pattern is compared with the absolute threshold of hearing and the AMT is set as the greater of the two.

Scaling issues
Auditory filter shape is dependent on stimulus level [12,13]. Therefore it is necessary to scale the signal appropriately to represent the actual playback level in dB SPL. This is achieved in the following way. (a) The output level of the speech waveform is set to 60 dB (SPL) for normal hearing subject and 90 dB (SPL) for individuals with hearing loss.
(b) The maximum dB value of the signal is identified after performing a frame-based FFT analysis of the signal.
(c) A scaling factor is chosen to convert the power spectrum of the signal in dB to a dB (SPL) scale such that the maximum dB (SPL) is limited to 60 dB (SPL) for normal hearing and 90 dB (SPL) for hearing-impaired individuals.

Audible noise suppression
In our formulation, we use a window frame of the noisy speech Y w (i, k) and clean speech X w (i, k) frequency responses in the following power spectral representations (in a manner similar to [1]): (17) The noisy speech spectrum is compared with the AMT as calculated in the previous section. The clean speech spectrum is estimated using a nonlinear gain function that is derived using a nonlinear filtering operation for the ith frame and kth subband as shown below [1]: where the parameter a b (i) is given by where D pb is the mean noise power spectrum of the noise in ERB subband b, and T b is the masking threshold in the same subband. We can see from (19) that if the noise level approaches the masked threshold T b (i, k), then the value of a b (i) approaches 2D pb , and therefore the suppression in (18) is always greater than the traditional Wiener filter solution (i.e., the Wiener filter solution would have a b (i) = D pb , so a b (i) = 2D pb will produce a greater suppression value as a function of frequency). If the noise spectrum is below this threshold, no further enhancement processing is performed (as illustrated in Figure 1). The enhanced signal is renormalized 2 and converted back to the time domain.

EVALUATION
In this section, a detailed performance evaluation is presented for the formulated GMMSE-AMT[ERB] algorithm in the form of objective speech quality results as well as results from subjective speech quality and intelligibility tests. The objective quality of the enhanced speech is assessed in terms of segmental SNRs (SegSNR) as well as the Itakura-Saito (IS) objective speech quality measure [21] for the GMMSE-AMT[ERB-NH] implementation. These measures are explained below in detail. Finally, detailed subjective speech quality tests using a quality rating scale and intelligibility tests using the nonsense syllable test (NST) are presented for individuals with and without hearing loss to assess the performance of the GMMSE-AMT[ERB-NH] and GMMSE-AMT[ERB-HI] algorithm implementations. For our evaluation, we considered two types of noise with different frequency and temporal structure: (i) stationary flat communications channel noise (FLN), and (ii) large crowd noise from within an open room (LCR). These noise sources have previously been used for speech enhancement and robust speech recognition evaluations [22]. The FLN noise represents a broadband noise source that is quite stationary. The LCR noise is slowly varying and primarily low frequency, where high-frequency (4 kHz) content is approximately 10 dB lower than that seen in the low-frequency region. and (iii) speech enhanced using the present algorithm (GMMSE-AMT[ERB-NH]) for a single sentence to illustrate detailed processing performance. The processed sentence, "In wage negotiations the industry bargains as a unit with the single union," is taken from the TIMIT speech corpus, and is approximately 5.5 seconds in duration and sampled at an 8 kHz sample rate. Figure 3 also shows the IS objective speech quality measures for the same sentence, (iv) degraded with the FLN 5 dB noise, and (v) enhanced with GMMSE-AMT[ERB-NH]. From this figure, one can observe noticeable noise suppression performed by the GMMSE-AMT[ERB-NH] scheme. The cumulative area under the IS curves in the bottom two panels represents the total amount of distortion as estimated with the IS measure. The enhanced sentence IS plot (v) shows noticeably less distortion than the degraded sentence across the phoneme sequence. This single sentence result has therefore confirmed that the proposed enhancement method provides noise suppression and quality improvement, which is in proportion to the level and type of distortion. We consider a more extensive set of speech enhancement evaluations using objective speech quality measures (overall and within each phoneme) and subjective speech quality measures in the next section. Before considering this, we will briefly consider an example comparison of the AMT used in the GMMSE-AMT[ERB] enhancement scheme. Figures 4 and 5 show the spectral plots of (1) the noisy speech power spectrum, (2) the clean speech power spectrum, and (3) the audible masked threshold (AMT) for the vowel /EY/ for the GMMSE-AMT[ERB-NH] and GMMSE-AMT[ERB-HI] implementations. Any portion of the power spectrum of the noisy speech that falls below the AMT is assumed to be inaudible and therefore will not be suppressed. Comparing the AMT for the voiced speech in the NH and HI schemes, one can see that for the HI scheme ( Figure 5) there would be far less suppression than in the NH scheme ( Figure 4). Because of the pronounced effect of masking in HI individuals, more signal components are masked. On average, noise suppression is performed approximately 80% of the time for the NH scheme and about 40% of the time for HI scheme if we consider each ERB-based filter band and timebased analysis frame. Next, we consider objective measures of processed speech quality over a larger speech corpus.

Objective quality measures
The performance of an enhancement algorithm can be assessed in two ways: (a) employing objective speech quality measures and/or (b) subjective listener tests, which have as their goal to quantify the improvement/distortion that a human listener would perceive. Two of the most widely used objective quality measures are the segmental SNR (SegSNR) and the Itakura-Saito (IS) distance measure [21,22]. In normal-hearing listeners, the SegSNR and IS measures have been benchmarked against subjective speech quality measures such as the diagnostic acceptability measure (DAM). The correlation between DAM and IS is 0.59 and between DAM and SegSNR is 0.77. These values are based on a variety of distortions including additive noise, communication distortions, nonlinear distortions, and vocoder distortions [21].
We note that the research performed on objective speech quality measures have focused almost exclusively on measures for predicting speech quality for voice coding applications ( [21], [23,Chapter 9]). However, these objective measures have been used extensively to assess the performance of speech enhancement and noise suppression schemes as well. An important issue to note is that for the present study, we employ an AMT. In many objective measures, such as SegSNR, overall speech signal energy and noise signal energy are used within a frame-by-frame basis. Since the purpose of the AMT is to balance noise suppression versus processing artifacts, the AMT is in effect disabling the noise suppression scheme in regions, where further noise suppression would, only introduce, audible processing artifacts. Therefore, for measures such as SegSNR, methods which did not employ an AMT would, in theory, always be selected over those with an AMT since more noise power is left behind (even if that noise is not audible). As such, it would be appropriate to consider a direct comparison of speech enhancement methods that either (i) process noisy speech without an AMT or (ii) employ an AMT, but do not compare between methods that have AMT engaged and disabled. For this reason, we do not report objective measures within our enhancement methods for engaged/disabled AMT processing.
For a broad objective quality evaluation, the 192sentence core test set in the TIMIT database, with both male and female speakers, was degraded with both stationary (FLN) and nonstationary (LCR) additive noise sources. The noise levels were set at 0 dB and 5 dB SNR. Overall av-   Table 1 (note that each entry represents an average over 192-TIMIT-sentences). There is a measurable improvement in SegSNRs for both noise types at all SNR levels. There is also a corresponding level of improvement in the IS measure for the enhanced speech over the degraded speech for all conditions (this is especially true for noise types at 5 dB SNR).
Next, we consider performance of the proposed enhancement method with respect to TMK. In Table 2, we present the average SegSNR and IS objective speech quality measures for the 192-TIMIT-sentence test set for FLN and LCR noise distortions at 0 dB SNR. Both noise level (SegSNR) and speech quality (IS) are significantly impacted by both noise sources. Using the TMK algorithm, we performed enhancement for all 192-sentences, and measurable improvement is seen. Since FLN noise is closer to white Gaussian noise, the level of improvement in IS is slightly larger than for the LCR noise, which has multiple speakers in a crowd setting and is more time varying. 3 The results from Table 2 confirm a similar level of noise suppression, as represented in SegSNR measure, between GMMSE-AMT[ERB-NH] and TMK algorithms. For quality improvement, the performance is comparable for FLN, and GMMSE-AMT[ERB-NH] is slightly better than TMK for LCR. Having considered overall performance, we now wish to examine where in the acoustic phoneme space TMK versus GMMSE-AMT[ERB-NH] shows improvement. In Table 3, we summarize individual IS objective measure performance for each phoneme from the 192 TIMIT sentence test set. The original degraded speech at an SNR of 5 dB with FLN noise is shown under "DEG," and corresponding IS measures for the TMK and proposed enhancement method (labeled as ERB). There are 76 876 frames of speech processed in each case. From this table, we see that GMMSE-AMT[ERB-NH] provides a consistently higher level of quality for nasals, vowels, diphthongs, and semi-vowels. Fricatives and stops resulted in similar level of performance for both enhancement methods. The only class which showed a slight loss for GMMSE-AMT[ERB-NH] was for the silence class (a reduction in IS of 0.15 when going to GMMSE-AMT[ERB-NH] from TMK).

Listener evaluations
In this section, we describe the procedures used to evaluate the effectiveness of the GMMSE-AMT[ERB-NH] scheme in normal-hearing listeners and the GMMSE-AMT[ERB-NH] and GMMSE-AMT[ERB-HI] schemes in hearing-impaired listeners. Our current evaluation uses a sampling rate of 8000 Hz, which was motivated by our earlier studies on speech enhancement for telephone/telecommunication applications [24], as well as limited computational resources for hearing aid systems.

Listeners
Six listeners with normal hearing and ten listeners with hearing loss participated in this study. Listeners with normal hearing had thresholds of 20 dB HL (ANSI, 1989) or better at octave frequencies from 250-8000 Hz, inclusive. Listeners with hearing loss demonstrated test results consistent with sensorineural pathology: normal tympanometry; absence of otoacoustic emissions in regions of threshold loss and absence of an air-bone gap exceeding 10 dB at any frequency. Listeners with hearing loss had a mild-to-severe hearing loss. All listeners were tested monaurally. Table 4 provides a summary of the characteristics of the listeners with hearing loss, including the audiometric thresholds of the test ear. The test ear of the hearing-impaired listeners was chosen based on the ear with a threshold configuration, allowing the best digital filter design for linear amplification (see below). Listeners were tested individually in a double-walled sound booth. Daily test sessions typically lasted one hour but did not extend beyond two hours. Listeners were compensated 8 USD/hour for their participation.

Stimuli
Speech materials. Two different sets of speech stimuli were used in this study. Speech quality was assessed using 256 sentences from the hearing-in-noise test [25]. Speech intelligibility was assessed using 102 syllables from the CUNY nonsense syllable test [26]. The speech stimuli were digitized at an 8 kHz sampling rate and stored on a Pentium IV computer.
Noise conditions. Speech stimuli were degraded with large crowd room noise (LCR) and flat channel noise (FLN) at overall SNRs of 0 dB and +5 dB.
Signal processing. Digitized speech was degraded with sample noise files with appropriate scaling to generate each SNR. This set of "degraded" signals was then processed by the GMMSE-AMT(ERB) scheme to generate the set of "enhanced" speech signals. In all enhancement processing, the noise spectrum was estimated during an initial portion of silence/noise prior to speech activity, and this estimate was kept constant across the syllable (NST material) or sentence (HINT). The GMMSE-AMT(ERB) scheme was applied in two ways. The first approach GMMSE-AMT(ERB-NH) used thresholds and auditory filter bandwidths characteristic of a normally functioning auditory system. Both listener groups were evaluated with the GMMSE-AMT(ERB-NH) approach. Implemented only for the hearing-impaired listener group, the second approach GMMSE-AMT(ERB-HI) used thresholds and auditory filter bandwidths characteristic of sensorineural hearing loss. Customized for each individual hearing-impaired listener, the GMMSE-AMT(ERB-HI) implementation adjusted the spread-of-masking functions based on individual thresholds and auditory filter bandwidths [14]. Table 5 provides a summary of the stimulus conditions. Quality and intelligibility were measured in a total of eight conditions for the normal-hearing group (2 noise types with 2 SNRs for 2 processing conditions) and a total of 12 conditions for the hearing-impaired group (2 noise types with 2 SNRs for 3 processing conditions).

Equipment
For listener presentation, the digitally stored stimuli went through a digital-to-analog converter (TDT AP2, DD1), a 4000 Hz anti-aliasing filter (TDT FT3), an attenuator (TDT PA4), and a headphone buffer (TDT HB6). Finally, the stimuli were presented monaurally to the test ear of each listener through a TDH-49 earphone.

Presentation level
All stimuli were presented to normal-hearing listeners at an equalized RMS level of 60 dB SPL. Because listeners with hearing loss were not wearing hearing aids, the preprocessed stimuli were frequency-shaped through digital filtering to simulate amplification. Thus, the stimuli presented to the hearing-impaired subjects through headphones was an amplified version of the signal presented to the normal-hearing   subjects, with the amplification approximating the linear gain prescribed by the NAL-R fitting procedure [15].

Speech quality ratings
The categorical rating scales used for the quality ratings are the same as those used by Neuman et al. [27] and are similar to those developed by Gabrielson et al. [28]. A 10-point rating scale was used to obtain ratings on five different stimulus attributes: clarity, pleasantness, background noise, loudness, and overall impression, with a rating of "0" being worst and a rating of "10" being best. Listeners used a written response form containing the five quality scales to record their ratings. For each condition, participants listened to a block of 30 of the 256 HINT sentences and then used the 10-point scales to rate the quality of the speech for each of the five attributes. The starting sentence for each block of 30 sentences was randomly selected such that on one block of trials the subject would listen to sentences 45 through 75, on the next block sentences 125 through 155 and so forth. A set of quality ratings consisted of ratings on each of the five attributes in each of the eight conditions. The order of the conditions in each set was randomized. Three sets of quality ratings were obtained. Each set took about 40 minutes to complete.

Intelligibility
Nonsense syllable test. The nonsense syllable test (NST) [26,29] is a closed-set test in which a listener hears a nonsense syllable and then chooses between seven and nine response alternatives. The test consists of 102 syllables contained in 11 subtests, each of which contains between seven and nine syllables. The subtests differ in terms of voicing and position of consonants as well as the vowel. The order of presentation of the 102 nonsense syllables was randomized on each block of trials. The intelligibility session for each listener included one 102-syllable list in each condition, with the order of the conditions randomized within the set. The overall measure of performance is the percentage of correctly identified nonsense syllables.

Speech quality ratings
Speech quality ratings for each attribute were first averaged for the three trials for each listener. Ratings were then averaged across listeners in each group. Average ratings for the five attributes of quality for the normal-hearing listeners and hearing-impaired listeners are shown in Figure 6. A separate repeated measures analysis of variance (ANOVA) was done for each quality attribute for each of the listener groups. Listener groups were considered separately because the number of processing conditions differed between the two groups. The results of these statistical analyses are shown in Table 6. Enhancement with the GMMSE-AMT[ERB] technique resulted in significant benefit in quality ratings on   Figure 7: Intelligibility percent-correct scores on the nonsense syllable test scores for normal-hearing listeners (left) and for hearing-impaired listeners (right) for degraded (DEG) and enhanced (AMT(NH) and AMT(HI)) speech conditions. several attributes in both subject groups. In normal-hearing listeners, enhancement resulted in significantly less noisy ratings, better clarity ratings, and better overall quality ratings. In hearing-impaired listeners, enhancement resulted in significantly better clarity ratings, significantly less noisy ratings, and significantly better overall quality ratings. In the hearing-impaired group, loudness ratings increased slightly (albeit significantly) in the enhanced conditions. Increasing SNR had a significant effect on four of the five rating scales in each listener group (NH: ratings of clarity, pleasantness, loudness, and overall quality; HI: ratings of clarity, background noise, loudness, and overall quality). Overall variability was greater in the HI group versus the NH group. In the normal-hearing group, noise type was a significant factor in quality ratings: LCR was consistently rated more favorably compared to FLN. In both listener groups, the (processing × SNR) interaction was significant for the background noise scale: stimuli enhanced with GMMSE-AMT[ERB] showed significantly larger changes (decreases) in ratings of noisiness in the 5 dB SNR condition. Figure 7 shows NST scores (in proportion correct) for degraded and enhanced conditions for both normal-hearing listeners (left) and hearing-impaired listeners (right). The NST percent-correct scores were first subjected to an arcsin transform [30] and then submitted to repeated measures ANOVAs. The ANOVA results are shown in Table 7. NST scores were better (20% on average) and less variable in the normal-hearing listeners than in the hearing-impaired listeners. In the normal-hearing group, the main effects of noise and SNR were significant: intelligibility scores were better in the +5 dB SNR condition and for the LCR noise. In the hearing-impaired group, the only significant main effect was SNR. Enhancement did not significantly affect intelligibility scores in either group.

DISCUSSION AND CONCLUSIONS
In this study, we have considered the problem of speech enhancement in diverse environmental conditions using a speech enhancement scheme that employs an auditory masking threshold (AMT) to balance the degree of noise suppression versus perceived processing artifacts. The goals of this study have been to (i) modify the suppression structure to incorporate the modified generalized minimum mean square error (GMMSE) estimators, and (ii) establish a working framework for speech enhancement which directly incorporates the hearing response of individual hearing-impaired listeners. This approach was motivated by the earlier study that resulted in the TMK algorithm [1], which showed a substantial level of intelligibility improvement as measured by the DRT (diagnostic rhyme test) for individuals with normal-hearing. Motivated by this first demonstration of intelligibility improvement in the speech enhancement literature, we previously developed an approach which improved on the estimation of the AMT [9] and also evaluated the improved procedure using quality measures and formal DRT testing [9]. We saw that an approach that improves on the estimation of the AMT and integrates this into a generalized MMSE noise suppression algorithm [10,11] does improve quality, but the level of intelligibility improvement was only modest for normal-hearing individuals [9]. Even so, we feel that these prior studies served as an important foundation to develop improved noise suppression schemes for hearingimpaired persons, and, in theory, should offer the potential to develop more effective automatic speech processing algorithms for digital hearing aids, which could both improve quality and intelligibility. The present study has considered a revised formulation that is more suitable for hearing aid applications and incorporated the following processing phases: (i) a modified generalized minimum mean square error estimator (GMMSE) was employed, (ii) the frequency resolution of the cochlea was represented using the auditory filter equivalent rectangular bandwidths (ERBs) rather than the critical band scale, (iii) estimation of the auditory masking threshold and spreading functions for masking were adjusted to address the elevated thresholds and broader auditory filters that result from sensorineural hearing loss, and (iv) the current algorithm did not include the tonality offset developed for use in MPEG-4 audio coding applications, since it is based more on the harmonic structure of sounds associated with music. After developing the GMMSE-AMT[ERB] noise suppression scheme, we specialized the approach to those with normal hearing and hearing impaired listeners (i.e., NH and HI algorithm versions). The output level of the speech waveform was set to different levels for normal and hearing-impaired individuals. The algorithm was evaluated using large crowd room noise and flat communications channel noise at two separate SNRs. Using objective speech quality measures, the output SegSNR performance improved from 2.44 to 3.32 dB over the Table 6: Summary of the main effects (processing, noise, SNR) from the analysis of variance carried out for the five attributes of quality using HINT sentences for each listener group: * p < 0.05; * * p < 0.01; * * * p < 0.001. F-values are also reported for significant interactions.  original degraded corpus. Using the Itakura-Saito objective quality measure, the level of distortion was measurably reduced from an initial degraded level of 2.38-4.23 down to 1.63-2.45, improvements ranging from 0.75 to 1.78. This improvement came within the acoustic phoneme space primarily in nasals, vowels, diphthongs, and semi-vowels, with the same performance for stops and fricatives. Next, formal listener evaluations using 6 normal and 10 hearing-impaired individuals were performed for quality using HINT sentences and intelligibility using the CUNY nonsense syllable test. For subjective quality tests, a measurable level of speech quality improvement and background noise reduction were obtained with GMMSE-AMT[ERB-NH] for NH and HI listeners. The GMMSE-AMT[ERB-HI] version of the enhancement algorithm also showed quality improvement over the original degraded materials. However, results with GMMSE-AMT[ERB-HI] and GMMSE-AMT[ERB-NH] were similar. Customization of the AMT did not show significant advantages over the uncustomized (default NH version) method in listener ratings of quality.
Formal intelligibility evaluations using NST materials showed either a slight improvement, the same, or a slight reduction across the four noise conditions for GMMSE-AMT[ERB-HI] and GMMSE-AMT[ERB-NH] algorithm configurations. This is in stark contrast to the level of intelligibility improvement reported in [1] for normalhearing individuals. As addressed in [9], possible reasons for discrepancies reported between [1] and our work include (i) differences in sampling rate/bandwidth, (ii) use of a voice activity detector with noise spectral update in [1] versus a single initial noise estimate for our studies, (iii) differences in linguistic backgrounds (Greek versus English) of listeners, and (iv) procedures used for listener evaluations. Finally, while the present study established a framework for customization, the customized implementation was not significantly better for hearing-impaired listeners. In the present formulation, two steps are crucial for speech enhancement: these include the particular method for estimating the AMT, and second the particular method used to perform the noise suppression given the AMT. Given the results from the present study, it is natural to ask if (i) the noise suppression was not capable of taking full advantage of the customization for individual hearing responses; and/or (ii) whether there remains an error in how the AMT estimation is performed for HI listeners; and finally, (iii) whether there is additional knowledge or information, either separate or in addition to the AMT, needed to perform effective customized noise suppression for HI listeners.
In future studies, it would be useful to consider these three issues. Also, we maintained a single noise spectral estimate across the speech sentence, and engaging the voice activity detector to update noise estimates as well as α in the GMMSE enhancement scheme could improve performance. We believe that it would be possible to incorporate a codebook-based AMT scheme such as that in [31] for individuals with cochlear hearing loss. Such an approach would require extensive modeling of the particular types of hearing loss for each listener, and to incorporate this bias into the AMT codebook entry selection process.