EURASIP Journal on Applied Signal Processing 2005:9, 1305–1322 c ○ 2005 Jeroen Breebaart et al. Parametric Coding of Stereo Audio

Parametric-stereo coding is a technique to efficiently code a stereo audio signal as a monaural signal plus a small amount of parametric overhead to describe the stereo image. The stereo properties are analyzed, encoded, and reinstated in a decoder according to spatial psychoacoustical principles. The monaural signal can be encoded using any (conventional) audio coder. Experiments show that the parameterized description of spatial properties enables a highly efficient, high-quality stereo audio representation.


INTRODUCTION
Efficient coding of wideband audio has gained large interest during the last decades. With the increasing popularity of mobile applications, Internet, and wireless communication protocols, the demand for more efficient coding systems is still sustaining. A large variety of different coding strategies and algorithms has been proposed and several of them have been incorporated in international standards [1,2]. These coding strategies reduce the required bit rate by exploiting two main principles for bit-rate reduction. The first principle is the fact that signals may exhibit redundant information. A signal may be partly predictable from its past, or the signal can be described more efficiently using a suitable set of signal functions. For example, a single sinusoid can be described by its successive time-domain samples, but a more efficient description would be to transmit its amplitude, frequency, and This is an open-access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. starting phase. This source of bit-rate reduction is often referred to as "signal redundancy." The second principle (or source) for bit-rate reduction is the exploitation of "perceptual irrelevancy." Signal properties that are irrelevant from a perceptual point of view can be discarded without a loss in perceptual quality. In particular, a significant amount of bit-rate reduction in current state-of-the-art audio coders is obtained by exploiting auditory masking.
Basically, two different coding approaches can be distinguished that aim at bit-rate reduction. The first approach, often referred to as "waveform coding," describes the actual waveform (in frequency subbands or transform-based) with a limited (sample) accuracy. By ensuring that the quantization noise that is inherently introduced is kept below the masking curve (both across time and frequency), the concept of auditory masking (e.g., perceptual intrachannel irrelevancy) is effectively exploited.
The second coding approach relies on parametric descriptions of the audio signal. Such methods decompose the audio signal in several "objects," such as transients, sinusoids, and noise (cf. [3,4]). Each object is subsequently parameterized and its parameters are transmitted. The decoder at the receiving end resynthesizes the objects according to the transmitted parameters. Although it is difficult to obtain transparent audio quality using such coding methods, parametric coders often perform better than waveform or transform coders (i.e., with a higher perceptual quality) at extremely low bit rates (typically up to about 32 kbps).
Recently, hybrid forms of waveform coders and parametric coders have been developed. For example, spectral band replication (SBR) techniques are proposed as a parametric coding extension for high-frequency content combined with a waveform or transform coder operating at a limited bandwidth [5,6]. These techniques reduce the bit rate of waveform or transform coders by reducing the signal bandwidth that is sent to the encoder, combined with a small amount of parametric overhead. This parametric overhead describes how the high-frequency part, which is not encoded by the waveform coder, can be resynthesized from the lowfrequency part.
The techniques described up to this point aim at encoding a single audio channel. In the case of a multichannel signal, these methods have to be performed for each channel individually. Therefore, adding more independent audio channels will result in a linear increase of the total required bit rate. It is often suggested that for multichannel material, cross-channel redundancies can be exploited to increase the coding efficiency. A technique referred to as "mid-side coding" exploits the common part of a stereophonic input signal by encoding the sum and difference signals of the two input signals rather than the input signals themselves [7]. If the two input signals are sufficiently correlated, sum/difference coding requires less bits than dual-mono coding. However, some investigations have suggested that the amount of mutual information in the signals for such a transform is rather low [8].
One possible explanation for this finding is related to the (limited) signal model. To be more specific, the crosscorrelation coefficient (or the value of the cross-correlation function at lag zero) of the two input signals must be significantly different from zero in order to obtain a bit-rate reduction. If the two input signals are (nearly) identical but have a relative time delay, the cross-correlation coefficient will (in general) be very low, despite the fact that there exists significant signal redundancy between the input signals. Such a relative time delay may result from the usage of a stereo microphone setup during the recording stage or may result from effect processors that apply (relative) delays to the input signals. In this case, the cross-correlation function shows a clear maximum at a certain nonzero delay. The maximum value of the cross-correlation as a function of the relative delay is also known as "coherence." Coherent signals can in principle be modeled using more advanced signal models, for example, using cross-channel prediction schemes. However, studies indicate only limited success in exploiting coherence using such techniques [9,10]. These results indicate that exploiting cross-channel redundancies, even if the signal model is able to capture relative time delays, does not lead to a large coding gain.
The second source for bit-rate reduction in multichannel audio relates to cross-channel perceptual irrelevancies. For example, it is well known that for high frequencies (typically above 2 kHz), the human auditory system is not sensitive to fine-structure phase differences between the left and right signals in a stereo recording [11,12]. This phenomenon is exploited by a technique referred to as "intensity stereo" [13,14]. Using this technique, a single audio signal is transmitted for the high-frequency range, combined with time-and frequency-dependent scale factors to encode level differences. More recently, the so-called binaural-cue coding (BCC) schemes have been described that initially aimed at modeling the most relevant sound-source localization cues [15,16,17], while discarding other spatial attributes such as the ambiance level and room size. BCC schemes can be seen as an extension of intensity stereo in terms of bandwidth and parameters. For the full-frequency range, only a single audio channel is transmitted, combined with time-and frequencydependent differences in level and arrival time between the input channels. Although the BCC schemes are able to capture the majority of the sound localization cues, they suffer from narrowing of the stereo image and spatial instabilities [18,19], suggesting that these techniques are mostly advantageous at low bit rates [20]. A solution that was suggested to reduce the narrowing stereo image artifact is to transmit the interchannel coherence as a third parameter [4]. Informal listening results in [21,22] claim improvements in spatial image width and stability.
In this paper, a parametric description of the spatial sound field will be presented which is based on the three spatial properties described above (i.e., level differences, time differences, and the coherence). The analysis, encoding, and synthesis of these parameters is largely based on binaural psychoacoustics. The amount of spatial information is extracted and parameterized in a scalable fashion. At low parameter rates (typically in the order of 1 to 3 kbps), the coder is able to represent the spatial sound field in an extremely compact way. It will be shown that this configuration is very suitable for low-bit-rate audio coding applications. It will also be demonstrated that, in contrast to statements on BCC schemes [20,21], if the spatial parameters bit rate is increased to about 8 kbps, the underlying spatial model is able to encode and recreate a spatial image which has a subjective quality which is equivalent to the quality of current high-quality stereo audio coders (such as MPEG-1 layer 3 at a bit rate of 128 kbps/s). Inspection of the coding scheme proposed here and BCC schemes reveals (at least) three important differences that all contribute to quality improvements: (1) dynamic window switching (see Section 5.1); (2) different methods of decorrelation synthesis (see Section 6); (3) the necessity of encoding interchannel time or phase differences, even for loudspeaker playback conditions (see Section 3.1).
Finally, the bit-rate scalability options and the fact that a high-quality stereo image can be obtained enable integration of parametric stereo in state-of-the-art transform-based [23,24] and parametric [4] mono audio coders for a wide quality/bit-rate range.
The paper outline is as follows. First the psychoacoustic background of the parametric-stereo coder is discussed. Section 4 discusses the general structure of the coder. In Section 5, an FFT-based encoder is described. In Section 6, an FFT-based decoder is outlined. In Section 7, an alternative decoder based on a filter bank is given. In Section 8, results from listening tests are discussed, followed by a concluding section.

PSYCHOACOUSTIC BACKGROUND
In 1907, Lord Rayleigh formulated the duplex theory [25], which states that sound-source localization is facilitated by interaural intensity differences (IIDs) at high frequencies and by interaural time differences (ITDs) at low frequencies. This theory was (in part) based on the observation that at low frequencies, IIDs between the eardrums do not occur due to the fact that the signal wavelength is much larger than the size of the head, and hence the acoustical shadow of the head is virtually absent. According to Lord Rayleigh, this had the consequence that human listeners can only use ITD cues for sound-source localization at low frequencies. Since then, a large amount of research has been conducted to investigate the human sensitivity to both IIDs and ITDs as a function of various stimulus parameters. One of the striking findings is that although it seems that IID cues are virtually absent at low frequencies for free-field listening conditions, humans are nevertheless very sensitive to IID and ITD cues at low and high frequencies. Stimuli with specified, frequency-independent values of the ITD and IID can be presented over headphones, resulting in a lateralization of the sound source which depends on the magnitude of the ITD as well as the IID [26,27,28]. The usual result of such laboratory headphone-based experiments is that the source images are located inside the head and are lateralized along the axis connecting the left and the right ears. The reason for the fact that these stimuli are not perceived externalized is that the single frequencyindependent IID or ITD is a poor representation of the acoustic signals at the listener's eardrums in free-field listening conditions. The waveforms of sounds are filtered by the acoustical transmission path between the source and the listener's eardrums, which includes room reflections and pinna filtering, resulting in an intricate frequency dependence of the ITD and IID [29]. Moreover, if multiple sound sources with different spectral properties exist at different spatial locations, the spatial cues of the signals arriving at the eardrums will show a frequency dependence which is even more complex because they are constituted by (weighted) combinations of the spatial cues of the individual's sound sources.
To be more specific, there is considerable evidence that the binaural auditory system renders its binaural cues in a set of frequency bands, without having the possibility to acquire these properties at a finer frequency resolution. This spectral resolution of the binaural auditory system can be described by a filter bank with filter bandwidths that follow the ERB (equivalent rectangular bandwidth) scale [38,39,40].
The limited temporal resolution at which the auditory system can track binaural localization cues is often referred to as "binaural sluggishness," and the associated time constants are between 30 and 100 milliseconds [32,41]. Although the auditory system is not able to follow IIDs and ITDs that vary quickly over time, this does not mean that listeners are not able to detect the presence of quickly varying cues. Slowly-varying IIDs and/or ITDs result in a movement of the perceived sound-source location, while fast changes in binaural cues lead to a percept of "spatial diffuseness," or a reduced "compactness" [42]. Despite the fact that the perceived "quality" of the presented stimulus depends on the movement speed of the binaural cues, it has been shown that the detectability of IIDs and ITDs is practically independent of the variation speed [43]. The sensitivity of human listeners to time-varying changes in binaural cues can be described by sensitivity to changes in the maximum of the cross-correlation function (e.g., the coherence) of the incoming waveforms [44,45,46,47]. There is a considerable evidence that the sensitivity to changes in the coherence is the basis of the phenomenon of the binaural masking level difference (BMLD) [48,49]. Moreover, the sensitivity to quasistatic ITDs can also be described by (changes in) the cross-correlation function [35,36,50].
Recently, it has been demonstrated that the concept of "spatial diffuseness" mostly depends on the coherence value itself and is relatively unaffected by the temporal finestructure details of the coherence within the temporal integration time of the binaural auditory system. For example, van de Par et al. [51] measured the detectability and discriminability of interaurally out-of-phase test signals presented in an interaurally in-phase masker. The subjects were perfectly able to detect the presence of the out-of-phase test signal, but they had great difficulty in discriminating different test signal types (i.e., noise versus harmonic tone complexes).
Besides the limited spectral and temporal resolution that seems to underly the extraction of spatial sound-field properties, it has also been shown that the auditory system exhibits a limited spatial resolution. The spatial parameters have to change by a certain minimum amount before subjects are able to detect the change. For IIDs, the resolution is between 0.5 and 1 dB for a reference IID of 0 dB and is relatively independent of frequency and stimulus level [52,53,54,55]. If the reference IID increases, IID thresholds increase also. For reference IIDs of 9 dB, the IID threshold is about 1.2 dB, and for a reference IID of 15 dB, the IID threshold amounts between 1.5 and 2 dB [56,57,58].
The sensitivity to changes in ITDs strongly depends on frequency. For frequencies below 1000 Hz, this sensitivity can be described as a constant interaural phase difference (IPD) sensitivity of about 0.05 rad [11,53,59,60]. The reference ITD has some effect on the ITD thresholds: large ITDs in the reference condition tend to decrease sensitivity to changes in the ITDs [52,61]. There is almost no effect of stimulus level on ITD sensitivity [12]. At higher frequencies, the binaural auditory system is not able to detect time differences in the fine-structure waveforms. However, time differences in the envelopes can be detected quite accurately [62,63]. Despite this high-frequency sensitivity, ITD-based sound-source localization is dominated by low-frequency cues [64,65].
The sensitivity to changes in the coherence strongly depends on the reference coherence. For a reference coherence of +1, changes of about 0.002 can be perceived, while for a reference coherence around 0, the change in coherence must be about 100 times larger to be perceptible [66,67,68,69]. The sensitivity to interaural coherence is practically independent of stimulus level, as long as the stimulus is sufficiently above the absolute threshold [70]. At high frequencies, the envelope coherence seems to be the relevant descriptor of the spatial diffuseness [47,71].
The threshold values described above are typical for spatial properties that exist during a prolonged time (i.e., 300 to 400 milliseconds). If the duration is smaller, thresholds generally increase. For example, if the duration of the IID and ITD in a stimulus is decreased from 310 to 17 milliseconds, the thresholds may increase by up to a factor of 4 [72]. Interaural coherence sensitivity also strongly depends on the duration [73,74,75]. It is often assumed that the increased sensitivity for longer durations results from temporal integration properties of the auditory system. There is, however, one important exception in which the auditory system does not seem to integrate spatial information across time. In reverberant rooms, the perceived location of a sound source is dominated by the first 2 milliseconds of the onset of the sound source, while the remaining signal is largely discarded in terms of spatial cues. This phenomenon is referred to as "the law of the first wavefront" or "precedence effect" [76,77,78,79].
In summary, it seems that the auditory system performs a frequency separation and temporal averaging process in its determination of IIDs, ITDs, and the coherence. This estimation process leads to the concept of a certain soundsource location as a function of frequency and time, while the variability of the localization cues leads to a certain degree of "diffuseness," or spatial "widening," with hardly any interaction between diffuseness and location [72]. Furthermore, these cues are rendered with a limited (spatial) resolution. These observations form the basis of the parametric stereo coder as described in the following sections. The general idea is to encode all (monaurally) relevant sound sources using a single audio channel, combined with a parameterization of the spatial sound stage. The parameterized sound stage consists of IID, ITD, and coherence parameters as a function of frequency and time. The update rate, frequency resolution, and quantization of these parameters is determined by the human sensitivity to (changes in) these parameters.

Headphones versus loudspeaker rendering
The psychoacoustic background as discussed in Section 2 is based on spatial cues at the level of the listener's eardrums. In the case of headphone rendering, the spatial cues which are presented to the human hearing system (i.e., the interaural cues ILD, ITD, and coherence) are virtually the same as the spatial cues in the original stereo signal (interchannel cues). For loudspeaker playback, however, the complex acoustical transmission paths between loudspeakers and eardrums (as described in Section 2) may cause significant changes in the spatial cues. It is therefore highly unlikely that the spatial cues of the original stereo signal (e.g., the interchannel cues) and the spatial cues at the level of the listener's eardrums (interaural cues) are even comparable in the case of loudspeaker playback. In fact, it has been suggested that the acoustical transmission path effectively converts certain spatial cues (for example interchannel intensity differences) to other cues at the level of the eardrums (e.g., interaural time differences) [80,81]. However, this effect of the transmission path is not necessarily problematic for parametric-stereo coding. As long as the interaural cues are the same for original material and material which has been processed by a parametricstereo coder, the listener should have a similar percept of the spatial sound field. Although a detailed analysis of this problem is beyond the scope of this paper, we state that given certain restrictions on the acoustical transmission path, it can be shown that the interaural spatial cues are indeed comparable for original and decoded signal, provided that all three interchannel parameters are encoded and reconstructed correctly. Moreover, well-known algorithms that aim at widening of the perceived sound stage for loudspeaker playback (so-called crosstalk-cancellation algorithms, which are used frequently in commercial recordings) heavily rely on correct interchannel phase relationships (cf. [82]). These observations are in contrast to statements by others (cf. [18,21,22]) that interchannel time or phase differences are irrelevant for loudspeaker playback.
Supported by the observations given above, we will refer to ILD, ITD, and coherence as interchannel parameters. If all three interchannel parameters are reconstructed correctly, we assume that the interaural parameters of original and decoded signals are very similar as well (but different from the interchannel parameters).

Mono coding effects
As discussed in Section 1, bit-rate reduction in conventional lossy audio coders is obtained predominantly by exploiting the phenomenon of masking. Therefore, lossy audio coders rely on accurate and reliable masking models, which are often applied to individual channel signals in the case of a stereo or multichannel signal. For a parametric-stereo extended audio coder, however, the masking model is applied only once on a certain combination of the two input signals. This scheme has two implications with respect to masking phenomena.
The first implication relates to spatial unmasking of quantization noise. In stereo waveform or transform coders, Input Figure 1: Structure of the parametric-stereo encoder. The two input signals are first processed by a parameter extraction and downmix stage. The parameters are subsequently quantized and encoded, while the mono downmix can be encoded using an arbitrary mono audio coder. The mono bit stream and spatial parameters are subsequently combined into a single output bit stream.
individual quantizers are applied on the two input signals or on linear combinations of the input signals. As a consequence, the injected quantization noise may exhibit different spatial properties than the audio signal itself. Due to binaural unmasking, the quantization noise may thus become audible, even if it is inaudible if presented monaurally. For tonal material, this unmasking effect (or BMLD, quantified as threshold difference between a binaural condition and a monaural reference condition) has shown to be relatively small (about 3 dB, see [83,84]). However, we expect that for broadband maskers, the unmasking effect is much more prominent. If one assumes an interaurally in-phase noise as a masker, and a quantization noise which is either inter-aurally in-phase or interaurally uncorrelated, BMLDs are reported of 6 dB [85]. More recent data revealed BMLDs of 13 dB for this condition, based on a sensitivity of changes in the correlation of 0.045 [86]. To prevent these spatial unmasking effects of quantization noise, conventional stereo coders often apply some sort of spatial unmasking protection algorithm. For a parametric stereo coder, on the other hand, there is only one waveform or transform quantizer, working on the mono (downmix) signal. In the stereo reconstruction phase, both the quantization noise and the audio signal present in each frequency band will obey the same spatial properties. Since a difference in spatial characteristics of quantization noise and audio signal is a prerequisite for spatial unmasking, this effect is less likely to occur for parametric-stereo enhanced coders than for conventional stereo coders.

CODER IMPLEMENTATION
The generic structure of the parametric-stereo encoder is shown in Figure 1. The two input channels are fed to a stage that extracts spatial parameters and generates a mono downmix of the two input channels. The spatial parameters are subsequently quantized and encoded, while the mono downmix is encoded using an arbitrary mono audio coder. The resulting mono bit stream is combined with the encoded spatial parameters to form the output bit stream.
The parametric-stereo decoder basically performs the reverse process, as shown in Figure 2. The spatial parameters are separated from the incoming bit stream and decoded.  Figure 2: Structure of the parametric-stereo decoder. The demultiplexer splits mono and spatial parameter information. The mono audio signal is decoded and fed into the spatial synthesis stage, which reinstates the spatial cues based on the decoded spatial parameters.
The mono bit stream is decoded using a mono audio decoder. The decoded audio signal is fed into the spatial synthesis stage, which reinstates the spatial image, resulting in a two-channel output.
Since the spatial parameters are estimated (at the encoder side) and applied (at the decoder side) as a function of time and frequency, both the encoder and decoder require a transform or filter bank that generates individual time/frequency tiles. The frequency resolution of this stage should be nonuniform according to the frequency resolution of the human auditory system. Furthermore, the temporal resolution should generally be fairly low (in the order of tens of milliseconds) reflecting the concept of binaural sluggishness, except in the case of transients, where the precedence effect dictates a time resolution of only a few milliseconds. Furthermore, the transform or filter bank should be oversampled, since time-and frequency-dependent changes will be made to the signals which would lead to audible aliasing distortion in a critically-sampled system. Finally, a complexvalued transform or filter bank is preferred to enable easy estimation and modification of (cross-channel) phase-or time-difference information. A process that meets these requirements is a variable segmentation process with temporally overlapping segments, followed by forward and inverse FFTs. Complex-modulated filter banks can be employed as a low-complexity alternative [23,24].

FFT-BASED ENCODER
The spatial analysis and downmix stage of the encoder is shown in more detail in Figure 3. The two input signals are first segmented by an analysis windowing process. Subsequently, each windowed segment is transformed to the frequency domain using a fast fourier transform (FFT). The transformed segments are used to extract spatial parameters and to generate a mono downmix signal. The mono signal is transformed to the time domain using an inverse FFT, followed by synthesis windowing and overlap-add (OLA).

Segmentation
The encoder receives a stereo input signal pair x 1   using overlapping frames of total length N with a (fixed) hop size of N h samples. If no transients are detected, the analysis window length and the window hop size (or parameter update rate) should match the lower bound of the measured time constants of the binaural auditory system. In the following, a parameter update interval of approximately 23 milliseconds is used. Each segment is windowed using overlapping analysis windows and subsequently transformed to the frequency domain using an FFT. Dynamic window switching is used in the case of transients. The purpose of window switching is twofold: firstly, to account for the precedence effect, which dictates that only the first 2 milliseconds of a transient in a reverberant environment determine its perceived location; secondly, to prevent pre-echos resulting from the frequency-dependent processing which is applied in otherwise relatively long segments. The window switching procedure, of which the essence is demonstrated in Figure 4, is controlled by a transient detector. If a transient is detected at a certain temporal position, a stop window of variable length is applied which just stops before the transient. The transient itself is captured using a very short window (in the order of a few milliseconds). A start window of variable length is subsequently applied to ensure segmentation at the same temporal grid as before the transient.

Parameter extraction
For each frequency band b, three spatial parameters are computed. The first parameter is the interchannel intensity difference (IID[b]), defined as the logarithm of the power ratio of corresponding subbands from the input signals: where * denotes complex conjugation. The second parameter is the relative phase rotation. The phase rotation aims at optimal (in terms of correlation) phase alignment between the two signals. This parameter is denoted by the interchannel phase difference (IPD[b]) and is obtained as follows: Using the IPD as specified in (3), (relative) delays between the input signals which are represented as a constant phase difference in each analysis frequency band, hence result in a fractional delay. Thus, within each analysis band, the constant slope of phase with frequency is modeled by a constant phase difference per band, which is a somewhat limited model for the delay. On the other hand, constant phase differences across the input signals are described accurately, which is in turn not possible if an ITD parameter (i.e., a parameterized slope of phase with frequency) would have been used. An advantage of using IPDs over ITDs is that the estimation of ITDs requires accurate unwrapping of bin-by-bin phase differences within each analysis frequency band, which can be prone to errors. Thus, usage of IPDs circumvents this potential problem at the cost of a possibly limited model for ITDs. The third parameter is the interchannel coherence (IC[b]), which is, in our context, defined as the normalized cross-correlation coefficient after phase alignment according to the IPD. The coherence is derived from the cross-spectrum in the following way:

Downmix
A suitable mono signal S[k] is obtained by a linear combination of the input signals X 1 [k] and X 2 [k]: where w 1 and w 2 are weights that determine the relative amount of X 1 and X 2 in the mono output signal. For example, if w 1 = w 2 = 0.5, the output will consist of the average of the two input signals. A downmix that is created using fixed weights however bears the risk that the power of the downmix signal strongly depends on the cross-correlation of the two input signals. To circumvent signal loss and signal coloration due to time-and frequency-dependent crosscorrelations, the weights w 1 and w 2 are (1) complex-valued, to prevent phase cancellation, and (2) varying in magnitude, to ensure overall power preservation. Specific details of the downmix procedure are however beyond the scope of this paper.
After the mono signal is generated, the last parameter that has to be extracted is computed. The IPD parameter as described above specifies the relative phase difference between the stereo input signal (at the encoder) and the stereo output signals (at the decoder). Hence the IPD does not indicate how the decoder should distribute these phase differences across the output channels. In other words, an IPD parameter alone does not indicate whether a first signal is lagging the second signal, or vice versa. Thus, it is generally impossible to reconstruct the absolute phase for the stereo signal pair using only the relative phase difference. Absolute phase reconstruction is required to prevent signal cancellation in the applied overlap-add procedure in both the encoder as well as the decoder (see below). To signal the actual distribution of phase modifications, an overall phase difference (OPD) is computed and transmitted. To be more specific, the decoder applies a phase modification equal to the OPD to compute the first output signal, and applies a phase modification of the OPD minus the IPD to obtain the second output signal. Given this specification, the OPD is computed as the average phase difference between X 1 [k] and S[k], following Subsequently, the mono signal S[k] is transformed to the time domain using an inverse FFT. Finally, a synthesis window is applied to each segment followed by overlap-add, resulting in the desired mono output signal.

Parameter quantization and coding
The IID, IPD, OPD, and IC parameters are quantized according to perceptual criteria. The quantization process aims at introducing quantization errors which are just inaudible.
The IID index for subband b, IDX IID [b], is then equal to For the IPD parameter, the vector IPDs represents the available quantized IPD values: This repertoire is in line with the finding that the human sensitivity to changes in timing differences at low frequencies can be described by a constant phase difference sensitivity. The IPD index for subband b, IDX IPD [b], is given by where mod(·) means the modulo operator, · the floor function, and Λ IPDs the cardinality of the set of possible quantized IPD values (i.e., the number of elements in IPDs). The OPD is quantized using the same quantizer, resulting in IDX OPD [b] according to Finally, the repertoire for IC, represented in the vector ICs, is given by (see also (21)) This repertoire is based on just-noticeable differences in correlation reported by [69]. The coherence index IDX IC [b] for subband b is determined by The IPD and OPD indices are not transmitted for subbands b > 17 (approximately 2 kHz), given the fact that the human auditory system is insensitive to fine-structure phase differences at high frequencies. ITDs present in the high-frequency envelopes are supposed to be represented by the time-varying nature of IID parameters (hence discarding ITDs presented in envelopes that fluctuate faster than the parameter update rate). Thus, for each frame, 34 indices for the IID and IC have to be transmitted, and 17 indices for the IPD and OPD. All parameters are transmitted differentially across time. In principle, differential coding of indices Λ (λ = {0, . . . , Λ − 1}) requires 2Λ − 1 codewords λ d = {−Λ + 1, . . . , 0, . . . , Λ − 1}. Assuming that each differential index λ d has a probability of occurrence p(λ d ), the entropy H(p) (in bits/symbol) of this distribution is given by Given the fact that the cardinality of each parameter Λ is known by the decoder, each differential index λ d can also be modulo-encoded by λ mod , which is given by The decoder can simply retain the transmitted index λ recursively following with q the frame number of the current frame. The entropy for λ mod , H(p mod ), is given by Given that it follows that the difference in entropy between differential and modulo-differential coding, H(p) − H(p mod ), equals For nonnegative probabilities p(·), it follows that In other words, modulo-differential coding results in an entropy which is equal to or smaller than the entropy obtained for non modulo-differential coding. However, the bit-rate gains for modulo time-differential coding compared to timedifferential coding are relatively small: about 15% for the IPD and OPD parameters, and virtually no gain for the IID and IC parameters. The entropy per symbol, using modulodifferential coding, and the resulting contribution to the overall bit rate are given in Table 1. These numbers were obtained by analysis of 80 different audio recordings representing a large variety of material. The total estimated parameter bit rate for the configuration as described above, excluding bit-stream overhead, and averaged across a large amount of representative stereo material amounts to 7.7 kbps. If further parameter bit-rate reduction is required, the following changes can be made. transmission of IPD and OPD parameters. Informal listening experiments showed that lowering the number of frequency bands below 10 results in severe degradation of the perceived spatial quality.
(ii) No transmission of IPD and OPD parameters. As described above, the coherence is a measure of the difference between the input signals which cannot be accounted for by (subband) phase and level differences. A lower bit rate is obtained if the applied signal model does not incorporate phase differences. In that case, the normalized cross-correlation is the relevant measure of differences between the input signals that cannot be accounted for by level differences. In other words, phase or time differences between the input signals are modeled as (additional) changes in the coherence. The estimated coherence value (which is in fact the normalized cross-correlation) is then derived from the cross-spectrum following The associated bit-rate reduction amounts to approximately 27% compared to parameter sets which do include the IPD and OPD values.
(iii) Increasing the quantization errors of the parameters. The bit-rate reduction is only marginal, given the fact that the distribution of time-differential parameters is very peaky.
(iv) Decreasing the parameter update rate. The bit rate scales approximately linear with the update rate.
In summary, the parameter bit rate can be scaled between approximately 8 kbps for maximum quality (using 34 analysis bands, an update rate of 23 milliseconds, and transmitting all relevant parameters) to about 1.5 kbps (using 20 analysis frequency bands, an update rate of 46 milliseconds, and no transmission of IPD and OPD parameters).

FFT-BASED DECODER
The spatial synthesis part of the decoder receives a mono input signal s[n] and has to generate two output signals y 1 [n] and y 2 [n]. These two output signals should obey the transmitted spatial parameters. A more detailed overview of the spatial synthesis stage is shown in Figure 5.
In order to generate two output signals with a variable (i.e., parameter-dependent) coherence, a second signal has to be generated which has a similar spectral-temporal envelope as the mono input signal, but is incoherent from a fine-structure waveform point of view. This incoherent (or orthogonal) signal, s d [n], is obtained by convolving the mono input signal s[n] with an allpass decorrelation filter h d [n]. A very cost-effective decorrelation allpass filter is obtained by a simple delay. The combination of a delay and a (fixed) mixing matrix to produce two signals with a certain spatial diffuseness is known as a Lauridsen decorrelator [87]. The decorrelation is produced by complementary comb-filter peaks and troughs in the two output signals. This approach works well provided that the delay is sufficiently long to result in multiple comb-filter peaks and troughs in each auditory filter. Due to the fact that the auditory filter bandwidth is larger at higher frequencies, the delay is preferably frequency dependent, being shorter at higher frequencies. A frequency-dependent delay has the additional advantage that it does not result in harmonic comb-filter effects in the output. A suitable decorrelation filter consists of a single period of a positive Schroeder-phase complex [88] of length N s = 640 (i.e., with a fundamental frequency of f s /N s ). The Schroeder-phase complex exhibits low autocorrelation at nonzero lags and its impulse response h d [n] for 0 ≤ n ≤ N s − 1 is given by Subsequently, the segmentation, windowing, and transform operations that are performed are equal to those performed in the encoder, resulting in the frequency-domain representations S[k] and S d [k], for the mono input signal s[n] and its decorrelated version s d [n], respectively. The next step consists of computing linear combinations of the two input signals to arrive at the two frequency-domain output signals Y 1 [k] and Y 2 [k]. The dynamic mixing process, which is performed on a subband basis, is described by the matrix multiplication R B . For each subband b (i.e., k b ≤ k < k b+1 ), we have with The diagonal matrix V enables real-valued (relative) scaling of the two orthogonal signals S[k] and S d [k]. The matrix A is a real-valued rotation in the two-dimensional signal space, that is, A −1 = A T , and the diagonal matrix P enables modification of the complex-phase relationships between the output signals, hence |p i j | = 1 for i = j and 0 otherwise. The nonzero entries in the matrices P, A, and V are determined by the following constraints.
(1) The power ratio of the two output signals must obey the transmitted IID parameter. (2) The coherence of the two output signals must obey the transmitted IC parameter. should be equal to the OPD value.
The solution for the matrix P is given by The matrices A and V can be interpreted as the eigenvector, eigenvalue decomposition of the covariance matrix of the (desired) output signals, assuming (optimum) phase alignment (P) prior to correlation. The solution for the eigenvectors and eigenvalues (maximizing the first eigenvalue v 11 ) results from a singular value decomposition (SVD) of the covariance matrix. The matrices A and V are given by (see [89] for more details) with α[b] being a rotation angle in the two-dimensional signal space defined by S and S d , which is given by and γ[b] a parameter for relative scaling of S and S d (i.e., the relation between the eigenvalues of the desired covariance matrix): with and c[b] the square root of the power ratio of the two subband output signals: It should be noted that a two-dimensional eigenvector problem has in principle four possible solutions: each eigenvector, which is represented as columns in the matrix A, may be multiplied with a factor −1. The modulo operator in (27) ensures that the first eigenvector is always positioned in the first quadrant. However, this technique only works under the constraint of IC > 0, which is guaranteed if phase alignment is applied. If no IPD/OPD parameters are transmitted, however, the IC parameters may become negative, which requires a different solution for the matrix R. A convenient solution is obtained if we maximize S[k] in the sum of the output signals (i.e., Y 1 [k] + Y 2 [k]). This results in the mixing matrix with , , Finally, the frames are transformed to the time domain, windowed (using equal synthesis windows as in the encoder), and combined using overlap-add.

QMF-BASED DECODER
The FFT-based decoder as described in the previous section requires a relatively long FFT length to provide sufficient frequency resolution at low frequencies. As a result, the resolution at high frequencies is unnecessarily high, and consequently the memory requirements of an FFT-based decoder are larger than necessary. To reduce the frequency resolution at high frequencies while still maintaining the required resolution at low frequencies, a hybrid complex filter bank is used. To be more specific, a hybrid complex-modulated quadrature mirror filter bank (QMF) is used which is an extension to the filter bank as used in spectral band replication (SBR) techniques [5,6,90]. The outline of the QMF-based parametric-stereo decoder is shown in Figure 6.  Figure 6: Structure of the QMF-based decoder. The signal is first fed through a hybrid QMF analysis filter bank. The filter-bank output and a decorrelated version of each filter-bank signal are subsequently fed into the mixing and phase-adjustment stage. Finally, two hybrid QMF banks generate the two output signals.  The input signal is first processed by the hybrid QMF analysis filter bank. A copy of each filter-bank output is processed by a decorrelation filter. This filter has the same purpose as the decorrelation filter in the FFT-based decoder; it generates a decorrelated version of the input signal in the QMF domain. Subsequently, both the QMF output and its decorrelated version are fed into the mixing and phaseadjustment stage. This stage generates two hybrid QMFdomain output signals with spatial parameters that match the transmitted parameters. Finally, the output signals are fed through a pair of hybrid QMF synthesis filter banks to result in the final output signals.
The hybrid QMF analysis filter bank consists of a cascade of two filter banks. The structure is shown in Figure 7.
The first filter bank is compatible with the filter bank as used in SBR algorithms. The subband signals which are generated by this filter bank are obtained by convolving the input signal with a set of analysis filter impulse responses h k [n] given by The magnitude responses of the first 4 frequency bands (k = 0, . . . , 3) of the QMF analysis bank are illustrated in Figure 8.
The down-sampled subband signals S k [q] of the lowest QMF subbands are subsequently fed through a second complex-modulated filter bank (sub-filter bank) to further enhance the frequency resolution; the remaining subband signals are delayed to compensate for the delay which is introduced by the sub-filter bank. The output of the hybrid (i.e., combined) filter bank is denoted by S k,m [q], with k the subband index of the initial QMF bank, and m the filter index of the sub-filter bank. To allow easy identification of the two filter banks and their outputs, the index k of the first filter bank will be denoted "subband index," and the index m of the subfilter bank is denoted "sub-subband index." The sub-filter bank has a filter order of N s = 12, and an impulse response G k,m [q] given by with g k [q] the prototype window associated with QMF band k, q the sample index, and M k the number of sub-subbands in QMF subband k (m = 0, . . . , M k − 1). Table 2 gives the number of sub-subbands M k as a function of the QMF band k, for both the 34 and 20 analysis-band configurations. As an example, the magnitude response of the 4-band sub-filter bank (M k = 4) is given in Figure 9. Obviously, due to the limited prototype length (N s = 12), the stop-band attenuation is only in the order of 20 dB.
The mixing matrix R k,m is determined as follows. Each quartet of the parameters IID, IPD, OPD, and IC for a single parameter subband b represents a certain frequency range and a certain moment in time. The frequency range depends on the specification of the encoder analysis frequency bands (i.e., the grouping of FFT bins), while the position in time depends on the encoder time-domain segmentation. If the encoder is designed properly, the time/frequency localization of each parameter quartet coincides with a certain sample index in a sub-subband or set of sub-subbands in the QMF domain. For that particular QMF sample index, the mixing matrices are exactly the same as their FFT-based counterparts (as specified by (25)-(32)). For QMF sample indices in between, the mixing matrices are interpolated linearly (i.e., its real and imaginary parts are interpolated individually).
The mixing process is followed by a pair of hybrid QMF synthesis filter banks (one for each output channel), which also consist of two stages. The first stage comprises summation of the sub-subbands m which stem from the same subband k: Finally, upsampling and convolution with synthesis filters (which are similar to the QMF analysis filters as specified by (33)) results in the final stereo output signal.
The fact that the same filter-bank structure is used for both PS and SBR enables an easy and low-cost integration of SBR and parametric stereo in a single decoder structure (cf. [23,24,91,92]). This combination is known as enhanced aacPlus and is under consideration for standardization in MPEG-4 as the HE-AAC/PS profile [93]. The structure of the decoder is shown in Figure 10. The incoming bit stream is demultiplexed into a band-limited AAC bit stream, SBR parameters, and parametric-stereo parameters. The AAC bit stream is decoded by an AAC decoder and fed into a 32band QMF analysis bank. The output of this filter bank is processed by the SBR stage and by the sub-filter bank as described in Section 7. The resulting full-bandwidth mono signal is converted to stereo by the PS stage, which performs decorrelation and mixing. Finally, two hybrid QMF synthesis banks result in the final output signals. More details on enhanced aacPlus can be found in [23,92].

PERCEPTUAL EVALUATION
To evaluate the parametric-stereo coder, two listening tests were conducted. The first test aims at establishing the maximum perceptual quality that can be obtained given the underlying spatial model. Other authors have argued that parametric-stereo coding techniques are only advantageous in the low-bit-rate range, since near transparency could not be achieved [20,21,22]. Therefore, this experiment is useful for two reasons: firstly, to verify statements by others on the maximum quality that can be obtained using parametric stereo, secondly, if parametric stereo is included in an audio coder, the maximum overall bit rate at which parametric stereo still leads to a coding gain compared to conventional stereo techniques is in part dependent on the quality limitations induced by the parametric-stereo algorithm only. To exclude quality limitations induced by other coding processes besides parametric stereo, this experiment was performed without a mono coder. The second listening test was performed to derive the actual coding gain of parametric stereo in a complete coder. For this purpose, a comparison was made between a state-of-the-art stereo coder (i.e., aac-Plus) and the same coder extended with parametric stereo (e.g., enhanced aacPlus) as described in Section 7.

Listening test I
Nine listeners participated in this experiment. All listeners had experience in evaluating audio codecs and were specifically instructed to evaluate both the spatial audio quality as well as other noticeable artifacts. In a double-blind MUSHRA test [94], the listeners had to rate the perceived quality of several processed items against the original (i.e., unprocessed) excerpts on a 100-point scale with 5 anchors. All excerpts were presented over Stax Lambda Pro headphones. The processed items included (1) encoding and decoding using a state-of-the-art MPEG-1 layer 3 (MP3) coder at a bit rate of 128 kbps stereo and using its highest possible quality settings; (2) encoding and decoding using the FFT-based parametric-stereo coder as described above without mono coder (i.e., assuming transparent mono coding) operating at 8 kbps; (3) encoding and decoding using the FFT-based parametric-stereo coder without mono coder operating at a bit rate of 5 kbps (using 20 analysis frequency bands instead of 34); (4) the original as hidden reference.
The 13 test excerpts are listed in Table 3. All items are stereo, 16-bit resolution per sample, at a sampling frequency of 44.1 kHz.
The subjects could listen to each excerpt as often as they liked and could switch in real time between the four versions of each item. The 13 selected items showed to be the most critical items from an 80-item test set for either parametric stereo or MP3 during development and in-between evaluations of the algorithms described in this paper. The items had a duration of about 10 seconds and contained a large variety of audio classes. The average scores of all subjects are shown in Figure 11. The top panel shows mean MUSHRA scores for 8 kbps parametric stereo (black bars) and MP3 at 128 kbps (white bars) as a function of the test item. The rightmost bars indicate the mean across all test excerpts. Most excerpts show very similar scores, except for excerpts 4, 8, 10, and 13. Excerpts 4 ("Harpsichord") and 8 ("Plucked string") show a significantly higher quality for parametric stereo. These items contain many tonal components, a property that is typically problematic for waveform coders due to the large audibility of quantization noise for such material. On the other hand, excerpts 10 ("Man in the long black coat") and 13 ("Two voices") have higher scores for MP3. Item 13 exhibits an (unnaturally) large amount of channel separation, which is partially lost after parametric-stereo decoding. On average, both coders have equal scores.
The middle panel shows results for the parametric-stereo coder working at 5 kbps (black bars) and 8 kbps (white bars). In most cases, the 8 kbps coder has a higher quality than the 5 kbps coder, except for excerpts 5 ("Castanets") and 7 ("Glockenspiel"). On average, the quality of the 5 kbps coder is only marginally lower than for 8 kbps, which demonstrates the shallow bit-rate/quality slope for the parametric-stereo coder.
The bottom panel shows 128 kbps MP3 (white bars) against the hidden reference (black bars). As expected, the hidden reference scores are close to 100. For fragments 7 ("Glockenspiel") and 10 ("Man in the long black coat"), the hidden reference scores lower than MP3 at 128 kbps, which indicates transparent coding.
It is important to note that the results described here were obtained for headphone listening conditions. We have found that headphone listening conditions are much more critical for parametric stereo than playback using loudspeakers. In fact, a listening test has shown that on average, the difference in MUSHRA scores between headphones and loudspeaker playback amounts to 17 points in favor of loudspeaker playback for an 8 kbps FFT-based encoder/decoder. This means that the perceptual quality for loudspeaker playback has an average MOS of over 90, indicating excellent perceptual quality. The difference between these playback conditions is most probably the result of the combination of an unnaturally large channel separation which is obtained using headphones on the one hand, and crosstalk resulting from the downmix procedure on the other hand. It seems that the amount of interchannel crosstalk that is inherently introduced by transmission of a single audio channel only is less than the amount of interaural crosstalk that occurs in freefield listening conditions. A consequence of this observation is that a comparison of the present coder with BCC schemes is rather difficult, since the BCC algorithms were all tested under subcritical conditions using loudspeaker playback (cf. [16,17,18,19,20]).

Listening test II
This test also employed MUSHRA [94] methodology and included 10 items which were selected for the MPEG-4 HE-AAC stereo verification test [95]. The following versions of each item were included in the test: (1) the original as hidden reference; (2) a first lowpass filtered anchor (3.5 kHz bandwidth); (3) a second lowpass filtered anchor (7 kHz bandwidth); (4) aacPlus (HE-AAC) encoded at a bitrate of 24 kbps; (5) aacPlus (HE-AAC) encoded at a bit rate of 32 kbps; (6) enhanced aacPlus (HE-AAC/PS) encoded at a total bit rate of 24 kbps. Twenty analysis bands were used, and no IPD or OPD parameters were transmitted. The average parameter update rate amounted to 46 milliseconds. For each frame, the required number of bits for the stereo parameters was calculated. The remaining number of bits was available for the mono coder (HE-AAC).
Two different test sites participated in the test, with 8 and 10 experienced subjects per site, respectively. All excerpts were presented over headphones. The results per site, averaged across excerpts, are given in Figure 12.
At both test sites, it was found that aacPlus with parametric stereo (enhanced aacPlus) at 24 kbps achieves a respectable average subjective quality of around 70 on a MUSHRA scale. Moreover, at 24 kbps, the subjective quality of enhanced aacPlus is equal to aacPlus at 32 kbps and significantly better than aacPlus at 24 kbps. These results indicate a coding gain for enhanced aacPlus of 25% over stereo aacPlus.

CONCLUSIONS
We have described a parametric-stereo coder which enables stereo coding using a mono audio channel and spatial parameters. Depending on the desired spatial quality, the spatial parameters require between 1 and 8 kbps. It has been demonstrated that for headphone playback, a spatial parameter bit stream of 5 to 8 kbps is sufficient to reach a quality level that is comparable to popular coding techniques currently on the market (i.e., MPEG-1 layer 3). Furthermore, it has been shown that a state-of-the-art coder such as aacPlus benefits from a significant reduction in bit rate without subjective quality loss if enhanced with parametric stereo.