EURASIP Journal on Applied Signal Processing 2002:11, 1260–1273 c ○ 2002 Hindawi Publishing Corporation Noise Adaptive Stream Weighting in Audio-Visual Speech Recognition

It has been shown that integration of acoustic and visual information especially in noisy conditions yields improved speech recognition results. This raises the question of how to weight the two modalities in di ﬀ erent noise conditions. Throughout this paper we develop a weighting process adaptive to various background noise situations. In the presented recognition system, audio and video data are combined following a Separate Integration (SI) architecture. A hybrid Artiﬁcial Neural Network/Hidden Markov Model (ANN/HMM) system is used for the experiments. The neural networks were in all cases trained on clean data. Firstly, we evaluate the performance of di ﬀ erent weighting schemes in a manually controlled recognition task with di ﬀ erent types of noise. Next, we compare di ﬀ erent criteria to estimate the reliability of the audio stream. Based on this, a mapping between the measurements and the free parameter of the fusion process is derived and its applicability is demonstrated. Finally, the possibilities and limitations of adaptive weighting are compared and discussed.


INTRODUCTION
The limited performance of Automatic Speech Recognition (ASR) systems in the presence of background noise still restricts their usability in many scenarios. Different attempts have been made to increase the robustness of ASR systems but all fall short in comparison to human performance.
It is well known that the movement of the lips plays an important role in speech perception [1,2]. The contribution of the lips is especially high in noisy speech [3,4]. This is due to the fact that visual speech mainly conveys information about the place of articulation, which is most easily confused in the audio modality when noise is present [5]. Motivated by these findings many researchers have tried to integrate the information transmitted by lip movement into ASR systems (see [6,7,8,9,10,11,12] for a review). The first systems, already, showed noticeable improvements of the recognition scores in noise when the audio and video signals are jointly evaluated. Since then significant progress was made, and currently a recognition system using both, audio and video data, can outperform humans having only access to the audio signal at low Signal to Noise Ratio (SNR) [13]. Despite this high performance of audio-visual ASR systems, there is still a long way to go before these systems will have performance comparable to humans in an identical task.
Throughout this paper, we mainly want to focus on the adaptive fusion of audio and video data under different noise conditions. We start with a quick look at different possible fusion architectures and point out why we have chosen a Separate Integration architecture, where fusion takes place on a decision level. Next, we present four different fusion schemes of audio and video decisions. A comparison of these fusion schemes in a wide range of noise conditions allows to identify the best scheme. In order to be adaptive to changing noise conditions, there is need for a criterion to evaluate the reliability of the audio channel. We present three different reliability criteria and compare them in different noise conditions. We conclude this paper with a discussion of the results of our comparisons. Throughout this discussion, special attention is paid to the question of whether adaptive weights on the audio and video stream are necessary, or if it is sufficient to simply use one fixed weight for all situations.

FUSION OF AUDIO AND VIDEO DATA
When looking at the fusion of audio and video data for audio-visual speech recognition, the first question to be addressed is where the fusion of the data takes place. Several different architectures for the fusion process have been proposed [5,14]. The first is integration on the feature level. In this case, audio and video features are directly combined to a larger feature vector, which is then used to identify the corresponding phoneme. This is also referred to as Direct Integration (DI) (see also Figure 1). In contrast to this, fusion can also take place after independent identification of each stream. Hence the fusion is rather a fusion of identification results. This is called Separate Integration (SI). Between these two extremes lies the so-called Motor Recoding (MR) in which the input features are first transformed into a common representation, and the classification then is based upon the combined features in this representation. The articulatory gesture parameters are chosen as common representation, to which both audio and video features are mapped. A problematic point when using Motor Recoding is the choice of the representation of the articulatory gestures. In the fourth fusion architecture one stream is dominant. In this case, the decision is based on the dominant stream, and the second stream is only used to rescore the identification results of the dominant stream. This is called Dominant Recoding (DR). Due to the fact that it conveys much more information than the video stream, naturally the audio stream is chosen as the dominant stream.
When comparing the different fusion architectures, Separate Integration exhibits some characteristics that make it the best choice for our task. An important property is that the fusion of the two input streams can be controlled by weighting the streams. The code elements in Figure 1 are the phonemes H i to which we can assign a posteriori probabilities P(H i |x A , x V ) for their occurrence given the acoustic feature vector x A and the video feature vector x V (see Figure 2). These a posteriori probabilities, or to be more precise their estimatesP, are generated by an Artificial Neural Network (ANN) [15] in each time frame. Therefore the SI, in combination with an ANN, allows an adaptive weighting of the input streams depending on their reliability. Adaptation of the weights can be done once per scenario as well as for each single frame. Furthermore comparisons of SI with other architectures showed superior performance of SI [16,17,18]. 1 For these reasons, we decided to use an SI architecture for our recognition experiments. Once we have chosen the SI 1 Regarding the comparison of DI and SI, these results were confirmed by our own experiments but not reported here.
Stream reliability Figure 2: Weighting of audio and video a posteriori probabilities in a Separate Integration architecture to take into account the changing reliability of the input streams. architecture, the next question to tackle is how the fusion of the identification results takes place.
The quality of the estimate of the a posteriori probabilities is related to the match of the training and test conditions. As training was in all cases performed on clean data, the reliability of these estimates, particularly in the case of the audio path, strongly depends on the noise present in the test condition. In order to cope with the changing reliability, a weighting of the audio and video probabilities is desirable. Even though the quality of the video stream was kept constant in all following tests, an adaptive weighting of the video stream depending on the quality of the audio stream does improve the performance.

Unweighted Bayesian Product
The simplest way to combine audio and video data is to follow Bayes' rule and multiply the audio and video a posteriori probabilities to derive the combined probabilities. This approach is valid in a probabilistic sense if the audio and video data are independent. Perceptive studies showed that in human speech perception audio and video data are treated as class conditional independent [19,20]. Under this hypothesis, When applying Bayes' rule, we can write the desired a posteriori probability of the phoneme H i as Replacement of the probabilities P by estimatesP leads to the representation of the, as we want to call it, Unweighted Bayesian Product (UBP) where the terms independent of the actual phoneme are replaced by the normalization factor with N being the number of phonemes. This fusion scheme is also the core of the Fuzzy Logical Model of Perception (FLMP) [21], which is used to model human perception.

Standard Weighted Product
In order to deal with varying reliability levels of the input streams, different authors introduced a weighted fusion, where different weights are applied to the audio and video channels. The weighting of the a posteriori probabilities pro-posed in [17,18] follows (we want to refer to this as Standard Weighted Product) The assumption of conditional independence is approached for equal a-priori probabilities of the phonemes or words, respectively, depending on the place of fusion. It is not actually fulfilled since equal weights on both streams correspond to weights of 0.5 instead of 1.
In addition to the intermediate setting, when the audio and video stream contribute equally to the recognition, two more distinct settings of the weights exist. When the SNR is very low, the estimation in the audio path completely fails. Therefore, the final a posteriori probability should only depend on the video features, which is achieved with λ = 0. Similarly, for very high SNR, the estimation in the audio path is in general much better than the one in the video path and consequentlŷ with λ = 1. The most common recognition systems are based on Gaussian Mixture Hidden Markov Models (GM/HMM). These produce likelihoods instead of a posteriori probabilities. Weighting of these likelihoods corresponds to a weighting of (1) [9,10,16]. This approximates the assumption of conditional independence, independent of the a-priori probabilities.
Equal weights of 0.5 instead of 1 entails that not the product of the probabilities but the square root of the product is evaluated when both the audio and the video stream have the same weight. To resolve this problem, we modify the parameterization of the Standard Weighted Product. We introduce the parameters α and β, which depend both on a third parameter c according to yieldinĝ Similarly to λ in the previous parameterization, the parame- ter c varies with the SNR, and it determines the contribution of the audio and video streams to the final probability.
When c = 0 the a posteriori probabilities from the audio and video path, both have the same weight as α = 1 and β = 1 (see also Figure 3). For c ≤ −1 (at very low SNR), α = 0, β = 1 and for c ≥ 1 (when only the audio signal carries information), α = 1, β = 0. Hence this takes the situations into account, where we only want to rely on one of the two streams. In contrast to the original parameterization of the Standard Weighted Product to which we want to refer as SWP λ , this implementation will be referred to as SWP αβ .

Geometric Weighting
A concept integrating class conditional independence of audio and video data expressed in (1) and the idea of noise dependent stream weighting expressed in (5) is the Geometric Weighting [22] The normalization factor is determined by evaluating the condition N j=1P Factors only dependent on x A and x V are eliminated by the normalization. The result of the sum in (11) is independent of H i and hence ε only depends on the fusion weights α and β.
For the Geometric Weighting we solely employed the parameterization with α and β as defined in (8). Consequently, for c = 0 the assumption of conditional independence as stated in (1) is fulfilled when equal weight is put on the audio and video stream. Similar to the description in the previous section for c = −1 the final probability only depends on the a posteriori probability of the video stream and for c = 1 it only depends on the audio stream (see also Figure 3).

Full Combination
Findings in human speech perception showed that the error rate for phoneme recognition using the full frequency range is approximately equal to the product of the error rates using only nonoverlapping frequency sub-bands [23,24]. This is known as the so-called Product of Errors (POE) Rule. Motivated by this rule, multistream recognition systems were built, which decompose the speech signal in multiple subbands, perform an identification of the phoneme for each sub-band, and then combine the results [25]. In general, the performance gain of this approach was not very high in noise and was countered by a loss of performance on clean speech. The loss on clean speech is alleviated by the so-called Full Combination (FC) approach [26]. Here phoneme identification is performed for all combinations of sub-bands, including also the full frequency range, and the identification results are then combined linearly.
When applying this concept to audio-visual recognition we have to consider two input streams. Taking all combinations of the input streams plus the empty stream containing only the a-priori probabilities into account we have a total of four streams: the audio, the video, the combined audiovisual and the empty stream. Hence three ANNs have to be trained to generate the corresponding probabilities. The weighting of the streams is performed by a linear combination of the a-priori and a posteriori probabilities according toP In order to reduce the number of neural networks to be trained on each independent stream (which grows exponentially with the number of streams), the so-called Full Combination Approximation (FCA) was introduced [26]. Here class conditional independence is assumed between the streams and hence the identification result for a combination of streams can be derived from the identification results of the individual streams (compare to (2)). Then the a posteriori probability of the combined audio-visual stream is evaluated according tô with η as defined in (4). The first term in (14) results from the postulation of class conditional independence, and the other terms ensure the same behavior as Geometric Weighting when only one of the streams is reliable. The a k are the weights with which the individual streams contribute to the final probability. They are set to , with α and β as given in (8). When the estimation process for the different probabilities is not consistent, and hence the sum over all probabilities does not equal one, an independent normalization for each stream is necessary. At c = 0 the assumption of conditional independence is fulfilled. Similarly for c = 1 and c = −1 all the weight is assigned to the audio or video  Figure 4: Implementation of the SI audio-visual speech recognition system. stream, respectively. In our implementation, the degrees of freedom of the FCA and the Geometric Fusion are limited to one. This might not be optimal but a multidimensional optimization with multiple degrees of freedom would be much more costly to perform.

THE RECOGNITION TASK
As a common task to evaluate the presented fusion schemes we have chosen the recognition of continuously uttered English numbers. This task comprises many of the problems of continuous speech recognition, whilst still being not too costly to implement. One of the distinct features of a continuous recognition task is the necessity to discriminate between speech segments and silence passages, which is especially problematic in noisy speech. Due to the very limited availability of audio-visual speech data, we had to record a new database to train our system.

The audio-visual database
For the recording of the database, selected utterances from NUMBERS95 [27] were chosen and repeated by a single native English-speaking male subject. The database contains 1712 sentences or 6432 words. It was subdivided into two subsets of similar size for training and final recognition. Synchronous recordings of the speech signal and video images of the head and mouth region at 50 frames per second were taken. Recordings were made on BETACAM video and standard audio tapes and A/D converted with 8 kHz off-line.

The recognition system
Our audio-visual speech recognition system is based on a hybrid Artificial Neural Network/Hidden Markov Model (ANN/HMM) structure. ANN/HMM hybrid systems represent an alternative concept for continuous speech recognition to pure HMM systems giving competitive recognition results [28]. As already mentioned in the previous section, our system follows an SI architecture (see Figure 4). The implementation of our system was carried out using the tool STRUT from TCTS lab Mons, Belgium [29]. The emphasis of our research lies on the fusion of the audio and video data during the recognition process which requires large amounts of data to obtain meaningful results. Therefore, following [16], we rely on geometric lip features and simplify the extraction of the features significantly by a chroma key process. The chroma key process requires coloring the speakers lips with blue lipstick. Due to the coloring, the lips can then be located easily, and their movement parameters can be extracted in real time. As lips parameters, the following were chosen: In Figure 5 the results of the feature extraction are visualized. The detected lip boundaries and the corresponding numerical values in pixels are given. The extracted video parameters were linearly interpolated from the original 50 Hz to 8 kHz, in order to be synchronous with the audio data. Following the interpolation, each lip parameter was low-pass filtered to remove high frequency noise introduced by the parameter extraction and to further smooth the results of the interpolation. Audio feature extraction was performed using RASTA-PLP [30].
To take temporal information into account, several successive time frames of the audio and video feature vectors are presented simultaneously to the input of the corresponding ANNs. The concept of visemes was not used. Each acoustical articulation is assumed to have a synchronously generated corresponding visual articulation. Hence the recognition process is based on phonemes. Individual phonemes are modeled via left-to-right HMM models. The number of states of the HMMs used to represent the different phonemes was adapted to the mean length of the corresponding phoneme. Word models were generated by the concatenation of the corresponding phoneme models. Recognition is based on a dictionary with the phonetic transcription of 30 English numbers. Complete sentences containing a sequence of numbers were presented to the system during the recognition process. The sentences consist of free format numbers making a grammar model unnecessary.
Training of the ANNs was in all cases performed on clean data. During our recognition tests we added 5 different types Table 1: Average of the relative error in percent for audio alone recognition and the fusion schemes Standard Weighted Product (SWP) parameterized with λ (SWP λ ) and α and β (SWP αβ ), Geometric Weighting (GW), Full Combination (FC), Full Combination Approximation (FCA), and Unweighted Bayesian Product (UBP) over all noise types and SNR levels. Additionally the 95% confidence interval for the relative error is given.

Method
Audio of environmental noise at 12 different SNR levels to the audio signal, resulting in 60 different test conditions. Adding noise to the recorded signal instead of adding it during the recordings does not take into account the changes in articulation speakers produced when background noise is presented [31] and therefore generates somehow nonrealistic scenarios. On the other hand it opens the possibility to test exactly the same utterances in different noise conditions and tremendously facilitates the recordings of the data. As additive noise we have chosen white noise, noise recorded in a car at 120 km/h and babble noise and two types of factory noise taken from the NOISEX database [32]. Noise was only added to the audio signal. We considered the video stream to be of constant quality and did not alter the video signal throughout the tests.

EVALUATION OF THE FUSION SCHEMES
The first step in the evaluation is to compare the fusion schemes under identical conditions using a manual setting of the optimal weights.

Manual weight adaptation
Throughout this first stage of evaluation, the fusion parameter c in the Standard Weighted Product with α and β parameterization (SWP αβ ), the FCA and the Geometric Fusion was adapted manually at each SNR level in order to get the best possible recognition score. During a test in a particular noise condition, the fusion parameter was held constant over all frames. Tests in that particular noise condition with different settings of the fusion parameter were repeated until the minimum Word Error Rate (WER) was reached. For the Standard Weighted Product with its original parameterization, the parameter λ instead of c was adapted to each noise scenario. In the following evaluation of the different fusion schemes, we will use the Relative Word Error Rate (RWER) instead of the WER. The reference point of the RWER is the WER resulting from a fusion according to the Standard Weighted Product with the original λ parameterization (SWP λ ) for the corresponding noise scenario. The RWER(SNR, n) at a given noise type n and SNR level is defined as To take all noise conditions into account the mean relative error for a particular fusion scheme over all noise conditions was calculated An improvement compared to the Standard Weighted Product results in a negative RWER. Table 1 compares the different mean relative errors. Both the FC and the FCA were implemented but due to the very poor performance of the identification network trained on the combined clean audio and video features in noise, resulting from a training on clean data, the performance of the FC was significantly worse than that of the FCA. For the Standard Weighted Product a parameterization with α and β is compared to the original parameterization with λ, which serves as the reference point for the evaluation of the relative error. Parameterization with α and β, which results in equal weights of 1 at c = 0 instead of 0.5 at λ = 0.5, leads to a small but consistent improvement over all noise types.
The results are given in detail in Figure 6 and Table 2, which show the graphical and numerical results, when car noise was added to the audio signal. For comparison, also the scores for the audio and video stream alone are given. Due to its poor performance, the FC is not included in this comparison. The SWP λ is included to serve as a reference point. From Figure 6 and Tables 1 and 2, it follows that all weighted fusion schemes are able to fulfill the basic postulation of audio-visual recognition. This postulation states that the audio-visual score should always be better or equal to the audio or video score alone [18]. From a useful fusion scheme we further expect that it is able to generate synergy effects from the joint use of audio and video data in a way that the resulting error rates are significantly lower than the error rates from either stream alone. The Standard Weighted Product rather yields poor performance and shows only little gain from the joint use of audio and video data. Geometric Weighting and FCA give very similar results, which are much better for audio-visual recognition at medium SNR than audio or video recognition alone. For low SNRs the Geometric Weighting performs slightly, though not significantly better, than the FCA, but gives identical results for medium and high SNR. The Unweighted Bayesian Product is the only fusion scheme which does not fulfill the basic postulation. At very low SNR values, the recognition scores drop below those of the video channel alone, whereas at medium and high SNR values the scores are very similar, or identical, to those of the Geometric Weighting or the FCA.
Due to its superior performance, we only employed the Geometric Weighting in the following tests.

Automatic weight adaptation
For a real-time scenario, the setting of the weights has to be performed automatically depending on the noise level. A prerequisite to this is the estimation of the reliability of the audio stream during the fusion. The reliability estimation can follow two different approaches, either relying on the statistics of the a posteriori probabilities or directly on the speech signal. We will first present two measures based on the distribution of the a posteriori probabilities and will then also present a measure based on the speech signal.

Entropy of a posteriori probabilities
The distribution of the a posteriori probabilities at the output of the ANN carries information on the reliability of the input stream to the ANN. If one distinct phoneme class shows a very high probability and all other classes have a low probability, this signifies a reliable input. Whereas, when all classes have quasi equal probability the input is very unreliable. This information is captured in the entropy of the estimated a posteriori probabilitiesP(H i,k |x A,k ) for the occurrence of the phoneme H i , given the acoustic feature vector x A,k at time frame k [16,33,34]. The average entropy of the a posteriori probabilities over all frames is where N is the number of phonemes and K the number of frames. We want to control the fusion process based on the entropy. Therefore a mapping between the value of the entropy and the fusion parameter c has to be established. Experiments showed that for this mapping it is necessary to exclude segments where the pause is the most likely state, due to many false identifications of pauses at low SNR levels. Therefore only those frames, where the silence state is not amongst the 4 most probable phonemes, are taken into account for the calculation of the entropy.

Dispersion of a posteriori probabilities
A measure similar to the entropy is the dispersion of the a posteriori probabilities [16,34] where the probabilitiesP(H m,k |x A,k ) are sorted in descending order, beginning with the highest one. Hence the difference between the M most likely phonemes is calculated and summed up. In our setup the best results were obtained for M = 3. As for the entropy, only frames where a silence is not among the 4 most likely phonemes are taken into account.

Voicing index as audio reliability measure
It is known that speech contains many harmonic components, whereas in many everyday life situations background noise is nonharmonic. Thus the lower the ratio of the energy of the harmonic to the nonharmonic components is, the more noise is present in the signal. A measure to asses this relation is the so-called voicing index [35]. The voicing index gives the conditional probability of a speech segment to be clean enough to be recognized when the harmonicity index R of this speech segment is known.
For the calculation of the harmonicity index, the speech signal is segmented into overlapping frames of 1024 sample values length and each speech segment is pre-emphasized and demodulated. The demodulation is performed by a rectification followed by a filtering with a trapezoidal bandpass filter. The cut-off frequencies of the band-pass filter are  derive the value of the harmonicity index R by the normalization of this maximum value by the zero time-lag of the autocorrelation function, representing the mean energy of the demodulated speech frame. When setting a threshold for the signal to be "clean enough to be recognized" at an SNR level of 0 dB, the corresponding conditional probability, and hence the voicing index, can be formulated as P(SNR > 0 dB|R). For the evaluation of this conditional probability we use an estimate of the conditional probability density function. We added white noise at 0 dB SNR to 288 sentences of the database, and we compiled a bi-dimensional histogram of the relationship between the local SNR value (in each 1024 bins time frame) and the harmonicity index. A sigmoidal mapping function between the harmonicity R and the voicing index is derived from this histogram and used to estimate the conditional probability. Similar to the previous criteria, the voicing index was evaluated only in those segments where the pause was not amongst the 4 most probable phonemes. First tests of the use of the voicing index for the fusion in audio-visual speech recognition are reported in [36].

Evaluation of global audio stream weights
After the definition of the various measures to be used in the estimation of the reliability of the audio stream, the questions at hand are: how sensitive are the recognition results to variations of the fusion parameter c, and how consistent are the reliability measures over different noise types and SNR levels?
To answer the first question we can have a look at Figure 7. Here the recognition results are plotted for c varying between −1 ≤ c ≤ 1. As additive noise, car noise at 12 SNR levels was used. The points of minimum WER used for the manual weight adaptation in Section 4.1 are connected by a dotted line in Figure 7. The goal of the automatic adaptation is now to find the mapping between the reliability estimation measure and the fusion parameter c, which results in the same minimum WERs in all noise conditions. As can be seen in the figure, there are large regions where the WER does not increase significantly over a wide range of values of the fusion parameter c. On the other hand, there are also regions at low SNR where small variations of the fusion parameter have a strong impact on the WER. In general, Figure 7 demonstrates that the fusion is not very sensitive to the setting of c for SNR> 0 dB, and hence an automatic choice of c should at least give reasonable results for these SNR values.
The next question is the sensitivity of the audio reliability estimation measures to different noise types. To test this sensitivity, we used all 5 noise types at 12 SNR levels each and calculated the average value of the corresponding reliability measure (entropy, dispersion, voicing index) over the whole test set for a given noise scenario. In Figure 8 we plotted the value of the reliability measure over the different optimal settings (i.e., the minimum WER = f (c) points) of the fusion parameter c. Each point of the curves corresponds to one of the 12 SNR values and each of the first five curves corresponds to one noise type. If the criteria were independent of the noise type, all points of the curve would lie on one continuously decreasing (for the entropy) or increasing (for the dispersion and the voicing index) curve. This is obviously not the case. Nevertheless, the curves lie more or less close together, which indicates that the variation of the criteria with the noise type is rather small. Exceptions are the babble noise in the case of the dispersion and white noise for the voicing index.
If we want to have a reliability measure that does not depend on the noise type, we have to search for a mapping between the reliability measure and the fusion parameter c which is optimal in a minimum error sense. Our optimization criterion for the mapping c( ) is the minimization of the squared relative word error over all noise types n and all SNR levels [37] with the relative word error RWER(SNR, n) = WER min (SNR, n) − WER measure (SNR, n) WER min (SNR, n) .
WER measure is the error rate obtained when using one of the reliability measures to control the fusion process, and WER min is the minimum error rate when setting the fusion parameter manually (as defined in Section 4.1). The mapping between the reliability measure and the fusion parameter c is approximated by a sigmoidal function where h, g, and d are the parameters which define the shape of the sigmoidal function being subject to the optimization. The results of the optimization for each criterion can be seen in Figure 8.  the unknown derivative is approximated by the difference quotient. The optimization procedure was repeated with dif-  Figure 8 is not a direct measure for the quality of the fit. As a consequence of the minimization of the word errors, the optimal sigmoidal fit is the one which causes variations of the fusion parameter c from the optimal value which induces the smallest increase in word error. Hence, in regions where variations of c cause only a small increase of the word error, the distance of the sigmoidal curve and the curves resulting from the reliability criteria can be significant, whereas the resulting word error rates are still very close to optimal.

Evaluation of adaptive audio stream weights
So far word error rates were calculated for a setting of the fusion parameter c being constant in one noise condition. The average value of the reliability measure was calculated in this noise condition and a global value of c for this noise condition was selected accordingly. This assumes that the whole test set is known at recognition time, which of course is unrealistic in a real life recognition system. Rather it is necessary to calculate the correct setting of the fusion parameter instantaneously for each frame. This also opens the possibility to cope with nonstationary noise and variations of the SNR of the speech signal. We therefore repeated the tests in the previous section with audio stream weights adapted on a frame by frame basis. To reduce the influence of estimation errors, the values of the fusion parameter were smoothed over time with a first order recursive filter with a cut off frequency of 0.6 Hz. Table 3 compares the results of the optimization for the different criteria, when the value of the fusion parameter is fixed over the whole test set (Global) and when it is varied (Frame Dependent). As for the previous recognition results, the average RWER is based on the results obtained with SWP λ and hence evaluated according to (15) and (16). In Figure 9 the results of the automatic fusion, the manual setting of the fusion parameter, and the fusion using the Unweighted Bayesian Product are compared. For the automatic fusion the voicing index was chosen as the reliability measure and its evaluation was performed on a frame by frame basis. The curve corresponding to the global evaluation of the voicing index is almost identical to the frame-wise evaluation and therefore not included in the plot.

DISCUSSION
In the previous sections, we presented different weight combination and estimation schemes of audio and video a posteriori probabilities in an audio-visual recognition task. Different tests were carried out to assess the performance of the different weighting schemes. In all tests, we used 5 different types of noise at 12 SNR levels each to obtain results not limited to one special scenario.

Performance of weight combination schemes
In the first test, the free parameters of the weighting schemes were adapted manually to each noise condition. Three of the presented weighting schemes, namely the Unweighted Bayesian Product, the FCA, and the Geometric Weighting, are based on the assumption of class conditional independence of audio and video features. The fourth one, Standard Weighted Product, only approximates this assumption for equal a-priori probabilities of the phonemes, which was not the case in our tests. Furthermore, the parameterization of the Standard Weighted Product is characterized by having a sum of weights equal to one. So, when both streams have equal weights of 0.5, the square root of the two a posteriori probabilities is taken instead of the product as for the other methods. In order to have weights equal to 1 on both streams in the equal weight condition (as for Geometric Weighting), we changed the parameterization of the Standard Weighted Product from λ and (1 − λ) to α and β, respectively. This led to a small, but consistent, improvement in comparison to the original form.
Yet the main result of this first comparison is the clear superior performance of the weighting schemes following the assumption of class conditional independence over the Standard Weighted Product, thanks to the introduction of the apriori probabilities. Especially the FCA and the Geometric Weighting showed very similar results, where one reason is the similarity of the two algorithms (FCA is based on arithmetic weighting). Both attain the pure a posteriori probabilities when all weight is put on either channel and produce the a posteriori probability following class conditional independence for equal weights. They differ only in the way the probabilities are weighted, apart from these three special cases. The results indicate a small but not very significant advantage of the Geometric Weighting for low SNR values and equal performance for the other values. Therefore we only took the Geometric Weighting into consideration in the succeeding experiments.

Performance of audio stream reliability measures
The next test was designed to reveal the performance of the weighting scheme found best in the previous test in a more realistic scenario, where the adaptation of the weights is done automatically and not by hand. In the first step of the comparison we investigated a static case, where we first evaluated the reliability measure over the whole dataset and then performed the fusion with the setting of the fusion parameter corresponding to the measure. The mapping of the reliability measure to the fusion parameter took a wide range of noise conditions into account. For the mapping a fit in the minimum error sense between the value of the measure in a particular noise condition and the corresponding optimal fusion parameter was established. The results showed that large improvements compared to the audio-only recognition can be achieved under all noise conditions investigated, however for low SNR values the WERs are still too high to achieve useful recognition. An open question is how the optimized mapping generalizes to new, previously unseen, noise conditions. The consistency of the results (see Figure 8) proposes a possibility for generalization, even though final answers can only be found by tests in noise conditions not present during the design of the mapping.
In the last step of the comparison, we made the transition from the unrealistic static case, where the whole test set has to be known before determination of the fusion parameter, to an evaluation of the measure on a frame by frame basis. In general, we expected an increase in performance from the fact that a frame-level fusion is able to take variations of the SNR during one utterance and from one utterance to the other into account and it is capable to cope with nonstationary noise (like babble or factory noise). On the other hand, the limitation of the estimation interval of the reliability criteria to one frame has a high impact on the quality of the estimation. This effect was alleviated via smoothing the values with a first order recursive filter, although this reduces the ability to quickly adapt to intensity variations. The results of the frame-wise adaptation showed that both effects, the larger flexibility and the lesser precision seem to trade off one another. The results of the frame dependent evaluation are very similar to those evaluated on the whole test-set (see Table 3). Even though there was no performance gain from the frame-wise evaluation, the results show that the reliability estimation criteria are applicable to a realistic system. Both, the entropy and the voicing index, showed only small  53   59  15  2  2  3 1  16  2 1 1  1  1  17  1  1  22  1 5  3  2 2  1  10  2  1  1 0  2  2  1 2  1  4  21 1  2 1 1  3  1 5  1  1 2 1 1  2  1  1  1   deviations from the optimum values. In the global evaluation the results of the entropy criterion were better than those of the voicing index, but in the more realistic frame-wise adaptation the entropy criterion deteriorated more than the voicing index. The voicing index, however, gave less consistent results for babble noise (which contains many harmonic components) and white noise (which has no harmonic components) than the entropy. The dispersion, especially in the frame-wise adaptation, was not competitive to the other criteria. To summarize, the entropy and the voicing index criterion can be used efficiently to control the adaptive fusion process.

Unweighted Bayesian Product versus adaptive weights
One interesting result of our comparison is the good performance at medium to high SNR of the Unweighted Bayesian Product, which does not require any weighting and hence no reliability estimation either. As can be seen in Figure 6, the performance of the Unweighted Bayesian Product is almost identical to that of the Geometric Weighting for medium and high SNR values (e.g., SNR ≥ 0 dB), whereas for low SNR values (e.g., SNR < 0 dB), the performance sharply decreases.
For SNR > 0 dB, audio and video channels carry complementary phonetic information which is well fused by Bayes' rule [20]. For SNR < 0 dB there is a gain for the weighting principle, and Bayes' rule seems to start producing wrong results. Decreasing audio stream reliability results a-priori in an increase of the entropy of the corresponding categorization results, which is also exploited in the stream reliability criterion based on the entropy. This should result in a flat-tening of the distribution of the probability values and a corresponding increase in its entropy. In the extreme case, where the stream under consideration does not contribute any information, the output distribution of this stream becomes a uniform distribution. During fusion the uniform distribution does not interfere with the distribution of the reliable input stream, as the product of the uniform distribution does not alter the shape of the second distribution. Consequently, the phonetic identification is not impaired by the unreliable stream. If this is true, why can we then observe a sharp decrease of performance at low SNR?
To answer this question, we should have a look at the confusion matrix of the phoneme identification. In a confusion matrix, the elements of the matrix determine the percentage of the stimuli on the y-axis as being identified as the output class on the x-axis. For the confusion matrix in Figure 10, car noise at −6 dB was added to the audio signal. 2 Already a first quick look on the main diagonal of the confusion matrix reveals that the distribution of errors is clearly nonuniform. There are phonemes which are identified very well, and others which show only poor identification scores. The silence state "sil" obviously plays a special role. With increasing noise level more and more phonemes are confused with the silence state. This is partly taken into account in the mapping between the fusion parameter c and the stream reliability measures by the fact that only segments, where the silence is not among the 4 most likely states, are used for the evaluation of the criteria.  31   33   12   29  2  3 1  4  1 1 5 1  1  1  4  47  3  9 1 1  1 9  6 11 1  31  12 2  2 3  1  1 2 46  12  1  2  3 1  1 1  1  52 4  2  4  1  6  1  1  71  1  1  3 13  1 9  53  1  2  25  1  1 7 6  4 1  7  3 1 1 53 13 1  3  5  6 3  3  4 3 60  3 1  1  4 5 1  1  1  4  2 63 1  2  3  3  2 1  3  3 2 56 4  2  3  6  3 72  6  1 3  1  51 2   Both the entropy and the dispersion criteria are improved by this modification. Furthermore also the phonemes "s," "n," and "i" attract many other phonemes. On the other hand, there are phonemes which are very poorly recognized and hardly any other phoneme is confused with them (e.g., "z" and "e:"). It follows from this analysis that the distribution of the a posteriori probabilities does not flatten but rather build certain peaks at some attractor phonemes and dips at phonemes which are hardly identified or confused. Though, to impair the audio-visual recognition, not only an increase of errors in the audio stream has to occur, but these errors also have to be correlated with those committed in the video stream. The combination according to Bayes' rule is able to compensate for uncorrelated errors to a certain degree. Therefore, to judge the consequences of the deformation of the audio a posteriori probability distribution, it is indispensable also to look at the video stream confusion matrix visualized in Figure 11. Comparing the two confusion matrices demonstrates that the phonemes confused in the audio stream ("s," "n," and "i") also lead to confusions in the video stream. In the video stream the silence state also is the origin for many confusions. Hence the errors of phonetic identification in the audio and video stream are correlated, and in this case Bayes' rule is not able to perform a compensation.
It appears that in both confusion matrices the dominant cause for confusions is the silence state, but not equally. At −6 dB 75% of the phonemes are confused in the audio stream with the silence state. In the video stream, only 22% are misleadingly recognized as pauses. 3 When fusing audio Table 4: Mean relative error for GW with voicing index evaluated on a frame by frame basis and unweighted Bayesian product. The errors are calculated over all noise types and SNR levels and for SNR ≥ 0 dB and SNR < 0 dB, separately. Additionally, the 95% confidence interval for the relative error is given. and video following the Unweighted Bayesian Product, this strong preference for pauses at noisy audio leads to a confusion of 43% of the phonemes with pauses. Whereas when Geometric Weighting is used to weight the audio and video probabilities this confusion drops to 24%. The weighting has the tendency to select the modality having less confusion with the silence state. Nevertheless, at medium SNR levels, the performance of the Unweighted Bayesian Product is very close to that of the Geometric Weighting. To further quantify this, in Table 4, in addition to the mean RWER of the Geometric Weighting and the Unweighted Bayesian Product over all noise conditions, also the mean RWER of the Unweighted Bayesian Product for SNR levels above and below 0 dB is given (all three evaluated according to (15) and (16)). From this evaluation it can be seen that the difference of performance between the UBP and the GW increases largely for SNR < 0 dB. Regardless of the remaining performance difference, there are applications where the SNR is typically higher than 0 dB, and a loss of performance is counterbalanced by a simple and intrinsically stable implementation.

CONCLUSION
Our objective was to compare a number of schemes for an adaptive combination of audio and video a posteriori probabilities estimated by an ANN for an audio-visual recognition task under different noise conditions. In a first test we looked at the effectiveness of different weight combination schemes for audio and video data. The results demonstrated that a multiplicative combination respecting class conditional independence of the streams gives the best results. Next, we compared different criteria for an adaptive estimation of the audio stream reliability using the Geometric Weighting method. The performance of both, the criterion based on the entropy of the a posteriori probabilities and the one based on the ratio of the harmonic to the nonharmonic components in the speech signal, was very close to the best achievable performance determined by a manual adjustment. We showed that an adaptive weighting scheme based on the entropy and the voicing index can be built yielding consistent performance in various noise conditions. Finally, we investigated if a constant weight on the audio and video stream in all noise conditions would give comparable performance to the adaptive weighting. The test we made showed that when the SNR is higher than 0 dB, the Unweighted Bayesian Product performs as well as Geometric Weighting, so weighting, fixed or adaptive, is unnecessary. Whereas for SNR values below −3 dB performance losses are tremendous if no weighting is performed. An analysis of the confusion matrices showed that the confusion of all phonemes with the silence state is the main cause of the failure of the Unweighted Bayesian Product for SNR < 0 dB. We remark that this is related to the continuous speech recognition task and the problem of speech detection in noise. Therefore an algorithm (namely FCA and GW) incorporating Bayes' rule, which performs well for SNR ≥ 0 dB, and a weighting principle, being dominant for SNR < 0 dB, seems to be optimal. The weighting globally performs as a switch between the two modalities, favoring the one having less confusions with the silence state. This complements Bayes' rule, when this type of confusion occurs.
All tests are based on a database with a single male speaker whose lips were colored in blue, to facilitate the lip feature extraction. Most of the tests were repeated on a database with a single female speaker where no additional coloring of the lips was used [38]. The results of these tests are comparable to those reported here.