EURASIP Journal on Applied Signal Processing 2003:2, 115–127 c ○ 2003 Hindawi Publishing Corporation Probabilistic Aspects in Spoken Document Retrieval

— Accessing information in multimedia databases encompasses a wide range of applications in which spoken document retrieval (SDR) plays an important role. In SDR, a set of automatically transcribed speech documents constitutes the ﬁles for retrieval, to which a user may address a request in natural language. This article deals with two probabilistic aspects in SDR. The ﬁrst part investigates the eﬀect of recognition errors on retrieval performance and inquires the question, why recognition errors have only a little eﬀect on the retrieval performance. In the second part, we present a new probabilistic approach to SDR that is based on interpolations between document representations. Experiments performed on the T REC-7 and T REC-8 SDR task show comparable or even better results for the new proposed method than other advanced heuristic and probabilistic retrieval metrics.


INTRODUCTION
Retrieving information in large, unstructured databases is one of the most important tasks computers use for today. While in the past, information retrieval focused on searching written texts only, the field of applications has since then extended to multimedia data such as audio and video documents which are growing every day in broadcast and media. Nowadays, radio and TV stations hold huge archives containing numberless documents that were produced and collected over the years. However, since these documents are usually neither indexed nor catalogued, the respective document collections are effectively not usable and thus the data stocks are idle. Therefore, the need of efficient methods enabling content-based access to little or even unstructured multimedia archives is of eminent importance.

Spoken document retrieval
A particular application in the domain of information retrieval is the content-based access to audio data in which spoken document retrieval (SDR) plays an important role. SDR extends the techniques developed in text retrieval to audio documents containing speech. To this purpose, the audio documents are automatically segmented and transcribed by a speech recognizer in advance. The resulting transcriptions are indexed and subsequently stored in large databases, thus constituting the files for retrieval, to which a user may address a request in natural language.
Over the past years, research shifted from pure text retrieval to SDR. However, since also state-of-the-art speech recognizers are still error-prone and thus far from perfect recognition, automatically generated transcriptions are often flawed, and not seldom they achieve word accuracies of less than 80% as, for example, on broadcast news transcription tasks [1].
Speech recognizers may insert new words into the original sequence of spoken words and may substitute or delete others that might be essential in order to filter out the relevant portion of a document collection. Unlike text retrieval, SDR thus requires retrieval metrics that are robust towards recognition errors. In the recent past, several research groups investigated retrieval metrics that are suitable for SDR tasks [2,3]. Surprisingly, the development of robust metrics turned out to be less difficult than expected at the beginning of the research in this field, for recognition errors seem to hardly affect retrieval performance, and this result also holds for tasks, where automatically generated transcriptions achieve word error rates of up to 40% (see the experimental results in Section 3.1). Although this was the unanimous result of past TREC evaluations [2,3], the reasons are only insufficiently examined. In this paper, we conduct a probabilistic analysis of errors in SDR. To this purpose, we propose two new error criteria that are more suitable in order to quantify the appropriateness of automatically generated transcriptions for retrieval applications. The second part of this paper attends to probabilistic retrieval metrics for SDR. Although probabilistic retrieval metrics are usually better motivated in terms of a mathematically well-founded theory than their heuristic counterparts, they often suffer from lower performances. In order to compensate for this shortcoming, we propose a new statistical approach to information retrieval based on a measure for document similarities. Experimental results for both the error analysis and the new statistical approach are presented on the TREC-7 and TREC-8 SDR task.
The structure of this paper is as follows. In Section 2, we start with a brief introduction to heuristic retrieval metrics. In order to improve the baseline performance, we propose a new method for query expansion. Section 3 is about the effect of recognition errors on retrieval performance. It includes a detailed error analysis and presents the datasets used for the experiments. In Section 4, we propose the new statistical approach to information retrieval and give detailed results of the experiments conducted. We conclude the paper with a summary in Section 5.

HEURISTIC RETRIEVAL METRICS IN SDR
Among the proposed heuristic approaches to information retrieval, the term-frequency/inverse-document-frequency (tfidf) metric belongs to the best investigated retrieval metrics. Due to its simple structure in combination with a fairly well initial performance, tf-idf forms the basis for several advanced retrieval metrics. In the following section, we give a brief introduction to tf-idf in order to introduce the terminology used in this paper and to form the basis for all further considerations.

Baseline methods
Let Ᏸ := {d 1 , . . . , d K } be a set of K documents and let w = w 1 , . . . , w s denote a request given as a sequence of s words. A retrieval system transforms w into a set of query terms q 1 , . . . , q m (m ≤ s) which are used to retrieve those documents that preferably should meet the user's information need. To this purpose, all words that are of "low semantic worth" for the actual retrieval process are eliminated (stopping) while the residual words are reduced to their morphological stem (stemming) using, for example, Porter's stemming algorithm [4]. Documents are preprocessed in the same manner as the queries are. The remaining words, also referred to as index terms, constitute the features that describe a document or query. In the following, index terms are denoted by d or q if they are associated with a certain document d or query q; otherwise, we use the symbol t. Let -:= {t 1 , . . . , t T } be a set of index terms and let ᏽ := {q 1 , . . . , q L } denote a set of queries. Then both documents and queries are given as sequences of index terms Each query q ∈ ᏽ partitions the document set Ᏸ into a subset Ᏸ rel (q) containing all documents that are relevant with respect to q, and the complementary set Ᏸ irr (q) containing the residual, that is, all irrelevant documents. The number of occurrences of an index term t in a document d k and a query q l , respectively, is denoted by with δ(·, ·) as the Kronecker function. The counts n(t, d k ) in (2) are also referred to as term frequencies of document d k . Using n(t, d k ) from (2), we define the document frequency n(t) as the number of documents containing the index term t, With the definition of the inverse document frequency a document specific weight ω(t, d) and a query specific weight ω(t, q) is assigned to each index term t. These weights are defined as the product over the term frequencies n(t, d) and n(t, q), respectively, and the inverse document frequencies , Given a query q, a retrieval system rates each document in the database whether or not it may meet the request. The result is a ranking list including all documents that are supposed to be relevant with respect to q. To this purpose, we define a retrieval function f that in case of using the tf-idf metric is defined as the product over all weights of index terms occurring in q as well as in d, normalized by the length of the query q and the document d, t∈-n 2 (t, q) · t∈-n 2 (t, d) .
The value of f (q, d) is called retrieval status value (RSV). The evaluation of f (q, d) for all documents d ∈ Ᏸ induces a ranking according to which the documents are compiled to a list that is sorted in descending order. The higher the RSV of a document, the better it may meet the query and the more important it may be for the user.

Advanced retrieval metrics
Based on the tf-idf metric, several modifications were proposed in literature leading, for example, to the Okapi metrics [5] as well as the SMART-1 and the SMART-2 metric [6]. The baseline results conducted for this paper use the following version of the SMART-2 metric. Here, the inverse document frequencies are given by Note that due to the floor operation in (7), a term weight will be zero if it occurs in more than half of the documents. According to [7], each index term t in a document d is associated with a weight g(t, d) that depends on the ratio of the logarithm of the term frequency n(t, d) to the logarithm of the average term frequency n(d), with log 0 := 0 and The logarithms in (8) prevent documents with high term frequencies from dominating those with low term frequencies.
In order to obtain the final term weights, g(t, d) is divided by a linear combination between a pivot element c and the number of singletons n 1 (d) in document d, with λ = 0.2 and Unlike tf-idf, only query terms are weighted with the inverse document frequency idf(t) Now, we can define the SMART-2 retrieval function as the product over the document and query specific index term weights  Figure 1: Principle of query expansion: using the difference vector ρ q , the original query vector e q is shifted towards the subset of relevant documents.

Improving retrieval performance
Often, the retrieval effectiveness can be improved using interactive search techniques such as relevance feedback methods. Retrieval systems providing relevance feedback conduct a preliminary search and present the top-ranked documents to the user who has to rate each document whether it meets his information need or not. Based on this relevance judgment, the original query vector is modified in the following way. Let Ᏸ rel (q) be the subset of top-ranked documents rated as relevant, and let Ᏸ irr (q) denote the subset of irrelevant retrieved documents. Further, let e d denote the document d embedded into a T-dimensional vector e d = (n(t 1 , d), . . . , n(t T , d)) , and let e q = (n(t 1 , q), . . . , n(t T , q)) denote the vector embedding of the query q. Then, the difference vector ρ q defined by (14) connects the centroids of both document subsets. Therefore, it can be used in order to shift the original query vector e q towards the cluster of relevant documents, resulting in a new query vector e q (see Figure 1) This method is also known as query expansion, and the Rocchio algorithm [8] counts among the best known implementations of this idea although there are many others as well [9,10,11]. Assuming that the r top-ranked documents of the preliminary search are (most likely) relevant, interactive search techniques can be automated by setting Ᏸ rel (q) to the first r retrieved documents, whereas Ᏸ irr (q) is set to ∅. However, since the effectiveness of a preliminary retrieval process may decrease due to recognition errors, query expansion is often performed on secondary document collections, for example, news paper articles that are kept apart from the actual retrieval corpus. This technique is very effective, but at the same time, it requires significantly more resources due to the additional indexing and storage costs of the supplementary database. Therefore, we focus on a new method for query expansion that solely uses the actual retrieval corpus while preserving robustness towards recognition errors. The approach comprises the following three steps: (1) perform a preliminary retrieval using SMART-2 with π : {1, . . . , K} → {1, . . . , K} induced by the ranking list so that f (q, d π(1) ) ≥ · · · ≥ f (q, d π(K) ) holds; (2) determine the query expansion vector e q defined as the sum over the expansion vectors v q (d) of the r topranked documents d π(1) , . . . , d π(r) (r ≤ K), (3) the new query vector e q is defined by

ANALYSIS OF RECOGNITION ERRORS AND RETRIEVAL PERFORMANCE
Switching from manual to recognized transcriptions raises the question of robustness of retrieval metrics towards recognition errors. Automatic speech recognition (ASR) systems may insert new words into the original sequence of spoken words while substituting or deleting others that might be essential in order to filter out the relevant portion of a document collection. In ASR, the performance is usually measured in terms of word error rate (WER). The WER is defined as the Levenshtein or edit distance, which is the minimal number of insertions (ins), deletions (del), and substitutions (sub) of words necessary to transform the spoken sentence into the recognized sentence. The relative WER is defined by Here, N is the total number of words in the reference transcriptions of the document collection Ᏸ. The computation of the WER requires an alignment of the spoken sentence with the recognized sentence. Thus, the order of words is explicitly taken into account.

Tasks and experimental results
Experiments for the investigation on the effect of recognition errors on retrieval performance were carried out on the TREC-7 and the TREC-8 SDR task using manually segmented stories [3]. The TREC-7 task comprises 2866 documents and 23 test queries. The TREC-8 task comprises 21745 spoken documents and 50 test queries. Table 1 summarizes some corpus statistics.
Recognition results on the TREC-7 SDR tasks were produced using the RWTH large vocabulary continuousspeech recognizer (LVCSR) [12]. The recognizer uses a timesynchronous beam search algorithm based on the concept of word-dependent tree copies and integrates the trigram language-model constraints in a single pass. Besides acoustic and histogram pruning, a look-ahead technique of the language-model probabilities is utilized [13]. Recognition results were produced using gender-independent models. Neither speaker-adaptive nor any normalization methods were applied. Every nine consecutive feature vectors, each consisting of 16 cepstral coefficients, are spliced and mapped onto a 45-dimensional feature vector using a linear discriminant analysis (LDA). The segmentation of the audio stream into speech and nonspeech segments is based on a Gaussian mixture distribution model. Table 2 shows the effect of recognition errors on retrieval performance, measured in terms of mean average precision (MAP) [14] for different retrieval metrics on the TREC-7 SDR task. Even though the WER of the recognized transcriptions is 32.5%, the retrieval performance decreases by only 9.9% relative using the SMART-2 metric in comparison with the original, that is, the manually generated transcriptions. The relative loss is even smaller (approx. 5% relative) if the new query expansion method is used.
Similar results could be observed on the TREC-8 corpus. Unlike the experiments conducted on the TREC-7 SDR task, we made use of the recognition outputs of the Byblos "Rough 'N Ready" LVCSR system [15] and the Dragon LVCSR system [16]. Here, the retrieval performance decreases by only 13.1% relative using the SMART-2 metric in combination with the recognition outputs of the Byblos speech recognizer and by 15.1% relative using the Dragon speech recognition outputs. Note that in both cases, the WER is approximately 40%, that is, almost every second word was misrecognized. Using the new query expansion method, the relative performance loss is nearly constant, that is, the transcriptions as produced by the Byblos speech recognizer cause a performance loss of 13.0% relative, whereas the transcriptions generated by the Dragon system cause a degradation of 13.4% relative.

Alternative error measures
Since most retrieval metrics usually disregard word orders, the WER is certainly not suitable in order to quantify the quality of recognized transcriptions for retrieval applications. A more reasonable error measure is given by the term error rate (TER) as proposed in [17] TER : As before, I k denotes the number of index terms in the reference document d k , n(t, d k ) is the original term frequency, and n(t, d k ) denotes the term frequency of the term t in the recognized transcription d k . Note that a substitution error according to the WER produces two errors in terms of the TER since it not only misses a correct word but also introduces a spurious one. Consequently, we have to count substitutions twice in order to compare both error measures. Nevertheless, the alignment on which the WER computation is based must still be determined using uniform costs, that is, substitutions are counted once. Using the definitions the TER can be rewritten as Since the contributions of term frequencies to term weights are often diminished by the application of logarithms (see (8)), the number of occurrences of an index term within a document d is of less importance than the fact whether a term does occur in d or not. Therefore, we propose the indicator error rate (IER) that is defined by with The IER discards term frequencies and measures the number of index terms that were missed or wrongly added during recognition. If we transfer the concepts recall and precision to pairs of documents, we will obtain a motivation for the IER.
To this purpose, we define Note that a high recall means that the recognized transcription d contains many index terms of the reference transcription d. A low precision means that the recognized transcription contains many index terms that do not occur in the reference transcription. Both the recall and precision errors are given by If we assume both the reference and the recognized documents to be of the same size, that is, |d | ≈ |d | which can be justified by the fact that language model scaling factors are usually set to values ensuring balanced numbers of deletions and insertions, we obtain the following interpretation of the IER: Similar results were observed on the TREC-8 SDR task using the recognition outputs of the Byblos and the Dragon speech recognition system, respectively (see Tables 8 and 9). Table 4 summarizes the most important error rates of Tables 3, 8,  and 9.
For each error measure, we can determine the accuracy rate which is given by max(1 − ER, 0), where ER is the WER, the TER, or the IER, respectively. Assuming a linear dependency of the retrieval effectiveness on the accuracy rate, we can compute the squared empirical correlation between the MAP obtained on the recognized documents and the  product over the accuracy rate and the MAP obtained on the reference documents. Table 5 shows the correlation coefficients thus computed. The computation of the accuracy rates refer to the ninth column of Tables 3, 8, and 9, that is, all documents were stopped and stemmed beforehand and reduced to query terms. Substitutions were counted only once in order to determine the word accuracies. Among the proposed error measures, the IER seems to best correlate with the retrieval effectiveness. However, the amount of data is still too small and further experiments will be necessary to prove this proposition.

Further discussion
In this section, we investigate the magnitude of the performance loss from a theoretical point of view. To this purpose, we consider the retrieval process in detail. When a user addresses a query to a retrieval system, each document in the database is rated according to its RSV. The induced ranking list determines a permutation π of the documents that can be mapped onto a vector indicating whether or not the document d i at position π(i) is relevant with respect to q. Let f be a retrieval function. Then, the application of f to a document collection Ᏸ given a query q leads to the permutation f q (Ᏸ) = (d π(1) , d π(2) , . . . , d π(K) ) with π induced by the following order: With the definition of the indicator function the ranking list can be mapped onto a binary vector Even though the deterioration of transcriptions as caused by recognition errors may change the indicator vector, a performance loss will only occur if the RSVs of relevant documents fall below the RSVs of irrelevant documents. Note that among the four possible cases of local exchange operations between documents, that is, Ᏽ q (d π(i) ) ∈ {0, 1} changes its position with Ᏽ q (d π( j) ) ∈ {0, 1} (i = j), only one case can cause a performance loss. Interestingly, it is possible to specify an upper bound for the probability that two documents d i and d j with f (q, d i ) > f (q, d j ) will change their relative order if they are deteriorated by recognition errors, that is, f (q, d i ) < f (q, d j ) will hold for the recognized documents d i and d j . According to [18], this upper bound is given by with Here, p c (t) denotes the probability that t is correctly recognized, p e (t) is the probability that t is recognized even though τ (τ = t) was spoken, and l(d) is a document specific length normalization that depends on the used retrieval metric. Thus, the upper bound for the probability of changing the order of two documents is vanishing for increasing document lengths [14, page 135]. In particular, this means that the relevant documents of the TREC-7 and the TREC-8 corpus are less affected by recognition errors than irrelevant documents since the average length of relevant documents is substantially larger than the average length of irrelevant documents (see Table 1). Now, let π 0 : {1, . . . , K} → {1, . . . , K} denote a permutation of the documents so that f (q, d π0 (1) ) > · · · > f (q, d π0(K) ) holds for a query q. Then, we can define a matrix At the beginning, A is an upper triangular matrix whose diagonal elements are zero. Since exchanges between relevant documents and exchanges between irrelevant documents do not affect the retrieval performance, each matrix element a i j will be set to 0 if {d π0(i) , d π0( j) } ⊆ Ᏸ rel (q) or {d π0(i) , d π0( j) } ⊆ Ᏸ irr (q). Then, the expectation of the ranking, that is, the permutation π maximizing the MAP of the recognized documents, can be determined according to Algorithm 1 using a greedy policy. The sequence of permutations π K • · · · • π 1 • π 0 defines a sequence of reorderings that corresponds with the expectation of the new ranking. The expectation will maximize the likelihood if the documents in the database are pairwise stochastically independent.

PROBABILISTIC APPROACHES TO IR
Besides heuristically motivated retrieval metrics, several probabilistic approaches to information retrieval were proposed and investigated over the past years. The methods range from binary independence retrieval models [19] over language model-based approaches [20] up to methods based on statistical machine translation [21]. The starting point of most probabilistic approaches to IR is the a posteriori probability p(d|q) of a document d given a query q. The posterior probability can be directly interpreted as RSV. In contrast to many heuristic retrieval models, RSVs of probabilistic approaches are thus always normalized and even comparable between different queries. Often, the posterior probability p(d|q) is denoted by p(d, b ∈ {rel, irr}|q), with the random variable b indicating the relevance of d with respect to q. However, since we consider noninteractive retrieval methods only, b is not observable and therefore obsolete since it cannot affect the retrieval process. The a posteriori probability can be rewritten as A document maximizing (34) is determined using Bayes' decision rule This decision rule is known to be optimal with respect to the expected number of decision errors if the required distributions are known [22]. However, as neither p(q|d) nor p(d) are known in practical situations, it is necessary to choose models for the respective distributions and estimate their parameters using suitable training data. Note that (35) can be easily extended to a ranking by determining not only the document maximizing p(d|q), but also by compiling a list that contains all documents sorted in descending order with respect to their posterior probability. In the recent past, several probabilistic approaches to information retrieval were proposed and evaluated. In [21] the authors describe a method based on statistical machine translation. A query is considered as a sequence of keywords extracted from an imaginary document that best meets the user's information need. Pairs of queries and documents are considered as bilingual annotated texts, where the objective of finding relevant documents is ascribed to a translation of a query (source language) into a document (target language). Experiments were carried out on various TREC tasks. Using the IBM-1 translation model [23] as well as a simplified version called IBM-0, the obtained retrieval effectiveness outperformed the tf-idf metric.
The approach presented in [24] makes use of multistate hidden Markov models (HMM) to interpolate documentspecific language models with a background language model. The background language model that is estimated on the whole document collection is used in order to smooth the probabilities of unseen index terms in the document-specific language models. Experiments performed on the TREC-7 ad hoc retrieval task showed better results than tf-idf.
In [25], the authors investigate an advanced version of the Markovian approach as proposed by [24]. Experiments conducted on the TREC-7 and TREC-8 SDR tasks achieve a retrieval effectiveness that is comparable with the Okapi metric, but does not outperform the SMART-2 results.
Even though many probabilistic retrieval metrics are able to outperform basic retrieval metrics as, for example, tf-idf, they usually do not achieve the effectiveness of advanced heuristic retrieval metrics such as SMART-2 or Okapi. In particular, for SDR tasks, probabilistic metrics often turned out to be less robust towards recognition errors than their heuristic counterparts. To compensate for this, we propose a new statistical approach to information retrieval that is based on document similarities [26].

Probabilistic retrieval using document representations
A fundamental difficulty in statistical approaches to information retrieval is the fact that typically a rare index term is well suited to filter out a document. On the other hand, a reliable estimation of distribution parameters requires that the underlying events, that is, index terms, are observed as frequently as possible. Therefore, it is necessary to properly smooth the distributions. In our case, document-specific term probabilities p(t|d) are smoothed with term probabilities of documents that are similar to d. The similarity measure is based on document representations which in the simplest case can be document-specific histograms of the index terms.
The starting point of our approach is the joint probability p(q, d) of a query q and a document d, Here, |q| denotes the number of index terms in q. The conditional probabilities p(q j , d|q Here, two model assumptions have been made: first, the conditional probabilities p(q|d, r) are assumed to be independent of d (see (39)) and secondly, p(d i |r, d i−1 1 ) will not depend on the predecessor terms d i−1 1 (see (41)).

Variants of interpolation
It remains to specify models for the document representations r ∈ R as well as the distributions p(q|r), p(d|r), and p(r). Since we want to distinguish between the event that a query term t is predicted by a representation r and the event that the term to be predicted is part of a document, p(q|r) and p(d|r) are modeled differently. In our approach, we identify the set of document representations R with the histograms over the index terms of the document collection Ᏸ, Thus, we can define the interpolations p q (t|r) and p d (t|r) as models for p(q|r) and p(d|r), Since we do not make any assumptions about the a priori relevance of a document representation, we set up a uniform distribution for p(r). Note that (44) is an interpolation between the relative counts n r (t)/n r (·) and n(t)/n(·). Instead of interpolating between the relative frequencies as in (44), we can also interpolate between the absolute frequencies Both interpolation variants will be discussed in the following section.

Experimental results
Experiments were performed on the TREC-7 and the TREC-8 SDR task using both the manually generated transcriptions and the automatically generated transcriptions. As before, all speech recognition outputs were produced using the RWTH LVCSR system for the TREC-7 corpus or taken from the Byblos "Rough 'N Ready" and the Dragon LVCSR system for the TREC-8 corpus.
Due to the small number of test queries for both retrieval tasks, we made use of a leaving-one-out (L-1-O) approach [27, page 220] in order to estimate the interpolation parameters α and β. Additionally, we added results under unsupervised conditions, that is, we optimized the smoothing coefficients α and β on TREC-8 queries and corpus and tested on the TREC-7 sets and vice versa. Finally, we carried out a cheating experiment by adjusting the parameters α and β to maximize the MAP on the complete set of test queries. This yields an optimistically upper bound of the possible retrieval effectiveness. All experiments conducted are based on the document representations according to (42), that is, each document is smoothed with all other documents in the database.
In a first experiment, the interpolation parameter α was estimated. Figure 2 shows the MAP as a function of the interpolation parameter α with fixed β on the reference transcriptions of the TREC-7 corpus. Using the L-1-0 estimation scheme, the best value for α was found to be 0.742, which has to be compared with a globally optimal value of 0.875, that is, the cheating experiment without L-1-O. The interpolation parameter β was adjusted in a similar way. Using the interpolation scheme according to (44), the retrieval effectiveness on both tasks is maximum for values of β that are very close to 1. This effect is caused by singletons, that is, index terms that occur once only in the whole document collection. Since the magnitude of the ratio of both denominators in (44) is approximately the optimal value for β should be found in the range of 1 − 1/D, assuming that singletons are the most important features in order to filter out a relevant document. In fact, using β = 1 − 1/D exactly meets the optimal value of 0.99965 on the TREC-7 corpus and 0.99995 on the TREC-8 retrieval task. However, since the interpolation, according to (44), runs the risk of becoming numerically unstable (especially for very large document collections), we investigated an alternative smoothing scheme that interpolates between absolute counts instead of relative counts (see (45)). Figure 3 depicts the MAP as a function of the interpolation parameter β for both interpolation methods on the reference transcriptions of the TREC-7 SDR task. Since the interpolation scheme, according to (45), proved to be numerically stable and achieved  slightly better results, it was used for all further experiments. Table 6 shows the obtained retrieval effectiveness for the new probabilistic approach on the TREC-7 SDR task. Using L-1-O, the retrieval performance of the new proposed method lies within the magnitude of the SMART-2 metric, that is, we obtained a MAP of 45.8% on manually transcribed data which must be compared with 46.6% using the SMART-2 retrieval metric. Using automatically generated transcriptions, we achieved a MAP of 40.4% which is close to the performance of the SMART-2 metric. A further performance gain could be obtained under unsupervised conditions. Using the optimal parameter setting of the TREC-8 corpus for the TREC-7 task achieved a MAP of 41.6%. Figure 4 shows the recall-precision graphs for both SMART-2 and the new probabilistic approach. The same applies to the results obtained on the TREC-8 SDR task (see Table 7). Here, the new probabilistic approach even outperformed the SMART-2 retrieval metric. Thus, we obtained a MAP of 51.3% on the manually tran- scribed data in comparison with 49.6% for the SMART-2 metric. This improvement over SMART-2 is also obtained on recognized transcriptions even though the improvement is smaller. Thus, we achieved a MAP of 44.4% on the automatically generated transcriptions produced with the Byblos speech recognizer, which is an improvement of 3% relative compared to the SMART-2 metric, and 44.1% MAP using the Dragon speech recognition outputs, which is an improvement of 5% relative. Similar to the results obtained on the TREC-7 corpus, the unsupervised experiments conducted on the automatically generated transcriptions of the TREC-8 task showed a further performance gain between 1% and 2% absolute. Figure 5 shows the recall-precision graphs for SMART-2 and the probabilistic approach.

CONCLUSION
In this paper, we presented a detailed analysis on the effect of recognition errors on retrieval performance. Since retrieval performance is only little affected by recognition errors, we investigated two alternative error measures, namely, the TER and the IER that turned out to be more suitable in order to describe the quality of automatically generated transcriptions for retrieval applications. Experiments carried out on the TREC-7 and TREC-8 SDR task revealed a better correlation between the obtained retrieval effectiveness and the proposed error measures. Baseline results were produced using a new query expansion method.
In the second part of this paper, we presented a new probabilistic approach to SDR based on interpolations between document-specific term histograms and a global term histogram that is pooled over all documents. To this purpose, the set of documents was mapped onto a set of document representations. These document representations were identified with document-specific histograms and can be interpreted as a kind of nearest neighbor concept. Two smoothing schemes were discussed and investigated. Experiments performed on the TREC-7 and the TREC-8 SDR task showed comparable or even better results for the new probabilistic approach than an enhanced version of the SMART-2 retrieval metric. In addition, the new probabilistic approach turned out to be robust towards recognition errors.