Multi-modal highlight generation for sports videos using an information-theoretic excitability measure
© Hasan et al.; licensee Springer. 2013
Received: 8 November 2012
Accepted: 14 October 2013
Published: 18 November 2013
The ability to detect and organize ‘hot spots’ representing areas of excitement within video streams is a challenging research problem when techniques rely exclusively on video content. A generic method for sports video highlight selection is presented in this study which leverages both video/image structure as well as audio/speech properties. Processing begins where the video is partitioned into small segments and several multi-modal features are extracted from each segment. Excitability is computed based on the likelihood of the segmental features residing in certain regions of their joint probability density function space which are considered both exciting and rare. The proposed measure is used to rank order the partitioned segments to compress the overall video sequence and produce a contiguous set of highlights. Experiments are performed on baseball videos based on signal processing advancements for excitement assessment in the commentators’ speech, audio energy, slow motion replay, scene cut density, and motion activity as features. Detailed analysis on correlation between user excitability and various speech production parameters is conducted and an effective scheme is designed to estimate the excitement level of commentator’s speech from the sports videos. Subjective evaluation of excitability and ranking of video segments demonstrate a higher correlation with the proposed measure compared to well-established techniques indicating the effectiveness of the overall approach.
Automatic video analysis and summarization has a wide range of applications in domains such as sports, movies, security, news and on-line video streaming. Hot-spot information can be utilized in technologies such as search, summarization, and mash-ups, in addition to navigation of multimedia content. For example, emotional ‘hot-spots’ within sports videos are usually more exciting than the overall game video, which motivates the formulation of a solution to automatically generate highlights from such videos. Various approaches towards automatic event detection and summarization in sports videos have been presented in the literature. Past methods utilize information from a single modality , or combine multiple modalities in different ways [2–7]. Many techniques depend on specific sports type [2, 8, 9], video effects , or environments. Methods such as those used by Lein et al.  depend on annotating the full game automatically using sophisticated machine learning and domain knowledge, whereas other methods tend to be more generic [3, 10–12]. In simpler methods such as in  as applied to baseball games, the probability of a baseball hit and excited speech is combined to estimate the excitability of a video segment. In , a generic approach was presented to estimate expected variations in a user’s excitement from the temporal characteristics of selected audio-visual features and the editing scheme of a video. In general, generic highlight extraction schemes aim at constructing temporal features from audio/video streams that are proportional to or indicate user excitability . Later, some kind of fusion strategy is used to generate a single excitement curve  providing estimated affective state of the viewer at different points in time/video segment. In our initial study , we examined a simple audio/video feature fusion method for baseball highlight extraction. In this paper, we extend the feature space and propose an information-theoretic measure of excitability for sports highlight selection in a generic framework . Our proposed measure is based on a simple but powerful principle of information theory: the less likely an event, the more information it contains. We assume that interesting parts in a video occur rarely  and therefore have high self-information (also known as the ‘surprisal’) . This can be intuitively understood as follows: if for a given sports video the ambient crowd noise is always high, then audio energy as an excitement indicator  would be unreliable, i.e., in this game, there is nothing ‘surprising’ in high audio energy. Our proposed method aims at estimating the user excitability directly from low-level features using their joint-PDF estimated over the game videos. Even when extended videos are not available for training these models, the proposed technique can still extract highlights from a given game video by estimating the feature PDFs from itself in an unsupervised fashion, provided that the features used are generally related to user excitement. An advantage of the proposed method is that it is less affected by extreme values of a single feature due to off-field distractions  since the joint behavior of the features is considered in a probabilistic framework. Using the proposed excitability measure, the video segments can be rank-ordered to automatically generate highlights. The technique can also be used to estimate an excitement-time curve  to demonstrate user-affective states over a time sequence of the video stream.
The multi-modal events/features used in the proposed highlights generation system are the following: slow motion replay, camera motion activity, scene cut density, and excitement in commentators’ speech. Past studies [5, 6] typically consider simplistic features such as energy, zero crossing rate, and others to estimate excitement from the audio modality. Inspired by studies on emotion assessment [18–20], in this study, we analyze the effects of excitement on the parameters of a linear model of speech production derived from commentators’ speech. As will be shown, some speech parameters are strongly correlated with the perceptual excitability and hence, are selected to form a feature vector for the audio-based excitement assessment.
This paper is organized as follows: section 2 proposes the probabilistic excitability measure and discusses methods of its implementation in highlight selection. In section 3, analysis of speech production features and their correlation with excitement in commentators’ speech in the context of sports videos is discussed. In section 4, the overall highlights extraction scheme is presented; section 5 details a subjective evaluation of the full system and discuses results, and section 6 concludes the study.
2 Proposed excitability measure
At first, the video is divided into small segments for feature extraction. Next, several features (scalar parameters) are extracted from each segment that are modeled to be generally proportional to the user’s excitement of the given segment. These features represent long/short term (cumulative) characteristics from different modalities, such as duration of excited speech, average motion activity, and others.
2.1 Basic formulation
The advantage of using the proposed measure is that it not only considers the value of the observation x i (k) in the k th segment, but also takes into account how likely it is that this feature yields a higher value than x i (k). Thus, ζ(k) can be used to rank video segments from a high to low excitement level.
2.2 Incorporating feature reliability
where and ∀i:0≤η i <1. If it is known a priori that some features are more reliable than others, then appropriate weights can be set. On the other hand, the correlation of the individual feature parameters to the subjective excitability can be obtained on a development dataset. These correlation values will give indications on which features are more reliable, i.e. more related to user excitement, and thus be weighted higher. We will discuss this further in the experiments section.
3 Excitement measurement in speech
Most current highlight extraction methods utilizing game audio tend to focus on simplistic features such as audio energy or short-time zero crossing [5, 6] to estimate the excitement level. Past literature on emotions and stress suggests that a number of speech production parameters can be affected by varying speech modalities [18–21]. In this section, we extract a set of speech parameters derived from the linear model of speech production and evaluate their correlation with the perceptual excitement level in the commentators audio. The subset of parameters displaying a strong correlation with the excitement level is identified and used as features in the design of an audio-based excitement classifiera.
which represents a frequency interval of one standard deviation in the distribution of energy spectrum with the mean equal to SCG. We have observed that SES, when combined with SCG, constitutes a more noise-robust spectral descriptor for emotion and stress classification than spectral slope .
Table 1 suggests that mean game F 0, SCG, and F 1−2 exhibit a relatively high linear relationship with subjective excitement labels, while F 3 and SES have just a moderate relationship (also note increased MSE values), and F 4 is almost unaffected by the perceived excitement. This corresponds well with the observations made in the past literature. Variations of vocal effort, typical for excited speech, are carried out by both varying sub-glottal pressure and tension in the laryngeal musculature . Pitch (in log frequency) changes almost linearly with vocal intensity . In the spectral domain, the energy in increased vocal effort speech migrates to higher frequencies, causing an upward shift of SCG , and flattening of the spectral slope of short-time speech spectra [18, 26]. F 1 is inversely proportional to the vertical position of the tongue and F 2 rises with tongue advancement . The increased vocal effort in excited speech is likely to be accompanied by a wider mouth opening, which is realized by lowering the jaw and the tongue. As a result, F 1 will increase in frequency [23, 28]. F 2 rises in some phones  while may decrease in others . On the other hand, locations of higher formants are rather determined by the vocal tract length  and as such are not as sensitive to the vocal effort variations.
Based on the results in Table 1, F 0, SCG, and F 1−3 are chosen as features for the automatic excitement-level assessment. The excitement level classification is conducted using a Gaussian mixture model (GMM)-based classifier and will be discussed in more detail in section 5.1.
4 Highlights extraction system
The notation used from this point forward is as follows: t, k, and i denote video frame, video segment, and feature index, respectively. For the i th feature, Φ i (t), x i (k), and G i (t) indicate feature value at time t, feature parameter extracted from segment k, and viewer arousal curve at time t estimated as in , respectively. The multi-modal events/features used for excitability measure include: (1) slow motion replay, (2) camera motion activity, (3) scene cut density, (4) commentators’ speech in high and (5) low excitement levels, and (6) audio energy. For comparison, we also implemented the highlight selection method presented in . The details of the system are discussed below.
4.1 Video processing
4.1.1 Slow motion detection
Next, the Z c () function outputs for each window is considered and if it is greater than some predefined threshold λ, the window is assumed to contain slow motion frames. We use λ=15.
4.1.2 Camera motion estimation
In sports videos, high motion in the camera usually indicate exciting events . For detecting camera motion, we use a block-matching algorithm  to estimate the motion vector between successive video frames. A large (64 by 64 pixels) block size is used in order to reduce the motion estimation sensitivity to movement of small objects within the frame. The raw motion values are normalized and stored in Φ 2(t), then smoothed using Equation 8 to obtain G 2(t).
We observe that the amplitude of the resultant motion vector calculated in each frame gives a good indication of camera movement, such as pan and zoom. Thus, segmental feature x 2(k) is computed by averaging G 2(t) across the k th segment.
4.1.3 Scene-cut density
where n(t) and p(t) are frame indices of the two nearest scene-cuts to the left and right of the frame t, respectively. The parameter δ is set to 500. Again, we use (8) to obtain G 3(t) from Φ 3(t) and average G 3(t) over the k th segment to compute x 3(k).
4.1.4 Pitching scene detection
- i)Area ratio condition ( C A ): Area ratio , R a , is computed from the binary image I f (·,·). If 25%≤R a ≤45% then C A =1; else 0.
Layout condition ( C L ): In pitching scenes, the lower half of the image usually contains more baseball field pixels . Thus, if the lower half has more than twice the number of field pixels compared to the upper half, we set C L =1, else 0, which becomes:if
then C L =1, otherwise C L =0.
- iii)Vertical profile condition ( C V ): The vertical distribution of the field pixels V(i) is given by the equation(11)
In pitching scenes, a strong valley is usually found in the left hand side of this distribution , as shown in Figure 12c, due to the presence of the pitcher. If such a valley is found, C V =1; else 0.
Player location condition ( C P ): From the video frame, a binary edge image is calculated using the ‘sobel’ method and image dilation  is performed. The resulting image is then divided into 16 equal blocks as shown in Figure 12d. In pitching scenes, a higher intensity in the edge-dilated image will be observed in blocks 7, 10, 11, and 14 due to the presence of the pitcher and the batter . If the image intensity of these regions is greater than the average intensity of the image, C P =1; else 0.
Here, + and · indicate the boolean OR and AND operations, respectively. In , we assumed that all pitching scenes satisfy C A and are used as an AND condition in C pitch. Using the proposed logic, we successfully detect the pitching shots with an 80.6% accuracy for the baseball games under consideration.
4.2 Audio processing
4.2.1 Speech/non-speech classification
where μ g and are the mean and variance of the g th Gaussian (g=1,2). Using the P g (k) values, each segment is assigned to the more likely Gaussian (i.e., the one with the higher posterior probability). Since the non-stationarity of speech is typically higher than that of the background, the Gaussian with the larger mean is assumed to be speech. Using this intuition, speech and background Gaussians within the GMM are identified and every k th segment is assigned to either speech or background. Furthermore, Viterbi smoothing is used to smooth the above decisions using a high self-transition probability (0.9∼0.99). This segmentation is used for our speech excitement level detection scheme. Using the above technique on audio data from the six baseball games (about 15 h of audio), an overall accuracy of 80.1% is obtained with a low miss rate of 2.6% (miss is speech detected as background) and false alarm rate of 17.3% (false alarm here represents the background detected as speech).
4.2.2 Excitement measurement in speech
Using the features discussed in section 3, a GMM-based classifier is designed to detect high and moderate excitement levels from the game audio. Details of the evaluation of this scheme is presented later in section 5.1 within other evaluations from this study. This section describes how the excitement classification output is used in construction of the feature parameter for highlights extraction.
To estimate the required G i (t) functions for high and moderate/low excitement in speech, we use the same principle used for slow motion feature. First, we form the function Φ 4(t), such that Φ 4(t)=1 if the high excitement class was detected at time instant t, and 0, otherwise. Similarly, we form Φ 5(t) for the detected moderate/low excitement class. Time domain examples of these functions are shown in Figure 13d,e. Next, the corresponding G i (t) functions are computed following Equation 8. The only difference here is that the function Φ 5(t) is inverted before filtering, following the fact that low excitement in speech is inversely proportional to the viewer arousal.
4.2.3 Audio energy measure
To compute the audio energy, a fixed audio segment size of 267 samples is chosen to be equivalent to our video frame rate of 29.97 frames/section. For each frame t, the audio energy Φ 6(t) is extracted and later filtered using Equation 8 to obtain G 6(t). Finally, the averaged value of G 6(t) in the k th segment is used to compute the segmental audio energy features x i (k).
4.3 Feature fusion and highlights generation
To estimate excitability in segment k, we use the averaged values of H M (t) in that segment to obtain .
For the proposed excitability measure, first, the multi-modal feature vector x(k)=(x 1(k),x 2(k)⋯x D (k)) is computed for each segment k. All features are normalized to zero mean with unit variance before further processing. Next, the histogram of the x i (k) values across all segments of the video is used to estimate the PDFs . The excitability measure ζ(k) from each segment can then be computed using Equation 3. Note that in this implementation, ζ(k) measure is calculated without a need of any prior knowledge about the features x i (k). In order to generate a highlight video, segmentation is performed using the detected pitching shot locations. The proposed measure is then used to rank order and combine the video segments according to the user-defined overall highlights duration. Sample highlights generated using the proposed technique can be found in http://sites.google.com/site/baseballhighlights/.
4.4 Generation of excitement time curve
4.5 Real-time implementation
In real-time highlight generation, important highlight events (e.g., a home run or a goal, etc.) are detected and played back immediately or sent to the users through online media. These methods usually work on explicitly detecting the important events using game dependent cues. Since the proposed scheme functions on low-level features and estimates excitability using a probabilistic measure, real-time event detection and broadcast is not feasible in the current framework. However, the method can be extended to operate in real-time by processing video segments in chunks and updating the highlights periodically. Though on-line highlight generation is not the focus of this study, we describe here how this can be achieved using the proposed framework.
where indicates the PDF estimate of the i th feature at segment k obtained from the past M segments. The PDFs may also be estimated from all the video segments starting from the beginning of the game up to segment k. For finding a highlight event within the last M segments, a predefined threshold τ on ζ(k) can be used. This threshold can be estimated from the ζ(k) values from the past M segments. This means that if the excitability measure found in the current segment k is significantly high compared to what was observed in the past M segments, segment k is used as highlight. This approach is summarized in Algorithm 1. It should be noted that, after the game is over, the highlights generated form the full video may not contain all the local ‘short-time’ highlight segments that were detected in real-time.
Algorithm 1 Real-time highlight generation using the proposed framework
5 Evaluation of proposed system
In this section, we first evaluate the speech production-based classifier in terms of the excitability measure for sports commentators speech. Next, we perform a subjective evaluation of the complete audio-video highlights extraction system combining the multi-modal features.
5.1 Speech production-based classifier
As discussed in section 3, fundamental frequency F 0, spectral center of gravity SCG, and the first three formants F 1−3 were found to provide good correlation with the perceptual excitement labels and hence are used in this section as input parameters for automatic excitement-level assessment.
The speech-based classifier utilized in the curently proposed highlights generation scheme replicates the one presented in our preliminary study . Two GMMs are trained to represent ‘moderate’ excitement (referring to excitement levels 1 and 2) and ‘high’ excitement (levels 3 and 4). The 32-mixture GMMs are trained on samples extracted from 4 baseball games, where the subjective excitement-level labels serve as target transcriptions of the data, and evaluated on 2 remaining games representing the open test set. The binary classification results are evaluated by means of equal error rate (EER). In a round robin experiment, the classifier utilizing features normalized by the game level statistics provides EER of 21.4 to 22.4 %, see  for details.
The feature normalization utilized in the previous paragraph assumes that the whole game is available to estimate the global statistics of the commentator’s speech (see section 3). Some applications may require generation of highlight previews while the game is still in progress. In such a case, the statistics of commentator’s speech features may be estimated on-the-fly. One possible approach is to estimate the feature means and standard deviations from the starting segment of the game and cumulatively update their values with every newly observed speech island. It is expected that the cumulative statistic estimates will converge over time towards the global game statistics.
To evaluate how the on-the-fly estimation might impact the accuracy of the excitement level classification, we implement a simple scheme where the statistics estimated from the initial segment of the game are continuously updated based on the newly observed speech islands using a weighted mean. The currently available estimates are used to normalize features extracted from the current voiced island. Clearly, there would be an option to go back in time and re-normalize all past segments with the newest estimates; however, we do not utilize this option in our current study.
Figure 16 details the averaged round robin EERs as functions of the elapsed game time. A 1-min-long window was applied to smooth the temporal EERs. It can be seen that the classifier utilizing game level statistics performs more accurately through a major part of the game; however, besides the first 4 min of the game, where the EER difference for both 50- and 3,000-s setups versus the game level setup drops by nearly 9 % EER, the EER difference for the global and on-the-fly systems does not exceed more than approximately 3 %, which seems quite affordable.
5.2 Subjective evaluation of the highlight generation system
In order to evaluate the effectiveness of the proposed measure of excitability, we conducted an independent subjective evaluation involving five viewers familiar with the game of baseball. We note that subjective evaluation of video summarization is quite challenging since it requires each subject to watch the entire game (typically approximately 3-h long), memorize all exciting moments and then select the most exciting parts to assess if the algorithm properly selected the highlights. In addition, it is natural to have a strong disagreement between viewers in deciding which scene should be selected/rejected. To make the evaluation feasible, we selected 100 pitching scenes (semantic segments) of about 15 s in duration from a baseball game. The subjects were asked to watch the videos and rank the excitability of the scene on a scale from 0 to 30. The rubrics used are boring (0 to 10), moderately exciting (10 to 20), and very exciting (20 to 30). Since the dynamic range of the ranking scores may vary across subjects, the scores obtained from each subject are normalized to zero mean and unit variance, and then averaged to obtain the final subjective measure. Note that the normalization here does not affect the ranking order of the segments decided by the subject (i.e., order from lowest ranking to highest ranking scenes will be preserved and also segments eligible to form highlights remain the same). Next, we extract excitement estimation in the selected 100 video segments obtained from (i) computed using , (ii) γ(k) computed from a simple addition and normalization of the feature parameters x i (k) , and (iii) the proposed measure ζ(k). The measures are evaluated by comparing their correlation and ranking agreement with the subjective measure.
5.2.1 Correlation between individual features and subjective scores
Comparison of correlation coefficient ( ρ ) between different feature parameters and subjective evaluation scores
Segmental feature parameter
Slow motion duration, x 1(k)
Average motion activity, x 2(k)
Average scene cut density, x 3(k)
Duration of speech in high class, x 4(k)
(-ve) Duration of speech in low class, x 5(k)
Audio energy, x 6(k)
Looking closely at Figure 17, we conclude that for our particular case, the most effective feature parameter is the average camera motion, with the highest individual correlation coefficient (ρ=0.6531). This is an indication that camera movement is more prominent during exciting moments of a baseball game, which is probably not the case for tennis, where the camera is more or less stationary. Excitement level detected in commentator speech is another good indication of excitability (ρ=0.5146). This is more or less true for any sports. The slow motion feature is also highly correlated with excitability (ρ=0.5066).
Equal weights: η i =1/D=1/6. Weight vector: w 0= [ 1 1 1 1 1 1]/6
Higher weights to features x 1(k) & x 2(k): w 1= [ 2 2 1 1 1 1]/8
Higher weights to features x 2(k) & x 4(k): w 2= [ 1 2 1 2 1 1]/8
These weight settings are motivated by the higher correlation these individual features demonstrate (slow motion, high audio excitement and camera motion activity). In practical applications, we will not be able to estimate the correlation values since the subjective scores will not be available. However, using a development dataset, these weights can be trained and later applied during evaluation.
5.2.2 Correlation between subjective and objective scores
When equal weights are used for the parameters η i in (3), the proposed excitability measure ζ(k) is seen to have the highest correlation coefficient of 0.77473, which outperforms the other two excitability measures γ(k) and , yielding correlations of 0.67579 and 0.59287, respectively. This indicates the effectiveness of the proposed measure in fusing multiple feature parameters. These values are also shown in Figure 17, indicating that the subjective scores are better correlated with the proposed excitability measure, as opposed to the individual features.
Performance evaluation by comparing top N highlight segments selected using subjective and objective methods
%TPR in top N segments
Using equal feature weights
Using feature weights: η=w 1
Using feature weights: η=w 2
5.2.3 Comparison between subjective and objective highlight rank
As an example, for N=20, the proposed technique selects 70% of the top 20 subjectively ranked segments with an FPR of 7.5%, while for , the TPR and FPR values are 55% and 11.2%, respectively. With proper design of the feature weights, the highlight selection performance can improve further as shown in Table 2.
A generic video highlights generation scheme based on an information theoretic measure of user excitability was presented. The scheme utilizes audio excitement and low-level video features. Based on the analysis of the sports commentator’s speech, production parameters most correlated with the perceptual excitability were selected to form an acoustic feature vector for excitement assessment. Along with this, audio energy, slow motion duration, camera motion, and scene-cut density were used as features in the highlights extraction system. The proposed integrated excitability measure was shown to effectively combine the multi-modal features in video segments and found to be highly correlated with a perceptual assessment of excitability. The proposed scheme was also shown to outperform state-of-the-art generic excitability ranking methods.
Example Processed Output Content
This project was funded by the AFRL under contract FA8750-12-1-0188 (Approved for public release, distribution unlimited), and partially by the University of Texas at Dallas from the Distinguished University Chair in Telecommunications Engineering held by J.H.L. Hansen.
- Pan H, Van Beek P, Sezan M: Detection of slow-motion replay segments in sports video for highlights generation. Proc. IEEE ICASSP 7–11 May 2001 Google Scholar
- Delakis M, Gravier G, Gros P: Audiovisual integration with Segment Models for tennis video parsing. Comput. Vis. Image Underst 2008, 111(2):142-154. 10.1016/j.cviu.2007.09.002View ArticleGoogle Scholar
- Fleischman M, Roy B, Roy D: Temporal feature induction for Baseball highlight classification. In Proceedings of the 15th Int. Conf. on Multimedia, Augsburg, Germany, 24–29 Sept.. New York: ACM; 2007:333-336.Google Scholar
- Xiong Z, Radhakrishnan R, Divakaran A: Generation of sports highlights using motion activity in combination with a common audio feature extraction framework. Proc. IEEE ICIP, Barcelona, Catalonia, 14-17 Sept. 2003.Google Scholar
- Kolekar M, Sengupta S: Semantic concept mining in cricket videos for automated highlight generation. Multimedia Tools and Appl 2010, 47(3):545-579. 10.1007/s11042-009-0337-1View ArticleGoogle Scholar
- Tjondronegoro D, Tao X, Sasongko J, Lau C: Multi-modal summarization of key events and top players in sports tournament videos. In 2011 IEEE Workshop on Applications of Computer Vision (WACV), Kona, HI, 5–7 January. Piscataway: IEEE; 2011:471-478.View ArticleGoogle Scholar
- Cheng C, Hsu C: Fusion of audio and motion information on HMM-based highlight extraction for baseball games. IEEE Trans. Multimedia 2006, 8(3):585-599.View ArticleGoogle Scholar
- Lien C, Chiang C, Lee C: Scene-based event detection for baseball videos. J. of Visual Comm. and Image Representation 2007, 18: 1-14. 10.1016/j.jvcir.2006.09.002View ArticleGoogle Scholar
- Assfalg J, Bertini M, Bimbo AD, Nunziati W, Pala P: Soccer highlights detection and recognition using HMMs. In Proc. ICME ’02, Lausanne, Switzerland, 26-29 Aug. 2002 Volume 1. Piscataway: IEEE; 2002:825-828.Google Scholar
- Hanjalic A: Adaptive extraction of highlights from a sport video based on excitement modeling. IEEE Trans. Multimedia 2005, 7(6):1114-1122.View ArticleGoogle Scholar
- Xiong Z, Radhakrishnan R, Divakaran A, Huang T: Audio events detection based highlights extraction from baseball, golf and soccer games in a unified framework. In Proc. IEEE ICASSP, Hong Kong, China, 6–10 April 2003 Volume 5. Washington, DC: IEEE Computer Society; 2003:632-635.Google Scholar
- Chang P, Han M, Gong Y: Extract highlights from baseball game video with hidden Markov models. In Proc. Image Process., Rochester, New York, 22–25, Sept. Volume 1. Piscataway: IEEE; 2002:609-612.Google Scholar
- Rui Y, Gupta A, Acero A: Automatically extracting highlights for TV baseball programs. In Proceedings of the Eighth ACM International Conference on Multimedia, Los Angeles, CA, October 30 – November 03. New York: ACM; 2000:115-115.Google Scholar
- Ma Y, Lu L, Zhang H, Li M: A user attention model for video summarization. In Proceedings of the tenth ACM international conference on Multimedia, Juan les Pins, France, 1–6 December, 2002. New York: ACM; 2002:542-542.Google Scholar
- Bořil H, Sangwan A, Hasan T, Hansen JHL: Automatic excitement-level detection for sports highlights generation. Proc. InterSpeech, Makuhari, Chiba, Japan 26–30 September 2010 2202-2205.Google Scholar
- Hasan T, Bořil H, Sangwan A, Hansen JHL: A multi-modal highlight extraction scheme for sports videos using an information-theoretic excitability measure. IEEE ICASSP, Kyoto, Japan 25–30, March 2012 2381-2384.Google Scholar
- Cover TM, Thomas JA: Elements of Information Theory. New York: Wiley-Interscience; 1991.View ArticleGoogle Scholar
- Hansen JHL: Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition. Speech Comm 1996, 20(1-2):151-173. 10.1016/S0167-6393(96)00050-7View ArticleGoogle Scholar
- Cowie R, Cowie Douglas-E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor JG: Emotion recognition in human-computer interaction. IEEE Signal Process. Mag 2001, 18: 32-80. 10.1109/79.911197View ArticleGoogle Scholar
- Bořil H, Kleinschmidt T, Boyraz P, Hansen JHL: Impact of cognitive load and frustration on drivers’ speech. The J. Acoust. Soc. Am 2010, 127(3):1996-1996.Google Scholar
- Lee CM, Narayanan SS: Toward detecting emotions in spoken dialogs. IEEE Trans. on Speech & Audio Process 2005, 13(2):293-303.View ArticleGoogle Scholar
- Sjolander K, Beskow J: Wave Surfer-an open source speech tool. Proc. of ICSLP‘00, Beijing, China, 16–20 Oct. 2000 Volume 4 Google Scholar
- Schulman R: Dynamic and perceptual constraints of loud speech. The J. Acoust. Soc. Am 1985, 78(S1):S37-S37.View ArticleGoogle Scholar
- Gramming P, Sundberg S, Ternström S, Perkins W: Relationship between changes in voice pitch and loudness. STL-QPSR 1987, 28: 39-55.Google Scholar
- Lu Y, Cooke M: Speech production modifications produced by competing talkers, babble, and stationary noise. The J. Acoust. Soc. Am 2008, 124(5):3261-3275. 10.1121/1.2990705View ArticleGoogle Scholar
- Pisoni D, Bernacki R, Nusbaum H, Yuchtman M: Some acoustic-phonetic correlates of speech produced in noise. In Proc. of ICASSP, Tampa, Florida, 26–29 March, Volume 10. Piscataway: IEEE; 1985:1581-1584.Google Scholar
- Kent RD, Read C, San Diego: The Acoustic Analysis of Speech. Whurr Publishers; 1992.Google Scholar
- Bond Z, Moore T: A note on Loud and Lombard speech. Proc. of ICSLP‘90, Kobe, Japan, 18–22 November 1990 Google Scholar
- Bořil H: Robust speech recognition: analysis and equalization of Lombard effect in Czech Corpora, PhD thesis. Czech Republic: Czech Technical University in Prague; 2008. http://www.utdallas.edu/~hxb076000 Google Scholar
- Junqua JC: The Lombard reflex and its role on human listeners and automatic speech recognizers. The J. Acoust. Soc. Am 1993, 93: 510-524. 10.1121/1.405631View ArticleGoogle Scholar
- Wakita H: Normalization of vowels by vocal-tract length and its application to vowel identification. IEEE Trans. Acoust. Speech and Signal Processing 1977, 25(2):183-192. 10.1109/TASSP.1977.1162929View ArticleGoogle Scholar
- Oppenheim A, Schafer R: Discrete-Time Signal Processing. Upper Saddle River, NJ: Prentice Hall; 1999.Google Scholar
- Liu C, Huang Q, Jiang S, Xing L, Ye Q, Gao W: A framework for flexible summarization of racquet sports video using multiple modalities. Comput. Vis. Image Underst 2009, 113: 415-424. 10.1016/j.cviu.2008.08.002View ArticleGoogle Scholar
- Liu H, Zhang Wj, Cai J: A fast block-matching algorithm based on variable shape search. J. Zhejiang University - Science A 7: 2006. [10.1631/jzus.2006.A0194]Google Scholar
- Truong B, Dorai C, Venkatesh S: New enhancements to cut, fade, and dissolve detection processes in video segmentation. In Proc. of the 8th ACM Inter. Conf. on Multimedia, Los Angeles, CA, October 30 – November 03, 200. New York: ACM; 2000:227-227.Google Scholar
- Chu W, Wu J: Explicit semantic events detection and development of realistic applications for broadcasting baseball videos. Multimedia Tools and Appl 2008, 38: 27-50. 10.1007/s11042-007-0145-4MathSciNetView ArticleGoogle Scholar
- Van Den Boomgaard R, Van Balen R: Methods for fast morphological image transforms using bitmapped binary images. Graphical Models and Image Process 1992, 54(3):252-258.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.