A Content-Adaptive Analysis and Representation Framework for Audio Event Discovery from

Radhakrishnan, Regunathan; Divakaran, Ajay; Xiong, Ziyou; Otsuka, Isao

doi:10.1155/ASP/2006/89013

Research Article
Open access
Published: 01 December 2006

A Content-Adaptive Analysis and Representation Framework for Audio Event Discovery from "Unscripted" Multimedia

Regunathan Radhakrishnan¹,
Ajay Divakaran¹,
Ziyou Xiong¹ &
…
Isao Otsuka²

EURASIP Journal on Advances in Signal Processing volume 2006, Article number: 089013 (2006) Cite this article

1478 Accesses
8 Citations
Metrics details

Abstract

We propose a content-adaptive analysis and representation framework to discover events using audio features from "unscripted" multimedia such as sports and surveillance for summarization. The proposed analysis framework performs an inlier/outlier-based temporal segmentation of the content. It is motivated by the observation that "interesting" events in unscripted multimedia occur sparsely in a background of usual or "uninteresting" events. We treat the sequence of low/mid-level features extracted from the audio as a time series and identify subsequences that are outliers. The outlier detection is based on eigenvector analysis of the affinity matrix constructed from statistical models estimated from the subsequences of the time series. We define the confidence measure on each of the detected outliers as the probability that it is an outlier. Then, we establish a relationship between the parameters of the proposed framework and the confidence measure. Furthermore, we use the confidence measure to rank the detected outliers in terms of their departures from the background process. Our experimental results with sequences of low- and mid-level audio features extracted from sports video show that "highlight" events can be extracted effectively as outliers from a background process using the proposed framework. We proceed to show the effectiveness of the proposed framework in bringing out suspicious events from surveillance videos without any a priori knowledge. We show that such temporal segmentation into background and outliers, along with the ranking based on the departure from the background, can be used to generate content summaries of any desired length. Finally, we also show that the proposed framework can be used to systematically select "key audio classes" that are indicative of events of interest in the chosen domain.

References

Jasinschi RS, Dimitrova N, McGee T, Agnihotri L, Zimmerman J, Li D: Integrated multimedia processing for topic segmentation and classification. Proceeding of International Conference on Image Processing (ICIP '01), October 2001, Thessaloniki, Greece 3: 366–369.
Google Scholar
Lienhart R: Automatic text recognition for video indexing. Proceeding of 4th ACM International Conference on Multimedia (ACM Multimedia '96), November 1996, Boston, Mass, USA 11–20.
Chapter Google Scholar
Hanjalic A, Kakes G, Lagendijk RL, Biemond J: DANCERS: Delft advanced news retrieval system. IS&T/SPIE Electronic Imaging 2001: Storage and Retrieval for Media Databases 2001, January 2001, San Jose, Calif, USA, Proceedings of SPIE 4315: 301–310.
Google Scholar
Wang Y, Liu Z, Huang J-C: Multimedia content analysis-using both audio and visual clues. IEEE Signal Processing Magazine 2000, 17(6):12–36. 10.1109/79.888862
Article Google Scholar
Winston H, Hsu H-M, Chang S-F: A statistical framework for fusing mid-level perceptual features in news story segmentation. Proceeding of IEEE International Conference on Multimedia and Expo (ICME '03), July 2003, Baltimore, Md, USA 2: 413–416.
Google Scholar
Aner A, Kender JR: Video summaries through mosaic-based shot and scene clustering. Proceeding of 7th European Conference on Computer Vision (ECCV '02), May–June 2002, Copenhagen, Denmark 4: 388–402.
MATH Google Scholar
Li Y, Kuo CC: Content-based video analysis, indexing and representation using multimodal information, M.S. thesis. University of Southern California, Los Angeles, Calif, USA; 2003.
Book Google Scholar
Sundaram H, Chang S-F: Determining computable scenes in films and their structures using audio-visual memory models. Proceeding of 8th ACM International Conference on Multimedia (ACM Multimedia '00), October–November 2000, Los Angeles, Calif, USA 95–104.
Chapter Google Scholar
Nitta N, Babaguchi N, Kitahashi T: Extracting actors, actions and events from sports video—a fundamental approach to story tracking. Proceeding of 15th International Conference on Pattern Recognition (ICPR '00), September 2000, Barcelona, Spain 4: 718–721.
Article Google Scholar
Ekin A, Tekalp AM: Automatic soccer video analysis and summarization. IS&T/SPIE 15th Annual Symposium on Electronic Imaging Science and Technology: Storage and Retrieval for Media Databases 2003, January 2003, Santa Clara, Calif, USA, Proceedings of SPIE 5021: 339–350.
Google Scholar
Pan H, van Beek P, Sezan MI: Detection of slow-motion replay segments in sports video for highlights generation. Proceeding of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '01), May 2001, Salt Lake City, Utah, USA 3: 1649–1652.
Google Scholar
Xu M, Maddage NC, Xu C, Kankanhalli M, Tian Q: Creating audio keywords for event detection in soccer video. Proceeding of IEEE International Conference on Multimedia and Expo (ICME '03), July 2003, Baltimore, Md, USA 2: 281–284.
Google Scholar
Xie L, Chang S-F, Divakaran A, Sun H: Unsupervised mining of statistical temporal structures in video. In Video Mining. Edited by: Rosenfeld A, Doermann D, DeMenthon D. Kluwer Academic, Boston, Mass, USA; 2003:279–309.
Chapter Google Scholar
Wu G, Wu Y, Jiao L, Wang Y-F, Chang EY: Multi-camera spatio-temporal fusion and biased sequence-data learning for security surveillance. Proceeding of 11th ACM International Conference on Multimedia (ACM Multimedia '03), November 2003, Berkeley, Calif, USA 528–538.
Chapter Google Scholar
Xiong Z, Rui Y, Radhakrishnan R, Divakaran A, Huang TS: A unified framework for video summarization, browsing and retrieval. In Handbook of Image & Video Processing. 2nd edition. Edited by: Bovik Al. Academic Press, San Diego, Calif, USA; 1013–1030.
Chapter Google Scholar
Xiong Z, Radhakrishnan R, Divakaran A, Huang TS: Effective and efficient sports highlights extraction using the minimum description length criterion in selecting GMM structures [audio classification]. Proceeding of IEEE International Conference on Multimedia and Expo (ICME '04), June 2004, Taipei, Taiwan 3: 1947–1950.
Google Scholar
Xiong Z, Radhakrishnan R, Divakaran A, Huang TS: Audio events detection based highlights extraction from baseball, golf and soccer games in a unified framework. Proceeding of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), April 2003, Hong Kong, China 5: 632–635.
Google Scholar
Shi J, Malik J: Normalized cuts and image segmentation. Proceeding of Computer Vision and Pattern Recognition (CVPR '97), June 1997, San Juan, Puerto Rico, USA 731–737.
Google Scholar
Rao RP, Pearlman WA: Multirate vector quantization of image pyramids. Proceeding of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '91), April 1991, Toronto, Ontario, Canada 4: 2657–2660.
Google Scholar
Duda RO, Hart PE: Pattern Classification and Scene Analysis. John Wiley & Sons, New York, NY, USA; 1973.
MATH Google Scholar
Perona P, Freeman WT: A factorization approach to grouping. Proceeding of 5th European Conference on Computer Vision (ECCV '98), June 1998, Freiburg, Germany 1: 655–670.
Google Scholar
Wand MP, Jones MC: Kernel Smoothing. Chapman & Hall, London, UK; 1995.
Book Google Scholar
Sheather SJ, Jones MC: A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society. Series B 1991, 53(3):683–690.
MathSciNet MATH Google Scholar
Comaniciu D, Ramesh V, Meer P: The variable bandwidth mean shift and data-driven scale selection. Proceeding of 8th IEEE International Conference on Computer Vision (ICCV '01), July 2001, Vancouver, British Columbia, Canada 1: 438–445.
Google Scholar
Papoulis A: Probability, Random Variables and Stochastic Processes. McGraw-Hill, New York, NY, USA;
Rabiner L, Juang B-H: Fundamentals of Speech Recognition, Prentice Hall Signal Processing Series. Prentice-Hall, Englewood Cliffs, NJ, USA; 1993.
Google Scholar
Radhakrishnan R, Otsuka I, Xiong Z, Divakaran A: Modeling sports highlights using a time-series clustering framework and model interpretation. Storage and Retrieval Methods and Applications for Multimedia 2005, January 2005, San Jose, Calif, USA, Proceedings of SPIE 5682: 269–276.
Google Scholar
Upcroft B, Ong LL, Kumar S, Ridley M, Bailey T, et al.: Rich probabilistic representations for bearing only decentralised data fusion. Proceeding of The Eighth International Conference on Information Fusion, July 2005, Philadelphia, Pa, USA
Google Scholar

Download references

Author information

Authors and Affiliations

Mitsubishi Electric Research Laboratory, Cambridge, MA, 02139, USA
Regunathan Radhakrishnan, Ajay Divakaran & Ziyou Xiong
Advanced Technology RLD Center, Mitsubishi Electric Corporation, Hyogo, Kyoto, 661-8661, Japan
Isao Otsuka

Authors

Regunathan Radhakrishnan
View author publications
You can also search for this author in PubMed Google Scholar
Ajay Divakaran
View author publications
You can also search for this author in PubMed Google Scholar
Ziyou Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Isao Otsuka
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Regunathan Radhakrishnan.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Radhakrishnan, R., Divakaran, A., Xiong, Z. et al. A Content-Adaptive Analysis and Representation Framework for Audio Event Discovery from "Unscripted" Multimedia. EURASIP J. Adv. Signal Process. 2006, 089013 (2006). https://doi.org/10.1155/ASP/2006/89013

Download citation

Received: 01 September 2004
Revised: 21 April 2005
Accepted: 04 May 2005
Published: 01 December 2006
DOI: https://doi.org/10.1155/ASP/2006/89013