Low-order auditory Zernike moment: a novel approach for robust music identification in the compressed domain
© Li et al.; licensee Springer. 2013
Received: 11 March 2013
Accepted: 11 July 2013
Published: 1 August 2013
Audio identification via fingerprint has been an active research field for years. However, most previously reported methods work on the raw audio format in spite of the fact that nowadays compressed format audio, especially MP3 music, has grown into the dominant way to store music on personal computers and/or transmit it over the Internet. It will be interesting if a compressed unknown audio fragment could be directly recognized from the database without decompressing it into the wave format at first. So far, very few algorithms run directly on the compressed domain for music information retrieval, and most of them take advantage of the modified discrete cosine transform coefficients or derived cepstrum and energy type of features. As a first attempt, we propose in this paper utilizing compressed domain auditory Zernike moment adapted from image processing techniques as the key feature to devise a novel robust audio identification algorithm. Such fingerprint exhibits strong robustness, due to its statistically stable nature, against various audio signal distortions such as recompression, noise contamination, echo adding, equalization, band-pass filtering, pitch shifting, and slight time scale modification. Experimental results show that in a music database which is composed of 21,185 MP3 songs, a 10-s long music segment is able to identify its original near-duplicate recording, with average top-5 hit rate up to 90% or above even under severe audio signal distortions.
As an emerging entertainment fashion, online music business such as listening, downloading, identification, and searching have become one of the hottest applications on the World Wide Web for several years. According to the statistical report in , online music ranks third in all network applications, and 75.2% Internet users have ever used the above services.
Among various online applications, music identification based on audio fingerprinting technique has attracted much attention from both the research community and the industry. By comparing the fingerprint of an unknown music segment, which is usually transmitted from mobile phones on the wireless telecom network or from personal computers on the Internet, with those previously calculated and stored in a fingerprint database, related metadata such as lyrics and singer’s name are returned. The fingerprint must characterize the nature of the music content to differentiate from each other, possess strong robustness to various kinds of severe audio signal degradations, and typically use only a several-second music fragment for identification in the database. To date, a number of algorithms have been published with rather high retrieval performance, most of them operate on the PCM wave format, and commercially deployed software systems have also appeared .
However, with the mature of CD-quality audio compression techniques at lower bit rate and the fast growing of the Internet, compressed audio content is increasingly ubiquitous and has become the dominant fashion of storing and transmitting in either music libraries or personal electronic equipment. It will be interesting and meaningful in practice if audio features are directly extracted from the compressed domain and used for music identification in the database.
So far, only a few algorithms that perform music information retrieval (MIR) directly on the compressed domain have been proposed. Liu and Tsai  calculated the compressed domain energy distribution from the output of the polyphase filters as a feature to index songs. For each song in the dataset, they use its refrain as the query example to retrieve all similar repeating phases, obtaining an average 78% recall and 32% precision. They claim that to their knowledge, this is the first compressed domain MIR algorithm. Lie and Su  directly used selected modified discrete cosine transform (MDCT) spectral coefficients and derived sub-band energy and its variation to represent the tonic characteristic of a short-term sound and to match between two audio segments. The retrieving probability achieves up to 76% among the top-5 matched. Tsai and Hung  described a query-by-example algorithm using 176 MP3 songs of the same singer as the database. They calculate spectrum energy from sub-band coefficients (SBC) to simulate the melody contour and use it to measure the similarity between the query example and those database items. By summing up the sub-band coefficients in every 12 frames (about one tone duration) to obtain tone energy lines, the melody contour is represented by a string sequence with two letters (U, D), where ‘D’ means the current tone energy is smaller than its preceding one, and ‘U’ means greater. With 40 frames assembled as a query example, the accuracy achieves 74% within top-4 and 90% within top-5. In Tsai and Wang’s paper , they used scale factors (SCF) and sub-band coefficients in an MP3 bit stream frame as features to characterize and index the object. All SCF and SBC values are divided into 26 bins using a tree-structured quantizer; values in the same bin are accumulated to form a histogram as the final indexing patterns. Due to its statistical nature, this approach can tolerate certain length variance between the query example and database items. When length variance is between [0%, 10%), [10%, 15%), [15%, 100%), the query item can be obtained in top-5, 10, 15 results, respectively. Pye  designed a new parameterization referred to as an MP3 cepstrum based on a partial decompression of MPEG-1 Layer III audio to facilitate the management of a typical digital music library. It is approximately six times faster than conventional Mel frequency cepstrum coefficient (MFCC) for music retrieval while the average hit rate is only 59%. Jiao et al.  designed a robust compressed domain audio fingerprinting algorithm, taking the ratio between the sub-band energy and full-band energy of a segment as intra-segment feature and the difference between continuous intra-segment features as inter-segment feature. Experiments show that such fingerprints are robust against transcoding, down sampling, echo adding, and equalization. However, the authors do not show any results on the retrieval hit rate. Zhou and Zhu  designed an MP3 compressed domain audio fingerprinting algorithm. By exploiting long-term time variation information based on the modulation frequency analysis, it is reported to be especially robust against time scale modification (TSM) at the cost of higher computation complexity. However, in experiment, the defined detection rate and accuracy rate are different from the top-n style measures of other algorithms; thus, it is difficult to judge whether it really outperforms other methods as stated. In , Liu and Chang calculated four kinds of compressed domain features, i.e., MDCT, MFCC, MPEG-7, and chroma vectors from the compressed MP3 bit stream. PCA is applied to reduce the high dimensional feature space, and QUC-tree and inverted lists of MP3 signatures are constructed to perform more efficient search. However, the experiments are only performed on MP3 fragments, which is not enough to reflect the song-level performance in real application environment.
The above methods achieve certain retrieval achievements, while they do not consider or obtain convincing results to the most central problem in audio fingerprinting, i.e., robustness. In practical application scenarios, for example, transmitting an unknown music clip through cell phone and wireless telecom network, the audio might often be contaminated by various audio distortions like lossy compression, environmental noise, echo adding, time stretching, and pitch shifting. Moreover, previously used features principally follow the line of MDCT coefficient and its derived spectral energy. Then, can we develop a new type of compressed domain feature to achieve high robustness in audio fingerprinting? It is well known that Zernike moment has been widely used in many image-related research fields such as image recognition , image watermarking , human face recognition , and image analysis  due to its prominent property of strong robustness and rotation, scale, and translation (RST) invariance. So far, various compressed domain audio features including scale factors [15, 16], MP3 window-switching pattern [17, 18], basic MDCT coefficients and derived spectral energy, energy variation, duration of energy peaks, amplitude envelope, spectrum centroid, spectrum spread, spectrum flux, roll-off, RMS, rhythmic content like beat histogram [19–24] have been used in different applications such as retrieval, segmentation, genre classification, speech/music discrimination, summarization, singer identification, watermarking, and beat tracing/tempo induction. However, in spite of the extensive use in various image-related research fields for years, to the authors’ knowledge, Zernike moment has not yet been applied to music information retrieval. This motivated our initial idea of developing compressed domain Zernike moments for audio fingerprinting technique. Two important properties of Zernike moment, i.e., strong robustness and translation invariance, are utilized to respectively resolve the problems of noise interference and desynchronization to some extent. Note that in the one-dimensional (1D) audio circumstance, properties of rotation and scale invariance are of no use.
In this paper, we first group 90 granules, the basic processing unit in decoding the MP3 bit stream, into a relatively big block for the statistical purpose, then calculate low-order Zernike moments from extracted MDCT coefficients located in selected low to middle sub-bands, and finally obtain the fingerprint sequence by modeling the relative relationship of Zernike moments between consecutive blocks. Experimental results show that this low-order Zernike moment-based audio feature achieves high robustness against common audio signal degradations like recompression, noise contamination, echo adding, equalization, band-pass filtering, pitch shifting, and slight TSM. A 10-s music fragment, which is possibly distorted, is able to retrieve its original recording with an average top-5 hit rate of 90% or beyond in our test dataset composed of 21,185 popular songs.
The remainder of this paper is organized as follows. Section 2 introduces the basic principles of MPEG-1 Layer III, bit stream data format, the concept of Zernike moment, and its effectiveness as a robust audio compressed domain feature. Section 3 details the steps of deriving MDCT low-order Zernike moment-based audio fingerprint and the searching strategy. Experimental results on retrieval hit rate under various audio signal distortions are given in Section 4. Finally, Section 5 concludes this paper and points out some possible ways for future work.
2 Compressed domain auditory Zernike moment
2.1 Principles of MP3 compression and decoding
MDCT transform on a sub-band will produce 18 frequency lines if using a long window and three groups of frequency lines each having six frequency lines at different time intervals if using three consecutive short windows. Fifty percent overlap between adjacent windows is adopted in both cases. Therefore, MDCT transform on a granule will always produce 576 frequency lines, which are organized in different ways in the cases of long windowing and short windowing.
2.2 A brief introduction of the Zernike moment
Zernike moment was originally designed as a powerful tool for image processing applications due to its robustness and RST invariant property. It has been demonstrated to outperform other image moments such as geometric moments, Legendre moments, and complex moments in terms of sensitivity to image noise, information redundancy, and capability for image representation .
Note that .
2.3 Compressed domain auditory Zernike moment
The inconvenience of directly applying Zernike moment to the audio case lies in that audio is inherently a time-variant 1D data, while the Zernike moments are only applicable for 2D data. Therefore, we must map the audio signals to 2D form before making them suitable for calculating the moment. In this paper, we construct a series of consecutive granule-MDCT 2D images to directly calculate the Zernike moment sequence in the MP3 compressed domain. In light of the frame format of the MP3 bit stream, one granule corresponds to about 13 ms, which means that it is indeed an alternative representation of time. On the other hand, MDCT coefficients can be roughly mapped to actual frequencies . Therefore, the way we construct granule-MDCT images is virtually done on the time-frequency plane. Human audition can be viewed in parallel with human vision if the sound is converted from a one-dimensional wave to a two-dimensional pattern distributed over time along a frequency axis, and the two-dimensional pattern (frequency vs. time) constitutes a 2-D auditory image . This way, we may seek to explore alternative approaches to audio identification by making recourse to mature technical means applied in computer vision. Although the link between computer vision and music identification has been made in several published algorithms, which all take the short-time Fourier transform of time-domain audio slices to create the spectrograms for the time-frequency representation using only the magnitude components [32–35], methods based on visualization of compressed domain time-MDCT images have not yet been demonstrated for music identification. We argue that mature techniques in computer vision such as Zernike moment may in fact be useful for computational audition; the detailed calculation procedures of the proposed method will be described in the next section.
3 Algorithm description
As described above, the main difficulty of applying Zernike moment to audio is the dimension mismatching. So, we first depict how to create a 2D auditory image from 1D compressed domain MP3 bit stream. The detailed procedure of this proposed algorithm is described as follows.
3.1 MDCT-granule auditory image construction
3.1.1 Y-axis construction: frequency alignment
where s(i, n) is the original MDCT coefficient in the i th granule, n th frequency line for the long-window case; s m (i, j) is the original MDCT coefficient in the i th granule, j th frequency line, m th window for the short-window case; and sn l (i, j) and sn s (i, j) are the new MDCT values in the i th granule, j th frequency line for the long- and short-window cases, respectively.
3.1.2 X-axis construction: granule grouping
After the above Y-direction operations, we have to go on setting up the X-axis to form the final auditory images. In the proposed method, N continuous granules (N = 90 in experiment) are partitioned into a block and act as the X-axis of one auditory image. The first block slides forward with M granules (M = 1 in experiment) as the hop size to form the X-axis of the following images.
3.1.3 Auditory image construction
Map between MDCT coefficients and actual frequencies for long and short windows sampled at 44.1 kHz
Index of MDCT coefficient
Index of MDCT coefficient
0 to 11
0 to 459
0 to 3
0 to 459
12 to 23
460 to 918
4 to 7
460 to 918
24 to 35
919 to 1,337
8 to 11
919 to 1,337
36 to 89
1,338 to 3,404
12 to 29
1,338 to 3,404
90 to 195
3,405 to 7,462
30 to 65
3,405 to 7,462
196 to 575
7,463 to 22,050
66 to 191
7,463 to 22,050
where k means the k th auditory image, and N block is the total number of blocks of the query clip or the original music piece, which is variable and determined by the audio length.
3.2 Compressed domain audio features: MDCT Zernike moments
Fragment input and robustness are known to be two crucial constraints on audio fingerprinting schemes. If modeling with audio signal operations, this is equal to imposing random cropping plus other types of audio signal processing on the input query example. Random cropping causes serious desynchronization between the input fingerprint sequence and those stored ones, bringing a great threat to the retrieval hit rate. Usually, there are two effective mechanisms to resist time-domain misalignment, one is invariant feature, and the other is implicit synchronization which might be more powerful than the former . However, in the MPEG compressed domain, due to its compressed bit stream data nature and fixed frame structure, it is almost impossible to extract meaningful salient points serving as anchors as in the uncompressed domain . Therefore, designing a statistically stable audio feature becomes the main method to fulfill the task of fragment retrieval and resisting time-domain desynchronization in audio fingerprinting.
where n is the moment order, and m must be subject to the condition that (n − |m|) is non-negative and even.
3.2.1 Effect of moment orders
3.3 Fingerprint modeling
3.4 Fingerprint matching
The total number of comparison within the database is (N − n + 1) × N song.
The experiments include a training stage and a testing stage. In the training step, three parameters (i.e., hop size, block size, and BER threshold) that affect the algorithm’s performance are experimentally tuned to get the best retrieval results. To achieve this end, a small training music database composed of 100 distinct MP3 songs is set up. In the testing stage, the algorithm with the obtained parameters from training is tested on a large dataset composed of 21,185 different MP3 songs to thoroughly evaluate the retrieval performance and robustness. All songs in the two databases are mono, 30 s long, originally sampled at 44.1 kHz, and compressed to 64 kbps, with a fingerprint sequence of 672 bits. In both stages, audio queries are prepared as follows. For each song in the training (testing) database, a 10-s long query segment is randomly cut and distorted by 13 kinds of common audio signal manipulations to model the real-world environment, and hence, 1,400 (296,590) query segments (including the original segments) are obtained, respectively.
4.1 Parameter tuning
First, we describe the parameter tuning procedure. Note that when the combination of the parameters varies, the associated fingerprint database is named in accordance with the following rule, i.e., FPDB_ < hop-size > _ < block-size > _ < order-number > .
4.1.1 Effect of hop size
4.1.2 Effect of block size
4.1.3 BER thresholding
4.2 Retrieval results under distortions
It can be seen that this proposed MDCT Zernike moment-based fingerprint shows satisfying identification results, even under severe audio signal processing like heavy lossy recompression, volume modulation, echo adding, noise interference, and various frequency wrappings such as band-pass filtering, equalization, and pitch shifting (±10%). To be more specific, when the queries are original or only distorted by echo adding, band-pass filtering, and volume modulation, the top-5 hit rates (green bars) are almost not influenced and all come close to 100%. Under other more severe signal manipulations such as equalization, pitch shifting, noise addition, and MP3 compression, the top-5 hit rates are pretty good and still above 90%. The only deficiency is that under pitch-reserved time scale modifications, which can be modeled as a kind of cropping/pasting to relatively smooth local parts in between music edges , the identification results drop quickly with the increase of scaling factors and become unacceptable when ±3% time scale modifications are performed.
This weakness is essentially caused by the fixed data structure of the MP3 compressed bit stream. In this case, implicit synchronization methods based on salient local regions cannot be applied. The only way to resist serious time-domain desynchronization is to increase the overlap between consecutive blocks and design more steady fingerprints; however, the overlap has an upper limit of 100% (98% has been used in this method), and discovering more powerful features is not an easy work.
At present, it is difficult to quantitatively compare with other algorithms because different datasets or even different evaluation measures are used. It is also unrealistic to precisely reimplement these methods due to the lack of adequate details. In fact, one important task of our work is to collect enough songs (21,185 songs in the experiment) and queries (296.590 distorted queries) under various distortions (13 kinds of audio signal distortions) so that the experimental results are more convincing. Since this dataset is much larger and more comprehensive than those used in the cited references, the source codes will be published to the research community and serve as a baseline system.
4.3 False analysis
False statistics of identification results
In this paper, a novel music identification algorithm is proposed, which directly works on the MP3-encoded bit stream by constructing the MDCT-granule auditory images and then calculating the auditory Zernike moments. By virtue of the short-time stationary characteristics of such feature and large overlap, 10-s long query excerpts are shown to have achieved promising retrieval hit rates from the large-scale database containing intact MP3 songs and distorted copies under various audio signal operations including the challenging pitch shifting and time scale modification. For future work, combining the MDCT Zernike moments with other powerful compressed domain features using information fusion will be our main approach to improve the retrieval performance and robustness against large time domain misalignment and stretching. Cover song identification performed right on the compressed domain is our final aim to be accomplished.
WL received his PhD degree in Computer Science from Fudan University, Shanghai, China in 2004. He is now a professor in the School of Computer Science and Technology, Fudan University, leading the multimedia security and audio information processing laboratory. He has published 40 refereed papers so far, including international leading journals and key conferences, such as IEEE Transactions on Audio, Speech, and Language Processing, IEEE Transactions on Multimedia, Computer Music Journal, IWDW, ACM SIGIR, and ACM Multimedia. He is a reviewer of international journals like IEEE Transactions on Signal Processing, IEEE Transactions on Multimedia, IEEE Transactions on Audio, Speech & Language Processing, IEEE Transactions on Information Forensics and Security, and Signal Processing, and conferences such as ICME, ACM MM, and IEEE GlobalCom.
This work is supported by NSFC (61171128), 973 Program (2010CB327900), and 985 Project (EZH2301600/026).
- China Internet Network Information Center: he 29th statistical report on internet development in China, 2012. (CNNIC, 2013), . Accessed January 2012 http://www1.cnnic.cn/IDR/ReportDownloads/ (CNNIC, 2013), . Accessed January 2012Google Scholar
- Cano P, Batlle E, Kalker T, Haitsma J: A review of audio fingerprinting. J. VLSI Signal Process. 2005, 41(3):271-284. 10.1007/s11265-005-4151-3View ArticleGoogle Scholar
- Liu CC, Tsai PJ: Content-based retrieval of MP3 music objects. Paper presented at the ACM international conference on information and knowledge management (CIKM 2001). Atlanta, Georgia, USA; 2001.Google Scholar
- Lie WN, Su CK: Content-based retrieval of MP3 songs based on query by singing. Paper presented at the IEEE international conference on acoustics, speech, and signal processing (ICASSP 2004). Montreal, Quebec, Canada; 2004.Google Scholar
- Tsai TH, Hung JH: Content-based retrieval of MP3 songs for one singer using quantization tree indexing and melody-line tracking method. In Proceeding of The IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2006). Edited by: IEEE. IEEE, New York; 2006:505-508.Google Scholar
- Tsai TH, Wang YT: Content-based retrieval of audio example on MP3 compression domain. Paper presented at the IEEE workshop on multimedia signal processing (MMSP 2004). Siena, Italy; 2004.Google Scholar
- Pye D: Content-based methods for the management of digital music. Paper presented at the IEEE international conference on acoustics, speech and signal processing (ICASSP 2000). Istanbul, Turkey; 2000.Google Scholar
- Jiao YH, Yang B, Li MY, Niu XM: MDCT-based perceptual hashing for compressed audio content identification. Paper presented at the IEEE workshop on multimedia signal processing (MMSP 2007). Chania, Crete, Greece; 2007.Google Scholar
- Zhou R, Zhu Y: A robust audio fingerprinting algorithm in MP3 compressed domain. WASET Journal 2011, 55: 715-719.MathSciNetGoogle Scholar
- Liu CC, Chang PF: An efficient audio fingerprint design for MP3 music. Paper presented at the ACM international conference on advances in mobile computing and multimedia (MoMM 2011). Ho Chi Minh City, Vietnam; 2011.Google Scholar
- Khotanzad A, Hong YH: Invariant image recognition by Zernike moments. IEEE Trans Pattern Anal Mach Intell 1990, 12(5):489-497. 10.1109/34.55109View ArticleGoogle Scholar
- Kim HS, Lee HK: Invariant image watermark using Zernike moments. IEEE Trans. Circuits Syst. Video Technol. 2003, 13(8):766-775. 10.1109/TCSVT.2003.815955View ArticleGoogle Scholar
- Haddadnia J, Ahmadi M, Faez K: An efficient feature extraction method with pseudo-Zernike moment in RBF neural network-based human face recognition system. EURASIP J Adv. Signal Process. 2003, 9: 890-901.View ArticleGoogle Scholar
- Kamilaa NK, Mahapatrab S, Nanda S: RETRACTED: Invariance image analysis using modified Zernike moments. Pattern Recogn. Lett. 2005, 26(6):747-753. 10.1016/j.patrec.2004.09.026View ArticleGoogle Scholar
- Jarina R, O’Connor N, Marlow S, Murphy N: Rhythm detection for speech-music discrimination in compressed-domain. Paper presented at the IEEE international conference on digital signal processing (DSP 2002). Pine Mountain, Georgia, USA; 2002.Google Scholar
- Takagi K, Sakazawa S: Light weight MP3 watermarking method for mobile terminals. Paper presented at the ACM international conference on multimedia (ACM Multimedia 2005). Hilton, Singapore; 2005.Google Scholar
- Wang Y, Vilermo M: A compressed domain beat detector using MP3 audio bit streams. Paper presented at the ACM international conference on multimedia (ACM Multimedia 2001). Ottawa, Ontario, Canada; 2001.Google Scholar
- D’Aguanno A, Vercellesim G: Tempo induction algorithm in MP3 compressed-domain. Paper presented at the ACM international conference on multimedia information retrieval (ACM MIR 2007). University of Augsburg, Germany; 2007.Google Scholar
- Tzanetakis G, Cook P: Sound analysis using MPEG compressed audio. Paper presented at the IEEE international conference on acoustics, speech, and signal processing (ICASSP 2000). Istanbul, Turkey; 2000.Google Scholar
- Liu CC, Huang CS: A singer identification technique for content-based classification of MP3 music objects. Paper presented at the ACM international conference on information and knowledge management (CIKM 2002). McLean, Virginia, USA; 2002.Google Scholar
- Liu CC, Yao PC: Automatic summarization of MP3 music objects. Paper presented at the IEEE international conference on speech, acoustics, and signal Processing (ICASSP 2004). Montreal, Quebec, Canada: ; 2004.Google Scholar
- Shao X, Xu CS, Wang Y, Kankanhalli M: Automatic music summarization in compressed-domain. Paper presented at the IEEE international conference on speech, acoustics, and signal Processing (ICASSP 2004). Montreal, Quebec, Canada; 2004.Google Scholar
- Jarina R, O’Connor N, Murphy N, Marlow S: An experiment in audio classification from compressed data. Paper presented at the international workshop on systems, signals and image processing (IWSSIP 2004). Poznan, Poland; 2004.Google Scholar
- Rizzi A, Buccino NM, Panella M: A Uncini, Genre classification of compressed audio data. Paper presented at the IEEE workshop on multimedia signal processing (MMSP 2008). Cairns, Queensland, Australia; 2008.Google Scholar
- Painter T, Sanias A: Perceptual coding of digital audio. Proc IEEE 2000, 88(4):451-513.View ArticleGoogle Scholar
- Salomonsen K, Søgaard S, Larsen EP: Design and implementation of an MPEG/audio layer III bit stream processor, Thesis. Department of Communication Technology, Aalborg University, Denmark; 1997.Google Scholar
- Pfeiffer S, Vincent T: Formalisation of MPEG-1 compressed-domain audio features, in technical report number 01/196, CSIRO Mathematical and Information, Sciences, Australia. 2001.Google Scholar
- Lagerster K: Design and implementation of an MPEG-1 layer III audio decoder, Thesis. Chalmers University of Technology; 2001.Google Scholar
- Prokop RJ, Reeves AP: A survey of moment-based techniques for unoccluded object representation and recognition. CVGIP-Graph. Model. Im. 1992, 54(5):438-460. 10.1016/1049-9652(92)90027-UView ArticleGoogle Scholar
- Chang TY: Research and implementation of MP3 encoding algorithm, Thesis. National Chiao Tung University, Hsinchu, Taiwan, ROC; 2002.Google Scholar
- Rifkin R, Bouvrie J, Schutte K, Chikkerur S, Kouh M, Ezzat T, Poggio T: Phonetic classification using hierarchical, feed-forward, spectro-temporal patch-based architectures. Computer Science and Artificial Intelligence Laboratory Technical Report, MIT-CSAIL-TR-2007-019 2007.Google Scholar
- Ke Y, Hoiem D, Sukthankar R: Computer vision for music identification. Paper presented at the IEEE computer society conference on computer vision and pattern recognition (CVPR 2005). San Diego, CA, USA; 2005.Google Scholar
- Sukthankar R, Ke Y, Hoiem D: Semantic learning for audio applications: a computer vision approach. Paper presented at the IEEE computer society conference on computer vision and pattern recognition workshop (CVPRW 2006). 2006.Google Scholar
- Baluja S, Covell M: Audio fingerprinting: combining computer vision & data stream processing. Paper presented at the IEEE international conference on acoustics, speech and signal processing (ICASSP 2007). Honolulu, Hawaii, USA; 2007.Google Scholar
- Baluja S, Covell M: Waveprint: efficient wavelet-based audio fingerprinting. Pattern Recogn. 2008, 41(11):3467-3480. 10.1016/j.patcog.2008.05.006View ArticleGoogle Scholar
- Cox I, Miller M, Bloom J, Fridrich J, Kalker T: Digital Watermarking and Steganography. 2nd edition. Morgan Kaufmann, Burlington, MA; 2007.Google Scholar
- Li W, Xue XY, Lu PZ: Localized audio watermarking technique robust against time-scale modification. IEEE Trans. Multimedia 2006, 8(1):60-69.View ArticleGoogle Scholar
- Haitsma J, Kalker T: A highly robust audio fingerprinting system. Paper presented at the international conference on music information retrieval (ISMIR 2002). Paris, France; 2002.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.