Robust feature representation for classification of bird song syllables
- Maria Sandsten^{1}Email author,
- Mareile Große Ruse^{2} and
- Martin Jönsson^{2}
https://doi.org/10.1186/s13634-016-0365-8
© The Author(s) 2016
Received: 14 August 2015
Accepted: 17 May 2016
Published: 31 May 2016
Abstract
A novel feature set for low-dimensional signal representation, designed for classification or clustering of non-stationary signals with complex variation in time and frequency, is presented. The feature representation of a signal is given by the first left and right singular vectors of its ambiguity spectrum matrix. If the ambiguity matrix is of low rank, most signal information in time direction is captured by the first right singular vector while the signal’s key frequency information is encoded by the first left singular vector. The resemblance of two signals is investigated by means of a suitable similarity assessment of the signals’ respective singular vector pair. Application of multitapers for the calculation of the ambiguity spectrum gives an increased robustness to jitter and background noise and a consequent improvement in performance, as compared to estimation based on the ordinary single Hanning window spectrogram. The suggested feature-based signal compression is applied to a syllable-based analysis of a song from the bird species Great Reed Warbler and evaluated by comparison to manual auditive and/or visual signal classification. The results show that the proposed approach outperforms well-known approaches based on mel-frequency cepstral coefficients and spectrogram cross-correlation.
Keywords
Time-frequency analysis Ambiguity spectrum Singular value decomposition Multitaper Bird song1 Introduction
In biology, bird song analysis has been a large field for several decades, and for many years, methods based on spectrograms (sonograms) have been considered well-suited for the comparison of bird sounds. Generally, song analysis tools are especially challenged when recordings have been conducted on birds under natural outdoor conditions. In these environments, disturbing background noises, such as wind and interference from other birds, is typically present and often distorts the recorded signal substantially. The extent to which such background noise effectively impairs the analysis depends on the type/structure of the underlying signal and on the particular research question. Essentially two principal topics have been considered in literature in the context of classification or clustering of bird song units (e.g., syllables). The hitherto most common research aims at the song-based identification of bird species. Characterizing patterns of songs from different bird species are often sufficiently distinct, so that rather straightforward features such as time and frequency moments, time duration, and frequency bandwidth often yield satisfactory results. Somewhat more sophisticated is song analysis by means of pairwise cross-correlation of spectrograms (SPCC), [1, 2] or dynamic time warping (DTW), [3–5]. Besides, methods that have become popular in speech analysis, such as approaches based on pitch frequency or mel-frequency cepstrum coefficients (MFCC), have been successfully applied to bird species classification, [6]. More recently, bird sounds analysis of especially noisy signals has been approached using wavelets, [7].
The other main question guiding bird-song research is the within-species classification and clustering. This task often constitutes a much more involved problem, especially when the songs of the species under consideration have a complex structure. Such problems thus require sufficiently sophisticated methods that are able to not only capture subtle characteristic details within a song, but also to compare them with each other. More simplistic methods, which may be well-tailored for species identification, smooth out the differences that should be detected and will fail in the within-species analysis. The Great Reed Warbler (GRW) is one example of a species with songs of pronounced profound complexity. However, due to the lack of sufficiently sophisticated methods, song analysis for the GRW has so far mainly been conducted manually, by listening and visually studying the syllable sonograms, [5, 8, 9].
Bird song analysis is only one of various several applications of time-frequency (TF) analysis of non-stationary signals and a significant number of approaches in this important field has been suggested. Since its introduction to bird song analysis in the 50 s, the sonogram (or TF spectrum) has become one of the most established tools in context of bird song investigation, and computationally efficient and robust algorithms for spectrogram-based TF analysis can be found using, e.g., multitapers (MTs). Originally, MTs were introduced by Thomson, [10], who proposed the discrete Prolate Spheroidal Sequences for estimation of a low-variance spectrum with pre-specified resolution. Nowadays, the noise-robust Thomson MTs are well-established in the context of stationary spectrum analysis and have found various application areas. Recently, the Hermite MTs, [11], have gained popularity, especially in applications with particular interest in estimation of the TF spectrum of non-stationary signals, [12–16]. As the MT spectrogram is known to reduce variation in amplitude and to limit resolution in time and frequency, it is tailored for the analysis of multi-component signals with jitter, or variance, in both location of the components and their amplitudes, [17]. In contrast to TF distributions, that aim at optimal resolution of signal components and cross-term suppression, [18], MT spectrograms are more suitable for the type of data considered in this paper, as MTs are expected to smooth out small differences in time and frequency locations and therefore lower the in-class variance.
Extracting features from the TF spectrum for classification or clustering is a non-trivial task, and several approaches have been proposed in literature. In different application areas, there has recently been an increased interest in decomposition methods, such as approaches related to principal component analysis (PCA), independent component analysis (ICA), singular value decomposition (SVD), and non-negative matrix factorization (NMF). In [18], the authors achieve increased noise-insensitivity by combining image processing techniques with wavelets and SVD. Barry et al., [19] improved classification performance of event-related signals by the application of PCA to TF spectra of the electroencephalogram data. The SVD of TF spectra was also used to classify multi-component bird song syllables in [17], and multi-component frequency modulated (FM) signals in [20]. Approaches employing these techniques are promising for two key reasons. On the one hand, they create a decoupling of the time and frequency domains and therefore facilitate the separate inspection of corresponding features, and on the other hand, they achieve a noise reduction as noise is spread across the collection of all singular vectors and the signal part usually has low rank [19, 20]. The NMF, [21], where a matrix with positive values is decomposed into positive basis functions, has been applied for classification of TF spectrum of audio signals [22, 23]. In [24], it is shown that the NMF decomposition method of a TF spectrum is superior to PCA and ICA for classification of audio data. However, the NMF method is computationally more demanding than, e.g., the SVD technique, as it requires an iterative solution. Different algorithms for better convergence have been proposed, and recently, an approach using the SVD basis functions as initialization of the NMF algorithm was suggested, [25].
A less intuitive, but for certain topics—particularly in bird song analysis—highly suitable tool for signal representation and the ground for feature extraction is the so-called ambiguity spectrum (AS) [26]. A characteristic for this two-dimensional Fourier transform of the TF spectrum is its invariance to time and frequency shifts of the signal. More specifically, the absolute values of the AS of a signal and its time and frequency shifted version are identical. Therefore, signal analysis based on the AS focuses on differences between time and frequency components rather than on their actual location in the TF plane. Moreover, for many applications—as also for representation of syllables from bird songs—the AS will be a matrix of low rank, and can hence be well-approximated by only a few (e.g., the first pair) of its singular vectors. The AS and its first pair of singular vectors will constitute the essence of our methodology.
As the MT spectrogram is robust to jitters in the amplitudes and locations of a signal’s components [17], selecting the first two singular vectors of the MT spectrogram is more intuitive than the AS-based representation. However, as shown in [17], using singular vectors of the spectrogram for classification of multi-component signals requires several singular vector pairs and more advanced algorithms to combine the singular vectors in an appropriate way. To ensure better comparability of ambiguity- and TF-domain analysis, we, however, base our investigations in both cases on only the first pair of singular vectors.
The main contribution of this paper is the introduction of a feature set based on SVD of the AS on the basis of which, e.g., classification and clustering tasks of non-stationary signals can be performed. The latter may be conducted in terms of a similarity measure. The method has recently been applied to clustering of a whole song of the GRW [27], where in this work, a collection of possible similarity measures is presented and their respective performances are evaluated and compared. Additionally, the optimal parameters of the TF methods are found, and robustness of methods are investigated for additional noise disturbances.
The suggested algorithm consists of four steps. The detection step aims at detecting and subdividing a birdsong strophe into individual syllables (∼ 50–300 ms). In the second step, the syllable-specific ambiguity spectra are estimated, and the corresponding SVDs are calculated in the third step. Each syllable will then be represented by the first two singular vectors of its AS. As the ambiguity matrix for these kinds of signals is typically of low rank, this representation captures the signal’s key information, both in time and frequency direction. In the fourth step, the alikeness of two syllables, represented by their respective pair of singular vectors, is assessed by means of a collection of candidate similarity measures, which are evaluated and compared to each other.
The reminder of the article is structured as follows. In the subsequent section, we give a short treatment of the TF representation of a signal along with the quadratic class of smoothed spectra. The ambiguity spectrum, which will play an important role in our methodology, is introduced and its utilization is motivated. We give a first application of some spectral methods on a bird-song signal, the latter being the main object for our analysis. In section 3, we introduce our feature set which is based on the SVD of a signal’s AS and provide a few examples. Next, we present two raw similarity measures in section 4 and use them to derive three combined measures, all of which subsequently are to be assessed and compared. In section 5, we describe a two-step method for detection of syllables from a bird song strophe. The data used for our examples is described in section 6, and a baseline truth for the classification is defined. In section 7, we evaluate the suggested similarity measures as well as different approaches for estimation of the AS. Moreover, this section provides a comparison of the proposed approach to other well-known methods. Section 8 contains a major application of our methodology to a larger set of syllables in a more complex classification study. Finally, section 9 concludes.
2 Time-frequency analysis and multitapers
where each \(S_{x}^{(k)}\) is a spectrogram with window function q _{ k }. Thus, the corresponding filtered AS can be calculated as \({A_{x}^{Q}}(\nu,\tau)=\mathcal {F}_{t \rightarrow \nu }\mathcal {F}_{f \rightarrow \tau } S_{x}(\textit {\text {t,f}})\).
As Hermite functions are more localized in the TF plane than Thomson MTs [11, 30], they pose the method of choice in this paper, i.e., we choose q _{ k }(t)=h _{ k }(t), with the corresponding weights λ _{ k }=1,k=1…K.
The main merit of MT spectrograms is their reduced variance as compared to a single-window spectrogram. The variance of the latter is roughly of order V[ S _{ x }(t,f)]≈S _{ x }(t,f)^{2}. Multitapering using K tapers, however, can lead to a substantial variance reduction. The reason for such improvement in terms of robustness is that the spectrograms from different tapers are uncorrelated, provided the signal satisfies certain properties. Therefore, their average reduces the variance by up to a factor of K, i.e., \(V[\!S_{x}(\textit {\text {t,f}})]\approx \frac {1}{K}S_{x}(\textit {\text {t,f}})^{2}\).
In a later section, we will compare methods which rest upon the AS as derived from a single window Hanning spectrogram to those which employ the AS calculated from Hermite MT spectrograms. To facilitate a reasonable comparison between methods based on the Hanning window and those based on Hermite windows, their respective time and frequency concentrations should be related. However, whereas the window length of a Hanning window is well-defined, the Hermite functions are of infinite length and therefore a “window length” is not a reasonable quantity for connecting those windows to each other in a suitable way. Hence, we need to define a measure that relates the two window types. In this paper, we therefore define the time concentration of a window as that time interval in which 99 % of the power is located. A corresponding definition of frequency concentration is to use the frequency interval which contains 99 % of the window’s spectral power. For the MT methods, the corresponding concentration values from the window h _{ K }(t) are used, as this is the window with lowest time and frequency concentration in the set of K MTs h _{1},…,h _{ K }. Note that with a larger value of K, i.e., with more tapers, the time and frequency resolution of the corresponding final estimate will decrease.
Key ingredients to our approach are on the one hand the usage of MTs for spectrogram estimation and on the other hand the transformation to the AS for subsequent feature extraction. The main property of the AS (which also holds for the filtered AS), as already mentioned in the introduction, is its invariance to frequency modulation and time shifts. In fact, for a modulated and time-shifted signal \(\phantom {\dot {i}\!}z(t) = x(t-t_{0})e^{i2\pi f_{0} t}\) one has |A _{ z }(ν,τ)|=|A _{ x }(ν,τ)|, [28]. This property is desirable in many applications related to acoustic signals. As an example, if the comparison of two identical syllables starting at different time points is based on their respective ambiguity spectra, they will indeed be classified as being the same. Analogously, a frequency modulated syllable, which can be thought of as pronouncing the same syllable in different pitches, will not affect identification of these syllables either.
3 Feature extraction—singular value decomposition
The singular value decomposition is a low-rank matrix approximation and a popular noise-reduction technique for a data matrix. The decomposition of a matrix A results in the representation \( \mathbf {A} = \sum _{j=1}^{r} \sigma _{j} \mathbf {u}_{j} \mathbf {v}^{H}_{j}\), where u _{ j },v _{ j } are the singular vectors of unit length and σ _{1}≥…≥σ _{ r }≥0 the singular values. The unit-length vector v _{1}, i.e., the first right singular vector, maximizes the Euclidean norm ∥A v∥_{2} and can hence be seen as the vector with unit length which undergoes the maximum amplification under A. Thus, v _{1} serves as a crude approximation of the directions of the columns of A. Similarly, u _{1} maximizes ∥A ^{ T } u∥_{2} and serves as an approximation of the row-directions. Hence, if the matrix A is of low-rank, the vectors u _{1},v _{1} comprise the major information in A: u _{1} captures the frequency-related information while v _{1} captures the time-related and the matrix \(\tilde {\mathbf {A}}_{1} = \sigma _{1}\mathbf {u}_{1}\mathbf {v}_{1}^{H}\), with σ _{1} satisfying σ _{1}=∥A ^{ T } u _{1}∥_{2}=∥A v _{2}∥_{2}, gives a good rank-1 approximation of A.
The ambiguity matrix derived from a song syllable is typically of low rank and therefore predestined for approximation by a small collection of singular vectors. Thus, little information is lost when replacing the AS-based similarity assessment of two syllables with a comparison of the corresponding first left and right singular vectors. More specifically, if \(\hat {\mathbf {A}}^{(A)}\) denotes the estimated AS of syllable s _{ A } and \(\hat {\mathbf {u}}_{1}^{(A)}, \hat {\mathbf {v}}_{1}^{(A)}\) the first pair of singular vectors (with corresponding notation for syllable s _{ B }), similarity between s _{ A } and s _{ B } can be captured by confining the investigation to comparison of \(\hat {\mathbf {u}}_{1}^{(A)}, \hat {\mathbf {v}}_{1}^{(A)}\) and \(\hat {\mathbf {u}}_{1}^{(B)},\hat {\mathbf {v}}_{1}^{(B)}\). Closeness of \(\hat {\mathbf {u}}_{1}^{(A)},\hat {\mathbf {u}}_{1}^{(B)}\) then suggests similarity of syllable structures in the frequency domain while closeness of \(\hat {\mathbf {v}}_{1}^{(A)}, \hat {\mathbf {v}}_{1}^{(B)}\) corresponds to structural resemblance in the time dimension.
A study of the singular values of the ambiguity spectrum shows that the first pair of singular vectors captures 80 % of the energy in most of the signals. Increasing the number of singular vector pairs to, e.g., 10, would explain approximately 90 % of the variations in the signal. Such gain in captured energy comes, however, at the expense of increased noise and unwanted jitter effects from small differences in similar signals. Restricting signal representation to the first pair is therefore a reasonable choice if the main structure of the signal should be captured. Our investigations including more pairs of singular vectors did not increase the performance.
4 Similarity measures
The normality of singular vectors implies β(s _{ A },s _{ A })=1. Note, however, that β(s _{ A },s _{ B })=1 does not suggest equality of syllable s _{ A } and syllable s _{ B } but rather a strong alikeness. In clustering applications, the key issue is to decide whether two syllables are realizations of the same syllable type (and thus should be allocated to the same cluster) or if they arouse from distinct syllable types (and hence should be grouped in different clusters). Due to background noise in recordings and within-individual variability, β(s _{ A },s _{ B }) will rarely be equal to 1 even if s _{ A },s _{ B } represent the same type of syllable. Thus, the decision on assigning two syllables to the same or to distinct groups will be made based on whether or not β(s _{ A },s _{ B }) exceeds a certain threshold ρ.
5 Syllable detection
The syllable detection approach is divided into two steps. A set of filters is applied in the first step to filter out background noise while the syllables are defined and extracted in the second step based on time distances between amplitude peaks.
where l _{ s e n s } is the sensitivity of the detector in percentage. After a particular signal section has been declared as a syllable, its start and end time points are extended backwards and forwards (by a default value of ±60 ms) to include the syllable’s weaker start and end.
6 Data presentation and baseline classification
The data under consideration is a 7-min bird-song signal recorded from the Great Reed Warbler under natural outdoor conditions. The bird song has been recorded analogously with a Telinga parabola and microphone and a SONY cassette tape recorder (SONY TC-D5M). The recording is of average quality (with respect to noise and disturbance) and the signal is digitized to a sample frequency of 44.1 kHz, which is subsequently decimated by a factor 4 for the further analysis.
Before the main analysis, the output from the automatic syllable detection step (as described in the previous section) is manually checked for detection errors. In four of the strophes, initial notes were falsely declared as syllables and were therefore removed manually (in total 4×3 syllables). In one strophe a burst of noise was erroneously detected as a syllable and removed as well.
The resulting data to be used for our analysis consists of 362 detected syllables in 28 strophes. A typical strophe section in a GRW song consists of 2–8 repeats of the same syllable type followed by a change to realizations of another syllable structure, see Fig. 3. This characteristic pattern makes it fairly easy to assess whether two subsequent syllables belong to the same or to different types, since the change to another type of syllable is generally quite pronounced. This facilitates a rather straightforward visual (based on the spectrogram and/or the time domain representation of the syllables) and auditive classification of subsequent syllables as being similar (S), i.e., realizations of the same syllable type, or non-similar (N). As an example, the two first syllables in Fig. 4 b are labeled as similar, while the pair given by the second and third syllables is marked as non-similar, followed by the as similar declared pair of the third and fourth syllable. This pairwise classification was conducted for all 28 strophes and the resulting labeled data contains 217 subsequent syllable pairs classified as similar and 117 classified as non-similar. This labeling is used as the baseline “truth” for the evaluation of different methods and parameter choices.
The computed SNR of the strophe in Fig. 3 is 20 dB.
Note that due to the special pattern of the GRW song (repeats of a particular syllable type are followed by repetitions of another syllable structure), comparing subsequent syllables depicts a simpler problem than the general approach where all syllables are compared with each other. Defining a baseline truth for method evaluation in the general problem is a much more involved task as it is often difficult to decide (based on listening and visual inspection of spectrograms) whether two syllables, which have been chosen from a song on random basis, are similar and different experts might likely come to different conclusions.
7 Evaluation
With a feature set based on the first singular vectors of the estimated filtered ambiguity spectrum and our proposed similarity measures at hand, we proceed to evaluate our methodology. The target quantities for evaluation are (1) the similarity rate p _{ S }(α), i.e., the proportion of correctly classified pairs of similar syllables while accepting α·100 % false positives (α·100 % “non-similar” pairs are misclassified as “similar”) and (2) the non-similarity rate p _{ N }(α), i.e., the proportion of correctly classified non-similar syllables while allowing for α·100 % false negatives (falsely as "non-similar" classified pairs of “similar” syllables). Here, α is fixed to the value of 0.05.
In the first part, we evaluate the performance of the raw similarity measures β _{ u } and β _{ v } individually, based on different settings for the MT windows. In the second part, we assess the performance of the suggested combined measures for a selection of MT and single window settings.
7.1 Evaluation of raw measures under different window settings
All different windows and their time and frequency bandwidths
Window | Δ t (ms) | Δ f (kHz) | Window | Δ t (ms) | Δ f (kHz) |
---|---|---|---|---|---|
MT8(53,0.21) | 53.3 | 0.215 | MT2(15,0.24) | 14.9 | 0.237 |
MT8(27,0.43) | 26.8 | 0.431 | MT2(7.6,0.50) | 7.62 | 0.495 |
MT8(13,0.88) | 13.4 | 0.883 | MT2(3.8,0.97) | 3.81 | 0.969 |
MT8(6.9,1.8) | 6.89 | 1.77 | MT2(2.0,2.0) | 1.99 | 1.96 |
MT4(39,0.15) | 39.4 | 0.151 | H1(32,0.065) | 31.56 | 0.065 |
MT4(20,0.32) | 19.8 | 0.323 | H1(16,0.11) | 15.96 | 0.108 |
MT4(10,0.65) | 9.98 | 0.646 | H1(8.0,0.24) | 7.98 | 0.237 |
MT4(5.1,1.3) | 5.08 | 1.31 | H1(4.2,0.47) | 4.17 | 0.474 |
MT4(2.5,2.6) | 2.54 | 2.61 | H1(2.2,0.95) | 2.18 | 0.947 |
MT2(30,0.13) | 29.7 | 0.129 | H1(1.1,1.8) | 1.09 | 1.83 |
In further investigations, the analysis will be restricted to one Hermite MT and a single Hanning window spectrogram. It is, however, not too obvious which constellations are most qualified. Clearly, best results are given by K=8 MTs, but it is ambiguous which window length should be used for them. In most cases, the window length corresponding to method MT8(13,0.88) appears to give the best results for MTs, while for the Hanning window spectrogram the choice H1(2.2,0.95) is considered as most suitable for further analysis. These constellations will in future considerations be referred to as MT8_{ AU } when similarity is assessed by filtered ambiguity spectrum and β _{ u } and as MT8_{ AV } when instead the measure β _{ v } is employed. Similarly, investigation based on the chosen Hanning window ambiguity spectrum will be referred to as H1_{ AU } and H1_{ AV }, depending on whether similarity is assessed by β _{ u } or β _{ v }.
To summarize, further investigations will be based on MT8_{ AU },MT8_{ AV },H1_{ AU },H1_{ AV },MT8_{ SU }, and H1_{ SU }.
7.2 Evaluation of combined measures
7.3 Comparison with established approaches
Here, we compare more thoroughly the performance of the best method based on the multitaper AS, MT8_{ A m e a n }, to the corresponding Hanning-window approach H1_{ A m e a n }, to the spectrogram-based methods which have been selection in section 7.1 (i.e., to, MT8_{ S U } and H1_{ S U }) and moreover to established approaches based on MFCCs and the SPCC. For the MFCC method, the often used implementation by Malcolm Slaney^{1} is chosen with eight cepstral coefficients, a 25 ms Hamming window and 90 % overlap between frames. For the SPCC method, we use the single window Hanning spectrogram with time and frequency resolutions 2.18 ms and 947 Hz as defined above.
8 Classification of four predefined syllable classes
The classification of subsequent syllable pairs into “similar” or “non-similar” constitutes the first performance assessment of our methodology. This classification task is, however, beneficial for all algorithms as two subsequent similar syllables are often very similar and therefore alikeness is not too difficult to detect. In the same way, two non-similar syllables are generally sufficiently different such that switches from one syllable-type to another are easily detected by auditive or visual inspection.
For the considered syllable subset, the average of the mean powers of all 51 syllables resulted in a SNR of 15 dB.
Rates for correct classification of two syllables belonging to the same class accepting 5 % false positives
Class comp. | M T8_{ Amean } | H1_{ Amean } | M T8_{ SU } | H1_{ SU } | MFCC | SPCC |
---|---|---|---|---|---|---|
1−2 | 1.0 | 0.994 | 0.402 | 0.325 | 0.905 | 0.846 |
1−3 | 1.0 | 1.0 | 1.0 | 1.0 | 0.987 | 1.0 |
1−4 | 1.0 | 1.0 | 1.0 | 1.0 | 0.932 | 1.0 |
2−3 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
2−4 | 1.0 | 1.0 | 0.842 | 0.849 | 0.931 | 0.842 |
3−4 | 1.0 | 1.0 | 1.0 | 1.0 | 0.421 | 1.0 |
The best results for all class comparisons are given by MT8_{ A m e a n }, closely followed by H1_{ A m e a n }. The results of the spectrogram-based methods MT8_{ S U } and H1_{ S U } are, however, convincing only for some of the comparisons. For the comparison of class 1 and 2, these methods fail with an achieved similarity rate of only 0.402 and 0.325, respectively. This drop in performance is not surprising as these methods are based on the measure u _{1} and therefore entirely disregard the time information contained in the syllables. For the comparison of class 2 and 4 the performances of MT8_{ S U } and H1_{ S U } are slightly better but still quite unreliable (0.842 and 0.849). The MFCC method performs convincingly for all pairwise class comparisons, apart from the latter (class 3 versus class 4). These two classes include short single syllables, and these do not fulfill the typical structure for which the MFCC method is designed for, i.e., signals, such as speech, with repeating structures suitable for the cepstral decomposition. The results of the SPCC method are promising in all cases, except for the comparisons of class 1 to 2 and class 2 to 4. The insufficient performance for these two class comparisons is connected to the different number of repeats of main components of a syllable in class 2, as already discussed and exemplified in the introduction (see also Fig. 1). There it was noted that cross-correlation of spectrograms will be unreliable if the number of components in the syllables vary.
9 Conclusions
In this work, a novel feature set for low-dimensional signal representation is suggested that is designed for the analysis of non-stationary signals with complex variation in time and frequency. The features for signal representation are given by the first pair of singular vectors from the MT ambiguity spectrum, which ensures robustness to noise, and shifts in time, frequency and amplitude. For classification or and clustering purposes of a signal (e.g., a bird song), a collection of similarity measures are proposed. These are compared and evaluated on the basis of an outdoor recording of a wild male Great Reed Warbler, being a bird species with complex song structure. Moreover, it is shown that the suggested signal representation along with a specific combined similarity measure (which uses an average of the inner product of right singular vectors and of the left singular vectors) clearly outperforms other well-known methods (SPCC and MFCC) in the example analysis of the bird-song data.
Our methodology is also compared to a similar approach where the AS is replaced by the spectrogram for feature extraction, and it could be is observed that switching to the spectrogram comes along with a marked evident decrease in performance. Furthermore, we compared calculation of the spectrograms by means of based on (Hermite) MTs to spectrograms based on a single Hanning window and concluded that MTs increase the performance in a classification task, both for AS-based and for spectrogram-based feature representation.
10 Endnote
Declarations
Acknowledgements
Thanks to the eSSENCE academy for funding and to the Department of Biology, Lund University, for data collection.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
- ERA Cramer, Measuring consistency: spectrogram cross-correlation versus targeted acoustic parameters. Bioacoustics: Int. J. Anim. Sound Recording. 22(3), 247–257 (2012).View ArticleGoogle Scholar
- S Keen, JC Ross, ET Griffiths, M Lanzone, A Farnsworth, A comparison of similarity-based approaches in the classification of flight calls of four species of north american wood-warblers (parulidae). Ecol. Informatics. 21:, 25–33 (2014).View ArticleGoogle Scholar
- S Fagerlund, UK Laine, New parametric representations of bird sounds for automatic classification, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 8247–8251 (2014). http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6855209&isnumber=6853544.
- CD Meliza, SC Keen, DR Rubenstein, Pitch- and spectralbased dynamic time warping methods for comparing field recordings of harmonic avian vocalizations. J. Acoust. Soc. Am.134(2), 1407–1415 (2013).View ArticleGoogle Scholar
- O Tchernichovski, TJ Lints, S Deregnaucourt, A Cimenser, PP Mitra, Studying the song development process. rationale and methods. Ann. NY Acad. Sci.1016:, 348–363 (2004).View ArticleGoogle Scholar
- P Somervuo, Härma, Ä, S Fagerlund, Parametric representations of bird sounds for automatic species recognition. IEEE Trans. Audio Speech Lang. Process.14(6), 2252–2263 (2006).View ArticleGoogle Scholar
- X Zhang, Y Li, Adaptive energy detection for bird sound detection in complex environments. Neurocomputing. 155:, 108–116 (2015).View ArticleGoogle Scholar
- D Hasselquist, S Bensch, T von Schantz, Correlation between male song repertoire, extra-pair paternity and offspring survival in the great reed warbler. Nature. 381:, 229–232 (1996).View ArticleGoogle Scholar
- E Wȩgrzyn, K Leniowski, Syllable sharing and changes in syllable repertoire size and composition within and between years in the great reed warbler, acrocephalus arundinaceus. J. Ornithol.151:, 255–267 (2010). doi:10.1007/s10336-009-0451-x.View ArticleGoogle Scholar
- DJ Thomson, Spectrum estimation and harmonic analysis. Proc. IEEE. 70(9), 1055–1096 (1982).View ArticleGoogle Scholar
- I Daubechies, Time-frequency localization operators: a geometric phase space approach. IEEE Trans. Information Theory. 34(4), 605–612 (1988).MathSciNetView ArticleMATHGoogle Scholar
- B Jokanovic, MG Amin, YD Zhang, F Ahmad, Multi-window time-frequency signature reconstruction from undersampled continuous-wave radar measurements for fall detection. IET Radar, Sonar Navigation. 9(2), 173–183 (2015).View ArticleGoogle Scholar
- M Hansson-Sandsten, Optimal estimation of the time-varying spectrum of a class of locally stationary processes using Hermite functions. EURASIP J. Adv. Signal Process (2011). Article ID 980805.Google Scholar
- Orovic, Í, Stankovic, Ś, M Amin, A new approach for classification of human gait based on time-frequency feature representations. Signal Process.91(6), 1448–1456 (2011).View ArticleGoogle Scholar
- P Wahlberg, M Hansson, Kernels and multiple windows for estimation of the Wigner-Ville spectrum of gaussian locally stationary processes. IEEE Trans. Signal Process.55(1), 73–87 (2007).MathSciNetView ArticleGoogle Scholar
- M Hansson-Sandsten, M Tarka, J Caissy-Martineau, B Hansson, D Hasselquist, SVD-based classification of bird singing in different time-frequency domains using multitapers, Signal Processing Conference, 2011 19th European, 966–970 (2011). http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7073944&isnumber=7069645.
- M Hansson-Sandsten, Classification of bird song syllables using singular vectors of the multitaper spectrogram, Signal Processing Conference, 2015 23rd European, 554–558 (2015). http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7362444&isnumber=7362087.
- B Boashash, NA Khan, T Ben-Jabeur, Time-frequency features for pattern recognition using high-resolution TFDs: A tutorial review. Digital Signal Process. 40:, 1–30 (2015). doi:10.1016/j.dsp.2014.12.015.MathSciNetView ArticleGoogle Scholar
- RJ Barry, FM de Blasio, EM Bernat, GZ Steiner, Event-related EEG time-frequency PCA and the orienting reflex to auditory stimuli. Psychophysiology. 52:, 555–561 (2015).View ArticleGoogle Scholar
- Z Yu, Y Sun, W Jin, A novel generalized demodulation approach for multi-component signals. Signal Process.118:, 188–202 (2016).View ArticleGoogle Scholar
- DD Lee, HS Seung, Learning the parts of objects by non-negative matrix factorization. Nature. 401(6755), 788–791 (1999).View ArticleGoogle Scholar
- R Hennequin, R Badeau, B David, NMF with time-frequency activations to model non-stationary audio events. IEEE Trans. Audio Speech Lang. Process.19(4), 744–753 (2011).View ArticleGoogle Scholar
- B Ghoraani, S Krishnan, Time-frequency matrix feature extraction and classification of environmental audio signals. IEEE Trans. Audio Speech Lang. Process.19(7), 1071–1083 (2011).View ArticleGoogle Scholar
- B Ghoraani, Selected topics on time-frequency matrix decomposition analysis. J. Pattern Recognit. Intell. Syst.1(3), 64–78 (2013).Google Scholar
- H Qiao, New SVD based initialization strategy for non-negative matrix factorization. Pattern Recognit. Lett.63:, 71–77 (2015).View ArticleGoogle Scholar
- D Groutage, D Bennink, Feature sets for nonstationary signals derive from moments of the singular value decomposition of cohen-posch (positive time-frequency) distributions. IEEE Trans. Signal Process.48(5), 1498–1503 (2000).View ArticleGoogle Scholar
- M Große Ruse, D Hasselquist, B Hansson, M Tarka, M Sandsten, Automated analysis of song structure in complex birdsongs. Animal Behav. 112:, 39–51 (2015). doi:http://dx.doi.org/10.1016/j.anbehav.2015.11.013.View ArticleGoogle Scholar
- B Boashash, Time Frequency Signal Analysis and Processing; A Comprehensive Reference, 1st edn. (Elsevier Ltd, The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, UK, 2003).Google Scholar
- L Cohen, Time-Frequency Analysis, 1st edn. (Prentice-Hall Inc., Upper Saddle River, NJ, USA, 1995).Google Scholar
- M Bayram, RG Baraniuk, Multiple window time-frequency analysis, Time-Frequency and Time-Scale Analysis, 1996., Proceedings of the IEEE-SP International Symposium on, 173–176 (1996). http://ieeexplore.ieee.org.ludwig.lub.lu.se/stamp/stamp.jsp?tp=&arnumber=547209&isnumber=11466.
- K Leniowski, E Wȩgrzyn, Organization, variation in time, and impacting factors in the song strophe repertoire in the great reed warbler (acrocephalus arundinaceus). Ornis Fennica. 90:, 129–141 (2013).Google Scholar