Open Access

Music recommendation according to human motion based on kernel CCA-based relationship

EURASIP Journal on Advances in Signal Processing20112011:121

https://doi.org/10.1186/1687-6180-2011-121

Received: 15 April 2011

Accepted: 5 December 2011

Published: 5 December 2011

Abstract

In this article, a method for recommendation of music pieces according to human motions based on their kernel canonical correlation analysis (CCA)-based relationship is proposed. In order to perform the recommendation between different types of multimedia data, i.e., recommendation of music pieces from human motions, the proposed method tries to estimate their relationship. Specifically, the correlation based on kernel CCA is calculated as the relationship in our method. Since human motions and music pieces have various time lengths, it is necessary to calculate the correlation between time series having different lengths. Therefore, new kernel functions for human motions and music pieces, which can provide similarities between data that have different time lengths, are introduced into the calculation of the kernel CCA-based correlation. This approach effectively provides a solution to the conventional problem of not being able to calculate the correlation from multimedia data that have various time lengths. Therefore, the proposed method can perform accurate recommendation of best matched music pieces according to a target human motion from the obtained correlation. Experimental results are shown to verify the performance of the proposed method.

Keywords

content-based multimedia recommendationkernel canonical correlation analysislongest common subsequencep-spectrum

1 Introduction

With the popularization of online digital media stores, users can obtain various kinds of multimedia data. Therefore, technologies for retrieving and recommending desired contents are necessary to satisfy the various demands of users. A number of methods for content-based multimedia retrieval and recommendationa have been proposed. Image recommendation [13], music recommendation [46], and video recommendation [7, 8] have been intensively studied in several fields. It should be noted that most of these previous works had the constraint of query examples and returned results to be recommended being of the same type. However, due to diversification of users' demands, there is a need for a new type of multimedia recommendation in which the media types of query examples and the returned results can be different. Thus, several recommendation methods [912] for realizing these recommendation schemes have been proposed. Generally, they are called cross-media recommendation. In the conventional methods of the cross-media recommendation, the query examples and recommended results need not to be of the same media types. For example, users can search music pieces by submitting either an image example or a music example.

Among the conventional methods of cross-media recommendation, Li et al. proposed a method for recommendation between images and music pieces by comparing their features directly using a dynamic time warping algorithm [9]. Furthermore, Zhang et al. proposed a method for cross-media recommendation between multimedia documents based on a semantic graph [11, 12]. A multimedia document (MMD) is a collection of co-existing heterogeneous multimedia objects that have the same semantics. For example, an educational web page with instructive text, images and audio is an MMD. By these conventional methods, users can search for their desired contents more flexibly and effectively.

It should be noted that the above-conventional methods concentrate on recommendation between different types multimedia data. Thus, in this scheme, users are forced to provide query multimedia data, although they do not have a limitation of media types. This means that users must make some decisions to provide queries, and this causes difficulties for reflecting their demands. If recommendation of some multimedia data from features directly obtained from users is realized, one feasible solution can be provided to overcome the limitation. Specifically, we show the following two example applications: (i) background music selection from humans' dance motions for non-edited video contentsb and (ii) presentation of music information from features of target music pieces or dance motions. In the first example, using the relationship obtained between dance motions and music pieces in a database, we can obtain/find matched music pieces from human motions in video contents, and vice versa. This should be useful for creating a new dance program with background music and a music promotional video with dance motions. For example, given human motions of a classic ballet program, we can assign music pieces matched to the target human motions, and this example will be shown in the verification in the experiment section. Next, in the second example, this can present to users information of music that they are listening to, i.e., song title, composer, etc. Users can use sounds of music pieces or the user's own dance motion associated with the music as the query for obtaining information on the music. As described above, the application can also use the relationship between human motions and music pieces, and it can be a more flexible information presentation system than the conventional ones. In this way, information directly obtained from users, i.e., users' motions can retain the potential to get various benefits. These schemes are cross-media recommendation schemes and they remove barriers between users and those multimedia contents.

In this article, we deal with recommendation of music pieces from features obtained from users. Among the features, human motions have high-level semantics, and their use is effective for realizing accurate recommendation. Therefore, we try to estimate suitable music pieces from human motions. This is because we consider that correlation extraction between human motions and music pieces becomes feasible using some specific video contents such as dance and music promotional videos. This benefit is also useful in performance verification. Then, we assume that the meaning of "suitable" is emotionally similar. Specifically, in our purpose, the recommendation of suitable music pieces according to human motions is that the recommended music pieces are emotionally similar to the query human motions.

In this article, we propose a new method for cross-media recommendation of music pieces according to human motions based on kernel canonical correlation analysis (CCA) [13]. We use video contents in which video sequences and audio signals contain human motions and music pieces, respectively, as training data for calculating their correlation. Then, using the obtained correlation, estimation of the best matched music piece from a target human motion becomes feasible. It should be noted that several methods of cross-media recommendation have previously been proposed. However, there have been no methods focused on handling data that have various time lengths, i.e., human motions and music pieces. Thus, we propose a cross-media recommendation method that can effectively use characteristics of time series, and we assume that this can be realized using kernel CCA and our defined kernel functions. From the above discussion, the main contribution of the proposed method is handling data that have various time lengths for cross-media recommendation.

In this approach, we have to consider the differences in time lengths. In the proposed method, new kernel functions of human motions and music pieces are introduced into the CCA-based correlation calculation. Specifically, we newly adopt two types of kernel functions, which can represent similarities by effectively using human motions or music pieces having various time lengths, for the kernel CCA-based correlation calculation. First, we define a longest common subsequence (LCSS) kernel for using data having different time lengths. Since the LCSS [14] is commonly used for motion comparison, the LCSS kernel should be suitable for our purpose. It should be noted that kernel functions must satisfy Mercer's theorem [15], but our newly defined kernel function does not necessarily satisfy this theorem. Therefore, we also adopt another type of kernel function, spectrum intersection kernel, that satisfies Mercer's theorem. This function introduces the p-spectrum [16] and is based on the histogram intersection kernel [17]. Since the histogram intersection kernel is known as a function that satisfies Mercer's theorem, the spectrum intersection kernel also satisfies this theorem.

Actually, there have been kernel functions that do not satisfy Mercer's theorem, and there have also been several proposed methods that use such kernel functions. The effectiveness of the above-described methods has also been verified. Thus, we should also verify the effectiveness of our defined kernel function, which does not satisfy Mercer's theorem, i.e., the LCSS kernel. In addition, we should also compare our two newly defined kernel functions experimentally. Therefore, in this article, we introduce two types of kernel functions. Using these two types of kernel functions, the proposed method can directly compare multimedia data that have various time lengths, and this is the main advantage of our method. Thus, the use of these kernel functions effectively provides a solution to the problem of not being able to simply apply sequential data such as human motions and music pieces to cross-media recommendation. Consequently, effective modeling of the relationship using music and human motion data that have various time lengths is realized, and successful music recommendation can be expected.

This article is organized as follows. First, in Section 2, we briefly explain the kernel CCA used for calculating the correlation between human motions and music pieces. Next, in Section 3, we describe our two newly defined kernel functions. Kernel CCA-based music recommendation according to human motion is proposed in Section 4. Experimental results that verify the performance of the proposed method are shown in Section 5. Finally conclusions are given in Section 6.

2 Kernel canonical correlation analysis

In this section, we explain kernel CCA. First, two variables x and y are transformed into Hilbert space H x and H y via non-linear maps ϕ x and ϕ y . From the mapped results ϕ x (x) H x and ϕ y (y) H y ,c the kernel CCA seeks to maximize the correlation
ρ = E [ u v ] E [ u 2 ] E [ v 2 ]
(1)
between
u = a , ϕ x ( x )
(2)
and
v = b , ϕ y ( y )
(3)

over the projection directions a and b. This means that kernel CCA finds the directions a and b that maximize the correlation E [ u v ] of corresponding projections subject to E [ u 2 ] = 1 and E [ v 2 ] = 1 .

The optimal directions a and b can be found by solving the Lagrangian
L = E [ u v ] - λ 1 2 ( E [ u 2 ] - 1 ) - λ 2 2 ( E [ v 2 ] - 1 ) + η 2 ( | | a | | 2 + | | b | | 2 ) ,
(4)

where η is a regularization parameter. The above-computation scheme is called regularized kernel CCA [13]. By taking the derivatives of Equation 4 with respect to a and b, λ1 = λ2(= λ) is derived, and the directions a and b maximizing the correlation ρ (= λ) can be calculated.

3 Kernel function construction

Construction of new kernel functions is described in this section. The proposed method constructs two types of kernel functions for human motions and music pieces, respectively. First, we introduce an LCSS kernel as a kernel function that does not satisfy Mercer's theorem. This function is based on the LCSS algorithm [18], which is commonly used for motion or temporal music signal comparison since the LCSS algorithm can compare two temporal signals even if they have different time lengths. Therefore, it seems that this kernel function is suitable for our recommendation scheme. On the other hand, we also introduce a spectrum intersection kernel that satisfies Mercer's theorem. This function is based on the p-spectrum [16], which is generally used for text comparison. The p-spectrum uses the continuity of words. This property is also useful for analyzing the structure of temporal sequential data, i.e., human motions. Thus, the spectrum intersection kernel is also suitable for our recommendation scheme.

For the following explanation, we prepare pairs of human motions and music pieces extracted from the same video contents and denote each pair as a segment. The segments are defined as short terms of video contents that have various time lengths. From the obtained segments, we extract human motion features and music features of the j th (j = 1, 2,..., N) segment as V j = [ v j ( 1 ) , v j ( 2 ) , . . . , v j ( N v j ) ] and M j = [ m j ( 1 ) , m j ( 2 ) , . . . , m j ( N M j ) ] , where N v j and N M j are the numbers of components of V j and M j , respectively, and N is the number of segments. In V j and M j , v j ( l v ) ( l v = 1 , 2 , . . . , N v j ) and m j ( l m ) ( l m = 1 , 2 , . . . , N M j ) correspond to optical flows [19] and chroma vectors [20], respectively. The optical flow is a simple and representative feature that represents motion characteristics between two successive frames in video sequences and is commonly used for motion comparison. Thus, we adopt the optical flow as temporal components of human motion features. Furthermore, the chroma vector represents tone distribution of music signals at each time. The chroma vector can represent the characteristics of a music signal robustly if it is extracted in a short time. In addition, due to the simplicity of the implementation, we adopted these features in our method. More details of these features are given in Appendices A.1 and A.2.

3.1 Kernel function for human motions

3.1.1 LCSS kernel

In order to define kernel functions for human motions having various time lengths, we firstly explain the LCSS kernel for human motions that uses an LCSS-based similarity in [14]. An LCSS is an algorithm that enables calculation of the longest common part and its length (LCSS length) between two sequences.

Figure 1 shows an example of a table produced by LCSS length of two sequences X = 〈B, D, C, A, B〉 and Y = 〈A, B, C, B, A, B〉. In this figure, the highlighted components represent the common components in two different sequences and LCSS length between X and Y becomes four.
Figure 1

An example of a table based on LCSS length of the sequences X = 〈 B , D , C , A , B 〉 and Y = 〈 A , B , C , B , A , B .

Here, we show the definition of similarity between human motion features. For the following explanations, we denote two human motion features as V a = [ v a ( 1 ) , v a ( 2 ) , . . . , v a ( N v a ) ] and V b = [ v b ( 1 ) , v b ( 2 ) , . . . , v b ( N v b ) ] , where v a ( l a ) ( l a = 1 , 2 , . . . , N v a ) and v b ( l b ) ( l b = 1 , 2 , . . . , N v b ) are components of V a and V b , respectively, and N v a and N v b are the numbers of components in V a and V b , respectively. In addition, v a (l a ) and v b (l b ) correspond to optical flows extracted in each frame in each video sequence. Note that N v a and N v b depend on the time lengths of their segments; that is, they depend on the number of frames of their video sequences. The similarity between V a and V b is defined as follows:
S i m v ( V a , V b ) = L C S S ( V a , V b ) min ( N v a , N v b ) ,
(5)
where LCSS(V a ,V b ) is the LCSS length of V a and V b , and it is recursively defined as
L C S S ( V a , V b ) = R V a V b ( l a , l b ) | l a = N v a , l b = N v b ,
(6)
R V a V b ( l a , l b ) = 0 if N V a = 0 or N V b = 0 , 1 + R V a V b ( l a - 1 , l b - 1 ) if c ( v a ( l a ) ) = c ( v b ( l b ) ) , max { R V a V b ( l a - 1 , l b ) , R V a V b ( l a , l b - 1 ) } otherwise ,
(7)

where c(·) is a cluster number of optical flow. In the proposed method, we apply a k-means algorithm [21] for all optical flows obtained from all segments, and the obtained cluster numbers assigned to the belonging optical flows c(·) are used for easy comparison of two different optical flows. For this purpose, some kinds of quantization or labeling of the temporal variation of the time series seem to be available. In the proposed method, we adopt k-means clustering for its simplicity.

We then define this similarity measure as the LCSS kernel for human motions κ v L C S S ( , ) as follows:
κ V L C S S ( V a , V b ) = S i m V ( V a , V b ) .
(8)

The above-kernel function can be used for time series having various time lengths. Not only our LCSS kernel but also other kernel functions are known as non-positive semi-definite. Therefore, these do not strictly satisfy Mercer's theorem [15]. Fortunately, kernel functions that do not satisfy Mercer's theorem have been verified to be effective for classification of sequential data using a kernel function in [18].

Furthermore, several methods using kernel functions that do not satisfy the theorem have been proposed in [22, 23]. Also, a sigmoid kernel has been commonly used and is well known as a kernel function which does not satisfy Mercer's theorem. We therefore briefly discuss implications and problems that might emerge using a kernel function that does not satisfy the theorem. In order to satisfy Mercer's theorem, a gram matrix whose elements correspond to values of a kernel function is required to be a positive semi-definite and symmetric matrix. Not only our defined kernel function but also other kernel functions that do not satisfy Mercer's theorem have symmetric and non-positive semi-definite gram matrices. Thus, for the solution based on such kernel functions, several methods have modified eigenvalues of the gram matrices to be greater than or equal to zero. It should be noted that we used our defined kernel functions directly in the proposed method.

3.1.2 Spectrum intersection kernel

Next, we explain the spectrum intersection kernel for human motions. In order to define the spectrum intersection kernel for human motions, we firstly calculate p-spectrum-based features. The p-spectrum [16] is the set of all p-length (contiguous) subsequences that it contains. The p-spectrum-based features on string X are indexed by all possible subsequences X s of length p and defined as follows:
r p ( X ) = ( r X s ( X ) ) X s A p ,
(9)
where
r X s ( X ) = number of times X s occurs i n X ,
(10)
and A is the set of characters in strings. For human motion features, we cannot apply the p-spectrum directly since human motion features are defined as sequences of vectors. Therefore, we apply the p-spectrum to sequences of cluster numbers of optical flows as that done for the LCSS kernel. We use the histogram intersection kernel [17] for constructing the spectrum intersection kernel. The histogram intersection kernel κ HI (·, ·) is a useful kernel function for classification of histogram-shaped features and is defined as follows:
κ H I ( h a , h b ) = i h = 1 N h min { h a ( i h ) , h b ( i h ) } ,
(11)
where h a and h b are histogram-shaped features, h a (i h ) and h b (i h ) are the i h th element (bin) values of h a and h b , respectively, and N h is the numbers of bins of histogram-shaped features. Furthermore, i h = 1 N h h a ( i h ) = 1 and i h = 1 N h h b ( i h ) = 1 are required to apply the histogram intersection kernel into h a and h b . The p-spectrum-based features also have histogram shapes, and they can be applied to the histogram intersection kernel. Note that the sums of elements have to be normalized in the same way as that done for histogram-shaped features. After that, we define this kernel function as the spectrum intersection kernel for human motions κ v S I ( , ) shown as follows:
κ V S I ( V a , V b ) = κ H I ( r p ( V a ) , r p ( V b ) ) .
(12)

The above-kernel function can consider statistical characteristics of human motion features. Since the histogram intersection kernel is positive semi-definite [17], the spectrum intersection kernel can satisfy Mercer's theorem [15]. Note that the above-kernel function is equivalent to the spectrum kernel defined in [16] if we use the simple inner product of p-spectrum-based features instead of the histogram intersection in Equation 12.

3.2 Kernel function for music pieces

3.2.1 LCSS kernel

The kernel functions for music pieces are defined in the same way as those of human motions. First, we show the definition of the LCSS kernel for music pieces. For the following explanations, we denote two music features as M a = [ m a ( 1 ) , m a ( 2 ) , . . . , m a ( N M a ) ] and M b = [ m b ( 1 ) , m b ( 2 ) , . . . , m b ( N M b ) ] , where M a and M b are chromagrams [24] and are extracted from segments, m a ( l a ) ( l a = 1 , 2 , . . . , N M a ) and m b ( l b ) ( l b = 1 , 2 , . . . , N M b ) are components of M a and M b , and N M a and N M b are the numbers of components of M a and M b , respectively. In addition, m a (l a ) and m b (l b ) are chroma vectors [20] that have 12 dimensions. Since N M a and N M b depend on the time lengths of their segments, the similarity between music features is also defined on the basis of the LCSS algorithm. Note that it is desirable that the similarity between an original music piece and its modulated version becomes high since they have similar melodies, base lines, or harmonics. Therefore, we define similarity considering the modulation of music. In the proposed method, we use temporal sequences of chroma vectors, i.e., chromagrams defined in [24], as music features. One of the advantages of the use of 12-dimensional chroma vectors in the chromagrams is that the transposition amount of modulation can be naturally represented only by the amount ζ by which its 12 elements are shifted (rotated). Therefore, the proposed method effectively uses the above characteristic for measuring similarities between chromagrams. For the following explanation, we define the modulated chromagram M b ζ = [ m b ζ ( 1 ) , m b ζ ( 2 ) , . . , m b ζ ( N M b ) ] . Note that m b ζ ( l b ) ( l b = 1 , 2 , . . . , N M b ) represents a modulated chroma vector whose elements are shifted by amount ζ.

The similarity between M a and M b is defined as follows:
S i m M ( M a , M b ) = max ζ L C S S ( M a , M b ζ ) min ( N M a , N M b ) ,
(13)
where L C S S ( M a , M b ζ ) is recursively defined as
L C S S ( M a , M b ζ ) = R M a M b ζ ( l a , l b ) | l a = N M a , l b = N M b ,
(14)
R M a M b ζ ( l a , l b ) = 0 if l a = 0 or l b  = 0, 1 + R M a M b ζ ( l a - 1 , l b - 1 ) if S i m τ { m a ( l a ) , m b ζ ( l b ) } > T h , max { R M a M b ζ ( l a - 1 , l b ) , R M a M b ζ ( l a , l b - 1 ) } otherwise .
(15)
s i m τ { m a ( l a ) , m b ζ ( l b ) } = 1 - m ̃ a ( l a ) m ̃ b ζ ( l b ) 12
(16)
m ̃ a ( l a ) = m a ( l a ) max τ m a , τ ( l a ) ,
(17)
m ̃ b ζ ( l b ) = m b ζ ( l b ) max τ m b , τ ζ ( l b ) ,
(18)
where T h (= 0.8) is a positive constant for determining the fitness between two different chroma vectors, Sim τ {·, ·} is a similarity between chroma vectors defined in [20], m ̃ a ( l a ) and m ̃ b ζ ( l b ) are normalized chroma vectors, ma, τ(l a ) and m b , τ ζ ( l b ) are elements of the chroma vectors, and τ corresponds to tone, i.e., "C", "D#", "G#", etc. Note that the effectiveness of Sim τ {·, ·} is verified in [20]. We then define this similarity as the LCSS kernel for music pieces κ M L C S S ( , ) described as follows:
κ M L C S S ( M a , M b ) = S i m M ( M a , M b ) .
(19)

3.2.2 Spectrum intersection kernel

Next, we explain the spectrum intersection kernel for music pieces. In order to define the spectrum intersection kernel for music pieces, we firstly calculate p-spectrum-based features in the same way as those of human motions. It should be noted that the proposed method cannot calculate the p-spectrum from music features directly since the music features are defined as sequences of vectors. Therefore, we transform all of the vector components of music features into characters, such as alphabetic letters or numbers, based on hierarchical clustering algorithms, where the characters correspond to cluster numbers. For clustering the vector components, the modulation of music should also be considered in the same way as the LCSS kernel for music pieces. Therefore, clustering considering modulation is necessary. The procedures of this scheme are shown as follows.

Step 1: Calculation of optimal modulation amounts between music features
First, the proposed method calculates the optimal modulation amounts ζ ab between two music features M a and M b . This scheme is based on LCSS-based similarity and is defined as follows:
ζ a b = argmax ζ L C S S ( M a , M b ζ ) min ( N M a , N M b ) .
(20)

The optimal modulation amount ζ ab is calculated for all pairs.

Step 2: Similarity measurement between chroma vectors using the obtained optimal modulation amounts
Similarity between vector components, which is that between chroma vectors, is calculated using the obtained optimal modulation amounts. For example, the similarity between chroma vectors m a (l a ) and m b (l b ), which are the l a th and l b th components of two arbitrary music features M a and M b , respectively, is calculated using the obtained optimal modulation amount ζ ab and Equation 16 as follows:
S i m c { m a ( l a ) , m b ( l b ) } = 1 - | m ̃ a ( l a ) - m ̃ b ζ a b ( l b ) | 12 .
(21)

The above similarity is calculated between two different chroma vectors for all music features.

Step 3: Clustering chroma vectors based on the obtained similarities

Using the obtained similarities, the two most similar chroma vectors are assigned to the same cluster for clustering chroma vectors. This scheme is based on the single linkage method [25]. The merging scheme is recursively performed until the number of clusters becomes less than K M .

Using the clustering results, the proposed method calculates transformed music features m j * = [ m j * ( 1 ) , m j * ( 2 ) , , m j * ( N M j ) ] ( j = 1 , 2 , , N ) , where m j * ( l M ) ( l M = 1 , 2 , , N M j ) is a cluster number assigned to a corresponding chroma vector. Note that vector/matrix transpose is denoted by the superscript ' in this article. The proposed method then calculates p-spectrum-based features from m j * . For the following explanations, we denote two transformed music features as m a * = [ m a * ( 1 ) , m a * ( 2 ) , , m a * ( N M a ) ] and m b * = [ m b * ( 1 ) , m b * ( 2 ) , , m b * ( N M b ) ] , where m a * and m b * are vectors transformed from M a and M b , respectively, and m a * ( l a ) ( l a = 1 , 2 , , N M a ) and m b * ( l b ) ( l b = 1 , 2 , , N M b ) are the cluster numbers assigned to m a (l a ) and m b (l b ), respectively. Then, the spectrum intersection kernel for music pieces is calculated in the same way as that for human motions and is defined as follows:
κ M S I ( m a , m b ) = κ H I ( r p ( m a * ) , r p ( m b * ) ) .
(22)

4 Kernel CCA-based music recommendation according to human motion

A method for recommending music pieces suitable for human motions is presented in this section. An overview of the proposed method is shown in Figure 2. In our cross-media recommendation method, pairs of human motions and music pieces that have a close relationship are necessary for effective correlation calculation. Therefore, we prepare these pairs extracted from the same video contents as segments. From the obtained segments, we extract human motion features and music features. More details of these features are given in Appendices A.1 and A.2. By applying kernel CCA to the features of human motions and music pieces, the proposed method calculates their correlation. In this approach, we define new kernel functions that can be used for data having various time lengths and introduce them into the kernel CCA.
Figure 2

Overview of the proposed method. The left and right parts in this figure represent the correlation calculation phase and the recommendation phase, respectively, in the proposed method.

Therefore, the proposed method can calculate the correlations by considering their sequential characteristics. Then, effective modeling of the relationship using human motions and music pieces having various time lengths is realized, and successful music recommendation can be expected.

First, we define the features of V j and M j (j = 1, 2,..., N) in the Hilbert space as ϕ v (vec[V j ]) and ϕ M (vec[M j ]), where vec[·] is the vectorization operator that turns a matrix into a vector. Next, we find features
s j = A ϕ V ( vec [ V j ] ) - ϕ ̄ V ,
(23)
t j = B ϕ M ( vec [ M j ] ) - ϕ ̄ M ,
(24)
A = [ a 1 , a 2 , , a D ] ,
(25)
B = [ b 1 , b 2 , , b D ] ,
(26)

where ϕ ̄ V and ϕ ̄ M are mean vectors of ϕ v (vec[V j ]) and ϕ M (vec[M j ]) (j = 1, 2,..., N), respectively. The matrices A and B are coefficient matrices whose columns a d and b d (d = 1, 2,..., D), respectively, correspond to the projection directions in Equations 2 and 3, where the value D is the dimension of A and B. Then, we define a correlation matrix Λ whose diagonal elements are the correlation coefficients λ d (d = 1,2,..., D). The details of the calculation of A, B, and Λ are shown as follows.

In order to obtain A, B, and Λ, we use the regularized kernel CCA shown in the previous section. Note that the optimal matrices A and B are given by
A = Ξ v H E v ,
(27)
B = Ξ M H E M ,
(28)
Ξ V = [ ϕ V ( vec [ V 1 ] ) , ϕ V ( vec [ V 2 ] ) , , ϕ V ( vec [ V N ) ] ] ,
(29)
Ξ M = [ ϕ M ( vec [ M 1 ] ) , ϕ M ( vec[ M 2 ] ) , , ϕ M ( vec[ M N ] ) ] ,
(30)
where E V = [ e V 1 , e V 2 , , e V D ] and E M = [ e M 1 , e M 2 , , e M D ] are N × D matrices. Furthermore,
H = I - 1 N 1 1
(31)
is a centering matrix, where I is the N × N identity matrix, and 1 = [1,..., 1]' is an N × 1 vector. From Equations 27 and 28, the following equations are satisfied:
a d = Ξ V H e V d ,
(32)
b d = Ξ M H e M d .
(33)
Then, by calculating the optimal solution e V d and e M d ( d = 1 , 2 , , D ) , A and B are obtained. In the same way as Equation 4, we calculate the optimal solution e V d and e M d that maximizes
L = e V L e M - λ 2 ( e V M e V - 1 ) - λ 2 ( e M P e M - 1 ) ,
(34)
where e V , e M , and λ correspond to e V d , e M d , and λ d , respectively. In the above equation, L, M, and P are calculated as follows:
L = 1 N H K V H H K M H ,
(35)
M = 1 N H K V H H K V H + η 1 H K V H ,
(36)
P = 1 N H K M H H K M H + η 2 H K M H .
(37)
Furthermore, η1 and η2 are regularization parameters, and K V ( = Ξ V Ξ V ) and K M ( = Ξ M Ξ M ) are matrices whose elements are defined as values of the corresponding kernel functions defined in Section 3. By taking derivatives of Equation 34 with respect to e V and e M , optimal e V , e M , and λ can be obtained as solutions of following eigenvalue problems:
M - 1 L P - 1 L e V = λ 2 e V ,
(38)
P - 1 L M - 1 L e M = λ 2 e M ,
(39)

where λ is obtained as an eigenvalue, and the vectors e V and e M are, respectively, obtained as eigenvectors. Then, the d th (d = 1, 2,..., D) eigenvalue of λ becomes λ d , where λ1λ2 ≥ ... ≥ λ D . Note that the dimension D is set to a value for which the cumulative proportion obtained from λ d (d = 1,2,...,D) becomes larger than a threshold. Furthermore, the eigenvectors e V and e M corresponding to λ d become e V d and e M d , respectively.

From the obtained matrices A, B, and Λ, we can estimate the optimal music features from given human motion features, i.e., we can select the best matched music pieces according to human motions. An overview of music recommendation is shown in Figure 3. When a human motion feature V in is given, we can select the predetermined number of music pieces according to the query human motion that minimize the following distances:
Figure 3

Overview of music recommendation according to human motion.

d = t i n - t ^ i 2 ( i = 1 , 2 , , M t ) ,
(40)
where t in and t ^ i are, respectively, the query human motion feature and music features in the database M ^ i ( i = 1 , 2 , , M t ) transformed into the same feature space shown as follows:
t ^ i = B ϕ M ( vec [ M ^ i ] ) - ϕ ̄ M = E M κ M ^ i - 1 N K M 1 ,
(41)
t i n = Λ A ϕ V ( vec [ V i n ] ) - ϕ ̄ V = Λ E V κ V i n - 1 N K V 1 ,
(42)

and M t is the number of music pieces in the database. Note that κ V i n is an N × 1 vector whose q th elements are κ V L C S S ( V i n , V q ) or κ V S I ( V i n , V q ) , and κ M ^ i is an N × 1 vector whose q th elements are κ M L C S S ( M ^ i , M q ) or κ M S I ( M ^ i , M q ) .

As described above, we can estimate the best matched music pieces according to the human motions. The proposed method calculates the correlation between human motions and music pieces based on the kernel CCA. Then, the proposed method introduces the kernel functions that can be used for time series having various time lengths based on the LCSS or p-spectrum. Therefore, the proposed method enables calculation of the correlation between human motions and music pieces that have various time lengths. Furthermore, effective correlation calculation and successful music recommendation according to human motion based on the obtained correlation are realized.

5 Experimental results

The performance of the proposed method is verified in this section. For the experiments, 170 segments were manually extracted. In the experiments, we used video contents of three classic ballet programs. Of the 170 segments, 44 were from Nutcracker, 54 were from Swan Lake, and 72 were from Sleeping Beauty. Each segment consisted of only one human motion and the background music did not change in the segment. In addition, camera change was not included in the segment. The audio signals in each segment were mono channel, 16 bits per sample and were sampled at 44.1 [kHz]. Human motion features and music features were extracted from the obtained segments.

For evaluation of the performance of our method, we used videos of classic ballet programs. However, there were some differences between motions extracted from classic ballet programs and those extracted in our daily life. In cross-media recommendation, we have to consider whether or not we should recommend contents that have the same meanings as those of queries. For example, when we recommend music pieces from the user's information, recommendation of sad music pieces is not always suitable if the user seems to be sad. Our approach also has to consider the above point. In this article, we focus on extraction of the relationship between human motions and music pieces and perform the recommendation based on the extracted relationship. In addition, we have to prepare some ground truths for evaluation of the proposed method. Therefore, we used videos of classic ballet programs since the human motions and music pieces extracted from the same videos of classic ballet programs had strong and direct relationships.

In order to evaluate the performance of our method, we also prepared five datasets #1 to #5 that were pairs of 100 segments for training (training segments) and 70 segments for testing (testing segments), i.e., a simple cross-validation scheme. It should be noted that we randomly divided the 170 segments into five datasets. The reason for dividing the 170 segments into five datasets was to perform various verifications by changing the combination of test segments and training segments. Then, the number of datasets (five) was simply determined. Furthermore, the training segments and testing segments were obtained from the above prepared 170 segments. For the experiments, 12 kinds of tags representing expression marks in music shown in Table 1 were used. We examined whether each tag could be used for labeling human motions and music pieces. Thus, tags that seemed to be difficult to use for these two media types were removed in this process. Then, we could obtain the above 12 kinds of tags. One suitable tag was manually selected and annotated to each segment for performance verification. In the experiments, one person with musical experience annotated the label that was the best matched to each segment. Generally, annotation should be performed by several people. However, since labels, i.e., expression marks in music, were used in the experiment, it was necessary to have the ground truths made by a person who had knowledge of music. Thus, in the experiment, only one person annotated the labels.
Table 1

Description of expression marks

Name

Definition

agitato

Agitated

amabile

Amiable, pleasant

appassionato

Passionately

capriccioso

Unpredictable, volatile

grazioso

Gracefully

lamentoso

Lamenting, mournfully

leggiero

Lightly, delicately

maestoso

Majestically

pesante

Heavy, ponderous

soave

Softly

spiritoso

Spiritedly

tranquillo

Calmly, peacefully

First, we show the recommended results (see Additional file 1). In this file, we show original video contents and recommended video contents. The background music pieces of recommended video contents are not original but are music pieces recommended by our method. These results show that our method can recommend a suitable music piece for a human motion.

Additional file 1:Recommended results. Additional file 1.mov; Description of data: This video content shows our recommendation results. In this video content, original video contents and recommended results, whose video contents' background music are music pieces recommended by our method, are shown. (MOV 6 MB)

Next, we quantitatively verify the performance of the proposed method. In this simulation, we verify the effectiveness of our kernel functions. In the proposed method, we define two types of kernel functions, LCSS kernel and spectrum intersection kernel, for human motions and music pieces. Thus, we experimentally compare our two newly defined kernel functions. Using combinations of the kernel functions, we prepared four simulations "Simulation 1"-"Simulation 4", as follows:

  • Simulation 1 used the LCSS kernel for both human motions and music pieces.

  • Simulation 2 used the spectrum intersection kernel for both human motions and music pieces.

  • Simulation 3 used the spectrum intersection kernel for human motions and the LCSS kernel for music pieces.

  • Simulation 4 used the LCSS kernel for human motions and the spectrum intersection kernel for music pieces.

These simulations were performed to verify the effectiveness of our two newly defined kernel functions for human motions and music pieces. For the following explanations, we denote the LCSS kernel as "LCSS-K" and the spectrum intersection kernel as "SI-K". In addition, for the experiments, we used the following criterion:
Accuracy Score = i 1 = 1 70 Q i 1 1 70 ,
(43)

where the denominator corresponds to the number of testing segments. Furthermore, Q i 1 1 ( i 1 = 1 , 2 , , 70 ) is one if the tags of three recommended music pieces include the tag of the human motion query.

Otherwise, Q i 1 1 is zero. It should be noted that the number of recommended music pieces (three) was simply determined. We next explain how the number of recommended music pieces affects the performance of our method. For the following explanation, we define the terms "over-recommendation" and "mis-recommendation". Over-recommendation means that the recommended results tend to contain music pieces that are not matched to the target human motions as well as matched music pieces, and mis-recommendation means that music pieces that are matched to the target human motions tend not to be correctly selected as the recommendation results. There is a tradeoff relationship between over-recommendation and mis-recommendation. That is, if we increase the number of recommended results, over-recommendation increases and mis-recommendation decreases. On the other hand, if we decrease the number of recommended results, over-recommendation decreases and mis-recommendation increases. Furthermore, we evaluate the recommendation accuracy according to the above criterion. Figure 4 shows that the accuracy score of simulation 1 was higher than accuracy scores of the other simulations. This is because the LCSS kernel can effectively compare human motions and music pieces respectively having different time lengths. Note that in these simulations, we used bi (p = 2)-gram for calculating p-spectrum-based features shown in Equation 9, the number of clusters for chroma vectors is set to K M = 500 and the parameters in our method are shown in Tables 2, 3, 4 and 5. All of these parameters are empirically determined, and they are set to values that provide the highest accuracy. More details of parameter determination are given in Appendix.
Figure 4

Accuracy scores in each simulation. #1 to #5 are dataset numbers and "AVERAGE" is the average value of the accuracy scores for the datasets.

Table 2

Description of parameters used in Simulation 1

Dataset

η 1

η 2

K c

#1

1.0 × 10-14

8.0 × 10-3

1300

#2

6.0 × 10-3

6.0 × 10-7

1000

#3

6.0 × 10-13

8.0 × 10-3

1200

#4

2.0 × 10-3

8.0 × 10-13

1000

#5

6.0 × 10-11

8.0 × 10-3

1200

Table 3

Description of parameters used in Simulation 2

Dataset

η 1

η 2

K c

#1

8.0 × 10-13

8.0 × 10-3

1500

#2

4.0 × 10-6

6.0 × 10-11

1000

#3

2.0 × 10-11

8.0 × 10-13

1000

#4

4.0 × 10-13

8.0 × 10-13

1300

#5

1.0 × 10-16

8.0 × 10-3

1500

Table 4

Description of parameters used in Simulation 3

Dataset

η 1

η 2

K c

#1

8.0 × 10-3

6.0 × 10-11

1000

#2

4.0 × 10-3

8.0 × 10-7

1200

#3

1.0 × 10-14

8.0 × 10-13

1000

#4

6.0 × 10-7

1.0 × 10-2

1300

#5

1.0 × 10-6

8.0 × 10-3

1000

Table 5

Description of parameters used in Simulation 4

Dataset

η 1

η 2

K c

#1

4.0 × 10-6

8.0 × 10-13

1000

#2

2.0 × 10-3

8.0 × 10-13

1000

#3

1.0 × 10-13

8.0 × 10-13

1200

#4

8.0 × 10-7

8.0 × 10-3

1000

#5

1.0 × 10-6

6.0 × 10-11

1300

In the following, we discuss the results obtained. First, we discuss the influence of our human motion features. The features used in our method are based on optical flow and extracted between two regions that contain a human corresponding to two successive frames. This feature can represent movements of arms, legs, hands, etc. However, this feature cannot represent global human movements. This is an important factor for representing motion characteristics of classic ballet. For accurate relationship extraction between human motions and music pieces, it is necessary to improve human motion features into features that can also represent global human movement. This can be complemented using information obtained by much more accurate sensors such as kinect.d

Next, we discuss the experimental conditions. In the experiments with the proposed method, we used tags, i.e., expression marks in music, as ground truths. This was annotated to each segment. However, this annotation scheme does not consider the relationship between tags. For example, in Table 1, "agitato" and "appassionato" have similar meanings. Thus, the choice of the 12 kinds of tags might be not suitable. It might be necessary to reconsider the choice tags. Also, we found that it is more important to introduce the relationship between tags into our defined accuracy criteria. However, it is difficult to quantify the relationship between them. Thus, we used only one tag for each segment. This can also be expected by the results of subjective evaluation in next experiment.

We also used comparative methods for verifying performance of the proposed method. For the comparative method, we exchanged the kernel functions into gaussian kernel κ G - K ( x , y ) = exp - x - y 2 2 σ 2 ( G - K ) , sigmoid kernel κS-K(x, y) = tanh(α x'y + β) (S-K), and linear kernel κL-K(x, y) = x'y (L-K). In this experiment, we set parameters σ(= 5.0), α(= 5.0), and β(= 3.0). It should be noted that these kernel functions cannot be applied to our human motion features and music features directly since the features have various dimensions. Therefore, we simply used the time average of optical flow-based vectors, v j avg , for human motion features and the time average of chroma vectors, m j avg , for music features. Then, we applied the above three types of kernel functions to the obtained features. Figure 5 shows the results of comparison for each kernel function. These results show that our kernel functions are more effective than other kernel functions. The results also show that it is important to consider the temporal characteristic of data, and our kernel function can successfully consider this characteristic. Note that in this comparison, we used parameters that provide the highest accuracy. The parameters are shown in Tables 6, 7 and 8.
Figure 5

Accuracy comparison in each kernel. #1 to #5 are dataset numbers and "AVERAGE" is the average value of the accuracy scores for the datasets.

Table 6

Description of parameters used in gaussian kernel

Dataset

η 1

η 2

#1

8.0 × 10-13

8.0 × 10-3

#2

4.0 × 10-7

8.0 × 10-13

#3

8.0 × 10-7

8.0 × 10-13

#4

6.0 × 10-13

2.0 × 10-7

#5

8.0 × 10-7

8.0 × 10-13

Table 7

Description of parameters used in sigmoid kernel

Dataset

η 1

η 2

#1

8.0 × 10-7

8.0 × 10-3

#2

6.0 × 10-3

1.0 × 10-2

#3

1.0 × 10-6

2.0 × 10-7

#4

4.0 × 10-3

1.0 × 10-14

#5

1.0 × 10-6

4.0 × 10-11

Table 8

Description of parameters used in linear kernel

Dataset

η 1

η 2

#1

4.0 × 10-11

2.0 × 10-7

#2

1.0 × 10-16

1.0 × 10-16

#3

8.0 × 10-11

2.0 × 10-3

#4

1.0 × 10-10

8.0 × 10-13

#5

1.0 × 10-14

8.0 × 10-13

Finally, we show results of subjective evaluation for our recommendation method. We performed subjective evaluation using 15 subjects (User1-User15). Table 9 shows the profiles of the subjects. In the evaluation, we used video contents which consisted of video sequences and music pieces. In the video contents, each video sequence included one human motion, and each music piece was a recommended result by the proposed method according to the human motion. The tasks of the subjective evaluation were as follows:
Table 9

Profiles of the subjects

Number of the subjects (male/female)

15(14/1)

Nationality(number)

Australia(1), Syria(1), China(3), Japan(10)

Ages(years)

22-30

  1. 1.

    Subjects watched each video content, whose video sequence was a target classic ballet scene and whose music was recommended by the proposed method.

     
  2. 2.

    Subjects determined whether the target classic ballet scene and the recommended music pieces were matched or not. Specifically, they answered yes or no.

     
  3. 3.

    Procedures 1 and 2 were repeated for 210 video contents.

     

In the subjective evaluation, we used the recommended results obtained by Simulation 1 in the above-described experiment. We also used datasets #1 and #2 for the subjective evaluation. In the evaluation, we showed the top three recommended results for each query human motion (query segment). Then, 70 query segments were examined and 210 recommended results were obtained for each dataset.

Furthermore, we used two criteria, "Accuracy Score 2" and "Accuracy Score 3", for verifying the performance. Accuracy Score 2 is defined as follows:
Accuracy Score 2 = i 2 = 1 70 Q i 2 2 70 ,
(44)
where the denominator corresponds to the number of query segments. Q i 2 2 ( i 2 = 1 , 2 , 70 ) is one if one or some of the recommended three music pieces at least subjects determined the query human motion and its music piece were matched. Otherwise, Q i 2 2 is zero. In addition, Accuracy Score 3 is the ratio of assessment results for 210 music pieces and is defined as follows:
Accuracy Score  3 = i 3 = 1 210 Q i 3 3 210 ,
(45)
where the denominator corresponds to the number of recommended music pieces. Furthermore, Q i 3 3 ( i 3 = 1 , 2 , , 210 ) is one if subjects determined the query human motion and its music piece matched. Otherwise, Q i 3 3 is zero. Table 10 shows the results of each score in the subjective evaluation. From the results, both scores show higher recommendation accuracy than that of the quantitative evaluation. Therefore, the results of the subjective evaluation confirmed the effectiveness of our method.
Table 10

Accuracy of subjective evaluation of each user in Dataset #1 and Dataset #2

User

Accuracy

Score 2

Accuracy

Score 3

 

#1

#2

#1

#2

User1

0.91

0.93

0.53

0.60

User2

0.99

0.97

0.71

0.79

User3

1.00

0.97

0.65

0.47

User4

0.96

0.80

0.40

0.36

User5

0.67

0.51

0.31

0.19

User6

0.93

0.93

0.38

0.33

User7

0.97

0.96

0.55

0.47

User8

0.99

0.99

0.51

0.60

User9

0.56

0.66

0.23

0.29

User10

0.99

0.99

0.46

0.50

User11

0.91

0.91

0.50

0.50

User12

0.93

0.97

0.45

0.43

User13

0.90

0.97

0.45

0.43

User14

0.94

0.99

0.54

0.63

User15

0.90

1.00

0.50

0.50

Average

0.90

0.90

0.48

0.48

6 Conclusions

In this article, we have presented a method for music recommendation according to human motion based on the kernel CCA-based relationship. In the proposed method, we newly defined two types of kernel functions. One is a sequential similarity-based kernel function that uses the LCSS algorithm, and the other is a statistical characteristic-based kernel function that uses the p-spectrum. Using these kernel functions, the proposed method enables calculation of the correlation that can consider their sequential characteristics. Furthermore, based on the obtained correlation, the proposed method enables accurate music recommendation according to human motion.

In the experiments, recommendation accuracy was sensitive to the parameters. It is desirable that these parameters be adaptively determined from the datasets. Thus, we need to complement this determination algorithm. Feature selection of the human motions and music pieces is also needed for more accurate extraction of the relationship between human motions and music pieces. These topics will be the subjects of subsequent studies.

Endnotes

aIn this article, we simply denote "retrieval and recommendation" as recommendation hereafter. bIn this article, video sequences are defined as data that contain only visual signals, and video contents are defined as data that contain both visual signals and audio signals. cIn this section, we assume that E [ ϕ x ( x ) ] = 0 and E [ ϕ y ( y ) ] = 0 for brief explanation, where E [ ] denotes the sample average of the random variates. dhttp://www.xbox.com/en-US/Kinect.

Appendix A: Feature extraction

In this article, we use human motion features and music features. Here, each feature extraction is explained in detail. Segments are extracted from video contents, i.e., video contents are separated into some segments S j (j = 1,2,..., N). Then, human motion features and music features are extracted from each segment. In this appendix, we explain methods for extraction of human motion features and music features in A.1 and A.2, respectively.

A.1 Extraction of human motion features

First, the proposed method separates segments S j into frames f j k ( k = 1 , 2 , , N j ) , where N j is the number of frames in segment S j . Furthermore, a rectangular region including one human is clipped from each frame, and they are regularized to the same size. In this article, we assume that this rectangular region has previously been obtained. Deciding the rectangular regions including humans might be difficult. However, there are several methods for extracting/deciding human regions from video sequences [26, 27]. These methods achieved accurate human region detection by combining visual information and sensor information such as kinect,d using a stereo-camera, or using a camera for which position is calibrated. Although we extract the rectangular region manually for simplicity, we consider that a certain precision can be guaranteed using these methods.

Next, we show the calculation of optical flow-based vectors. For calculating optical flows from segments, we firstly divide regions of frame f j k into blocks B j b ( b = 1 , 2 , , N B j ) , where N B j ( = 1600 ) is the number of blocks in each frame. Then, based on the Lucas-Kanade Algorithm [19], the optical flow in each block B i b is calculated between two successive regions from f j k + 1 to f j k for all segments S j . Then, we obtain optical flow-based vectors v j ( k ) ( k = 1 , 2 , N V j ) containing vertical and horizontal direction optical flow values for all blocks. Then, N v j corresponds to N j -1.

In this article, the human motion feature V j of segment S j is obtained as the sequence of the optical flow-based vector v j (k). The features obtained by the above procedure represent the temporal characteristics of human motions.

A.2 Extraction of music features

The proposed method uses chromagrams [24]. A chromagram represents the temporal sequence of chroma vectors over time and is calculated from each segment. Furthermore, the chroma vector represents magnitude distribution on the chroma that is assigned into 12 pitch classes within an octave, and thus the chroma vector has 12 dimensions. The 12-dimensional chroma vector m(t) is extracted from the magnitude spectrum Ψ τ (f Hz , t), which is calculated using short-time Fourier transform (STFT), where f Hz is frequency and t is time in an audio signal. The τ(τ = 1, 2,...,12)th element of m(t) corresponds to a pitch class of equal temperament and is defined as follows:
m τ ( t ) = h = O c t L O c t H - B P F τ , h ( f H z ) Ψ ( f H z , t ) d f H z ,
where Oct H and Oct L represent the highest and lowest octave positions, respectively. Furthermore, BPF τ,h (f Hz ) is a bandpass filter that passes the signal at the log-scale frequency F τ,h (in cents) of pitch class τ (chroma) in octave position h (height) as shown in the following equation:
F τ , h = 1200 h + 100 ( τ - 1 ) .

We define a chromagram that represents a temporal sequence of 12-dimensional chroma vectors extracted by the above procedure in segment S j as the music features M j = [ m j ( 1 ) , m j ( 2 ) , , m j ( N M j ) ] , where N M j is the number of components of M j . Details of the chroma vector and the chromagram are shown in [20].

Appendix B: Parameter determination

In this section, we explain the parameter determination. First, for the determination of parameters, we performed experiments to show the relationship between the accuracy score and the parameters. Figure 6 shows the relationships between the accuracy score and parameters in Simulation 1. From the obtained results, it can be seen that the kernel CCA-based approach tends to be sensitive for the parameters. It should be noted that in the dataset used for the experiments, there are quite different types of pairs of human motions and music pieces. Then, for similar pairs of human motions and music pieces, we will be able to use fixed parameters and obtain accurate results. Therefore, it can be seemed that stable recommendation accuracy scores are achieved using parameters that are determined from datasets that have similar characteristics. This means that for stable recommendation, some schemes performing clustering and classification of the contents become necessary as pre-procedures. The other simulations and other database are also sensitive the same as the shown results. For the above reasons, we used the parameters that provided the highest accuracy. Thus, the parameters were not determined by cross-validation. However, we recognized that such parameter should be determined by the cross-validation. This is our future work.
Figure 6

Relationships between η 1 , η 2 , K c , and Accuracy Score in Simulation 1. (a) Relationship between η1 and Accuracy Score, (b) Relationship between η2 and Accuracy Score, and (c) Relationship between K c and Accuracy Score. For examining each parameter, the other parameters are fixed. Then, in (b), datasets #3, #4 and #5 have almost the same characteristics.

Abbreviations

CCA: 

canonical correlation analysis

MMD: 

multimedia documents

LCSS: 

longest common subsequence

LCSS-K: 

LCSS: kernel

SI-K: 

spectrum intersection kernel

G-K: 

gaussian kernel

S-K: 

sigmoid kernel

L-K: 

linear kernel.

Declarations

Acknowledgements

This study was partly supported by the Grant-in-Aid for Scientific Research (B) 21300030, Japan Society for the Promotion of Science (JSPS).

Authors’ Affiliations

(1)
Graduate School of Information Science and Technology, Hokkaido University

References

  1. Kim I, Lee J, Kwon Y, Par S: Content-based image retrieval method using color and shape features. Proceedings of the 1997 International Conference on Information, Communication and Signal Processing 1997, 948-952.Google Scholar
  2. Zhang R, Zhang Z: Effective image retrieval based on hidden concept discovery in image database. IEEE Trans Image Process 2006,16(2):562-572.View ArticleGoogle Scholar
  3. He X, Ma W, Zhang H: Learning an image manifold for retrieval. Proceedings of the ACM Multimedia Conference 2004.Google Scholar
  4. Guo G, Li S: Content-based audio classification and retrieval by support vector machines. IEEE Trans Neural Networks 2003,14(1):209-215. 10.1109/TNN.2002.806626View ArticleGoogle Scholar
  5. Typke R, Wiering F, Veltkamp R: A survey of music information retrieval systems. Proceedings of the ISMIR 2005.Google Scholar
  6. Shen J, Shepherd J, Ngu A: Towards effective content-based music retrieval with multiple acoustic feature combination. IEEE Trans Multimedia 2006, 8: 1179-1189.View ArticleGoogle Scholar
  7. Greenspan H, Goldberger J, Mayer A: Probabilistic space-time video modeling via piecewise GMM. IEEE Trans Pattern Anal Mach Intell 2004,26(3):384-396. 10.1109/TPAMI.2004.1262334View ArticleGoogle Scholar
  8. Fan J, Elmagarmid A, Zhu X, Aref W, Wu L: ClassView: hierarchical video shot classification, indexing, and accessing. IEEE Trans Multimedia 2004,6(1):70-86. 10.1109/TMM.2003.819583View ArticleGoogle Scholar
  9. Li X, Dacheng T, Maybank S: Visual music and musical vision. Neurocomputing 2008, 71: 2023-2028. 10.1016/j.neucom.2008.01.025View ArticleGoogle Scholar
  10. Fujii A, Itou K, Akiba T, Ishikawa T: A cross-media retrieval system for lecture videos. Proceedings of the Eighth European Conference on Speech Communication and Technology (Eurospeech 2003) 2003, 1149-1152.Google Scholar
  11. Zhuang Y, Yang Y, Wu F: Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval. IEEE Trans Multimedia 2008,10(2):221-229.View ArticleGoogle Scholar
  12. Yang Y, Zhuang Y, Wu F, Pan Y: Harmonizing hierarchical manifolds for multimedia document semantics under standing and cross-media retrieval. IEEE Trans Multimedia 2008,10(3):437-446.View ArticleGoogle Scholar
  13. Akaho S: A kernel method for canonical correlation analysis. International Meeting of Psychometric Society 2001., 1:Google Scholar
  14. Jun S, Han B, Hwang E: A similar music retrieval scheme based on musical mood variation. First Asian Conference on Intelligent Information and Database Systems 2009, 1: 167-172.Google Scholar
  15. Mercer J: Functions of positive and negative type, and their connection with the theory of integral equations. Trans London Philos Soc (A) 1909, 209: 415-446. 10.1098/rsta.1909.0016View ArticleMATHGoogle Scholar
  16. Leslie C, Eskin E, Noble W: The spectrum kernel: a string kernel for SVM protein classification. Proceedings of the Pacific Biocomputing Symposium 2002, 566-575.Google Scholar
  17. Barla A, Odone F, Verri A: Histogram intersection kernel for image classification. ICIP(3) 2006, 513-516.Google Scholar
  18. Gruber C, Gruber T, Sick B: Online signature verification with new time series kernels for support vector machines. Advances in Biometrics 2005, 3832: 500-508. 10.1007/11608288_67View ArticleGoogle Scholar
  19. Lucas B, Kanade T: An iterative image registration technique with an application to stereo vision. Proceedings of the DARPA IU Workshop 1984, 121-130.Google Scholar
  20. Goto M: A chorus-section detection method for musical audio signals and its application to a music listening station. IEEE Trans Audio Speech Language Process 2006,14(5):1783-1794.View ArticleGoogle Scholar
  21. MacQueen J: Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Math. Statistics and Probability 1967, 1: 281-297.MathSciNetGoogle Scholar
  22. Mariethoz J, Bengio S: A kernel trick for sequences applied to text-independent speaker verification systems. Pattern Recognition 2007,40(8):2315-2324. 10.1016/j.patcog.2007.01.011View ArticleMATHGoogle Scholar
  23. Camps-Valls G, Martin-Guerrero J, Rojo-Alvarez J, Soria-Olivas E: Fuzzy sigmoid kernel for support vector classifier. Neurocomputing 2004, 62: 501-506.View ArticleGoogle Scholar
  24. Wakefield GH: Mathematical representation of joint timechroma distributions. SPIE 1999.Google Scholar
  25. Xu R, Dunsch W II: Survey of clustering algorithms. IEEE Trans Neural Networks 2005,16(3):645-678. 10.1109/TNN.2005.845141View ArticleGoogle Scholar
  26. Navneet D, Bill T, Cordelia S: Human detection using oriented histograms of flow and appearance. Comput Vision ECCV 2006 2006, 3952: 428-441. 10.1007/11744047_33View ArticleGoogle Scholar
  27. Mikolajczyk K, Schmid C, Zisserman A: Human detection based on a probabilistic assembly of robust part detectors. In Proceedings of the Eighth European Conference on Computer Vision. Volume 1. Prague, Czech Republic; 2004:69-81.Google Scholar

Copyright

© Ohkushi et al; licensee Springer. 2011

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.