Music recommendation according to human motion based on kernel CCA-based relationship

Ohkushi, Hiroyuki; Ogawa, Takahiro; Haseyama, Miki

doi:10.1186/1687-6180-2011-121

Research
Open access
Published: 05 December 2011

Music recommendation according to human motion based on kernel CCA-based relationship

Hiroyuki Ohkushi¹,
Takahiro Ogawa¹ &
Miki Haseyama¹

EURASIP Journal on Advances in Signal Processing volume 2011, Article number: 121 (2011) Cite this article

3123 Accesses
8 Citations
Metrics details

Abstract

In this article, a method for recommendation of music pieces according to human motions based on their kernel canonical correlation analysis (CCA)-based relationship is proposed. In order to perform the recommendation between different types of multimedia data, i.e., recommendation of music pieces from human motions, the proposed method tries to estimate their relationship. Specifically, the correlation based on kernel CCA is calculated as the relationship in our method. Since human motions and music pieces have various time lengths, it is necessary to calculate the correlation between time series having different lengths. Therefore, new kernel functions for human motions and music pieces, which can provide similarities between data that have different time lengths, are introduced into the calculation of the kernel CCA-based correlation. This approach effectively provides a solution to the conventional problem of not being able to calculate the correlation from multimedia data that have various time lengths. Therefore, the proposed method can perform accurate recommendation of best matched music pieces according to a target human motion from the obtained correlation. Experimental results are shown to verify the performance of the proposed method.

1 Introduction

With the popularization of online digital media stores, users can obtain various kinds of multimedia data. Therefore, technologies for retrieving and recommending desired contents are necessary to satisfy the various demands of users. A number of methods for content-based multimedia retrieval and recommendation^a have been proposed. Image recommendation [1–3], music recommendation [4–6], and video recommendation [7, 8] have been intensively studied in several fields. It should be noted that most of these previous works had the constraint of query examples and returned results to be recommended being of the same type. However, due to diversification of users' demands, there is a need for a new type of multimedia recommendation in which the media types of query examples and the returned results can be different. Thus, several recommendation methods [9–12] for realizing these recommendation schemes have been proposed. Generally, they are called cross-media recommendation. In the conventional methods of the cross-media recommendation, the query examples and recommended results need not to be of the same media types. For example, users can search music pieces by submitting either an image example or a music example.

Among the conventional methods of cross-media recommendation, Li et al. proposed a method for recommendation between images and music pieces by comparing their features directly using a dynamic time warping algorithm [9]. Furthermore, Zhang et al. proposed a method for cross-media recommendation between multimedia documents based on a semantic graph [11, 12]. A multimedia document (MMD) is a collection of co-existing heterogeneous multimedia objects that have the same semantics. For example, an educational web page with instructive text, images and audio is an MMD. By these conventional methods, users can search for their desired contents more flexibly and effectively.

It should be noted that the above-conventional methods concentrate on recommendation between different types multimedia data. Thus, in this scheme, users are forced to provide query multimedia data, although they do not have a limitation of media types. This means that users must make some decisions to provide queries, and this causes difficulties for reflecting their demands. If recommendation of some multimedia data from features directly obtained from users is realized, one feasible solution can be provided to overcome the limitation. Specifically, we show the following two example applications: (i) background music selection from humans' dance motions for non-edited video contents^b and (ii) presentation of music information from features of target music pieces or dance motions. In the first example, using the relationship obtained between dance motions and music pieces in a database, we can obtain/find matched music pieces from human motions in video contents, and vice versa. This should be useful for creating a new dance program with background music and a music promotional video with dance motions. For example, given human motions of a classic ballet program, we can assign music pieces matched to the target human motions, and this example will be shown in the verification in the experiment section. Next, in the second example, this can present to users information of music that they are listening to, i.e., song title, composer, etc. Users can use sounds of music pieces or the user's own dance motion associated with the music as the query for obtaining information on the music. As described above, the application can also use the relationship between human motions and music pieces, and it can be a more flexible information presentation system than the conventional ones. In this way, information directly obtained from users, i.e., users' motions can retain the potential to get various benefits. These schemes are cross-media recommendation schemes and they remove barriers between users and those multimedia contents.

In this article, we deal with recommendation of music pieces from features obtained from users. Among the features, human motions have high-level semantics, and their use is effective for realizing accurate recommendation. Therefore, we try to estimate suitable music pieces from human motions. This is because we consider that correlation extraction between human motions and music pieces becomes feasible using some specific video contents such as dance and music promotional videos. This benefit is also useful in performance verification. Then, we assume that the meaning of "suitable" is emotionally similar. Specifically, in our purpose, the recommendation of suitable music pieces according to human motions is that the recommended music pieces are emotionally similar to the query human motions.

In this article, we propose a new method for cross-media recommendation of music pieces according to human motions based on kernel canonical correlation analysis (CCA) [13]. We use video contents in which video sequences and audio signals contain human motions and music pieces, respectively, as training data for calculating their correlation. Then, using the obtained correlation, estimation of the best matched music piece from a target human motion becomes feasible. It should be noted that several methods of cross-media recommendation have previously been proposed. However, there have been no methods focused on handling data that have various time lengths, i.e., human motions and music pieces. Thus, we propose a cross-media recommendation method that can effectively use characteristics of time series, and we assume that this can be realized using kernel CCA and our defined kernel functions. From the above discussion, the main contribution of the proposed method is handling data that have various time lengths for cross-media recommendation.

In this approach, we have to consider the differences in time lengths. In the proposed method, new kernel functions of human motions and music pieces are introduced into the CCA-based correlation calculation. Specifically, we newly adopt two types of kernel functions, which can represent similarities by effectively using human motions or music pieces having various time lengths, for the kernel CCA-based correlation calculation. First, we define a longest common subsequence (LCSS) kernel for using data having different time lengths. Since the LCSS [14] is commonly used for motion comparison, the LCSS kernel should be suitable for our purpose. It should be noted that kernel functions must satisfy Mercer's theorem [15], but our newly defined kernel function does not necessarily satisfy this theorem. Therefore, we also adopt another type of kernel function, spectrum intersection kernel, that satisfies Mercer's theorem. This function introduces the p-spectrum [16] and is based on the histogram intersection kernel [17]. Since the histogram intersection kernel is known as a function that satisfies Mercer's theorem, the spectrum intersection kernel also satisfies this theorem.

Actually, there have been kernel functions that do not satisfy Mercer's theorem, and there have also been several proposed methods that use such kernel functions. The effectiveness of the above-described methods has also been verified. Thus, we should also verify the effectiveness of our defined kernel function, which does not satisfy Mercer's theorem, i.e., the LCSS kernel. In addition, we should also compare our two newly defined kernel functions experimentally. Therefore, in this article, we introduce two types of kernel functions. Using these two types of kernel functions, the proposed method can directly compare multimedia data that have various time lengths, and this is the main advantage of our method. Thus, the use of these kernel functions effectively provides a solution to the problem of not being able to simply apply sequential data such as human motions and music pieces to cross-media recommendation. Consequently, effective modeling of the relationship using music and human motion data that have various time lengths is realized, and successful music recommendation can be expected.

This article is organized as follows. First, in Section 2, we briefly explain the kernel CCA used for calculating the correlation between human motions and music pieces. Next, in Section 3, we describe our two newly defined kernel functions. Kernel CCA-based music recommendation according to human motion is proposed in Section 4. Experimental results that verify the performance of the proposed method are shown in Section 5. Finally conclusions are given in Section 6.

2 Kernel canonical correlation analysis

In this section, we explain kernel CCA. First, two variables x and y are transformed into Hilbert space H_xand H_yvia non-linear maps ϕ_xand ϕ_y. From the mapped results ϕ_x(x) ∈ H_xand ϕ_y(y) ∈ H_y,^c the kernel CCA seeks to maximize the correlation

ρ = \frac{E [u v]}{\sqrt{E [u^{2}] E [v^{2}]}}

(1)

between

u = ⟨a, ϕ_{x} (x)⟩

(2)

and

v = ⟨b, ϕ_{y} (y)⟩

(3)

over the projection directions a and b. This means that kernel CCA finds the directions a and b that maximize the correlation $E [u v]$ of corresponding projections subject to $E [u^{2}] = 1$ and $E [v^{2}] = 1$ .

The optimal directions a and b can be found by solving the Lagrangian

L = E [u v] - \frac{λ_{1}}{2} (E [u^{2}] - 1) - \frac{λ_{2}}{2} (E [v^{2}] - 1) + \frac{η}{2} (| | a | |^{2} + | | b | |^{2}),

(4)

where η is a regularization parameter. The above-computation scheme is called regularized kernel CCA [13]. By taking the derivatives of Equation 4 with respect to a and b, λ₁ = λ₂(= λ) is derived, and the directions a and b maximizing the correlation ρ (= λ) can be calculated.

3 Kernel function construction

Construction of new kernel functions is described in this section. The proposed method constructs two types of kernel functions for human motions and music pieces, respectively. First, we introduce an LCSS kernel as a kernel function that does not satisfy Mercer's theorem. This function is based on the LCSS algorithm [18], which is commonly used for motion or temporal music signal comparison since the LCSS algorithm can compare two temporal signals even if they have different time lengths. Therefore, it seems that this kernel function is suitable for our recommendation scheme. On the other hand, we also introduce a spectrum intersection kernel that satisfies Mercer's theorem. This function is based on the p-spectrum [16], which is generally used for text comparison. The p-spectrum uses the continuity of words. This property is also useful for analyzing the structure of temporal sequential data, i.e., human motions. Thus, the spectrum intersection kernel is also suitable for our recommendation scheme.

For the following explanation, we prepare pairs of human motions and music pieces extracted from the same video contents and denote each pair as a segment. The segments are defined as short terms of video contents that have various time lengths. From the obtained segments, we extract human motion features and music features of the j th (j = 1, 2,..., N) segment as $V_{j} = [v_{j} (1), v_{j} (2), . . ., v_{j} (N_{v_{j}})]$ and $M_{j} = [m_{j} (1), m_{j} (2), . . ., m_{j} (N_{M_{j}})]$ , where $N_{v_{j}}$ and $N_{M_{j}}$ are the numbers of components of V_jand M_j, respectively, and N is the number of segments. In V_jand $M_{j}, v_{j} (l_{v}) (l_{v} = 1, 2, . . ., N_{v_{j}})$ and $m_{j} (l_{m}) (l_{m} = 1, 2, . . ., N_{M_{j}})$ correspond to optical flows [19] and chroma vectors [20], respectively. The optical flow is a simple and representative feature that represents motion characteristics between two successive frames in video sequences and is commonly used for motion comparison. Thus, we adopt the optical flow as temporal components of human motion features. Furthermore, the chroma vector represents tone distribution of music signals at each time. The chroma vector can represent the characteristics of a music signal robustly if it is extracted in a short time. In addition, due to the simplicity of the implementation, we adopted these features in our method. More details of these features are given in Appendices A.1 and A.2.

3.1 Kernel function for human motions

3.1.1 LCSS kernel

In order to define kernel functions for human motions having various time lengths, we firstly explain the LCSS kernel for human motions that uses an LCSS-based similarity in [14]. An LCSS is an algorithm that enables calculation of the longest common part and its length (LCSS length) between two sequences.

Figure 1 shows an example of a table produced by LCSS length of two sequences X = 〈B, D, C, A, B〉 and Y = 〈A, B, C, B, A, B〉. In this figure, the highlighted components represent the common components in two different sequences and LCSS length between X and Y becomes four.

Here, we show the definition of similarity between human motion features. For the following explanations, we denote two human motion features as $V_{a} = [v_{a} (1), v_{a} (2), . . ., v_{a} (N_{v_{a}})]$ and $V b = [v_{b} (1), v_{b} (2), . . ., v_{b} (N_{v_{b}})]$ , where $v_{a} (l_{a}) (l_{a} = 1, 2, . . ., N v_{a})$ and $v_{b} (l_{b}) (l_{b} = 1, 2, . . ., N_{v_{b}})$ are components of V_aand V_b, respectively, and $N_{v_{a}}$ and $N_{v_{b}}$ are the numbers of components in V_aand V_b, respectively. In addition, v_a(l_a) and v_b(l_b) correspond to optical flows extracted in each frame in each video sequence. Note that $N_{v_{a}}$ and $N_{v_{b}}$ depend on the time lengths of their segments; that is, they depend on the number of frames of their video sequences. The similarity between V_aand V_bis defined as follows:

S i m_{v} (V_{a}, V_{b}) = \frac{L C S S (V_{a}, V_{b})}{min (N_{v_{a}}, N_{v_{b}})},

(5)

where LCSS(V_a,V_b) is the LCSS length of V_aand V_b, and it is recursively defined as

L C S S (V_{a}, V_{b}) = R_{V_{a} V_{b}} (l_{a}, l_{b}) |_{l_{a} = N_{v_{a}, l_{b} = N_{v_{b}}}},

(6)

R_{V_{a} V_{b}} (l_{a}, l_{b}) = \{\begin{matrix} 0 & if N_{V_{a}} = 0 or N_{V_{b}} = 0, \\ 1 + R_{V_{a} V_{b}} (l_{a} - 1, l_{b} - 1) & if c (v_{a} (l_{a})) = c (v_{b} (l_{b})), \\ max {R_{V_{a} V_{b}} (l_{a} - 1, l_{b}), R_{V_{a} V_{b}} (l_{a}, l_{b} - 1)} & otherwise, \end{matrix}

(7)

where c(·) is a cluster number of optical flow. In the proposed method, we apply a k-means algorithm [21] for all optical flows obtained from all segments, and the obtained cluster numbers assigned to the belonging optical flows c(·) are used for easy comparison of two different optical flows. For this purpose, some kinds of quantization or labeling of the temporal variation of the time series seem to be available. In the proposed method, we adopt k-means clustering for its simplicity.

We then define this similarity measure as the LCSS kernel for human motions $κ_{v}^{L C S S} (\cdot, \cdot)$ as follows:

κ_{V}^{L C S S} (V_{a}, V_{b}) = S i m_{V} (V_{a}, V_{b}) .

(8)

The above-kernel function can be used for time series having various time lengths. Not only our LCSS kernel but also other kernel functions are known as non-positive semi-definite. Therefore, these do not strictly satisfy Mercer's theorem [15]. Fortunately, kernel functions that do not satisfy Mercer's theorem have been verified to be effective for classification of sequential data using a kernel function in [18].

Furthermore, several methods using kernel functions that do not satisfy the theorem have been proposed in [22, 23]. Also, a sigmoid kernel has been commonly used and is well known as a kernel function which does not satisfy Mercer's theorem. We therefore briefly discuss implications and problems that might emerge using a kernel function that does not satisfy the theorem. In order to satisfy Mercer's theorem, a gram matrix whose elements correspond to values of a kernel function is required to be a positive semi-definite and symmetric matrix. Not only our defined kernel function but also other kernel functions that do not satisfy Mercer's theorem have symmetric and non-positive semi-definite gram matrices. Thus, for the solution based on such kernel functions, several methods have modified eigenvalues of the gram matrices to be greater than or equal to zero. It should be noted that we used our defined kernel functions directly in the proposed method.

3.1.2 Spectrum intersection kernel

Next, we explain the spectrum intersection kernel for human motions. In order to define the spectrum intersection kernel for human motions, we firstly calculate p-spectrum-based features. The p-spectrum [16] is the set of all p-length (contiguous) subsequences that it contains. The p-spectrum-based features on string $X$ are indexed by all possible subsequences $X_{s}$ of length p and defined as follows:

r_{p} (X) = {(r_{X_{s}} (X))}_{X_{s} \in A^{p}},

(9)

where

r_{X_{s}} (X) = number of times X_{s} occurs i n X,

(10)

and $A$ is the set of characters in strings. For human motion features, we cannot apply the p-spectrum directly since human motion features are defined as sequences of vectors. Therefore, we apply the p-spectrum to sequences of cluster numbers of optical flows as that done for the LCSS kernel. We use the histogram intersection kernel [17] for constructing the spectrum intersection kernel. The histogram intersection kernel κ^HI(·, ·) is a useful kernel function for classification of histogram-shaped features and is defined as follows:

κ^{H I} (h_{a}, h_{b}) = \sum_{^{i} h = 1}^{N^{h}} min {h_{a} (i_{h}), h_{b} (i_{h})},

(11)

where h_aand h_bare histogram-shaped features, h_a(i_h) and h_b(i_h) are the i_hth element (bin) values of h_aand h_b, respectively, and N^his the numbers of bins of histogram-shaped features. Furthermore, $\sum_{^{i} h = 1}^{N^{h}} h_{a} (i_{h}) = 1$ and $\sum_{^{i} h = 1}^{N^{h}} h_{b} (i_{h}) = 1$ are required to apply the histogram intersection kernel into h_aand h_b. The p-spectrum-based features also have histogram shapes, and they can be applied to the histogram intersection kernel. Note that the sums of elements have to be normalized in the same way as that done for histogram-shaped features. After that, we define this kernel function as the spectrum intersection kernel for human motions $κ_{v}^{S I} (\cdot, \cdot)$ shown as follows:

κ_{V}^{S I} (V_{a}, V_{b}) = κ^{H I} (r_{p} (V_{a}), r_{p} (V_{b})) .

(12)

The above-kernel function can consider statistical characteristics of human motion features. Since the histogram intersection kernel is positive semi-definite [17], the spectrum intersection kernel can satisfy Mercer's theorem [15]. Note that the above-kernel function is equivalent to the spectrum kernel defined in [16] if we use the simple inner product of p-spectrum-based features instead of the histogram intersection in Equation 12.

3.2 Kernel function for music pieces

3.2.1 LCSS kernel

The kernel functions for music pieces are defined in the same way as those of human motions. First, we show the definition of the LCSS kernel for music pieces. For the following explanations, we denote two music features as $M_{a} = [m_{a} (1), m_{a} (2), . . ., m_{a} (N_{M_{a}})]$ and $M_{b} = [m_{b} (1), m_{b} (2), . . ., m_{b} (N_{M_{b}})]$ , where M_aand M_bare chromagrams [24] and are extracted from segments, $m_{a} (l_{a}) (l_{a} = 1, 2, . . ., N_{M_{a}})$ and $m_{b} (l_{b}) (l_{b} = 1, 2, . . ., N_{M_{b}})$ are components of M_aand M_b, and $N_{M_{a}}$ and $N_{M_{b}}$ are the numbers of components of M_aand M_b, respectively. In addition, m_a(l_a) and m_b(l_b) are chroma vectors [20] that have 12 dimensions. Since $N_{M_{a}}$ and $N_{M_{b}}$ depend on the time lengths of their segments, the similarity between music features is also defined on the basis of the LCSS algorithm. Note that it is desirable that the similarity between an original music piece and its modulated version becomes high since they have similar melodies, base lines, or harmonics. Therefore, we define similarity considering the modulation of music. In the proposed method, we use temporal sequences of chroma vectors, i.e., chromagrams defined in [24], as music features. One of the advantages of the use of 12-dimensional chroma vectors in the chromagrams is that the transposition amount of modulation can be naturally represented only by the amount ζ by which its 12 elements are shifted (rotated). Therefore, the proposed method effectively uses the above characteristic for measuring similarities between chromagrams. For the following explanation, we define the modulated chromagram $M_{b}^{ζ} = [m_{b}^{ζ} (1), m_{b}^{ζ} (2), . ., m_{b}^{ζ} (N_{M_{b}})]$ . Note that $m_{b}^{ζ} (l_{b}) (l_{b} = 1, 2, . . ., N_{M_{b}})$ represents a modulated chroma vector whose elements are shifted by amount ζ.

The similarity between M_aand M_bis defined as follows:

S i m_{M} (M_{a}, M_{b}) = max_{ζ} \{\frac{L C S S (M_{a}, M_{b}^{ζ})}{min (N_{M_{a}}, N_{M_{b}})}\},

(13)

where $L C S S (M_{a}, M_{b}^{ζ})$ is recursively defined as

L C S S (M_{a}, M_{b}^{ζ}) = R_{M_{a} M_{b}^{ζ}} (l_{a}, l_{b}) |_{l_{a} = N_{M_{a},} l_{b} = N_{M_{b}}},

(14)

R_{M_{a} M_{b}^{ζ}} (l_{a}, l_{b}) = \{\begin{matrix} 0 & if l_{a} = 0 or l_{b} = 0, \\ 1 + R_{M_{a} M_{b}^{ζ}} (l_{a} - 1, l_{b} - 1) & if S i m_{τ} {m_{a} (l_{a}), m_{b}^{ζ} (l_{b})} > T_{h}, \\ max {R_{M_{a} M_{b}^{ζ}} (l_{a} - 1, l_{b}), R_{M_{a} M_{b}^{ζ}} (l_{a}, l_{b} - 1)} & otherwise . \end{matrix}

(15)

s i m_{τ} {m_{a} (l_{a}), m_{b}^{ζ} (l_{b})} = 1 - \frac{|{\tilde{m}}_{a} (l_{a}) {\tilde{m}}_{b}^{ζ} (l_{b})|}{\sqrt{12}}

(16)

{\tilde{m}}_{a} (l_{a}) = \frac{m_{a} (l_{a})}{max_{τ} m_{a, τ} (l_{a})},

(17)

{\tilde{m}}_{b}^{ζ} (l_{b}) = \frac{m_{b}^{ζ} (l_{b})}{max_{τ} m_{b, τ}^{ζ} (l_{b})},

(18)

where T_h(= 0.8) is a positive constant for determining the fitness between two different chroma vectors, Sim_τ{·, ·} is a similarity between chroma vectors defined in [20], ${\tilde{m}}_{a} (l_{a})$ and ${\tilde{m}}_{b}^{ζ} (l_{b})$ are normalized chroma vectors, m_{a, τ}(l_a) and $m_{b, τ}^{ζ} (l_{b})$ are elements of the chroma vectors, and τ corresponds to tone, i.e., "C", "D#", "G#", etc. Note that the effectiveness of Sim_τ{·, ·} is verified in [20]. We then define this similarity as the LCSS kernel for music pieces $κ_{M}^{L C S S} (\cdot, \cdot)$ described as follows:

κ_{M}^{L C S S} (M_{a}, M_{b}) = S i m_{M} (M_{a}, M_{b}) .

(19)

3.2.2 Spectrum intersection kernel

Next, we explain the spectrum intersection kernel for music pieces. In order to define the spectrum intersection kernel for music pieces, we firstly calculate p-spectrum-based features in the same way as those of human motions. It should be noted that the proposed method cannot calculate the p-spectrum from music features directly since the music features are defined as sequences of vectors. Therefore, we transform all of the vector components of music features into characters, such as alphabetic letters or numbers, based on hierarchical clustering algorithms, where the characters correspond to cluster numbers. For clustering the vector components, the modulation of music should also be considered in the same way as the LCSS kernel for music pieces. Therefore, clustering considering modulation is necessary. The procedures of this scheme are shown as follows.

Step 1: Calculation of optimal modulation amounts between music features

First, the proposed method calculates the optimal modulation amounts ζ^abbetween two music features M_aand M_b. This scheme is based on LCSS-based similarity and is defined as follows:

ζ^{a b} = \underset{ζ}{argmax} \{\frac{L C S S (M_{a}, M_{b}^{ζ})}{min (N_{M_{a}}, N_{M_{b}})}\} .

(20)

The optimal modulation amount ζ^abis calculated for all pairs.

Step 2: Similarity measurement between chroma vectors using the obtained optimal modulation amounts

Similarity between vector components, which is that between chroma vectors, is calculated using the obtained optimal modulation amounts. For example, the similarity between chroma vectors m_a(l_a) and m_b(l_b), which are the l_ath and l_bth components of two arbitrary music features M_aand M_b, respectively, is calculated using the obtained optimal modulation amount ζ^aband Equation 16 as follows:

S i m_{c} {m_{a} (l_{a}), m_{b} (l_{b})} = 1 - \frac{| {\tilde{m}}_{a} (l_{a}) - {\tilde{m}}_{b}^{ζ^{a b}} (l_{b}) |}{\sqrt{12}} .

(21)

The above similarity is calculated between two different chroma vectors for all music features.

Step 3: Clustering chroma vectors based on the obtained similarities

Using the obtained similarities, the two most similar chroma vectors are assigned to the same cluster for clustering chroma vectors. This scheme is based on the single linkage method [25]. The merging scheme is recursively performed until the number of clusters becomes less than K_M.

Using the clustering results, the proposed method calculates transformed music features $m_{j}^{*} = {[m_{j}^{*} (1), m_{j}^{*} (2), \dots, m_{j}^{*} ({N_{M}}_{_{j}})]}^{'} (j = 1, 2, \dots, N)$ , where $m_{j}^{*} (l_{M}) (l_{M} = 1, 2, \dots, {N_{M}}_{_{j}})$ is a cluster number assigned to a corresponding chroma vector. Note that vector/matrix transpose is denoted by the superscript ' in this article. The proposed method then calculates p-spectrum-based features from $m_{j}^{*}$ . For the following explanations, we denote two transformed music features as $m_{a}^{*} = {[m_{a}^{*} (1), m_{a}^{*} (2), \dots, m_{a}^{*} (N_{M_{a}})]}^{'}$ and $m_{b}^{*} = {[m_{b}^{*} (1), m_{b}^{*} (2), \dots, m_{b}^{*} (N_{M_{b}})]}^{'}$ , where $m_{a}^{*}$ and $m_{b}^{*}$ are vectors transformed from M_aand M_b, respectively, and $m_{a}^{*} (l_{a}) (l_{a} = 1, 2, \dots, N_{M_{a}})$ and $m_{b}^{*} (l_{b}) (l_{b} = 1, 2, \dots, N_{M_{b}})$ are the cluster numbers assigned to m_a(l_a) and m_b(l_b), respectively. Then, the spectrum intersection kernel for music pieces is calculated in the same way as that for human motions and is defined as follows:

κ_{M}^{S I} (m_{a}, m_{b}) = κ^{H I} (r_{p} (m_{a}^{*}), r_{p} (m_{b}^{*})) .

(22)

4 Kernel CCA-based music recommendation according to human motion

A method for recommending music pieces suitable for human motions is presented in this section. An overview of the proposed method is shown in Figure 2. In our cross-media recommendation method, pairs of human motions and music pieces that have a close relationship are necessary for effective correlation calculation. Therefore, we prepare these pairs extracted from the same video contents as segments. From the obtained segments, we extract human motion features and music features. More details of these features are given in Appendices A.1 and A.2. By applying kernel CCA to the features of human motions and music pieces, the proposed method calculates their correlation. In this approach, we define new kernel functions that can be used for data having various time lengths and introduce them into the kernel CCA.

Therefore, the proposed method can calculate the correlations by considering their sequential characteristics. Then, effective modeling of the relationship using human motions and music pieces having various time lengths is realized, and successful music recommendation can be expected.

First, we define the features of V_jand M_j(j = 1, 2,..., N) in the Hilbert space as ϕ_v(vec[V_j]) and ϕ_M(vec[M_j]), where vec[·] is the vectorization operator that turns a matrix into a vector. Next, we find features

s_{j} = A^{'} (ϕ_{V} (vec [V_{j}]) - {\bar{ϕ}}_{V}),

(23)

t_{j} = B^{'} (ϕ_{M} (vec [M_{j}]) - {\bar{ϕ}}_{M}),

(24)

A = [a_{1}, a_{2}, \dots, a_{D}],

(25)

B = [b_{1}, b_{2}, \dots, b_{D}],

(26)

where ${\bar{ϕ}}_{V}$ and ${\bar{ϕ}}_{M}$ are mean vectors of ϕ_v(vec[V_j]) and ϕ_M(vec[M_j]) (j = 1, 2,..., N), respectively. The matrices A and B are coefficient matrices whose columns a_dand b_d(d = 1, 2,..., D), respectively, correspond to the projection directions in Equations 2 and 3, where the value D is the dimension of A and B. Then, we define a correlation matrix Λ whose diagonal elements are the correlation coefficients λ_d(d = 1,2,..., D). The details of the calculation of A, B, and Λ are shown as follows.

In order to obtain A, B, and Λ, we use the regularized kernel CCA shown in the previous section. Note that the optimal matrices A and B are given by

A = Ξ_{v} H E_{v},

(27)

B = Ξ_{M} H E_{M},

(28)

Ξ_{V} = [ϕ_{V} (vec [V_{1}]), ϕ_{V} (vec [V_{2}]), \dots, ϕ_{V} (vec [V_{N})]],

(29)

Ξ_{M} = [ϕ_{M} (vec [M_{1}]), ϕ_{M} (vec[M_{2}]), \dots, ϕ_{M} (vec[M_{N}])],

(30)

where $E_{V} = [e_{V_{1}}, e_{V_{2}}, \dots, e_{V_{D}}]$ and $E_{M} = [{e_{M}}_{_{1}}, {e_{M}}_{_{2}}, \dots, {e_{M}}_{_{D}}]$ are N × D matrices. Furthermore,

H = I - \frac{1}{N} 1 1^{'}

(31)

is a centering matrix, where I is the N × N identity matrix, and 1 = [1,..., 1]' is an N × 1 vector. From Equations 27 and 28, the following equations are satisfied:

a_{d} = Ξ_{V} H e_{V_{d}},

(32)

b_{d} = Ξ_{M} H e_{M_{d}} .

(33)

Then, by calculating the optimal solution ${e_{V}}_{_{d}}$ and $e_{M_{d}} (d = 1, 2, \dots, D)$ , A and B are obtained. In the same way as Equation 4, we calculate the optimal solution ${e_{V}}_{_{d}}$ and $e_{M_{d}}$ that maximizes

L = {e^{'}}_{V} L e_{M} - \frac{λ}{2} ({e^{'}}_{V} M e_{V} - 1) - \frac{λ}{2} ({e^{'}}_{M} P e_{M} - 1),

(34)

where e_V, e_M, and λ correspond to ${e_{V}}_{_{d}}, e_{M_{d}}$ , and λ_d, respectively. In the above equation, L, M, and P are calculated as follows:

L = \frac{1}{N} H K_{V} H H K_{M} H,

(35)

M = \frac{1}{N} H K_{V} H H K_{V} H + η_{1} H K_{V} H,

(36)

P = \frac{1}{N} H K_{M} H H K_{M} H + η_{2} H K_{M} H .

(37)

Furthermore, η₁ and η₂ are regularization parameters, and $K_{V} (= Ξ_{V}^{'} Ξ_{V})$ and $K_{M} (= Ξ_{M}^{'} Ξ_{M})$ are matrices whose elements are defined as values of the corresponding kernel functions defined in Section 3. By taking derivatives of Equation 34 with respect to e_Vand e_M, optimal e_V, e_M, and λ can be obtained as solutions of following eigenvalue problems:

M^{- 1} L P^{- 1} L^{'} e_{V} = λ^{2} e_{V},

(38)

P^{- 1} L^{'} M^{- 1} L e_{M} = λ^{2} e_{M},

(39)

where λ is obtained as an eigenvalue, and the vectors e_Vand e_Mare, respectively, obtained as eigenvectors. Then, the d th (d = 1, 2,..., D) eigenvalue of λ becomes λ_d, where λ₁ ≥ λ₂ ≥ ... ≥ λ_D. Note that the dimension D is set to a value for which the cumulative proportion obtained from λ_d(d = 1,2,...,D) becomes larger than a threshold. Furthermore, the eigenvectors e_Vand e_Mcorresponding to λ_dbecome ${e_{V}}_{_{d}}$ and $e_{M_{d}}$ , respectively.

From the obtained matrices A, B, and Λ, we can estimate the optimal music features from given human motion features, i.e., we can select the best matched music pieces according to human motions. An overview of music recommendation is shown in Figure 3. When a human motion feature V_inis given, we can select the predetermined number of music pieces according to the query human motion that minimize the following distances:

d = ∥ t_{i n} - {\hat{t}}_{i} ∥^{2} (i = 1, 2, \dots, M_{t}),

(40)

where t_inand ${\hat{t}}_{i}$ are, respectively, the query human motion feature and music features in the database ${\hat{M}}_{i} (i = 1, 2, \dots, M_{t})$ transformed into the same feature space shown as follows:

\begin{align} {\hat{t}}_{i} & = B^{'} (ϕ_{M} (vec [{\hat{M}}_{i}]) - {\bar{ϕ}}_{M}) \\ = E_{M}^{'} (κ_{{\hat{M}}_{i}} - \frac{1}{N} K_{M} 1), \end{align}

(41)

\begin{align} t_{i n} & = Λ A^{'} (ϕ_{V} (vec [V_{i n}]) - {\bar{ϕ}}_{V}) \\ = {Λ E}_{V}^{'} ({κ_{V}}_{_{i n}} - \frac{1}{N} K_{V} 1), \end{align}

(42)

and M_tis the number of music pieces in the database. Note that ${κ_{V}}_{_{i n}}$ is an N × 1 vector whose q th elements are $κ_{V}^{L C S S} (V_{i n}, V_{q})$ or $κ_{V}^{S I} (V_{i n}, V_{q})$ , and $κ_{{\hat{M}}_{i}}$ is an N × 1 vector whose q th elements are $κ_{M}^{L C S S} ({\hat{M}}_{i}, M_{q})$ or $κ_{M}^{S I} ({\hat{M}}_{i}, M_{q})$ .

As described above, we can estimate the best matched music pieces according to the human motions. The proposed method calculates the correlation between human motions and music pieces based on the kernel CCA. Then, the proposed method introduces the kernel functions that can be used for time series having various time lengths based on the LCSS or p-spectrum. Therefore, the proposed method enables calculation of the correlation between human motions and music pieces that have various time lengths. Furthermore, effective correlation calculation and successful music recommendation according to human motion based on the obtained correlation are realized.

5 Experimental results

The performance of the proposed method is verified in this section. For the experiments, 170 segments were manually extracted. In the experiments, we used video contents of three classic ballet programs. Of the 170 segments, 44 were from Nutcracker, 54 were from Swan Lake, and 72 were from Sleeping Beauty. Each segment consisted of only one human motion and the background music did not change in the segment. In addition, camera change was not included in the segment. The audio signals in each segment were mono channel, 16 bits per sample and were sampled at 44.1 [kHz]. Human motion features and music features were extracted from the obtained segments.

For evaluation of the performance of our method, we used videos of classic ballet programs. However, there were some differences between motions extracted from classic ballet programs and those extracted in our daily life. In cross-media recommendation, we have to consider whether or not we should recommend contents that have the same meanings as those of queries. For example, when we recommend music pieces from the user's information, recommendation of sad music pieces is not always suitable if the user seems to be sad. Our approach also has to consider the above point. In this article, we focus on extraction of the relationship between human motions and music pieces and perform the recommendation based on the extracted relationship. In addition, we have to prepare some ground truths for evaluation of the proposed method. Therefore, we used videos of classic ballet programs since the human motions and music pieces extracted from the same videos of classic ballet programs had strong and direct relationships.

In order to evaluate the performance of our method, we also prepared five datasets #1 to #5 that were pairs of 100 segments for training (training segments) and 70 segments for testing (testing segments), i.e., a simple cross-validation scheme. It should be noted that we randomly divided the 170 segments into five datasets. The reason for dividing the 170 segments into five datasets was to perform various verifications by changing the combination of test segments and training segments. Then, the number of datasets (five) was simply determined. Furthermore, the training segments and testing segments were obtained from the above prepared 170 segments. For the experiments, 12 kinds of tags representing expression marks in music shown in Table 1 were used. We examined whether each tag could be used for labeling human motions and music pieces. Thus, tags that seemed to be difficult to use for these two media types were removed in this process. Then, we could obtain the above 12 kinds of tags. One suitable tag was manually selected and annotated to each segment for performance verification. In the experiments, one person with musical experience annotated the label that was the best matched to each segment. Generally, annotation should be performed by several people. However, since labels, i.e., expression marks in music, were used in the experiment, it was necessary to have the ground truths made by a person who had knowledge of music. Thus, in the experiment, only one person annotated the labels.

Table 1 Description of expression marks

Full size table

First, we show the recommended results (see Additional file 1). In this file, we show original video contents and recommended video contents. The background music pieces of recommended video contents are not original but are music pieces recommended by our method. These results show that our method can recommend a suitable music piece for a human motion.

Additional file 1:Recommended results. Additional file 1.mov; Description of data: This video content shows our recommendation results. In this video content, original video contents and recommended results, whose video contents' background music are music pieces recommended by our method, are shown. (MOV 6 MB)

Next, we quantitatively verify the performance of the proposed method. In this simulation, we verify the effectiveness of our kernel functions. In the proposed method, we define two types of kernel functions, LCSS kernel and spectrum intersection kernel, for human motions and music pieces. Thus, we experimentally compare our two newly defined kernel functions. Using combinations of the kernel functions, we prepared four simulations "Simulation 1"-"Simulation 4", as follows:

Simulation 1 used the LCSS kernel for both human motions and music pieces.
Simulation 2 used the spectrum intersection kernel for both human motions and music pieces.
Simulation 3 used the spectrum intersection kernel for human motions and the LCSS kernel for music pieces.
Simulation 4 used the LCSS kernel for human motions and the spectrum intersection kernel for music pieces.

These simulations were performed to verify the effectiveness of our two newly defined kernel functions for human motions and music pieces. For the following explanations, we denote the LCSS kernel as "LCSS-K" and the spectrum intersection kernel as "SI-K". In addition, for the experiments, we used the following criterion:

Accuracy Score = \frac{\sum_{i_{1} = 1}^{70} Q_{i_{1}}^{1}}{70},

(43)

where the denominator corresponds to the number of testing segments. Furthermore, $Q_{i_{1}}^{1} (i_{1} = 1, 2, \dots, 70)$ is one if the tags of three recommended music pieces include the tag of the human motion query.

Otherwise, $Q_{i_{1}}^{1}$ is zero. It should be noted that the number of recommended music pieces (three) was simply determined. We next explain how the number of recommended music pieces affects the performance of our method. For the following explanation, we define the terms "over-recommendation" and "mis-recommendation". Over-recommendation means that the recommended results tend to contain music pieces that are not matched to the target human motions as well as matched music pieces, and mis-recommendation means that music pieces that are matched to the target human motions tend not to be correctly selected as the recommendation results. There is a tradeoff relationship between over-recommendation and mis-recommendation. That is, if we increase the number of recommended results, over-recommendation increases and mis-recommendation decreases. On the other hand, if we decrease the number of recommended results, over-recommendation decreases and mis-recommendation increases. Furthermore, we evaluate the recommendation accuracy according to the above criterion. Figure 4 shows that the accuracy score of simulation 1 was higher than accuracy scores of the other simulations. This is because the LCSS kernel can effectively compare human motions and music pieces respectively having different time lengths. Note that in these simulations, we used bi (p = 2)-gram for calculating p-spectrum-based features shown in Equation 9, the number of clusters for chroma vectors is set to K_M= 500 and the parameters in our method are shown in Tables 2, 3, 4 and 5. All of these parameters are empirically determined, and they are set to values that provide the highest accuracy. More details of parameter determination are given in Appendix.

Table 2 Description of parameters used in Simulation 1

Full size table

Table 3 Description of parameters used in Simulation 2

Full size table

Table 4 Description of parameters used in Simulation 3

Full size table

Table 5 Description of parameters used in Simulation 4

Full size table

In the following, we discuss the results obtained. First, we discuss the influence of our human motion features. The features used in our method are based on optical flow and extracted between two regions that contain a human corresponding to two successive frames. This feature can represent movements of arms, legs, hands, etc. However, this feature cannot represent global human movements. This is an important factor for representing motion characteristics of classic ballet. For accurate relationship extraction between human motions and music pieces, it is necessary to improve human motion features into features that can also represent global human movement. This can be complemented using information obtained by much more accurate sensors such as kinect.^d

Next, we discuss the experimental conditions. In the experiments with the proposed method, we used tags, i.e., expression marks in music, as ground truths. This was annotated to each segment. However, this annotation scheme does not consider the relationship between tags. For example, in Table 1, "agitato" and "appassionato" have similar meanings. Thus, the choice of the 12 kinds of tags might be not suitable. It might be necessary to reconsider the choice tags. Also, we found that it is more important to introduce the relationship between tags into our defined accuracy criteria. However, it is difficult to quantify the relationship between them. Thus, we used only one tag for each segment. This can also be expected by the results of subjective evaluation in next experiment.

We also used comparative methods for verifying performance of the proposed method. For the comparative method, we exchanged the kernel functions into gaussian kernel $κ^{G - K} (x, y) = exp (- \frac{∥ x - y ∥^{2}}{2 σ^{2}}) (G - K)$ , sigmoid kernel κ^S-K(x, y) = tanh(α x'y + β) (S-K), and linear kernel κ^L-K(x, y) = x'y (L-K). In this experiment, we set parameters σ(= 5.0), α(= 5.0), and β(= 3.0). It should be noted that these kernel functions cannot be applied to our human motion features and music features directly since the features have various dimensions. Therefore, we simply used the time average of optical flow-based vectors, $v_{j}^{avg}$ , for human motion features and the time average of chroma vectors, $m_{j}^{avg}$ , for music features. Then, we applied the above three types of kernel functions to the obtained features. Figure 5 shows the results of comparison for each kernel function. These results show that our kernel functions are more effective than other kernel functions. The results also show that it is important to consider the temporal characteristic of data, and our kernel function can successfully consider this characteristic. Note that in this comparison, we used parameters that provide the highest accuracy. The parameters are shown in Tables 6, 7 and 8.

Table 6 Description of parameters used in gaussian kernel

Full size table

Table 7 Description of parameters used in sigmoid kernel

Full size table

Table 8 Description of parameters used in linear kernel

Full size table

Finally, we show results of subjective evaluation for our recommendation method. We performed subjective evaluation using 15 subjects (User1-User15). Table 9 shows the profiles of the subjects. In the evaluation, we used video contents which consisted of video sequences and music pieces. In the video contents, each video sequence included one human motion, and each music piece was a recommended result by the proposed method according to the human motion. The tasks of the subjective evaluation were as follows:

Table 9 Profiles of the subjects

Full size table

1.
Subjects watched each video content, whose video sequence was a target classic ballet scene and whose music was recommended by the proposed method.
2.
Subjects determined whether the target classic ballet scene and the recommended music pieces were matched or not. Specifically, they answered yes or no.
3.
Procedures 1 and 2 were repeated for 210 video contents.

In the subjective evaluation, we used the recommended results obtained by Simulation 1 in the above-described experiment. We also used datasets #1 and #2 for the subjective evaluation. In the evaluation, we showed the top three recommended results for each query human motion (query segment). Then, 70 query segments were examined and 210 recommended results were obtained for each dataset.

Furthermore, we used two criteria, "Accuracy Score 2" and "Accuracy Score 3", for verifying the performance. Accuracy Score 2 is defined as follows:

Accuracy Score 2 = \frac{\sum_{i_{2 = 1}}^{70} Q_{i_{2}}^{2}}{70},

(44)

where the denominator corresponds to the number of query segments. $Q_{i_{2}}^{2} (i_{2} = 1, 2 \dots, 70)$ is one if one or some of the recommended three music pieces at least subjects determined the query human motion and its music piece were matched. Otherwise, $Q_{i_{2}}^{2}$ is zero. In addition, Accuracy Score 3 is the ratio of assessment results for 210 music pieces and is defined as follows:

Accuracy Score 3 = \frac{\sum_{i_{3 = 1}}^{210} Q_{i_{3}}^{3}}{210},

(45)

where the denominator corresponds to the number of recommended music pieces. Furthermore, $Q_{i_{3}}^{3} (i_{3} = 1, 2, \dots, 210)$ is one if subjects determined the query human motion and its music piece matched. Otherwise, $Q_{i_{3}}^{3}$ is zero. Table 10 shows the results of each score in the subjective evaluation. From the results, both scores show higher recommendation accuracy than that of the quantitative evaluation. Therefore, the results of the subjective evaluation confirmed the effectiveness of our method.

Table 10 Accuracy of subjective evaluation of each user in Dataset #1 and Dataset #2

Full size table

6 Conclusions

In this article, we have presented a method for music recommendation according to human motion based on the kernel CCA-based relationship. In the proposed method, we newly defined two types of kernel functions. One is a sequential similarity-based kernel function that uses the LCSS algorithm, and the other is a statistical characteristic-based kernel function that uses the p-spectrum. Using these kernel functions, the proposed method enables calculation of the correlation that can consider their sequential characteristics. Furthermore, based on the obtained correlation, the proposed method enables accurate music recommendation according to human motion.

In the experiments, recommendation accuracy was sensitive to the parameters. It is desirable that these parameters be adaptively determined from the datasets. Thus, we need to complement this determination algorithm. Feature selection of the human motions and music pieces is also needed for more accurate extraction of the relationship between human motions and music pieces. These topics will be the subjects of subsequent studies.

Endnotes

^aIn this article, we simply denote "retrieval and recommendation" as recommendation hereafter. ^bIn this article, video sequences are defined as data that contain only visual signals, and video contents are defined as data that contain both visual signals and audio signals. ^cIn this section, we assume that $E [ϕ_{x} (x)] = 0$ and $E [ϕ_{y} (y)] = 0$ for brief explanation, where $E [\cdot]$ denotes the sample average of the random variates. ^dhttp://www.xbox.com/en-US/Kinect.

Appendix A: Feature extraction

In this article, we use human motion features and music features. Here, each feature extraction is explained in detail. Segments are extracted from video contents, i.e., video contents are separated into some segments S_j(j = 1,2,..., N). Then, human motion features and music features are extracted from each segment. In this appendix, we explain methods for extraction of human motion features and music features in A.1 and A.2, respectively.

A.1 Extraction of human motion features

First, the proposed method separates segments S_jinto frames $f_{j}^{k} (k = 1, 2, \dots, N_{j})$ , where N_jis the number of frames in segment S_j. Furthermore, a rectangular region including one human is clipped from each frame, and they are regularized to the same size. In this article, we assume that this rectangular region has previously been obtained. Deciding the rectangular regions including humans might be difficult. However, there are several methods for extracting/deciding human regions from video sequences [26, 27]. These methods achieved accurate human region detection by combining visual information and sensor information such as kinect,^d using a stereo-camera, or using a camera for which position is calibrated. Although we extract the rectangular region manually for simplicity, we consider that a certain precision can be guaranteed using these methods.

Next, we show the calculation of optical flow-based vectors. For calculating optical flows from segments, we firstly divide regions of frame $f_{j}^{k}$ into blocks $B_{j}^{b} (b = 1, 2, \dots, N^{B_{j}})$ , where $N^{B_{j}} (= 1600)$ is the number of blocks in each frame. Then, based on the Lucas-Kanade Algorithm [19], the optical flow in each block $B_{i}^{b}$ is calculated between two successive regions from $f_{j}^{k + 1}$ to $f_{j}^{k}$ for all segments S_j. Then, we obtain optical flow-based vectors $v_{j} (k) (k = 1, 2 \dots, N_{V_{j}})$ containing vertical and horizontal direction optical flow values for all blocks. Then, $N_{v_{j}}$ corresponds to N_j-1.

In this article, the human motion feature V_jof segment S_jis obtained as the sequence of the optical flow-based vector v_j(k). The features obtained by the above procedure represent the temporal characteristics of human motions.

A.2 Extraction of music features

The proposed method uses chromagrams [24]. A chromagram represents the temporal sequence of chroma vectors over time and is calculated from each segment. Furthermore, the chroma vector represents magnitude distribution on the chroma that is assigned into 12 pitch classes within an octave, and thus the chroma vector has 12 dimensions. The 12-dimensional chroma vector m(t) is extracted from the magnitude spectrum Ψ_τ(f_Hz, t), which is calculated using short-time Fourier transform (STFT), where f_Hzis frequency and t is time in an audio signal. The τ(τ = 1, 2,...,12)th element of m(t) corresponds to a pitch class of equal temperament and is defined as follows:

m_{τ} (t) = \sum_{h = O c t_{L}}^{O c t_{H}} \int_{- \infty}^{\infty} B P F_{τ, h} (f_{H z}) Ψ (f_{H z}, t) d f_{H z},

where Oct_Hand Oct_Lrepresent the highest and lowest octave positions, respectively. Furthermore, BPF_τ,h(f_Hz) is a bandpass filter that passes the signal at the log-scale frequency F_τ,h(in cents) of pitch class τ (chroma) in octave position h (height) as shown in the following equation:

F_{τ, h} = 1200 h + 100 (τ - 1) .

We define a chromagram that represents a temporal sequence of 12-dimensional chroma vectors extracted by the above procedure in segment S_jas the music features $M_{j} = [m_{j} (1), m_{j} (2), \dots, m_{j} (N_{M_{j}})]$ , where $N_{M_{j}}$ is the number of components of M_j. Details of the chroma vector and the chromagram are shown in [20].

Appendix B: Parameter determination

In this section, we explain the parameter determination. First, for the determination of parameters, we performed experiments to show the relationship between the accuracy score and the parameters. Figure 6 shows the relationships between the accuracy score and parameters in Simulation 1. From the obtained results, it can be seen that the kernel CCA-based approach tends to be sensitive for the parameters. It should be noted that in the dataset used for the experiments, there are quite different types of pairs of human motions and music pieces. Then, for similar pairs of human motions and music pieces, we will be able to use fixed parameters and obtain accurate results. Therefore, it can be seemed that stable recommendation accuracy scores are achieved using parameters that are determined from datasets that have similar characteristics. This means that for stable recommendation, some schemes performing clustering and classification of the contents become necessary as pre-procedures. The other simulations and other database are also sensitive the same as the shown results. For the above reasons, we used the parameters that provided the highest accuracy. Thus, the parameters were not determined by cross-validation. However, we recognized that such parameter should be determined by the cross-validation. This is our future work.

Abbreviations

CCA:: canonical correlation analysis
MMD:: multimedia documents
LCSS:: longest common subsequence
LCSS-K:: LCSS: kernel
SI-K:: spectrum intersection kernel
G-K:: gaussian kernel
S-K:: sigmoid kernel
L-K:: linear kernel.

References

Kim I, Lee J, Kwon Y, Par S: Content-based image retrieval method using color and shape features. Proceedings of the 1997 International Conference on Information, Communication and Signal Processing 1997, 948-952.
Google Scholar
Zhang R, Zhang Z: Effective image retrieval based on hidden concept discovery in image database. IEEE Trans Image Process 2006,16(2):562-572.
Article Google Scholar
He X, Ma W, Zhang H: Learning an image manifold for retrieval. Proceedings of the ACM Multimedia Conference 2004.
Google Scholar
Guo G, Li S: Content-based audio classification and retrieval by support vector machines. IEEE Trans Neural Networks 2003,14(1):209-215. 10.1109/TNN.2002.806626
Article Google Scholar
Typke R, Wiering F, Veltkamp R: A survey of music information retrieval systems. Proceedings of the ISMIR 2005.
Google Scholar
Shen J, Shepherd J, Ngu A: Towards effective content-based music retrieval with multiple acoustic feature combination. IEEE Trans Multimedia 2006, 8: 1179-1189.
Article Google Scholar
Greenspan H, Goldberger J, Mayer A: Probabilistic space-time video modeling via piecewise GMM. IEEE Trans Pattern Anal Mach Intell 2004,26(3):384-396. 10.1109/TPAMI.2004.1262334
Article Google Scholar
Fan J, Elmagarmid A, Zhu X, Aref W, Wu L: ClassView: hierarchical video shot classification, indexing, and accessing. IEEE Trans Multimedia 2004,6(1):70-86. 10.1109/TMM.2003.819583
Article Google Scholar
Li X, Dacheng T, Maybank S: Visual music and musical vision. Neurocomputing 2008, 71: 2023-2028. 10.1016/j.neucom.2008.01.025
Article Google Scholar
Fujii A, Itou K, Akiba T, Ishikawa T: A cross-media retrieval system for lecture videos. Proceedings of the Eighth European Conference on Speech Communication and Technology (Eurospeech 2003) 2003, 1149-1152.
Google Scholar
Zhuang Y, Yang Y, Wu F: Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval. IEEE Trans Multimedia 2008,10(2):221-229.
Article Google Scholar
Yang Y, Zhuang Y, Wu F, Pan Y: Harmonizing hierarchical manifolds for multimedia document semantics under standing and cross-media retrieval. IEEE Trans Multimedia 2008,10(3):437-446.
Article Google Scholar
Akaho S: A kernel method for canonical correlation analysis. International Meeting of Psychometric Society 2001., 1:
Google Scholar
Jun S, Han B, Hwang E: A similar music retrieval scheme based on musical mood variation. First Asian Conference on Intelligent Information and Database Systems 2009, 1: 167-172.
Google Scholar
Mercer J: Functions of positive and negative type, and their connection with the theory of integral equations. Trans London Philos Soc (A) 1909, 209: 415-446. 10.1098/rsta.1909.0016
Article MATH Google Scholar
Leslie C, Eskin E, Noble W: The spectrum kernel: a string kernel for SVM protein classification. Proceedings of the Pacific Biocomputing Symposium 2002, 566-575.
Google Scholar
Barla A, Odone F, Verri A: Histogram intersection kernel for image classification. ICIP(3) 2006, 513-516.
Google Scholar
Gruber C, Gruber T, Sick B: Online signature verification with new time series kernels for support vector machines. Advances in Biometrics 2005, 3832: 500-508. 10.1007/11608288_67
Article Google Scholar
Lucas B, Kanade T: An iterative image registration technique with an application to stereo vision. Proceedings of the DARPA IU Workshop 1984, 121-130.
Google Scholar
Goto M: A chorus-section detection method for musical audio signals and its application to a music listening station. IEEE Trans Audio Speech Language Process 2006,14(5):1783-1794.
Article Google Scholar
MacQueen J: Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Math. Statistics and Probability 1967, 1: 281-297.
MathSciNet Google Scholar
Mariethoz J, Bengio S: A kernel trick for sequences applied to text-independent speaker verification systems. Pattern Recognition 2007,40(8):2315-2324. 10.1016/j.patcog.2007.01.011
Article MATH Google Scholar
Camps-Valls G, Martin-Guerrero J, Rojo-Alvarez J, Soria-Olivas E: Fuzzy sigmoid kernel for support vector classifier. Neurocomputing 2004, 62: 501-506.
Article Google Scholar
Wakefield GH: Mathematical representation of joint timechroma distributions. SPIE 1999.
Google Scholar
Xu R, Dunsch W II: Survey of clustering algorithms. IEEE Trans Neural Networks 2005,16(3):645-678. 10.1109/TNN.2005.845141
Article Google Scholar
Navneet D, Bill T, Cordelia S: Human detection using oriented histograms of flow and appearance. Comput Vision ECCV 2006 2006, 3952: 428-441. 10.1007/11744047_33
Article Google Scholar
Mikolajczyk K, Schmid C, Zisserman A: Human detection based on a probabilistic assembly of robust part detectors. In Proceedings of the Eighth European Conference on Computer Vision. Volume 1. Prague, Czech Republic; 2004:69-81.
Google Scholar

Download references

Acknowledgements

This study was partly supported by the Grant-in-Aid for Scientific Research (B) 21300030, Japan Society for the Promotion of Science (JSPS).

Author information

Authors and Affiliations

Graduate School of Information Science and Technology, Hokkaido University, Sapporo, Japan
Hiroyuki Ohkushi, Takahiro Ogawa & Miki Haseyama

Authors

Hiroyuki Ohkushi
View author publications
You can also search for this author in PubMed Google Scholar
Takahiro Ogawa
View author publications
You can also search for this author in PubMed Google Scholar
Miki Haseyama
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hiroyuki Ohkushi.

Additional information

Competing interests

The authors declare that they have no competing interests.

Electronic supplementary material

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Ohkushi, H., Ogawa, T. & Haseyama, M. Music recommendation according to human motion based on kernel CCA-based relationship. EURASIP J. Adv. Signal Process. 2011, 121 (2011). https://doi.org/10.1186/1687-6180-2011-121

Download citation

Received: 15 April 2011
Accepted: 05 December 2011
Published: 05 December 2011
DOI: https://doi.org/10.1186/1687-6180-2011-121

Music recommendation according to human motion based on kernel CCA-based relationship

Abstract

1 Introduction

2 Kernel canonical correlation analysis

3 Kernel function construction

3.1 Kernel function for human motions

3.1.1 LCSS kernel

3.1.2 Spectrum intersection kernel

3.2 Kernel function for music pieces

3.2.1 LCSS kernel

3.2.2 Spectrum intersection kernel

Step 1: Calculation of optimal modulation amounts between music features

Step 2: Similarity measurement between chroma vectors using the obtained optimal modulation amounts

Step 3: Clustering chroma vectors based on the obtained similarities

4 Kernel CCA-based music recommendation according to human motion

5 Experimental results

6 Conclusions

Endnotes

Appendix A: Feature extraction

A.1 Extraction of human motion features

A.2 Extraction of music features

Appendix B: Parameter determination

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Electronic supplementary material

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Rights and permissions

About this article

Cite this article

Share this article

Keywords