 Research
 Open Access
 Published:
Music recommendation according to human motion based on kernel CCAbased relationship
EURASIP Journal on Advances in Signal Processing volume 2011, Article number: 121 (2011)
Abstract
In this article, a method for recommendation of music pieces according to human motions based on their kernel canonical correlation analysis (CCA)based relationship is proposed. In order to perform the recommendation between different types of multimedia data, i.e., recommendation of music pieces from human motions, the proposed method tries to estimate their relationship. Specifically, the correlation based on kernel CCA is calculated as the relationship in our method. Since human motions and music pieces have various time lengths, it is necessary to calculate the correlation between time series having different lengths. Therefore, new kernel functions for human motions and music pieces, which can provide similarities between data that have different time lengths, are introduced into the calculation of the kernel CCAbased correlation. This approach effectively provides a solution to the conventional problem of not being able to calculate the correlation from multimedia data that have various time lengths. Therefore, the proposed method can perform accurate recommendation of best matched music pieces according to a target human motion from the obtained correlation. Experimental results are shown to verify the performance of the proposed method.
1 Introduction
With the popularization of online digital media stores, users can obtain various kinds of multimedia data. Therefore, technologies for retrieving and recommending desired contents are necessary to satisfy the various demands of users. A number of methods for contentbased multimedia retrieval and recommendation^{a} have been proposed. Image recommendation [1–3], music recommendation [4–6], and video recommendation [7, 8] have been intensively studied in several fields. It should be noted that most of these previous works had the constraint of query examples and returned results to be recommended being of the same type. However, due to diversification of users' demands, there is a need for a new type of multimedia recommendation in which the media types of query examples and the returned results can be different. Thus, several recommendation methods [9–12] for realizing these recommendation schemes have been proposed. Generally, they are called crossmedia recommendation. In the conventional methods of the crossmedia recommendation, the query examples and recommended results need not to be of the same media types. For example, users can search music pieces by submitting either an image example or a music example.
Among the conventional methods of crossmedia recommendation, Li et al. proposed a method for recommendation between images and music pieces by comparing their features directly using a dynamic time warping algorithm [9]. Furthermore, Zhang et al. proposed a method for crossmedia recommendation between multimedia documents based on a semantic graph [11, 12]. A multimedia document (MMD) is a collection of coexisting heterogeneous multimedia objects that have the same semantics. For example, an educational web page with instructive text, images and audio is an MMD. By these conventional methods, users can search for their desired contents more flexibly and effectively.
It should be noted that the aboveconventional methods concentrate on recommendation between different types multimedia data. Thus, in this scheme, users are forced to provide query multimedia data, although they do not have a limitation of media types. This means that users must make some decisions to provide queries, and this causes difficulties for reflecting their demands. If recommendation of some multimedia data from features directly obtained from users is realized, one feasible solution can be provided to overcome the limitation. Specifically, we show the following two example applications: (i) background music selection from humans' dance motions for nonedited video contents^{b} and (ii) presentation of music information from features of target music pieces or dance motions. In the first example, using the relationship obtained between dance motions and music pieces in a database, we can obtain/find matched music pieces from human motions in video contents, and vice versa. This should be useful for creating a new dance program with background music and a music promotional video with dance motions. For example, given human motions of a classic ballet program, we can assign music pieces matched to the target human motions, and this example will be shown in the verification in the experiment section. Next, in the second example, this can present to users information of music that they are listening to, i.e., song title, composer, etc. Users can use sounds of music pieces or the user's own dance motion associated with the music as the query for obtaining information on the music. As described above, the application can also use the relationship between human motions and music pieces, and it can be a more flexible information presentation system than the conventional ones. In this way, information directly obtained from users, i.e., users' motions can retain the potential to get various benefits. These schemes are crossmedia recommendation schemes and they remove barriers between users and those multimedia contents.
In this article, we deal with recommendation of music pieces from features obtained from users. Among the features, human motions have highlevel semantics, and their use is effective for realizing accurate recommendation. Therefore, we try to estimate suitable music pieces from human motions. This is because we consider that correlation extraction between human motions and music pieces becomes feasible using some specific video contents such as dance and music promotional videos. This benefit is also useful in performance verification. Then, we assume that the meaning of "suitable" is emotionally similar. Specifically, in our purpose, the recommendation of suitable music pieces according to human motions is that the recommended music pieces are emotionally similar to the query human motions.
In this article, we propose a new method for crossmedia recommendation of music pieces according to human motions based on kernel canonical correlation analysis (CCA) [13]. We use video contents in which video sequences and audio signals contain human motions and music pieces, respectively, as training data for calculating their correlation. Then, using the obtained correlation, estimation of the best matched music piece from a target human motion becomes feasible. It should be noted that several methods of crossmedia recommendation have previously been proposed. However, there have been no methods focused on handling data that have various time lengths, i.e., human motions and music pieces. Thus, we propose a crossmedia recommendation method that can effectively use characteristics of time series, and we assume that this can be realized using kernel CCA and our defined kernel functions. From the above discussion, the main contribution of the proposed method is handling data that have various time lengths for crossmedia recommendation.
In this approach, we have to consider the differences in time lengths. In the proposed method, new kernel functions of human motions and music pieces are introduced into the CCAbased correlation calculation. Specifically, we newly adopt two types of kernel functions, which can represent similarities by effectively using human motions or music pieces having various time lengths, for the kernel CCAbased correlation calculation. First, we define a longest common subsequence (LCSS) kernel for using data having different time lengths. Since the LCSS [14] is commonly used for motion comparison, the LCSS kernel should be suitable for our purpose. It should be noted that kernel functions must satisfy Mercer's theorem [15], but our newly defined kernel function does not necessarily satisfy this theorem. Therefore, we also adopt another type of kernel function, spectrum intersection kernel, that satisfies Mercer's theorem. This function introduces the pspectrum [16] and is based on the histogram intersection kernel [17]. Since the histogram intersection kernel is known as a function that satisfies Mercer's theorem, the spectrum intersection kernel also satisfies this theorem.
Actually, there have been kernel functions that do not satisfy Mercer's theorem, and there have also been several proposed methods that use such kernel functions. The effectiveness of the abovedescribed methods has also been verified. Thus, we should also verify the effectiveness of our defined kernel function, which does not satisfy Mercer's theorem, i.e., the LCSS kernel. In addition, we should also compare our two newly defined kernel functions experimentally. Therefore, in this article, we introduce two types of kernel functions. Using these two types of kernel functions, the proposed method can directly compare multimedia data that have various time lengths, and this is the main advantage of our method. Thus, the use of these kernel functions effectively provides a solution to the problem of not being able to simply apply sequential data such as human motions and music pieces to crossmedia recommendation. Consequently, effective modeling of the relationship using music and human motion data that have various time lengths is realized, and successful music recommendation can be expected.
This article is organized as follows. First, in Section 2, we briefly explain the kernel CCA used for calculating the correlation between human motions and music pieces. Next, in Section 3, we describe our two newly defined kernel functions. Kernel CCAbased music recommendation according to human motion is proposed in Section 4. Experimental results that verify the performance of the proposed method are shown in Section 5. Finally conclusions are given in Section 6.
2 Kernel canonical correlation analysis
In this section, we explain kernel CCA. First, two variables x and y are transformed into Hilbert space H_{ x }and H_{ y }via nonlinear maps ϕ_{ x }and ϕ_{ y }. From the mapped results ϕ_{ x }(x) ∈ H_{ x }and ϕ_{ y }(y) ∈ H_{ y },^{c} the kernel CCA seeks to maximize the correlation
between
and
over the projection directions a and b. This means that kernel CCA finds the directions a and b that maximize the correlation E\left[uv\right] of corresponding projections subject to E\left[{u}^{2}\right]=1 and E\left[{v}^{2}\right]=1.
The optimal directions a and b can be found by solving the Lagrangian
where η is a regularization parameter. The abovecomputation scheme is called regularized kernel CCA [13]. By taking the derivatives of Equation 4 with respect to a and b, λ_{1} = λ_{2}(= λ) is derived, and the directions a and b maximizing the correlation ρ (= λ) can be calculated.
3 Kernel function construction
Construction of new kernel functions is described in this section. The proposed method constructs two types of kernel functions for human motions and music pieces, respectively. First, we introduce an LCSS kernel as a kernel function that does not satisfy Mercer's theorem. This function is based on the LCSS algorithm [18], which is commonly used for motion or temporal music signal comparison since the LCSS algorithm can compare two temporal signals even if they have different time lengths. Therefore, it seems that this kernel function is suitable for our recommendation scheme. On the other hand, we also introduce a spectrum intersection kernel that satisfies Mercer's theorem. This function is based on the pspectrum [16], which is generally used for text comparison. The pspectrum uses the continuity of words. This property is also useful for analyzing the structure of temporal sequential data, i.e., human motions. Thus, the spectrum intersection kernel is also suitable for our recommendation scheme.
For the following explanation, we prepare pairs of human motions and music pieces extracted from the same video contents and denote each pair as a segment. The segments are defined as short terms of video contents that have various time lengths. From the obtained segments, we extract human motion features and music features of the j th (j = 1, 2,..., N) segment as {V}_{j}=\left[{v}_{j}\left(1\right),{v}_{j}\left(2\right),...,{v}_{j}\left({N}_{{v}_{j}}\right)\right] and {M}_{j}=\left[{m}_{j}\left(1\right),{m}_{j}\left(2\right),...,{m}_{j}\left({N}_{{M}_{j}}\right)\right], where {N}_{{v}_{j}} and {N}_{{M}_{j}} are the numbers of components of V_{ j }and M_{ j }, respectively, and N is the number of segments. In V_{ j }and {M}_{j},{v}_{j}\left({l}_{v}\right)\phantom{\rule{2.77695pt}{0ex}}\left({l}_{v}=1,2,...,{N}_{{v}_{j}}\right) and {m}_{j}\left({l}_{m}\right)\phantom{\rule{2.77695pt}{0ex}}\left({l}_{m}=1,2,...,{N}_{{M}_{j}}\right) correspond to optical flows [19] and chroma vectors [20], respectively. The optical flow is a simple and representative feature that represents motion characteristics between two successive frames in video sequences and is commonly used for motion comparison. Thus, we adopt the optical flow as temporal components of human motion features. Furthermore, the chroma vector represents tone distribution of music signals at each time. The chroma vector can represent the characteristics of a music signal robustly if it is extracted in a short time. In addition, due to the simplicity of the implementation, we adopted these features in our method. More details of these features are given in Appendices A.1 and A.2.
3.1 Kernel function for human motions
3.1.1 LCSS kernel
In order to define kernel functions for human motions having various time lengths, we firstly explain the LCSS kernel for human motions that uses an LCSSbased similarity in [14]. An LCSS is an algorithm that enables calculation of the longest common part and its length (LCSS length) between two sequences.
Figure 1 shows an example of a table produced by LCSS length of two sequences X = 〈B, D, C, A, B〉 and Y = 〈A, B, C, B, A, B〉. In this figure, the highlighted components represent the common components in two different sequences and LCSS length between X and Y becomes four.
Here, we show the definition of similarity between human motion features. For the following explanations, we denote two human motion features as {V}_{a}=\left[{v}_{a}\left(1\right),{v}_{a}\left(2\right),...,{v}_{a}\left({N}_{{v}_{a}}\right)\right] and Vb=\left[{v}_{b}\left(1\right),{v}_{b}\left(2\right),...,{v}_{b}\left({N}_{{v}_{b}}\right)\right], where {v}_{a}\left({l}_{a}\right)\phantom{\rule{2.77695pt}{0ex}}\left({l}_{a}=1,2,...,N{v}_{a}\right) and {v}_{b}\left({l}_{b}\right)\phantom{\rule{2.77695pt}{0ex}}\left({l}_{b}=1,2,...,{N}_{{v}_{b}}\right) are components of V_{ a }and V_{ b }, respectively, and {N}_{{v}_{a}} and {N}_{{v}_{b}} are the numbers of components in V_{ a }and V_{ b }, respectively. In addition, v_{ a }(l_{ a }) and v_{ b }(l_{ b }) correspond to optical flows extracted in each frame in each video sequence. Note that {N}_{{v}_{a}} and {N}_{{v}_{b}} depend on the time lengths of their segments; that is, they depend on the number of frames of their video sequences. The similarity between V_{ a }and V_{ b }is defined as follows:
where LCSS(V_{ a },V_{ b }) is the LCSS length of V_{ a }and V_{ b }, and it is recursively defined as
where c(·) is a cluster number of optical flow. In the proposed method, we apply a kmeans algorithm [21] for all optical flows obtained from all segments, and the obtained cluster numbers assigned to the belonging optical flows c(·) are used for easy comparison of two different optical flows. For this purpose, some kinds of quantization or labeling of the temporal variation of the time series seem to be available. In the proposed method, we adopt kmeans clustering for its simplicity.
We then define this similarity measure as the LCSS kernel for human motions {\kappa}_{v}^{LCSS}\left(\cdot ,\cdot \right) as follows:
The abovekernel function can be used for time series having various time lengths. Not only our LCSS kernel but also other kernel functions are known as nonpositive semidefinite. Therefore, these do not strictly satisfy Mercer's theorem [15]. Fortunately, kernel functions that do not satisfy Mercer's theorem have been verified to be effective for classification of sequential data using a kernel function in [18].
Furthermore, several methods using kernel functions that do not satisfy the theorem have been proposed in [22, 23]. Also, a sigmoid kernel has been commonly used and is well known as a kernel function which does not satisfy Mercer's theorem. We therefore briefly discuss implications and problems that might emerge using a kernel function that does not satisfy the theorem. In order to satisfy Mercer's theorem, a gram matrix whose elements correspond to values of a kernel function is required to be a positive semidefinite and symmetric matrix. Not only our defined kernel function but also other kernel functions that do not satisfy Mercer's theorem have symmetric and nonpositive semidefinite gram matrices. Thus, for the solution based on such kernel functions, several methods have modified eigenvalues of the gram matrices to be greater than or equal to zero. It should be noted that we used our defined kernel functions directly in the proposed method.
3.1.2 Spectrum intersection kernel
Next, we explain the spectrum intersection kernel for human motions. In order to define the spectrum intersection kernel for human motions, we firstly calculate pspectrumbased features. The pspectrum [16] is the set of all plength (contiguous) subsequences that it contains. The pspectrumbased features on string \mathcal{X} are indexed by all possible subsequences {\mathcal{X}}_{s} of length p and defined as follows:
where
and \mathcal{A} is the set of characters in strings. For human motion features, we cannot apply the pspectrum directly since human motion features are defined as sequences of vectors. Therefore, we apply the pspectrum to sequences of cluster numbers of optical flows as that done for the LCSS kernel. We use the histogram intersection kernel [17] for constructing the spectrum intersection kernel. The histogram intersection kernel κ^{HI}(·, ·) is a useful kernel function for classification of histogramshaped features and is defined as follows:
where h_{ a }and h_{ b }are histogramshaped features, h_{ a }(i_{ h }) and h_{ b }(i_{ h }) are the i_{ h }th element (bin) values of h_{ a }and h_{ b }, respectively, and N^{h}is the numbers of bins of histogramshaped features. Furthermore, {\sum}_{{}^{i}h=1}^{{N}^{h}}{h}_{a}\left({i}_{h}\right)=1 and {\sum}_{{}^{i}h=1}^{{N}^{h}}{h}_{b}\left({i}_{h}\right)=1 are required to apply the histogram intersection kernel into h_{ a }and h_{ b }. The pspectrumbased features also have histogram shapes, and they can be applied to the histogram intersection kernel. Note that the sums of elements have to be normalized in the same way as that done for histogramshaped features. After that, we define this kernel function as the spectrum intersection kernel for human motions {\kappa}_{v}^{SI}\left(\cdot ,\cdot \right) shown as follows:
The abovekernel function can consider statistical characteristics of human motion features. Since the histogram intersection kernel is positive semidefinite [17], the spectrum intersection kernel can satisfy Mercer's theorem [15]. Note that the abovekernel function is equivalent to the spectrum kernel defined in [16] if we use the simple inner product of pspectrumbased features instead of the histogram intersection in Equation 12.
3.2 Kernel function for music pieces
3.2.1 LCSS kernel
The kernel functions for music pieces are defined in the same way as those of human motions. First, we show the definition of the LCSS kernel for music pieces. For the following explanations, we denote two music features as {M}_{a}=\left[{m}_{a}\left(1\right),{m}_{a}\left(2\right),...,{m}_{a}\left({N}_{{M}_{a}}\right)\right] and {M}_{b}=\left[{m}_{b}\left(1\right),{m}_{b}\left(2\right),...,{m}_{b}\left({N}_{{M}_{b}}\right)\right], where M_{ a }and M_{ b }are chromagrams [24] and are extracted from segments, {m}_{a}\left({l}_{a}\right)\phantom{\rule{2.77695pt}{0ex}}\left({l}_{a}=1,2,...,{N}_{{M}_{a}}\right) and {m}_{b}\left({l}_{b}\right)\phantom{\rule{2.77695pt}{0ex}}\left({l}_{b}=1,2,...,{N}_{{M}_{b}}\right) are components of M_{ a }and M_{ b }, and {N}_{{M}_{a}} and {N}_{{M}_{b}} are the numbers of components of M_{ a }and M_{ b }, respectively. In addition, m_{ a }(l_{ a }) and m_{ b }(l_{ b }) are chroma vectors [20] that have 12 dimensions. Since {N}_{{M}_{a}} and {N}_{{M}_{b}} depend on the time lengths of their segments, the similarity between music features is also defined on the basis of the LCSS algorithm. Note that it is desirable that the similarity between an original music piece and its modulated version becomes high since they have similar melodies, base lines, or harmonics. Therefore, we define similarity considering the modulation of music. In the proposed method, we use temporal sequences of chroma vectors, i.e., chromagrams defined in [24], as music features. One of the advantages of the use of 12dimensional chroma vectors in the chromagrams is that the transposition amount of modulation can be naturally represented only by the amount ζ by which its 12 elements are shifted (rotated). Therefore, the proposed method effectively uses the above characteristic for measuring similarities between chromagrams. For the following explanation, we define the modulated chromagram {M}_{b}^{\zeta}=\left[{m}_{b}^{\zeta}\left(1\right),{m}_{b}^{\zeta}\left(2\right),..,{m}_{b}^{\zeta}\left({N}_{{M}_{b}}\right)\right]. Note that {m}_{b}^{\zeta}\left({l}_{b}\right)\phantom{\rule{2.77695pt}{0ex}}\left({l}_{b}=1,2,...,{N}_{{M}_{b}}\right) represents a modulated chroma vector whose elements are shifted by amount ζ.
The similarity between M_{ a }and M_{ b }is defined as follows:
where LCSS\left({M}_{a},{M}_{b}^{\zeta}\right) is recursively defined as
where T_{ h }(= 0.8) is a positive constant for determining the fitness between two different chroma vectors, Sim_{ τ }{·, ·} is a similarity between chroma vectors defined in [20], {\stackrel{\u0303}{m}}_{a}\left({l}_{a}\right) and {\stackrel{\u0303}{m}}_{b}^{\zeta}\left({l}_{b}\right) are normalized chroma vectors, m_{a, τ}(l_{ a }) and {m}_{b,\tau}^{\zeta}\left({l}_{b}\right) are elements of the chroma vectors, and τ corresponds to tone, i.e., "C", "D#", "G#", etc. Note that the effectiveness of Sim_{ τ }{·, ·} is verified in [20]. We then define this similarity as the LCSS kernel for music pieces {\kappa}_{M}^{LCSS}\left(\cdot ,\cdot \right) described as follows:
3.2.2 Spectrum intersection kernel
Next, we explain the spectrum intersection kernel for music pieces. In order to define the spectrum intersection kernel for music pieces, we firstly calculate pspectrumbased features in the same way as those of human motions. It should be noted that the proposed method cannot calculate the pspectrum from music features directly since the music features are defined as sequences of vectors. Therefore, we transform all of the vector components of music features into characters, such as alphabetic letters or numbers, based on hierarchical clustering algorithms, where the characters correspond to cluster numbers. For clustering the vector components, the modulation of music should also be considered in the same way as the LCSS kernel for music pieces. Therefore, clustering considering modulation is necessary. The procedures of this scheme are shown as follows.
Step 1: Calculation of optimal modulation amounts between music features
First, the proposed method calculates the optimal modulation amounts ζ^{ab}between two music features M_{ a }and M_{ b }. This scheme is based on LCSSbased similarity and is defined as follows:
The optimal modulation amount ζ^{ab}is calculated for all pairs.
Step 2: Similarity measurement between chroma vectors using the obtained optimal modulation amounts
Similarity between vector components, which is that between chroma vectors, is calculated using the obtained optimal modulation amounts. For example, the similarity between chroma vectors m_{ a }(l_{ a }) and m_{ b }(l_{ b }), which are the l_{ a }th and l_{ b }th components of two arbitrary music features M_{ a }and M_{ b }, respectively, is calculated using the obtained optimal modulation amount ζ^{ab}and Equation 16 as follows:
The above similarity is calculated between two different chroma vectors for all music features.
Step 3: Clustering chroma vectors based on the obtained similarities
Using the obtained similarities, the two most similar chroma vectors are assigned to the same cluster for clustering chroma vectors. This scheme is based on the single linkage method [25]. The merging scheme is recursively performed until the number of clusters becomes less than K_{ M }.
Using the clustering results, the proposed method calculates transformed music features {m}_{j}^{*}={\left[{m}_{j}^{*}\left(1\right),{m}_{j}^{*}\left(2\right),\dots ,{m}_{j}^{*}\left({{N}_{M}}_{{}_{j}}\right)\right]}^{\prime}\left(j=1,2,\dots ,N\right), where {m}_{j}^{*}\left({l}_{M}\right)\left({l}_{M}=1,2,\dots ,{{N}_{M}}_{{}_{j}}\right) is a cluster number assigned to a corresponding chroma vector. Note that vector/matrix transpose is denoted by the superscript ' in this article. The proposed method then calculates pspectrumbased features from {m}_{j}^{*}. For the following explanations, we denote two transformed music features as {m}_{a}^{*}={\left[{m}_{a}^{*}\left(1\right),{m}_{a}^{*}\left(2\right),\dots ,{m}_{a}^{*}\left({N}_{{M}_{a}}\right)\right]}^{\prime} and {m}_{b}^{*}={\left[{m}_{b}^{*}\left(1\right),{m}_{b}^{*}\left(2\right),\dots ,{m}_{b}^{*}\left({N}_{{M}_{b}}\right)\right]}^{\prime}, where {m}_{a}^{*} and {m}_{b}^{*} are vectors transformed from M_{ a }and M_{ b }, respectively, and {m}_{a}^{*}\left({l}_{a}\right)\left({l}_{a}=1,2,\dots ,{N}_{{M}_{a}}\right) and {m}_{b}^{*}\left({l}_{b}\right)\left({l}_{b}=1,2,\dots ,{N}_{{M}_{b}}\right) are the cluster numbers assigned to m_{ a }(l_{ a }) and m_{ b }(l_{ b }), respectively. Then, the spectrum intersection kernel for music pieces is calculated in the same way as that for human motions and is defined as follows:
4 Kernel CCAbased music recommendation according to human motion
A method for recommending music pieces suitable for human motions is presented in this section. An overview of the proposed method is shown in Figure 2. In our crossmedia recommendation method, pairs of human motions and music pieces that have a close relationship are necessary for effective correlation calculation. Therefore, we prepare these pairs extracted from the same video contents as segments. From the obtained segments, we extract human motion features and music features. More details of these features are given in Appendices A.1 and A.2. By applying kernel CCA to the features of human motions and music pieces, the proposed method calculates their correlation. In this approach, we define new kernel functions that can be used for data having various time lengths and introduce them into the kernel CCA.
Therefore, the proposed method can calculate the correlations by considering their sequential characteristics. Then, effective modeling of the relationship using human motions and music pieces having various time lengths is realized, and successful music recommendation can be expected.
First, we define the features of V_{ j }and M_{ j }(j = 1, 2,..., N) in the Hilbert space as ϕ_{ v }(vec[V_{ j }]) and ϕ_{ M }(vec[M_{ j }]), where vec[·] is the vectorization operator that turns a matrix into a vector. Next, we find features
where {\stackrel{\u0304}{\varphi}}_{V} and {\stackrel{\u0304}{\varphi}}_{M} are mean vectors of ϕ_{ v }(vec[V_{ j }]) and ϕ_{ M }(vec[M_{ j }]) (j = 1, 2,..., N), respectively. The matrices A and B are coefficient matrices whose columns a_{ d }and b_{ d }(d = 1, 2,..., D), respectively, correspond to the projection directions in Equations 2 and 3, where the value D is the dimension of A and B. Then, we define a correlation matrix Λ whose diagonal elements are the correlation coefficients λ_{ d }(d = 1,2,..., D). The details of the calculation of A, B, and Λ are shown as follows.
In order to obtain A, B, and Λ, we use the regularized kernel CCA shown in the previous section. Note that the optimal matrices A and B are given by
where {E}_{V}=\left[{e}_{{V}_{1}},{e}_{{V}_{2}},\dots ,{e}_{{V}_{D}}\right] and {E}_{M}=\left[{{e}_{M}}_{{}_{1}},{{e}_{M}}_{{}_{2}},\dots ,{{e}_{M}}_{{}_{D}}\right] are N × D matrices. Furthermore,
is a centering matrix, where I is the N × N identity matrix, and 1 = [1,..., 1]' is an N × 1 vector. From Equations 27 and 28, the following equations are satisfied:
Then, by calculating the optimal solution {{e}_{V}}_{{}_{d}} and {e}_{{M}_{d}}\left(d=1,2,\dots ,D\right), A and B are obtained. In the same way as Equation 4, we calculate the optimal solution {{e}_{V}}_{{}_{d}} and {e}_{{M}_{d}} that maximizes
where e_{ V }, e_{ M }, and λ correspond to {{e}_{V}}_{{}_{d}},{e}_{{M}_{d}}, and λ_{ d }, respectively. In the above equation, L, M, and P are calculated as follows:
Furthermore, η_{1} and η_{2} are regularization parameters, and {K}_{V}\left(={\Xi}_{V}^{\prime}{\Xi}_{V}\right) and {K}_{M}\left(={\Xi}_{M}^{\prime}{\Xi}_{M}\right) are matrices whose elements are defined as values of the corresponding kernel functions defined in Section 3. By taking derivatives of Equation 34 with respect to e_{ V }and e_{ M }, optimal e_{ V }, e_{ M }, and λ can be obtained as solutions of following eigenvalue problems:
where λ is obtained as an eigenvalue, and the vectors e_{ V }and e_{ M }are, respectively, obtained as eigenvectors. Then, the d th (d = 1, 2,..., D) eigenvalue of λ becomes λ_{ d }, where λ_{1} ≥ λ_{2} ≥ ... ≥ λ_{ D }. Note that the dimension D is set to a value for which the cumulative proportion obtained from λ_{ d }(d = 1,2,...,D) becomes larger than a threshold. Furthermore, the eigenvectors e_{ V }and e_{ M }corresponding to λ_{ d }become {{e}_{V}}_{{}_{d}} and {e}_{{M}_{d}}, respectively.
From the obtained matrices A, B, and Λ, we can estimate the optimal music features from given human motion features, i.e., we can select the best matched music pieces according to human motions. An overview of music recommendation is shown in Figure 3. When a human motion feature V_{ in }is given, we can select the predetermined number of music pieces according to the query human motion that minimize the following distances:
where t_{ in }and {\widehat{t}}_{i} are, respectively, the query human motion feature and music features in the database {\widehat{M}}_{i}\left(i=1,2,\dots ,{M}_{t}\right) transformed into the same feature space shown as follows:
and M_{ t }is the number of music pieces in the database. Note that {{\kappa}_{V}}_{{}_{in}} is an N × 1 vector whose q th elements are {\kappa}_{V}^{LCSS}\left({V}_{in},{V}_{q}\right) or {\kappa}_{V}^{SI}\left({V}_{in},{V}_{q}\right), and {\kappa}_{{\widehat{M}}_{i}} is an N × 1 vector whose q th elements are {\kappa}_{M}^{LCSS}\left({\widehat{M}}_{i},{M}_{q}\right) or {\kappa}_{M}^{SI}\left({\widehat{M}}_{i},{M}_{q}\right).
As described above, we can estimate the best matched music pieces according to the human motions. The proposed method calculates the correlation between human motions and music pieces based on the kernel CCA. Then, the proposed method introduces the kernel functions that can be used for time series having various time lengths based on the LCSS or pspectrum. Therefore, the proposed method enables calculation of the correlation between human motions and music pieces that have various time lengths. Furthermore, effective correlation calculation and successful music recommendation according to human motion based on the obtained correlation are realized.
5 Experimental results
The performance of the proposed method is verified in this section. For the experiments, 170 segments were manually extracted. In the experiments, we used video contents of three classic ballet programs. Of the 170 segments, 44 were from Nutcracker, 54 were from Swan Lake, and 72 were from Sleeping Beauty. Each segment consisted of only one human motion and the background music did not change in the segment. In addition, camera change was not included in the segment. The audio signals in each segment were mono channel, 16 bits per sample and were sampled at 44.1 [kHz]. Human motion features and music features were extracted from the obtained segments.
For evaluation of the performance of our method, we used videos of classic ballet programs. However, there were some differences between motions extracted from classic ballet programs and those extracted in our daily life. In crossmedia recommendation, we have to consider whether or not we should recommend contents that have the same meanings as those of queries. For example, when we recommend music pieces from the user's information, recommendation of sad music pieces is not always suitable if the user seems to be sad. Our approach also has to consider the above point. In this article, we focus on extraction of the relationship between human motions and music pieces and perform the recommendation based on the extracted relationship. In addition, we have to prepare some ground truths for evaluation of the proposed method. Therefore, we used videos of classic ballet programs since the human motions and music pieces extracted from the same videos of classic ballet programs had strong and direct relationships.
In order to evaluate the performance of our method, we also prepared five datasets #1 to #5 that were pairs of 100 segments for training (training segments) and 70 segments for testing (testing segments), i.e., a simple crossvalidation scheme. It should be noted that we randomly divided the 170 segments into five datasets. The reason for dividing the 170 segments into five datasets was to perform various verifications by changing the combination of test segments and training segments. Then, the number of datasets (five) was simply determined. Furthermore, the training segments and testing segments were obtained from the above prepared 170 segments. For the experiments, 12 kinds of tags representing expression marks in music shown in Table 1 were used. We examined whether each tag could be used for labeling human motions and music pieces. Thus, tags that seemed to be difficult to use for these two media types were removed in this process. Then, we could obtain the above 12 kinds of tags. One suitable tag was manually selected and annotated to each segment for performance verification. In the experiments, one person with musical experience annotated the label that was the best matched to each segment. Generally, annotation should be performed by several people. However, since labels, i.e., expression marks in music, were used in the experiment, it was necessary to have the ground truths made by a person who had knowledge of music. Thus, in the experiment, only one person annotated the labels.
First, we show the recommended results (see Additional file 1). In this file, we show original video contents and recommended video contents. The background music pieces of recommended video contents are not original but are music pieces recommended by our method. These results show that our method can recommend a suitable music piece for a human motion.
Additional file 1:Recommended results. Additional file 1.mov; Description of data: This video content shows our recommendation results. In this video content, original video contents and recommended results, whose video contents' background music are music pieces recommended by our method, are shown. (MOV 6 MB)
Next, we quantitatively verify the performance of the proposed method. In this simulation, we verify the effectiveness of our kernel functions. In the proposed method, we define two types of kernel functions, LCSS kernel and spectrum intersection kernel, for human motions and music pieces. Thus, we experimentally compare our two newly defined kernel functions. Using combinations of the kernel functions, we prepared four simulations "Simulation 1""Simulation 4", as follows:

Simulation 1 used the LCSS kernel for both human motions and music pieces.

Simulation 2 used the spectrum intersection kernel for both human motions and music pieces.

Simulation 3 used the spectrum intersection kernel for human motions and the LCSS kernel for music pieces.

Simulation 4 used the LCSS kernel for human motions and the spectrum intersection kernel for music pieces.
These simulations were performed to verify the effectiveness of our two newly defined kernel functions for human motions and music pieces. For the following explanations, we denote the LCSS kernel as "LCSSK" and the spectrum intersection kernel as "SIK". In addition, for the experiments, we used the following criterion:
where the denominator corresponds to the number of testing segments. Furthermore, {Q}_{{i}_{1}}^{1}\left({i}_{1}=1,2,\dots ,70\right) is one if the tags of three recommended music pieces include the tag of the human motion query.
Otherwise, {Q}_{{i}_{1}}^{1} is zero. It should be noted that the number of recommended music pieces (three) was simply determined. We next explain how the number of recommended music pieces affects the performance of our method. For the following explanation, we define the terms "overrecommendation" and "misrecommendation". Overrecommendation means that the recommended results tend to contain music pieces that are not matched to the target human motions as well as matched music pieces, and misrecommendation means that music pieces that are matched to the target human motions tend not to be correctly selected as the recommendation results. There is a tradeoff relationship between overrecommendation and misrecommendation. That is, if we increase the number of recommended results, overrecommendation increases and misrecommendation decreases. On the other hand, if we decrease the number of recommended results, overrecommendation decreases and misrecommendation increases. Furthermore, we evaluate the recommendation accuracy according to the above criterion. Figure 4 shows that the accuracy score of simulation 1 was higher than accuracy scores of the other simulations. This is because the LCSS kernel can effectively compare human motions and music pieces respectively having different time lengths. Note that in these simulations, we used bi (p = 2)gram for calculating pspectrumbased features shown in Equation 9, the number of clusters for chroma vectors is set to K_{ M }= 500 and the parameters in our method are shown in Tables 2, 3, 4 and 5. All of these parameters are empirically determined, and they are set to values that provide the highest accuracy. More details of parameter determination are given in Appendix.
In the following, we discuss the results obtained. First, we discuss the influence of our human motion features. The features used in our method are based on optical flow and extracted between two regions that contain a human corresponding to two successive frames. This feature can represent movements of arms, legs, hands, etc. However, this feature cannot represent global human movements. This is an important factor for representing motion characteristics of classic ballet. For accurate relationship extraction between human motions and music pieces, it is necessary to improve human motion features into features that can also represent global human movement. This can be complemented using information obtained by much more accurate sensors such as kinect.^{d}
Next, we discuss the experimental conditions. In the experiments with the proposed method, we used tags, i.e., expression marks in music, as ground truths. This was annotated to each segment. However, this annotation scheme does not consider the relationship between tags. For example, in Table 1, "agitato" and "appassionato" have similar meanings. Thus, the choice of the 12 kinds of tags might be not suitable. It might be necessary to reconsider the choice tags. Also, we found that it is more important to introduce the relationship between tags into our defined accuracy criteria. However, it is difficult to quantify the relationship between them. Thus, we used only one tag for each segment. This can also be expected by the results of subjective evaluation in next experiment.
We also used comparative methods for verifying performance of the proposed method. For the comparative method, we exchanged the kernel functions into gaussian kernel {\kappa}^{\mathsf{\text{GK}}}\left(x,y\right)=exp\left(\frac{\parallel xy{\parallel}^{2}}{2{\sigma}^{2}}\right)\left(\mathsf{\text{GK}}\right), sigmoid kernel κ^{SK}(x, y) = tanh(α x'y + β) (SK), and linear kernel κ^{LK}(x, y) = x'y (LK). In this experiment, we set parameters σ(= 5.0), α(= 5.0), and β(= 3.0). It should be noted that these kernel functions cannot be applied to our human motion features and music features directly since the features have various dimensions. Therefore, we simply used the time average of optical flowbased vectors, {v}_{j}^{\mathsf{\text{avg}}}, for human motion features and the time average of chroma vectors, {m}_{j}^{\mathsf{\text{avg}}}, for music features. Then, we applied the above three types of kernel functions to the obtained features. Figure 5 shows the results of comparison for each kernel function. These results show that our kernel functions are more effective than other kernel functions. The results also show that it is important to consider the temporal characteristic of data, and our kernel function can successfully consider this characteristic. Note that in this comparison, we used parameters that provide the highest accuracy. The parameters are shown in Tables 6, 7 and 8.
Finally, we show results of subjective evaluation for our recommendation method. We performed subjective evaluation using 15 subjects (User1User15). Table 9 shows the profiles of the subjects. In the evaluation, we used video contents which consisted of video sequences and music pieces. In the video contents, each video sequence included one human motion, and each music piece was a recommended result by the proposed method according to the human motion. The tasks of the subjective evaluation were as follows:

1.
Subjects watched each video content, whose video sequence was a target classic ballet scene and whose music was recommended by the proposed method.

2.
Subjects determined whether the target classic ballet scene and the recommended music pieces were matched or not. Specifically, they answered yes or no.

3.
Procedures 1 and 2 were repeated for 210 video contents.
In the subjective evaluation, we used the recommended results obtained by Simulation 1 in the abovedescribed experiment. We also used datasets #1 and #2 for the subjective evaluation. In the evaluation, we showed the top three recommended results for each query human motion (query segment). Then, 70 query segments were examined and 210 recommended results were obtained for each dataset.
Furthermore, we used two criteria, "Accuracy Score 2" and "Accuracy Score 3", for verifying the performance. Accuracy Score 2 is defined as follows:
where the denominator corresponds to the number of query segments. {Q}_{{i}_{2}}^{2}\left({i}_{2}=1,2\dots ,70\right) is one if one or some of the recommended three music pieces at least subjects determined the query human motion and its music piece were matched. Otherwise, {Q}_{{i}_{2}}^{2} is zero. In addition, Accuracy Score 3 is the ratio of assessment results for 210 music pieces and is defined as follows:
where the denominator corresponds to the number of recommended music pieces. Furthermore, {Q}_{{i}_{3}}^{3}\left({i}_{3}=1,2,\dots ,210\right) is one if subjects determined the query human motion and its music piece matched. Otherwise, {Q}_{{i}_{3}}^{3} is zero. Table 10 shows the results of each score in the subjective evaluation. From the results, both scores show higher recommendation accuracy than that of the quantitative evaluation. Therefore, the results of the subjective evaluation confirmed the effectiveness of our method.
6 Conclusions
In this article, we have presented a method for music recommendation according to human motion based on the kernel CCAbased relationship. In the proposed method, we newly defined two types of kernel functions. One is a sequential similaritybased kernel function that uses the LCSS algorithm, and the other is a statistical characteristicbased kernel function that uses the pspectrum. Using these kernel functions, the proposed method enables calculation of the correlation that can consider their sequential characteristics. Furthermore, based on the obtained correlation, the proposed method enables accurate music recommendation according to human motion.
In the experiments, recommendation accuracy was sensitive to the parameters. It is desirable that these parameters be adaptively determined from the datasets. Thus, we need to complement this determination algorithm. Feature selection of the human motions and music pieces is also needed for more accurate extraction of the relationship between human motions and music pieces. These topics will be the subjects of subsequent studies.
Endnotes
^{a}In this article, we simply denote "retrieval and recommendation" as recommendation hereafter. ^{b}In this article, video sequences are defined as data that contain only visual signals, and video contents are defined as data that contain both visual signals and audio signals. ^{c}In this section, we assume that E\left[{\varphi}_{x}\left(x\right)\right]=0 and E\left[{\varphi}_{y}\left(y\right)\right]=0 for brief explanation, where E\left[\cdot \right] denotes the sample average of the random variates. ^{d}http://www.xbox.com/enUS/Kinect.
Appendix A: Feature extraction
In this article, we use human motion features and music features. Here, each feature extraction is explained in detail. Segments are extracted from video contents, i.e., video contents are separated into some segments S_{ j }(j = 1,2,..., N). Then, human motion features and music features are extracted from each segment. In this appendix, we explain methods for extraction of human motion features and music features in A.1 and A.2, respectively.
A.1 Extraction of human motion features
First, the proposed method separates segments S_{ j }into frames {f}_{j}^{k}\left(k=1,2,\dots ,{N}_{j}\right), where N_{ j }is the number of frames in segment S_{ j }. Furthermore, a rectangular region including one human is clipped from each frame, and they are regularized to the same size. In this article, we assume that this rectangular region has previously been obtained. Deciding the rectangular regions including humans might be difficult. However, there are several methods for extracting/deciding human regions from video sequences [26, 27]. These methods achieved accurate human region detection by combining visual information and sensor information such as kinect,^{d} using a stereocamera, or using a camera for which position is calibrated. Although we extract the rectangular region manually for simplicity, we consider that a certain precision can be guaranteed using these methods.
Next, we show the calculation of optical flowbased vectors. For calculating optical flows from segments, we firstly divide regions of frame {f}_{j}^{k} into blocks {\mathcal{B}}_{j}^{b}\left(b=1,2,\dots ,{N}^{{\mathcal{B}}_{j}}\right), where {N}^{{\mathcal{B}}_{j}}\left(=1600\right) is the number of blocks in each frame. Then, based on the LucasKanade Algorithm [19], the optical flow in each block {\mathcal{B}}_{i}^{b} is calculated between two successive regions from {f}_{j}^{k+1} to {f}_{j}^{k} for all segments S_{ j }. Then, we obtain optical flowbased vectors {v}_{j}\left(k\right)\left(k=1,2\dots ,{N}_{{V}_{j}}\right) containing vertical and horizontal direction optical flow values for all blocks. Then, {N}_{{v}_{j}} corresponds to N_{ j }1.
In this article, the human motion feature V_{ j }of segment S_{ j }is obtained as the sequence of the optical flowbased vector v_{ j }(k). The features obtained by the above procedure represent the temporal characteristics of human motions.
A.2 Extraction of music features
The proposed method uses chromagrams [24]. A chromagram represents the temporal sequence of chroma vectors over time and is calculated from each segment. Furthermore, the chroma vector represents magnitude distribution on the chroma that is assigned into 12 pitch classes within an octave, and thus the chroma vector has 12 dimensions. The 12dimensional chroma vector m(t) is extracted from the magnitude spectrum Ψ_{ τ }(f_{ Hz }, t), which is calculated using shorttime Fourier transform (STFT), where f_{ Hz }is frequency and t is time in an audio signal. The τ(τ = 1, 2,...,12)th element of m(t) corresponds to a pitch class of equal temperament and is defined as follows:
where Oct_{ H }and Oct_{ L }represent the highest and lowest octave positions, respectively. Furthermore, BPF_{ τ,h }(f_{ Hz }) is a bandpass filter that passes the signal at the logscale frequency F_{ τ,h }(in cents) of pitch class τ (chroma) in octave position h (height) as shown in the following equation:
We define a chromagram that represents a temporal sequence of 12dimensional chroma vectors extracted by the above procedure in segment S_{ j }as the music features {M}_{j}=\left[{m}_{j}\left(1\right),{m}_{j}\left(2\right),\dots ,{m}_{j}\left({N}_{{M}_{j}}\right)\right], where {N}_{{M}_{j}} is the number of components of M_{ j }. Details of the chroma vector and the chromagram are shown in [20].
Appendix B: Parameter determination
In this section, we explain the parameter determination. First, for the determination of parameters, we performed experiments to show the relationship between the accuracy score and the parameters. Figure 6 shows the relationships between the accuracy score and parameters in Simulation 1. From the obtained results, it can be seen that the kernel CCAbased approach tends to be sensitive for the parameters. It should be noted that in the dataset used for the experiments, there are quite different types of pairs of human motions and music pieces. Then, for similar pairs of human motions and music pieces, we will be able to use fixed parameters and obtain accurate results. Therefore, it can be seemed that stable recommendation accuracy scores are achieved using parameters that are determined from datasets that have similar characteristics. This means that for stable recommendation, some schemes performing clustering and classification of the contents become necessary as preprocedures. The other simulations and other database are also sensitive the same as the shown results. For the above reasons, we used the parameters that provided the highest accuracy. Thus, the parameters were not determined by crossvalidation. However, we recognized that such parameter should be determined by the crossvalidation. This is our future work.
Abbreviations
 CCA:

canonical correlation analysis
 MMD:

multimedia documents
 LCSS:

longest common subsequence
 LCSSK:

LCSS: kernel
 SIK:

spectrum intersection kernel
 GK:

gaussian kernel
 SK:

sigmoid kernel
 LK:

linear kernel.
References
Kim I, Lee J, Kwon Y, Par S: Contentbased image retrieval method using color and shape features. Proceedings of the 1997 International Conference on Information, Communication and Signal Processing 1997, 948952.
Zhang R, Zhang Z: Effective image retrieval based on hidden concept discovery in image database. IEEE Trans Image Process 2006,16(2):562572.
He X, Ma W, Zhang H: Learning an image manifold for retrieval. Proceedings of the ACM Multimedia Conference 2004.
Guo G, Li S: Contentbased audio classification and retrieval by support vector machines. IEEE Trans Neural Networks 2003,14(1):209215. 10.1109/TNN.2002.806626
Typke R, Wiering F, Veltkamp R: A survey of music information retrieval systems. Proceedings of the ISMIR 2005.
Shen J, Shepherd J, Ngu A: Towards effective contentbased music retrieval with multiple acoustic feature combination. IEEE Trans Multimedia 2006, 8: 11791189.
Greenspan H, Goldberger J, Mayer A: Probabilistic spacetime video modeling via piecewise GMM. IEEE Trans Pattern Anal Mach Intell 2004,26(3):384396. 10.1109/TPAMI.2004.1262334
Fan J, Elmagarmid A, Zhu X, Aref W, Wu L: ClassView: hierarchical video shot classification, indexing, and accessing. IEEE Trans Multimedia 2004,6(1):7086. 10.1109/TMM.2003.819583
Li X, Dacheng T, Maybank S: Visual music and musical vision. Neurocomputing 2008, 71: 20232028. 10.1016/j.neucom.2008.01.025
Fujii A, Itou K, Akiba T, Ishikawa T: A crossmedia retrieval system for lecture videos. Proceedings of the Eighth European Conference on Speech Communication and Technology (Eurospeech 2003) 2003, 11491152.
Zhuang Y, Yang Y, Wu F: Mining semantic correlation of heterogeneous multimedia data for crossmedia retrieval. IEEE Trans Multimedia 2008,10(2):221229.
Yang Y, Zhuang Y, Wu F, Pan Y: Harmonizing hierarchical manifolds for multimedia document semantics under standing and crossmedia retrieval. IEEE Trans Multimedia 2008,10(3):437446.
Akaho S: A kernel method for canonical correlation analysis. International Meeting of Psychometric Society 2001., 1:
Jun S, Han B, Hwang E: A similar music retrieval scheme based on musical mood variation. First Asian Conference on Intelligent Information and Database Systems 2009, 1: 167172.
Mercer J: Functions of positive and negative type, and their connection with the theory of integral equations. Trans London Philos Soc (A) 1909, 209: 415446. 10.1098/rsta.1909.0016
Leslie C, Eskin E, Noble W: The spectrum kernel: a string kernel for SVM protein classification. Proceedings of the Pacific Biocomputing Symposium 2002, 566575.
Barla A, Odone F, Verri A: Histogram intersection kernel for image classification. ICIP(3) 2006, 513516.
Gruber C, Gruber T, Sick B: Online signature verification with new time series kernels for support vector machines. Advances in Biometrics 2005, 3832: 500508. 10.1007/11608288_67
Lucas B, Kanade T: An iterative image registration technique with an application to stereo vision. Proceedings of the DARPA IU Workshop 1984, 121130.
Goto M: A chorussection detection method for musical audio signals and its application to a music listening station. IEEE Trans Audio Speech Language Process 2006,14(5):17831794.
MacQueen J: Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Math. Statistics and Probability 1967, 1: 281297.
Mariethoz J, Bengio S: A kernel trick for sequences applied to textindependent speaker verification systems. Pattern Recognition 2007,40(8):23152324. 10.1016/j.patcog.2007.01.011
CampsValls G, MartinGuerrero J, RojoAlvarez J, SoriaOlivas E: Fuzzy sigmoid kernel for support vector classifier. Neurocomputing 2004, 62: 501506.
Wakefield GH: Mathematical representation of joint timechroma distributions. SPIE 1999.
Xu R, Dunsch W II: Survey of clustering algorithms. IEEE Trans Neural Networks 2005,16(3):645678. 10.1109/TNN.2005.845141
Navneet D, Bill T, Cordelia S: Human detection using oriented histograms of flow and appearance. Comput Vision ECCV 2006 2006, 3952: 428441. 10.1007/11744047_33
Mikolajczyk K, Schmid C, Zisserman A: Human detection based on a probabilistic assembly of robust part detectors. In Proceedings of the Eighth European Conference on Computer Vision. Volume 1. Prague, Czech Republic; 2004:6981.
Acknowledgements
This study was partly supported by the GrantinAid for Scientific Research (B) 21300030, Japan Society for the Promotion of Science (JSPS).
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Electronic supplementary material
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Ohkushi, H., Ogawa, T. & Haseyama, M. Music recommendation according to human motion based on kernel CCAbased relationship. EURASIP J. Adv. Signal Process. 2011, 121 (2011). https://doi.org/10.1186/168761802011121
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/168761802011121
Keywords
 contentbased multimedia recommendation
 kernel canonical correlation analysis
 longest common subsequence
 pspectrum