- Research
- Open Access
Modeling individual HRTF tensor using high-order partial least squares
- Qinghua Huang^{1}Email author and
- Lin Li^{1}
https://doi.org/10.1186/1687-6180-2014-58
© Huang and Li; licensee Springer. 2014
- Received: 31 August 2013
- Accepted: 3 April 2014
- Published: 3 May 2014
Abstract
A tensor is used to describe head-related transfer functions (HRTFs) depending on frequencies, sound directions, and anthropometric parameters. It keeps the multi-dimensional structure of measured HRTFs. To construct a multi-linear HRTF personalization model, an individual core tensor is extracted from the original HRTFs using high-order singular value decomposition (HOSVD). The individual core tensor in lower-dimensional space acts as the output of the multi-linear model. Some key anthropometric parameters as the inputs of the model are selected by Laplacian scores and correlation analyses between all the measured parameters and the individual core tensor. Then, the multi-linear regression model is constructed by high-order partial least squares (HOPLS), aiming to seek a joint subspace approximation for both the selected parameters and the individual core tensor. The numbers of latent variables and loadings are used to control the complexity of the model and prevent overfitting feasibly. Compared with the partial least squares regression (PLSR) method, objective simulations demonstrate the better performance for predicting individual HRTFs especially for the sound directions ipsilateral to the concerned ear. The subjective listening tests show that the predicted individual HRTFs are approximate to the measured HRTFs for the sound localization.
Keywords
- Head-related transfer function
- Tensor
- Multi-linear model
- Laplacian score
- High-order partial least squares
1 Introduction
The generation of virtual three-dimensional (3D) audio based on head-related transfer functions (HRTFs) becomes important in many applications, such as PC entertainment, hearing aids, multimedia, and virtual reality, with spatial and immersive feelings in 3D auditory space. Its key technique is to recover the location information of a sound source by HRTFs. HRTFs describe the spectral changes of sound waves from the sound positions to a listener's ear carnal, due to the diffraction and reflection properties of the anthropometric structures. The corresponding representations in the time domain are head-related impulse responses (HRIRs). HRTFs not only vary with sound source locations (elevations and azimuths) and frequencies but also depend on external physiological structures uniquely from one listener to another. The tiny difference of anthropometric structures can create a significant influence on HRTFs for sound localization. Perceptual distortions may occur in spatial sound localization with non-individual HRTFs. Unfortunately, individual HRIR measurements for each listener are very time consuming and difficult to implement with specific instruments. So, it is not practical and economical for various applications. It is essential to obtain individual HRTFs fast and effectively.
Some theoretical calculation methods were used to generate individual HRTFs based on a snowman model [1] and the boundary element method (BEM) [2], which are unsuitable for the individual HRTF estimation at high frequencies. They also need a large amount of calculations and have a high request for computers [3]. To customize HRTFs more easily and effectively, some researchers attempted to explore the effect resulting from the difference of anthropometric structures on HRTFs [4–6] and further study the relationship between HRTFs and anthropometric parameters [7, 8]. Zotkin et al. measured the pinna size of a new subject and used the similarity of pinna structures between the new subject and the listeners in a known HRTF database to select the best matching HRTFs as the individual HRTFs of that subject [9]. Zeng et al. applied a hybrid algorithm for selecting the individual HRTFs based on the similarity of anthropometric structures [10]. These HRTF personalization methods based on database matching were limited by the sizes of the existing databases and the matching criteria. Consequently, individual modeling based on anthropometric measurements is a big breakthrough in HRTF customization. HRTF estimation has three indispensable parts including the dimension reduction of original HRTFs, a reasonable selection of anthropometric measurements, and the mapping relation between the compacted HRTFs and the selected anthropometric measurements.
HRTFs depend on frequencies, sound directions (azimuths and elevations), and listeners. The collection of high-spatial-resolution HRTFs of each listener makes up a large dataset with a multi-dimensional structure and complex characteristic. It is difficult to directly apply the original HRTFs into learning and storing. Due to the high dimensionality, it is necessary to extract the individual factors with lower dimension from the original HRTFs and get rid of non-individual features. Principal component analysis (PCA) was popularly applied to get individual weight coefficients and basis vectors before the HRTF customization [11–15]. Sodnik et al. found a suitable representation for the weight variations of the HRTF amplitudes by PCA [16]. Wang et al. applied PCA for the HRTF compression [17] and interpolation [18], respectively. Xie presented to recover the high-spatial-resolution HRTFs from the individual weight coefficients by a small set of measurements [19]. Kistler and Wightman modeled the HRTF matrix by PCA [20]. PCA successfully reduces the dimensionality of HRTF datasets and it is based on the so-called vector space model. Under this model, the HRTFs of a subject at different source locations are modeled as a vector and the collection of individual HRTFs is modeled as a matrix. It cannot capture the variations among different sound source directions and the interaction of multiple variables in HRTFs. To overcome the weakness of the vector model, Grindlay and Vasilescu modeled HRTFs using a multi-factor (tensor) representation [21]. The tensor framework is used to learn the interaction of the multi-variable for HRTF representation and can achieve the dimensionality reduction of the HRTF dataset along each variable separately. Rothbucher et al. used multi-way array analysis to customize HRTFs [22]. There is no specific selection of anthropometric parameters in combination with the HRTF tensor. However, the key parameter selection is important for the accuracy of HRTF model estimation.
A number of anthropometric parameters (head, torso, pinna, etc.) affect the immersion of listener's spatial hearing and show different influence on HRTFs. It is undesirable to apply all the anthropometric measurements for modeling the individual HRTFs. A reasonable selection of the anthropometric measurements is necessary to the individual HRTF customization. Xu et al. [5] found that ear parameters were significantly correlated with the magnitudes of HRTFs at high frequencies. Chen et al. studied the influence of the neck and torso parameters on the near-field HRTFs [23]. Rothbucher et al. [7] developed a measurement system that was capable to scan a human body for the anthropometric measurements. Hu et al. used correlation coefficients twice for the parameter selection and retained eight parameters for the HRTF estimation [12, 24, 25]. Zeng et al. [10] also selected 13 reference physiological parameters utilizing correlation analyses between the measured HRTFs and all the anthropometric parameters for matching the best HRTFs. Xu et al. [26] used a weighted correlation method and selected eight significant parameters. Hugeng and Gunawan [27] analyzed the correlation among anthropometric parameters and three physical quantities (interaural time difference, interaural level difference, and pinna notch frequency). However, correlation analyses only describe the linear relationship among anthropometric parameters and cannot evaluate the significance of a single parameter. These existing methods for the parameter selection did not consider the relation between the anthropometric parameters and the multi-dimensional HRTFs. In order to avoid the HRTF vector modeling, it is necessary to keep the multi-dimensional structure of the original HRTFs and on this basis to select the key parameters.
Once the individual HRTF factors with lower dimension and a few key parameters were obtained, some methods were widely used to construct the relationship between them. The anthropometric parameters were treated as the inputs and lower-dimensional HRTFs as the outputs. Many researchers constructed the HRTF prediction model based on an assumption of a linear relation between the HRTF vectors and anthropometric parameters [11–13, 24, 28, 29]. In [12, 24, 28, 29], the relation between the HRTFs and physical sizes of the head and ear was investigated by the multiple regression analysis and optimized by the least squares method. The performance of the estimated HRTFs was evaluated by objective and subjective ways. The results indicated that good performance was obtained with no significant difference between the measured and estimated HRTFs with respect to perception when the bandwidth ranged from 0 to 8 kHz. To get rid of trivial anthropometric measurements and improve the performance of the HRTF estimation, Hu et al. used partial least squares regression (PLSR) to model the linear relation [24]. Subsequently, to further describe the scattering of the incident sound by the physical structures, many researchers explored some non-linear multivariable statistical estimation methods to improve the performance of HRTF customization [25, 30–32]. In [25], a three-layer back-propagation artificial neural network (ANN) was used to HRTF personalization. Huang et al. applied support vector regression (SVR) to model personalized HRIRs [30, 31]. All these linear or non-linear HRTF personalization methods were based on the vector model to customize HRTFs. They are unable to establish the mapping from anthropometric parameters to a HRTF tensor. In order to predict the individual HRTFs of different sound directions, it requires a separate regression model for each sound direction. A high-spatial-resolution HRTF prediction based on the tensor model requires only one regression. Therefore, it is necessary to consider the HRTF data involving multiple variables as multi-way data structures [33] and a multi-linear extension for modeling a HRTF tensor using a small set of anthropometric measurements.
The above considerations motivate us to construct a multi-linear model for predicting an individual HRTF tensor. To learn HRTFs of any listener from a measured database, we present a HRTF customization through three steps. First, to keep the inherent interaction of different variables in HRTFs, a tensor is used to describe measured HRTFs, and the individual core tensor (ICT) with lower dimension in tensor subspace is extracted by high-order singular decomposition (HOSVD). Then, combining with the lower-dimensional ICT, a few anthropometric parameters are selected in consideration of the local geometric structure and global information in the parameter data space. Last, we use a multi-linear subspace regression to model the multi-linear relationship between the ICT and the selected key parameters. Section 2 presents the data processing including the dimension reduction of the HRTF tensor and the selection of key anthropometric parameters. The multi-linear subspace regression between the compacted HRTF tensor and the selected key parameters is developed in Section 3. The proposed method can realize the direct multi-linear mapping from the anthropometric parameters to the HRTF tensor. In Section 4, we give the results and discussions of the proposed HRTF personalization method. The conclusions are given in the last section.
2 Data processing
2.1 Notations and basic multi-linear algebra
In order to facilitate the distinctions of scalars, matrices, and tensors, the following notations are used. Scalars are denoted by italic letters, e.g., a; vectors by lowercase boldface letters, e.g., a; matrices by uppercase boldface letters, e.g., A; and tensors by boldface calligraphic letters, e.g., A. The i th entry of the vector a is denoted by a_{ i }, the (i, j)-element of the matrix A is denoted as a_{ ij }, the column-n vector of the matrix A as a_{ n }, and the element (i_{1}, i_{2}, …, i_{ N }) of an N th-order tensor $\mathit{A}\in {\mathrm{\mathbb{R}}}^{{\mathit{I}}_{1}\times \cdots \times {\mathit{I}}_{\mathit{n}}\times \cdots \times {\mathit{I}}_{\mathit{N}}}$ by ${\mathit{a}}_{{\mathit{i}}_{1}{\mathit{i}}_{2}\cdots {\mathit{i}}_{\mathit{N}}}$. The indices range from 1 to their corresponding capital versions, e.g., i_{ N } = 1, 2, …, I_{ N }. The n-mode unfolding operation of the tensor A is denoted by ${\mathbf{A}}_{\left(\mathit{n}\right)}\in {\mathrm{\mathbb{R}}}^{{\mathit{I}}_{\mathit{n}}\times \left({\mathit{I}}_{1}\cdots {\mathit{I}}_{\mathit{n}-1}{\mathit{I}}_{\mathit{n}+1}\cdots {\mathit{I}}_{\mathit{N}}\right)}$. The n th factor matrix is denoted by ${\mathbf{A}}^{\left(\mathit{n}\right)}\in {\mathrm{\mathbb{R}}}^{{\mathit{I}}_{\mathit{n}}\times {\mathit{I}}_{\mathit{n}}}$. The I_{ n } × I_{ n } unitary matrix is denoted by ${\mathbf{I}}_{{\mathit{I}}_{\mathit{n}}}$.
where $\mathit{S}\in {\mathrm{\mathbb{R}}}^{{\mathit{I}}_{1}\times {\mathit{I}}_{2}\times {\mathit{I}}_{3}}$ is the core tensor [34]. ${\mathbf{U}}^{\left(\mathit{n}\right)}\in {\mathrm{\mathbb{R}}}^{{\mathit{I}}_{\mathit{n}}\times {\mathit{I}}_{\mathit{n}}}$ is the unitary matrix and can be calculated by performing a matrix singular value decomposition (SVD) on the A_{(n)}[35]. The last term is the simplified notation [36].
2.2 Data processing
2.2.1 Dimension reduction of the HRTF tensor
where f is the frequency and r is the distance from the sound source to the center of a listener's head. The spatial direction of the sound source is marked by azimuth θ and elevation ϕ. The individual factors are embodied in the first variable p on behalf of different subjects. P(p, r, θ, ϕ, f) represents the sound pressure at the left or right ear, and P_{0}(r, f) is the sound pressure at the center of the listener's head with the listener absent. In the following section, r is omitted because HRTFs are asymptotically independent of distance in the far field (r > 1 m) [37].
Even in the far field, HRTFs are functions of frequencies and sound directions uniquely from one person to another. To keep the interaction of different variables, a third-order tensor is used to describe the HRTFs of different subjects. Due to the high dimension of each mode, a core tensor in a lower-dimensional subspace is extracted from the original HRTF tensor by HOSVD. It still keeps the multi-dimensional structure and contains the individual characteristic of the measured HRTFs.
2.2.2 Selection of key anthropometric parameters combining with the ICT
HRTFs describe the responses resulting from the diffraction and reflection of listener's anthropometric structures and are related to anatomy concentrating on the head, torso, and pinna. Each listener has his specific anthropometric shape and size. The parameter data space can be obtained by anthropometric measurements of the physiological structures from each subject [7, 39]. There is correlation among all the different anthropometric parameters. The correlation coefficients are not equal or close to one. So, the anthropometric parameters cannot replace each other, and it is better to select a group of necessary anthropometric measurements for approximately reflecting the fundamental property of HRTFs. How to select such a group of anthropometric parameters is a key work. We can take the following three procedures to select the key anthropometric parameters for the latter multi-linear personalization modeling.
First, in order to measure the influence of parameters on HRTFs, the correlation analyses are performed between all the measured parameters and the individual core tensor. The parameters with larger correlation coefficients are reserved as the results of the first selection procedure.
Then, we use a Laplacian score to further select appropriate parameters. It can measure the importance of each anthropometric parameter. In order to avoid the unbalance selection of similar parameters, the reserved parameters are divided into three classes before calculating the Laplacian score. We examine the intrinsic properties of the parameter space to evaluate each parameter after the correlation analysis. For each parameter, its Laplacian score is computed to reflect the locality preserving power. Laplacian score is based on the local observation and an assumption that two parameters are probably related to the same topic if they are close to each other [40].
where b_{ k } = [a_{1k}, a_{2k}, ⋯, a_{ Pk }]^{ T } consists of the k th parameter of all P subjects and var(⋅) is the variance computation. L_{ k } is the Laplacian score of the k th parameter, which concerns two aspects of the reserved anthropometric parameter. One is the variance of the parameter that reflects its representative power. The other relates to the local geometric structure of the parameter data space. It seeks the anthropometric parameters which best reflect the underlying manifold structure and are probably better for predicting HRTFs. Thus, we select these parameters with lower Laplacian scores, which have significant influence on HRTFs at the same time.
Last, correlation analysis is applied to delete some of the above selected parameters that have strong correlation with the others. Through the above selection process, K′ key parameters for P subjects are selected as the inputs of the multi-linear HRTF model. It is denoted by a matrix $\mathbf{A}\in {\mathrm{\mathbb{R}}}^{\mathit{P}\times {\mathit{K}}^{\prime}}$.
3 Multi-linear personalization modeling by HOPLS
where T = [t_{1}, t_{2}, …, t_{ R }] ∈ℝ^{P × R} is the latent matrix, $\mathit{Y}\in {\mathrm{\mathbb{R}}}^{\mathit{R}\times \mathit{R}{\mathit{J}}_{2}\times \mathit{R}{\mathit{J}}_{3}}$ is a block-diagonal tensor containing the tensor Y_{ r } (r = 1, 2, …, R) on the diagonal line; similarly, the core d_{ r } (r = 1, 2, …, R) is contained in a block-diagonal matrix D ∈ ℝ^{R × RI}, direction loading matrix ${\overline{\mathbf{Q}}}^{\left(2\right)}=\left[{\mathbf{Q}}_{1}^{\left(2\right)},{\mathbf{Q}}_{2}^{\left(2\right)},\cdots ,{\mathbf{Q}}_{\mathit{R}}^{\left(2\right)}\right]\in {\mathrm{\mathbb{R}}}^{{\mathit{D}}^{\prime}\times \mathit{R}{\mathit{J}}_{2}}$, frequency loading matrix ${\overline{\mathbf{Q}}}^{\left(3\right)}=\left[{\mathbf{Q}}_{1}^{\left(3\right)},{\mathbf{Q}}_{2}^{\left(3\right)},\cdots ,{\mathbf{Q}}_{\mathit{R}}^{\left(3\right)}\right]\in {\mathrm{\mathbb{R}}}^{{\mathit{F}}^{\prime}\times \mathit{R}{\mathit{J}}_{3}}$, and the anthropometric parameter loading matrix $\mathbf{V}=\left[{\mathbf{v}}_{1},{\mathbf{v}}_{2},\cdots ,{\mathbf{v}}_{\mathit{R}}\right]\in {\mathrm{\mathbb{R}}}^{{\mathit{K}}^{\prime}\times \mathit{RI}}$.
When the latent vector t is fixed, the cores Y and d are obtained by (12).
4 Experimental results
In the section, the performance of the proposed method is measured by objective evaluation and subjective sound localization based on a large number of HRTF measurements. The Center for Image Processing and Integrated Computing (CIPIC) database provides high-spatial-resolution HRIR measurements of 45 different subjects. It contains measured HRIRs for both left and right ears at 1,250 sound directions (25 azimuths and 50 elevations) [44]. The azimuths vary from −80° to 80°, and the elevations range from −45° to +230.625°. Moreover, 27 anthropometric parameters of 45 subjects are measured in the CIPIC database including 17 for the head and the torso from x_{1} to x_{17} and 10 for the pinna expressed by d_{1} − d_{8}, θ_{1}, and θ_{2}[44]. The CIPIC database is used to evaluate the performance of our proposed regression model based on HOPLS.
4.1 Simulations of the data processing
4.1.1 HRTF tensor compaction
- 1.The selections of D′ and F′ depend on the energy loss of the original HRTF tensor in each mode, respectively. The energy contained in the HRTF tensor is calculated by the squared Frobenius norm of H. It also equals the sum of the squared m-mode singular values [34] expressed by${\u2225\mathit{H}\u2225}_{\mathit{F}}^{2}={\displaystyle \sum _{\mathit{i}=1}^{\mathit{P}}{\left({\mathit{\sigma}}_{\mathit{i}}^{\left(1\right)}\right)}^{2}}={\displaystyle \sum _{\mathit{i}=1}^{\mathit{D}}{\left({\mathit{\sigma}}_{\mathit{i}}^{\left(2\right)}\right)}^{2}}={\displaystyle \sum _{\mathit{i}=1}^{\mathit{F}}{\left({\mathit{\sigma}}_{\mathit{i}}^{\left(3\right)}\right)}^{2}}$(16)
The selections of D′ and F′ for three azimuths
Azimuth | D′ | F′ | CR | ${\mathit{E}}_{\mathbf{ratio}}^{\left(\mathbf{2}\right)}$ | ${\mathit{E}}_{\mathbf{ratio}}^{\left(\mathbf{3}\right)}$ |
---|---|---|---|---|---|
−80° | 5 | 20 | 28.6 | ||
0° | 26 | 37 | 4.4 | 0.98 | 0.98 |
80° | 15 | 37 | 7.1 |
- 2.To measure the quantitative error of the reconstruction using the basis functions and the individual core tensor, the signal-to-distortion ratio (SDR) is defined in decibels as$\mathrm{SDR}\left(\mathit{p},\mathit{\theta},\mathit{\varphi}\right)=10lg\frac{{\displaystyle \sum _{\mathit{f}=1}^{\mathit{F}}{\left|\mathit{H}\left(\mathit{p},\mathit{\theta},\mathit{\varphi},\mathit{f}\right)\right|}^{2}}}{{\displaystyle \sum _{\mathit{f}=1}^{\mathit{F}}{\left|\mathit{H}\left(\mathit{p},\mathit{\theta},\mathit{\varphi},\mathit{f}\right)-\widehat{\mathit{H}}\left(\mathit{p},\mathit{\theta},\mathit{\varphi},\mathit{f}\right)\right|}^{2}}}$(18)
where H(p, θ, ϕ, f) and $\widehat{\mathit{H}}\left(\mathit{p},\mathit{\theta},\mathit{\varphi},\mathit{f}\right)$ represent the original and the reconstructed or predicted HRTF, respectively. The average of SDR (ASDR) defined as $\mathrm{ASDR}\left(\mathit{\theta},\mathit{\varphi}\right)=\frac{1}{\mathit{P}}{\displaystyle \sum _{\mathit{p}=1}^{\mathit{P}}\mathrm{SDR}}\left(\mathit{p},\mathit{\theta},\mathit{\varphi}\right)$ is used to measure the mean performance for the reconstruction or prediction in the following discussions.
4.1.2 Selecting anthropometric parameters
- 1.
In order to reduce the amount of computation and make correlation analyses more effectively, it is desirable to sample the upper left corner areas of the ICT for correlation analyses. In this procedure, the compacted ICT denoted by ${\tilde{\mathit{W}}}_{\mathit{c}}\in {\mathbb{R}}^{\mathit{P}\times {\mathit{D}}^{\u2033}\times {\mathit{F}}^{\u2033}}\left({\mathit{D}}^{\u2033}<{\mathit{D}}^{\prime}\phantom{\rule{0.7em}{0ex}}\mathrm{and}\phantom{\rule{0.6em}{0ex}}{\mathit{F}}^{\u2033}<{\mathit{F}}^{\prime}\right)$ is reshaped to a matrix ${\tilde{\mathbf{W}}}_{\mathit{c}}\in {\mathbb{R}}^{\mathit{P}\times \left({\mathit{D}}^{\u2033}{\mathit{F}}^{\u2033}\right)}$. Then, the absolute values of Pearson correlation coefficients are calculated and stored in a matrix ${\mathbf{R}}_{{\tilde{\mathbf{W}}}_{\mathit{c}}\mathbf{A}}\in {\mathbb{R}}^{27\times \left({\mathit{D}}^{\u2033}{\mathit{F}}^{\u2033}\right)}$. There are 25 compacted ICTs corresponding to 25 azimuths, so 25 correlation analyses are constructed for the different azimuths. The significance of all the anthropometric parameters on the HRTFs can be shown by the correlation coefficient matrices with elements larger than 0.35 and plotted in Figure 7. The results in Figure 7 show that all the anthropometric measurements affect the HRTFs with different levels. It is necessary to delete unimportant parameters. After 25 correlation analyses, 22 parameters shown more important to the HRTFs are reserved for the next selection step. The parameters x _{2}, x _{4}, x _{5}, x _{7}, and d _{2} have the weak correlation with the HRTFs and they are deleted in this step.
- 2.
After the correlation analyses, we model the intrinsic geometric structure of the reserved parameter space by the nearest neighbor graph. These reserved parameters (x _{1}, x _{3}, x _{6}, x _{8} − x _{17}, d _{1}, d _{3} − d _{8}, θ _{1}, θ _{2}) are arranged to three different classes shown in Table 2. Combining (8) and the graph, each parameter is evaluated by a Laplacian score. These parameters of each class are arranged by their corresponding scores in an ascending sequence. By this means, 17 parameters are reserved as the results of the Laplacian score procedure. They are x _{3}, x _{6}, x _{9} − x _{15}, x _{17}, d _{1}, d _{3}, d _{4}, d _{6} − d _{8}, and θ _{1} with the Laplacian score less than 0.4.
Reserved parameters arranged by Laplacian scores in an ascending order
Class | Reserved anthropometric parameter |
---|---|
Class 1 (head and torso) | x_{17}, x_{15}, x_{12}, x_{6}, x_{14}, x_{9}, x_{13}, x_{11}, x_{3}, x_{10}, x_{8}, x_{16}, x_{1} |
Class 2 (pinna) | d_{3}, d_{8}, d_{4}, d_{7}, d_{1}, d_{6}, d_{5} |
Class 3 (pinna angle) | θ_{1}, θ_{2} |
- 3.
The selected parameters are fed into the training of the individual HRTF modeling by HOPLS. The last step performs the correlation analysis among the reserved parameters. Similarly in order to show the dependent relation among those parameters, the gray image in Figure 8 presents the correlation coefficients of the reserved parameters larger than 0.5. From Figure 8, x _{6}, x _{9}, x _{12}, x _{14}, and x _{17} have strong correlation with others and are deleted. Thus, the parameters x _{3}, x _{10}, x _{11}, x _{13}, x _{15}, d _{1}, d _{3}, d _{4}, d _{6}, d _{7}, d _{8}, and θ _{1} are selected as the final necessary measurements for the individual HRTF prediction. All the final reserved parameters are selected by the procedures of the correlation analysis and the Laplacian score. We select these 12 parameters as the key parameters. However, the significance of each selected parameter on the HRTFs is still not clear. Since measurements of the anthropometric parameters need special instruments, we cannot implement the anthropometric measurements at present.
4.2 Objective evaluation and subjective localization experiment
Through the simulations in Section 4.1, we can obtain the individual core tensor and key anthropometric parameters. The goal of our proposed HRTF personalization is to model the multi-linear relation between the key parameters and the individual core tensor. In this section, the experiments are implemented to evaluate the feasibility of the proposed individual HRTF customization by objective evaluation and subjective perception. It is important to select the appropriate hyperparameters for preventing overfitting and controlling the complexity of the HRTF estimation model.
4.2.1 Selecting the numbers of loadings and latent vectors for the individual HRTF prediction
The selected numbers of latent vectors and loadings of HOPLS
Azimuth | Subject | HOPLS | |
---|---|---|---|
R | λ | ||
−80° | 003 | 1 | 1 |
033 | 4 | 1 | |
124 | 5 | 4 | |
134 | 4 | 2 | |
153 | 3 | 5 | |
0° | 003 | 1 | 3 |
033 | 3 | 1 | |
124 | 10 | 3 | |
134 | 2 | 1 | |
153 | 1 | 7 | |
80° | 003 | 1 | 3 |
033 | 7 | 5 | |
124 | 10 | 7 | |
134 | 4 | 3 | |
153 | 1 | 9 |
The optimal R and λ of five predicted subjects at three azimuths are different. These optimal R and λ bring good predicted performance. The ASDR is larger than 12.46 dB, but lower ASDR is obtained by other selections of R and λ. This implies that the performance of the individual HRTF prediction model can be adjusted by these two hyperparameters.
4.2.2 Subjective localization experiment
Localization impairment scale in the subjective tests
Grade | Localization similarity |
---|---|
1 | Very different |
2 | Slightly different |
3 | Slightly similar |
4 | Very similar |
5 | No difference |
5 Conclusions
High-dimensional HRTFs and redundant anthropometric parameters greatly affect the individual HRTF customization. We construct a multi-linear regression model between the HRTFs and the anthropometric parameters. The individual core tensor as the output variable of the regression model is firstly extracted from the measured HRTFs. Then, the key parameters are selected as the input variables of the multi-linear model based on the individual core tensor. The appropriate hyperparameter selection can achieve good prediction performance for the multi-linear model. Experimental results demonstrate the better performance for predicting the individual HRTFs in comparison to the PLSR method especially for the sound directions ipsilateral to the concerned ear. The listening tests show that the predicted HRTFs are approximate to the original ones for the sound localization. The performance of the individual HRTF prediction is relatively not good in the region of the high elevations to the contralateral ear. In our future work, we will further implement the anthropometric measurements to predict the individual HRTFs and focus on the improvement of the prediction performance of the contralateral HRTF personalization. At the same time, the non-linear methods for the HRTF tensor estimation will be our future task based on the current work.
Declarations
Acknowledgements
The authors would like to thank the editor and anonymous reviewers for their valuable comments. This work was supported by the National Natural Science Foundation (61001160) and Innovation Program of Shanghai Municipal Education Commission (12YZ023) of China.
Authors’ Affiliations
References
- Gumerov NA, Duraiswami R, Tang ZH: Numerical study of the influence of the torso on the HRTF. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2. Orlando; 2002:II1965-1968.Google Scholar
- Gumerov NA, O'Donovan AE, Duraiswami R, Zotkin DN: Computation of head-related transfer function via the fast multipole accelerated boundary element method and its spherical harmonic representation. J. Acoust. Soc. Am 2010, 127(1):370-386. 10.1121/1.3257598View ArticleGoogle Scholar
- Kahana Y, Nelson PA: Boundary element simulations of the transfer function of human heads and baffled pinnae using accurate geometric models. J. Sound Vib 2007, 119(5):552-579.View ArticleGoogle Scholar
- Algazi V, Duda R, Duraiswami R, Gumerov N, Tang Z: Approximating the head-related transfer function using simple geometric models of the head and torso. J. Acoust. Soc. Am 2002, 112(5):2053-2064. 10.1121/1.1508780View ArticleGoogle Scholar
- Xu S, Li Z, Zeng L, Salvendy G: A study of morphological influence on head-related transfer functions. IEEE International Conference on Industrial Engineering and Engineering Management, Singapore 2007, 472-476.Google Scholar
- Fels J, Vorlander M: Anthropometric parameters influencing head-related transfer functions. Acta Acustica united with Acustica 2009, 95(2):331-342. 10.3813/AAA.918156View ArticleGoogle Scholar
- Rothbucher M, Habigt T, Habigt JL, Riedmaier T, Diepold K: Measuring anthropometric data for HRTF personalization. Processing of the 6th International Conference on Signal-Image Technology and Internet-Based Systems, Kuala Lumpur 2010, 102-106.Google Scholar
- Zhang M, Kennedy RA, Abhayapala TD, Zhang W: Statistical method to identify key anthropometric parameters in HRTF individualization. In 2011 Joint Workshop on Hands-Free Speech Communication and Microphone Arrays, HSCMA'11. Edinburgh; 2011:213-218.View ArticleGoogle Scholar
- Zotkin DN, Hwang J, Duraswami R, Davis LS: HRTF personalization using anthropometric measurements. IEEE ASSP WASPAA'2003, New Paltz 2003, 157-160.Google Scholar
- Zeng XY, Wang SG, Gao LP: A hybrid algorithm for selecting head-related transfer function based on similarity of anthropometric structures. J. Sound Vib 2010, 329(19):4093-4105. 10.1016/j.jsv.2010.03.031View ArticleGoogle Scholar
- Inoue N, Kimura T, Nishino T, Itou K, Takeda K: Evaluation of HRTFs estimated using physical features. Acoust. Sci. Technol 2005, 26(5):453-455. 10.1250/ast.26.453View ArticleGoogle Scholar
- Hu HM, Zhou L, Zhang J, Ma H, Wu ZY: Head related transfer function personalization based on multiple regression analysis, in IEEE International Conference on Computational Intelligence and Security. Guangzhou 2006, 2: 1829-1832.Google Scholar
- Xu S, Li ZZ, Salvendy G: Improved method to individualize head-related transfer function using anthropometric measurements. Acoust. Sci. Technol 2008, 29(6):388-390. 10.1250/ast.29.388View ArticleGoogle Scholar
- Matsui K, Akio A: Estimation of individualized head-related transfer function based on principal component analysis. Acoust. Sci. Technol 2009, 30(5):338-347. 10.1250/ast.30.338View ArticleGoogle Scholar
- Hwang S, Park YJ, Park YS: Modeling and customization of head-related transfer functions using principal component analysis. In IEEE International Conference on Control, Automation and Systems(ICCAS). Seoul; 2008:227-231.Google Scholar
- Sodnik J, Umek A, Susnik R, Bobojevic G: Representation of head related transfer functions with principal component analysis. Proceedings of the Annual Conference of the Australian Acoustical Society, NSW 2004, 603-607.Google Scholar
- Wang L, Yin FL, Chen Z: HRTF compression via principal components analysis and vector quantization. IEICE Electron Express 2008, 5(9):321-325. 10.1587/elex.5.321View ArticleGoogle Scholar
- Wang L, Yin FL, Chen Z: Head-related transfer function interpolation through multivariate polynomial fitting of principal component weights. Acoust. Sci. Technol 2009, 30(6):395-403. 10.1250/ast.30.395View ArticleGoogle Scholar
- Xie BS: Recovery of individual head-related transfer functions from a small set of measurements. J. Acoust. Soc. Am 2012, 132(1):282-294. 10.1121/1.4728168View ArticleGoogle Scholar
- Kistler DJ, Wightman FL: A model of head-related transfer functions based on principal components analysis and minimum-phase reconstruction. J. Acoust. Soc. Am 1992, 91(3):1637-1647. 10.1121/1.402444View ArticleGoogle Scholar
- Grindlay G, Vasilescu MAO: A multilinear (tensor) framework for HRTF analysis and synthesis. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. Honolulu; 2007:I161-164.Google Scholar
- Rothbucher M, Durkovic M, Shen H, Diepold K: HRTF customization using multiway array analysis. In EUSIPCO'2010. Denmark; 2010:229-233.Google Scholar
- Chen ZW, Yu GZ, Xie BS, Guan SQ: Calculation and analysis of near-field head-related transfer functions from a simplified head-neck-torso model. Chin. Phys. Lett 2012, 29(3):034302. 10.1088/0256-307X/29/3/034302View ArticleGoogle Scholar
- Hu HM, Zhou L, Ma H, Wu ZY: Head-related transfer function personalization based on partial least square regression. J. Electron. Inform. Technol 2008, 30(1):154-158.View ArticleGoogle Scholar
- Hu HM, Zhou L, Ma H, Wu ZY: HRTF personalization based on artificial neural network in individual virtual auditory space. J. Appl. Acoust 2008, 69(2):163-172. 10.1016/j.apacoust.2007.05.007View ArticleGoogle Scholar
- Xu S, Li ZZ, Gavriel S: Individual head-related transfer functions based on population grouping. J. Acoust. Soc. Am 2008, 124(5):2708-2710. 10.1121/1.2982398View ArticleGoogle Scholar
- Hugeng W, Gunawan D: Improved method for individualization of head-related transfer functions on horizontal plane using reduced number of anthropometric measurements. J. Telecommun 2010, 2(2):31-41.Google Scholar
- Nishino T, Inoue N, Takeda K, Itakura F: Estimation of HRTFs on the horizontal plane using physical features. Appl. Acoust 2007, 68(8):897-908. 10.1016/j.apacoust.2006.12.010View ArticleGoogle Scholar
- Nishino T, Nakai Y, Takeda K, Itakura F: Estimating head related transfer function using multiple regression analysis. IEICE Trans. A 2001, 84: 260-268.Google Scholar
- Huang QH, Fang Y: Modeling personalized head-related impulse response using support vector regression. J Shanghai Univ (English edition) 2009, 13: 428-432. 10.1007/s11741-009-0602-2View ArticleGoogle Scholar
- Huang QH, Zhuang QL: HRIR personalisation using support vector regression in independent feature space. Electron. Lett 2009, 45(19):1002-1003. 10.1049/el.2009.1865View ArticleGoogle Scholar
- Li L, Huang QH: HRTF personalization modeling based on RBF neural network, in IEEE International Conference on Acoustics. Vancouver: Speech and Signal Processing (ICASSP); 2013:3707-3710.Google Scholar
- Rothbucher M, Shen H, Diepold K: Dimensionality reduction in HRTF by using multiway array analysis. In Human Centered Robot Systems. Berlin: Springer; 2009:103-110.View ArticleGoogle Scholar
- Lathauwer LD, Moor LD, Vandewalle J: A multilinear singular value decomposition. SIAM J Matrix Anal Appl 2000, 21(4):1253-1278. 10.1137/S0895479896305696MathSciNetView ArticleGoogle Scholar
- Bergqvist G, Larsson EG: The higher-order singular value decomposition: theory and an application [lecture notes]. IEEE Signal Process. Mag 2010, 27(3):151-154.View ArticleGoogle Scholar
- Kolda TG, Bader BW: Tensor decompositions and applications. SIAM Rev 2009, 51(3):455-500. 10.1137/07070111XMathSciNetView ArticleGoogle Scholar
- Xie BS, Zhong XL, Rao D, Liang ZQ: Head-related transfer function database and its analyses. Sci. China, Ser. G 2007, 50: 267-280. 10.1007/s11433-007-0018-xView ArticleGoogle Scholar
- Lathauwer LD, Moor BD, Vandewalle J: On the best rank-1 and rank-(R1, R2,…, RN) approximation of higher-order tensors. SIAM J Matrix Appl 2000, 21(4):1324-1342. 10.1137/S0895479898346995View ArticleGoogle Scholar
- Gupta N, Barreto A, Joshi M, Agudelo JC: HRTF database at FIU DSP Lab, in IEEE International Conference on Acoustic. Dallas: Speech and Signal Processing (ICASSP); 2010:169-172.Google Scholar
- He X, Cai D, Partha N: Laplacian score for feature selection. In Proceedings of Advances in Neural Information Processing Systems. Vancouver; 2005:507-514.Google Scholar
- Zhao QB, Caiafa CF, Mandic DP, Zhang L, Ball T, Schulze-Bonhage A, Cichocki A: Multilinear subspace regression: an orthogonal tensor decomposition approach. In Advances in Neural Information Processing Systems 24 (NIPS). Granada; 2011:1269-1277.Google Scholar
- Zhao QB, Caiafa CF, Mandic DP, Chao ZC, Nagasaka Y, Fujii N, Zhang L, Cichocki A: Higher-order partial least squares (HOPLS): a generalized multi-linear regression method. IEEE Trans Pattern Anal Mach Intell 2013, 35(7):1660-1673.View ArticleGoogle Scholar
- Wang HW: Partial Least Square Regression-Method and Application. Beijing: National Defense Industry Press; 2009:150-170.Google Scholar
- CIPIC HRTF database files, release 1.0 . Accessed 28 July 2012 http://interface.cipic.ucdavis.edu/
- Chanda PS, Park S, Kang TI: A binaural synthesis with multiple sound sources based on spatial features of head-related transfer functions. In IEEE IJCNN'06. Vancouver; 2006:1726-1730.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.