4.1 Choice of divergence
As mentioned in Section 2.1, three settings of the β-divergence are commonly used for NMF and NTF: Euclidean distance (β=2), KL divergence (β=1), and Itakura-Saito (IS) divergence (β=0). Their differences were investigated by C. Févotte [2]. One important characteristic of the IS divergence that is not shared with the two other types of divergence is that the absolute scale of given audio does not affect the total cost of the divergence. That is, the unnoticeably small spectrogram bins can be approximated as well as the dominant bins. We assume that IS-NTF is thus more appropriate when a relatively small signal might come from a direction close to that of the spatial cue. However, this assumption is probably true only when there is little ambient noise [30]. Thus, we selected IS-NTF for our initial experiments and used noise-free input signals, such as commercial music. Another motivation for employing IS divergence comes from a statistical perspective. It has been shown that the ML estimation of a sum of complex Gaussian components representing for each spectral bin is equivalent to minimizing IS divergence between the ideal and estimated power spectrograms [31]. While there are clear advantages of the IS divergence, IS-NTF suffers from the fact that it is more often caught in local minima.
The simplest solution to this problem might be to perform a number of training runs and then select the best results from among them. Another approach to mitigating this effect is tempering NTF by changing the type of divergence during the iterations [32]. For example, the training could start with EUC-NTF (NTF based on Euclidean distance), which is relatively robust with regard to local minima, and finish up with IS-NTF, which produces better results. This would require that developers carefully control β, and more iterations than usual would probably be needed.
4.2 Initialization of channel matrix
The initialization of channel matrix, Q, is based on the understanding of the matrix Q explained in Section 3.
(7)
where v0,k l and v1,k l are the spectrogram bins for the left and right channels, respectively. This feature concerns the arrows in Figure 1 and their relationships to each spectrogram bin. It is possible to determine the locations of sources with respect to the bins by searching for the peaks in the histogram of (Figure 3), which represents the dominant presence of the sources. The basis elements are preferentially allocated by initializing channel matrix, Q, based on the histogram of : More elements are allocated to directions where sources are likely to exist, although some are allocated to cover all directions. Figure 3 shows the histogram of calculated from a mixture of three audio instruments placed in different positions. The length of arrows corresponds to a frequency of each bin. As we can see the arrows in three directions in the histogram, it is highly likely that sources exist at 50° to 60°, 90° to 100°, and 110° to 120°. However, the allocation of basis elements to the left of the leftmost peak and to the right of the rightmost peak is not required since the superposition of the sources never appears outside of these peaks. It is therefore preferable to allocate basis elements inside the range spanned by the left and rightmost peaks that exist in the measured histogram of , as can be seen in the right image of Figure 1.
4.3 Initialization of frequency matrix and time matrix
Initialization of frequency matrix, W, and time matrix, H, is simply carried out by taking advantage of information of the histogram such that
(8)
(9)
where NG r p(d) denotes the frequency per bin of the histogram, and Grp(d) denotes the collection of bases allocated to the direction with the index d. Normalization of the matrices follows for the purpose of concentrating the energy of input tensor V into H.
(10)
(11)
where |·|1 denotes the L1-norm, and e
d
denotes the directional energy associated with the direction index d.
4.4 Weighting function
Since the spatial cue indicates which direction should be given preference, and since the histogram of indicates which source is dominant for a given direction, it is possible to approximate the spectrogram bin associated with the spatial cue more precisely than other bins. This is easy to accomplish by using the proper weighting tensor, G, in the cost function:
(12)
where ψ determines the shape of the exponential function. Figure 4 changes the weighting parameter ψ and that creates different shapes while forcing to point toward 100°. The weighting values of different ψ when are accentuated with markers. When ψ equals 0, all the weights for bin-wise cost functions become 1, which boils down the update rules of sc-NTF described in Section 2.1 to those used for PARAFAC-NTF.
4.5 Constraints
The energy for each direction should be estimated by adding all the basis elements in matrix H over time, equal to the procedure done by Equation 11. The estimated energy is fixed so that it can be used as a reference to constrain the energy distribution of the estimated tensor. This should reduce the likelihood of being trapped in local minima. Here, we again use the IS divergence to measure distance. The constraint on energy in a given direction is
(13)
By taking into account the normalization procedure in Equation 10, the equation can be boiled down to
(14)
For IS-NTF, the following should hold for the derivative of the constraint:
(15)