An experimental study of neural estimators of the mutual information between random vectors modeling power spectrum features

Mutual information (MI) quantifies the statistical dependency between a pair of random variables and plays a central role in signal processing and data analysis. Recent advances in machine learning have enabled the estimation of MI from a dataset using the expressive power of neural networks. In this study, we conducted a comparative experimental analysis of several existing neural estimators of MI between random vectors that model power spectrum features. We explored alternative models of power spectrum features by leveraging information-theoretic data processing inequality and bijective transformations. Empirical results demonstrated that each neural estimator of MI covered in this study has its limitations. In practical applications, we recommend the collective use of existing neural estimators in a complementary manner for the problem of estimating MI between power spectrum features.


Introduction
Mutual information (MI) is a measure of the amount of information that one random variable X contains about another random variable Y. Formally, the MI between X and Y, denoted I(X, Y), is the Kullback-Leibler (KL) divergence between the joint probability density p(x, y) and the product distribution p(x)p(y) of marginal densities [1], where, with the base of the logarithm e, the entropy is measured in nats.
A processing stage which maximizes the mutual information between its output and its input mixed with noise, is a way to extract useful input features, and provides a model for perceptual functions [2][3][4][5].This infomax principle forms the basis for applications such as speech recognition [6] and blind source separation [5].In each application, learning rules are devised to maximize MI.
(1) I(X, Y ) = E p(x,y) log p(X, Y ) p(X)p(Y ) Recent advances in MI estimation and optimization using neural networks have enabled unsupervised learning of representations based on the infomax principle [7,8] and contrastive predictive coding [9].The latter combines predictive coding with a probabilistic contrastive loss dependent on MI [9].These applications involve neural networks that estimate MI from a finite dataset without prior knowledge of the underlying distribution.
However, measuring MI from a finite dataset poses challenges due to inherent statistical limitations.It becomes impractical to obtain a high-confidence variational lower bound for MI larger than O(log N ) , where N is the number of data samples [10].This exponential complexity also applies to specific estimators, for example, conventional k-nearest neighbor estimator [11,12] and more recent variational estimators based on Donsker and Varadhan's representation of the KL divergence [13,14].
In this paper, we experimentally compare the performance of various MI estimators, including variational estimators accompanying the formal limitations derived in [10].The experiments also incorporate an alternative estimator based on a flow-based generative model [14,15], which has the potential to complement the variational estimators when the true MI is large [10,13,14].Collectively, we quantitatively evaluate the performance of the MI estimators to assess the applicability of recent advances in MI estimation.
While quantitative performance evaluations of MI estimators are present in the literature, they are often confined to the estimation of MI between Gaussian random vectors where the ground-truth MI is available in closed form [13,14,22].This work extends the previous results to MI estimation between exponential random vectors that model power spectrum features in the frequency domain, where consecutive samples are likely to be asymptotically uncorrelated.Additionally, we include experiments in the high MI region, where the flow-based estimator shows a relative advantage over variational estimators.
To quantitatively evaluate the performance of the estimators against the groundtruth MI, we employ a simplified model of the power spectrum features with a tractable expression of the MI.Additionally, we explore more realistic alternative models of the power spectrum features by leveraging information-theoretic data processing inequality.Potential applications of the MI between power spectrum features can be found in the speech processing literature, e.g., [17][18][19].

Estimators based on variational bounds
Beginning with the classic Barber and Agakov lower bound [20] and following the explanation in [13], we formulate the variational lower bounds of Nguyen [21], van den Oord [9], Belghazi [22], and Song [14].These formulations, which utilize critic function f θ (x, y) parameterized by a neural network to approximate the ratio of the probability density functions p(x|y)/p(x) = p(y|x)/p(y) [13,14], serve as candidates for the experi- mental study in this paper.
First, the lower bound of Barber and Agakov is obtained by replacing the intractable conditional probability density function p(x|y) with the variational distribution q(x|y) as follows [13,20], where h(X) denotes the differential entropy of the random variable X.
To circumvent the unknown differential entropy h(X), the variational distribution q(x|y) is defined using the critic function f θ (x, y) as follows [13], where the partition function g(y) is given by By substituting (3) into the last line of (2) and applying Jensen's inequality, the following lower bound of Donsker and Varadhan is obtained [13], Applying the inequality log(h) ≤ h/e with h = E p(x)p(y) e f θ (X,Y ) in ( 5), the NWJ estimate of Nguygen, Wainwright, and Jordan is derived as follows [13,21], The NJW bound in (6) is tight with the optimal critic f opt (x, y) = 1 + log(p(x|y)/p(x)) [13].However, the variance of the NJW bound is large, due to the estimation of the partition function whose variance increases exponentially with respect to MI [13,14].
The variance of the NJW bound can be reduced by using multiple independent samples.The Noise Contrastive Estimation (NCE) lower bound is obtained by averaging the bound over N replicates [13].
However, unlike the NWJ bound, the NCE bound is loose since it is upper bounded by log(N ) [13].
Another neural estimator, known as Mutual Information Neural Estimator (MINE), is also derived from the lower bound of Donsker and Varadhan in (5).The gradient of the Donsker and Varadhan lower bound is given by ( 2) g(y) = E p(x) e f θ (X,y) . (5) where the expectations over a mini-batch lead to a biased gradient estimate [22].The bias is reduced by the MINE gradient estimator, which replaces the estimate in the denominator of the second term in (8) by exponential average across mini-batches [22].Since the MINE estimate also includes the partition function, the variance increases rapidly with respect to the MI [10,14].To alleviate this, the Smoothed Mutual Information Lower-bound Estimator (SMILE) is proposed by limiting the variance of the partition function using the clipping function clip(a, τ ) = max(min(e a , e τ ), e −τ ) for some τ ≥ 0 [14], Specifically, I DV and I NWJ are appealing since they become tight with the optimal critic.However, they exhibit high variance due to their reliance on the high variance partition function estimator [13].I NCE and I SMILE aim to reduce the variance at the cost of increas- ing bias.The hyperparameter τ of I SMILE can be adjusted for the bias-variance trade-off [14], and for this work, we fixed τ = 1.
In summary, the variational estimators aim to maximize the lower bound of mutual information with respect to the critic function implemented as a neural network.The loss function to minimize is defined as the negative lower bounds of ( 5), ( 6), ( 7) and ( 9), where the expectation is implemented by the sample mean under mini-batches.Gradient descents are employed to fit the parameters of the neural networks, and the estimates of the MI lower bound are smoothed across the mini-batches considering the small sample size of the mini-batch.The loss function during training the neural network is directly used for the estimator.

Estimators based on flow-based generative model
Given a set of data instances without labels, generative models capture the data distribution by fitting the parameters to maximize the data likelihood.Neural networks are used to fit the parameters, and the data log-likelihoods log p φ (x, y) , log p ψ (x) and log p ξ (y) of samples from a mini-batch are evaluated to estimate MI as follows, [14] where the expectations are implemented by sample means.Again, exponential smoothing of the MI estimates across the mini-batches is performed to obtain a more reliable estimator.
A restricted Boltzmann machine (RBM), described by a probabilistic undirected graph, is a maximum likelihood model that learns a probability distribution over its data [23].Although greedy multi-layer training theoretically enhances the variational bounds of the log-likelihood, achieving the theoretical bounds may necessitate a prolonged Gibbs sampling, leading to an approximation such as contrastive divergence [23].Additionally, the computation of the log-likelihood involves an intractable partition function [24].
In a directed graphical model, efficient approximate inference can be achieved by sharing variational parameters across data instances, a strategy called amortized inference [25].In particular, variational autoencoder (VAE) leverages the efficiency of amortized inference while counteracting sample noise by the reparametrization trick.However, the (9) evaluating log-likelihood of a data point requires Monte-Carlo estimation using random samples of latent variables from the inference model [26].
Such approximations can be avoided by adopting a flow-based generative model instead, by leveraging the change of variable law of probabilities to transform a simple distribution into a complex one [27].A flow-based model using real-valued non-volume preserving transformations (real NVP) leads to an unsupervised learning algorithm, that directly evaluates and minimizes the negative log-likelihood function [15].This is advantageous for MI estimation using (10), which requires the computation of log-likelihood with respect to data samples in a mini-barch.
Given an observed data x, a latent variable z with simple prior distribution p Z , and a bijection f : X → Z , the log-likelihood is given by the change of variable formula [15], A flexible bijective model can be built by composing simple bijections, Then, the real NVP transformations are composed of simple bijective functions, where each function is referred to as an affine coupling layer modeled as follows, [15] where ⊙ is the Hadamard product.s φ 1 and t φ 2 stand for scale and translation, which can be arbitrarily complex by employing deep neural networks.The inverse of the bijective function is , and the Jacobian is exp 1 T s φ 1 (z 1 ) , where 1 ∈ R n×1 is vector of ones.The inverse and Jacobian remain numerically efficient.Since each later keeps the first half of the vector z 1 unchanged, a permutation is per- formed after every layer [15].
While the variational approaches in Sect.2.1 optimize single critic f θ (x, y) which gives rise to a partition function, the flow-based generative model optimizes parameters of log p φ (x, y) , log p ψ (x) , and log p ξ (y) separately.In particular, the flow-based estimator requires samples from the joint data distribution p(x, y) only, while variational estimators further require samples from the product of marginals p(x)p(y), except I NCE , to eval- uate the second terms on the right-hand sides of ( 5), ( 6) and ( 9) [14].

Experiments
In this section, we present experimental results quantifying the degree of the statistical relationship between a pair of random signal vectors by MI estimators.Due to the equivalence of time and frequency domain representations of a signal, MI can be equivalently estimated with spectral features.
The advantage of using the spectral features is that the spectral coefficients tend to be mutually uncorrelated with an increasing analysis window length for a stationary signal.Additionally, they tend to be asymptotically complex Gaussian due to the central limit theorem [28].Thus, utilizing spectral features alleviates the burden of capturing complicated inter-dependency within the signal vector.(11) Based on asymptotic theory, the signal vectors are assumed to consist of elements that are exponentially distributed and element-wise correlated.Then, the joint distribution between the signal vectors reduces to the product of bivariate distributions, with in which I 0 denotes the modified Bessel function of the first kind, r denotes the correla- tion coefficient between the pair of the exponential random variables (X i , Y i ) and 2σ 2 is equal to the mean of the marginal exponential distribution [29].
The bivariate exponential distribution ( 14) is derived from transforming a pair of twodimensional zero-mean Gaussian random vectors, [Z i 1 , Z i 2 ] T and [Z i 3 , Z i 4 ] T , with ele- ment-wise correlation coefficient of ρ and zero correlation elsewhere (see Eq. ( 119) in [29]).Through the transformations of We calculate the true MI by two-dimensional numerical integration, substituting ( 14) into (1).Figure 1 shows the resulting MI between bivariate exponential variables versus correlation coefficient r, compared with MI between bivariate Gaussian variables versus correlation coefficient ρ , evaluated from the closed form expression of −0.5 log(1 − ρ 2 ) [16].
The element-wise bivariate exponential model may be considered an oversimplification compared to more realistic models with an underlying super-Gaussian distribution [30][31][32].However, drawing samples from bivariate Gamma or Laplacian distribution based on the underlying super-Gaussian model in [30][31][32], given MI, is more complicated to implement.Instead, a more practical approach is to leverage the data processing inequality [1], where, the transformed data g 1 (X) and g 2 (Y ) create an alternative model of the power spectrum features.For instance, a simple element-wise transformation of the (13) p(x, y) ), for bijective g 1 and g 2 , Fig. 1 Mutual information versus correlation coefficient exponential model, i.e., g(X) = X ν with ν > 1 , results in a Weibull distribution derived from an underlying super-Gaussian model [33].

Tasks
In the literature, the signal vectors are assumed to be drawn from a 20-dimensional Gaussian distribution with element-wise correlation [13,14,22].To assess performance with higher-dimensional data, we experimented with a signal dimension of 40 and compare the result with the signal dimension of 20.Unlike previous works, each pair of elements is sampled from the bivariate exponential distribution of (13).
The true MI between the elements is varied from 0.2 to 1 nats, in steps of 0.2 nats, with 5k iterations conducted in each step.Pairs of correlated complex Gaussian random vectors are drawn, with a correlation coefficient ρ corresponding to the true MI.The vectors obtained by squaring the modulus of the complex Gaussian samples are fed into the neural networks.Then, the bias and variance of the estimators are analyzed and compared with each other.
Furthermore, to investigate the validity of the estimators upon data processing, we apply transformations, such as taking the element-wise logarithm, squaring, scaling, and shift rotation, which precede the input layer.The input data (x, y) is replaced with (x, g(y)), where g denotes the applied transformation.Taking logarithms results in commonly used log-spectra features.Squaring results in a heavy-tailed distribution which is based on a super-Gaussian model of power spectrum.Scaling implements a frequencydependent gain of a system.For the scaling, we multiply the data vector by a diagonal matrix, whose condition number is equal to 1e + 3 .Shift rotation is a frequency shifting operation, which is related to Doppler or harmonics.For the shift rotation, we circularly rotate the data by 10 bins.

Neural network architectures
For variational methods, the critic function f θ (x, y) is parameterized by a neural net- work.Assuming a separable critic f θ (x, y) = h θ 1 (x) T g θ 2 (y) where h and g are learned by two separate networks, only 2N forward passes are required for a batch size of N [13].In the joint critic, x and y are concatenated and fed into the neural network.For an equal batch size of N, the combination of the input vectors leads to N 2 forward passes for the joint critic.In general, the joint critics tend to perform better than separable critics [13].In this work, we consider the joint critic architecture only.
For all neural networks, we follow the setup in [13,14].We use the Adam optimizer with a learning rate of 5 × 10 −4 [34], and a batch size of 128.Rectified linear unit (ReLU) activations are used for each neuron.For the variational estimators, we use two-layer perceptrons with 256 neurons per layer in the case of 20-dimensional inputs.For the flow-based generative method, each of the log-likelihoods is estimated from 6 coupling layers, with each coupling layer implemented by two-layer perceptrons with 100 neurons per layer in the case of 20-dimensional inputs [15].The usual Gaussian priors are used for the flow-based method.For 40-dimensional inputs, we experimented with various network parameters to examine the reliability of the architecture when the dimension increases.

Experimental results
In Fig. 2, MI estimators between 20-dimensional vectors are presented, with the true MI stepping up every 5k iterations.The variational MI estimators exhibit large estimation errors at high MI due to the high variance of the partition function estimators or bias [10,13,14].In particular, I NWJ and I MINE show high variance, while I NCE shows high bias due to the upper bound introduced by ( 7).I SMILE benefits from the bias-variance trade- off but also degrades at high MI.
On the other hand, the performance of the flow-based generative estimator, I FLOW , is maintained even at high MI.In the experiment, the marginal distributions obtained from ( 14) are exponential with mean 2σ 2 irrelevant to the MI, whereas the joint distribu- tion changes according to the MI.Empirically, the sample variance of the joint entropy estimator, −E p(x,y) log p φ (X, Y ) in (10), is not significantly influenced by increasing MI.However, the flow-based estimator is generally more biased than the variational estimators at low MI.
Figure 3 shows the MI estimators obtained by doubling the dimension of the vectors to D = 40 .The results of Fig. 2 for D = 20 are duplicated in the leftmost column of Fig. 3 for comparison.The combined estimator in the last row, a heuristic algorithm combining the variational and flow-based estimator, will be explained later.The four columns, except the leftmost one, depict the MI estimators for D = 40 .Each column is obtained by changing neural network parameters, where the number of neurons per layer and the batch size are either unchanged or doubled in line with the increased dimension.
The variational MI estimators for D = 40 in Fig. 3 exhibit a more severe deviation from the true value than D = 20 , regardless of the changes in the network parameters.Particularly in the case 2 of Fig. 3, I NWJ tends to negative values at high MI.In general, the architecture of the variational estimators is difficult to scale up with the data dimension.Under the current architecture, it is conceivably necessary to divide frequency bins into disjoint subbands and estimate MI separately, resorting to asymptotic statistical independence between the subbands assuming a stationary signal.
In contrast, the architecture for the flow-based estimator appears to be more reliable for D = 40 .At high MI, the performance of the estimator for D = 40 is comparable to D = 20 when both the number of neurons per layer and the batch size increase twofold.However, at the intermediate MI region, the convergence of the estimator tends to be more unstable when the dimension increases to D = 40.
Figure 4 shows MI estimators between 20-dimensional input vectors transformed by element-wise logarithm, squaring, scaling, and shift rotation.Overall, the performance of I SMILE at low MI and I FLOW at high MI appears to be more robust against transformations, exhibiting smaller performance degradations relative to other estimators.Except for I NCE , which exhibits high bias, the bias or variance of the estima- tors is more negatively affected by squaring or scaling transformations than by taking the logarithm or shift rotation.Expanding the range of input data by squaring or scaling seems to have a detrimental effect on the estimators.
Since bijective transformations preserve MI, shrinking the range of the input data by applying a bijective transformation may help improve the estimator.For example, we can apply a linear transformation normalizing each frequency bin to unit variance [35], which is the inverse transformation of the scaling transformation, thus mitigating the bias induced by scaling shown in the fourth column of Fig. 4. Similarly, the heavy-tailed spectrum shown in the third column of Fig. 4 is inverse transformed by the square root function, resulting in a nonlinear transformation for input regularization.However, for this heavy-tailed white spectrum, a linear transformation such as normalizing to unit variance is not successful in shrinking the data range and is thus not adequate for input regularization.

A heuristically combined estimator
In practice, it is worthwhile to evaluate both I SMILE and I FLOW and choose one by inspecting the sample mean and variance of the estimators.When the sample mean of I FLOW is smaller than that of I SMILE , we expect more bias from I FLOW , as the variational I SMILE tends to exhibit negative bias resulting from the estimates of lower bounds.On the contrary, a large sample variance of I SMILE indicates that I SMILE is not a reliable esti- mator.We observe that the sample variance of I FLOW is relatively insensitive the true MI and thus is not a useful indicator of the confidence of I FLOW as shown in Fig. 3.
Consequently, in cases where the sample variance of I SMILE is small while the sample mean of I FLOW is relatively smaller than I SMILE , we presume that I SMILE is more reliable.On the other hand, when the sample mean of I FLOW is larger than I SMILE while the sam- ple variance of I SMILE is large, we choose I FLOW as a better alternative.Otherwise, when it is not evident that one estimator is preferable to the other and in case we lack further evidence, a simple average of the estimators may be used.are not significantly affected by the transformations.The heuristically combined estimator, taking advantage of the individual estimators, exhibits relatively smaller bias.The estimated standard deviation of the combined estimator is also relatively small, resulting from the sample variance of individual estimators.

Conclusion
In this study, we evaluated several neural estimators of mutual information (MI) between random vectors, which model power spectrum features.Firstly, the variational estimators exhibited significant bias or variance in the estimated value at high MI, leading to a substantial estimation error.However, these estimators demonstrated relatively greater reliability at low MI.Conversely, the flow-based generative estimator showed reduced bias and variance at high MI but suffered from slow convergence at low MI.These empirical results indicate that each MI estimator possesses inherent limitations, emphasizing the necessity of a complementary use of these estimators to mitigate their respective drawbacks.In response to this, we propose a heuristic algorithm that combines the strengths of individual estimators.However, the selection of parameters in the algorithm appears somewhat arbitrary and should be tailored based on the specific characteristics of the utilized data.Additionally, the algorithm still suffers from bias when both the estimators combined are biased.In particular, variational estimators tend to display more negative bias at high MI, underscoring bias a significant concern.Moreover, although the generative flow-based estimator demonstrates relatively strong performance at high MI, there is a need for a theoretical explanation of its statistical characteristics to justify its application at high MI.Beyond this, it is necessary to explore alternative generative models for MI estimation, aside from the flow-based estimator discussed in this study.Lastly, to validate the practical applicability of the neural estimators, future work should include experiments with real-world data.

Fig. 2
Fig. 2 MI estimators between 20-dimensional input vectors.(Light color: estimates in each iteration using a single mini-batch, dark color: estimates exponentially moving averaged across mini-batches, black color: true MI)

Fig. 3
Fig. 3 MI estimators obtained by increasing the dimension D of input vectors.(Light color: estimates in each iteration from a mini-batch, dark color: estimates exponentially moving averaged across mini-batches, black color: true MI) Left column D = 20 .Second column (case 1) D = 40 , with the architecture of the network unchanged.Third column (case 2) D = 40 , with the number of neurons per layer doubled.Fourth column (case 3) D = 40 , with the batch size doubled.Last column (case 4) D = 40 , with both the number of neurons per layer and the batch size doubled

Fig. 4
Fig. 4 MI estimators between 20-dimensional input vectors transformed by element-wise logarithm, squaring, scaling, and shift rotation.(Light color: estimates in each iteration from a mini-batch, dark color: estimates exponentially moving averaged across mini-batches, black color: true MI)

Table 1
Bias estimates (in nats) with respect to data processing

Table 2
Standard deviation estimates (in nats) with respect to data processingBest results are indicated in bold, except for the standard deviation of the NCE estimator (indicated in italics), which restricts the deviation at the cost of upper bounding the estimator