Perceptually controlled doping for audio source separation

The separation of an underdetermined audio mixture can be performed through sparse component analysis (SCA) that relies however on the strong hypothesis that source signals are sparse in some domain. To overcome this difficulty in the case where the original sources are available before the mixing process, the informed source separation (ISS) embeds in the mixture a watermark, which information can help a further separation. Though powerful, this technique is generally specific to a particular mixing setup and may be compromised by an additional bitrate compression stage. Thus, instead of watermarking, we propose a ‘doping’ method that makes the time-frequency representation of each source more sparse, while preserving its audio quality. This method is based on an iterative decrease of the distance between the distribution of the signal and a target sparse distribution, under a perceptual constraint. We aim to show that the proposed approach is robust to audio coding and that the use of the sparsified signals improves the source separation, in comparison with the original sources. In this work, the analysis is made only in instantaneous mixtures and focused on voice sources.


Introduction
Blind source separation (BSS) methods have been increasingly present in the signal processing literature since the first efforts in the area in the middle 80s. The BSS approach based on independent component analysis (ICA) is certainly consolidated as a fundamental unsupervised method [1], being employed especially in scenarios where the number of sources to be recovered is not greater than the number of sensors.
When dealing with the underdetermined case, i.e., scenarios with more sources than sensors, methods usually associated with the idea of sparse component analysis (SCA) [1], which assume that the sources are sparse in some domain, are able to identify the mixing model or even, in some cases, perfectly separate the underlying sources [2].
Several SCA approaches explore the fact that source signals are disjoint in the time-frequency domain [3,4], which means that there are regions in the time-frequency domain in which there is at most one source active. These *Correspondence: gael.mahe@parisdescartes.fr 1 LIPADE, Université Paris Descartes, Sorbonne Paris Cité, Paris 75006, France Full list of author information is available at the end of the article methods operate in a similar fashion, performing the following steps: (1) identify time-frequency regions in which at most one of the sources is active, (2) estimate the mixing parameters (or the direction of arrival (DOA)) associated with the active source, (3) gather all results in a histogram of estimates, and (4) process the histogram in order to obtain the mixing parameters (or the DOAs) and/or the number of sources [4][5][6][7][8].
If more than one source is active in each time-frequency region, but this number is smaller than the number of sensors, some methods try to identify the subspaces containing the sources and afterwards estimate the mixing parameters based on the information about these subspaces [9][10][11]. Another interesting approach is followed in [12] and [13], which combine SCA and ICA methods, proposing that ICA be performed in time-frequency regions in which the number of active sources do not exceed the number of sensors.
It is important to mention that the performance of these methods, however, is strongly dependent on the key assumption that the source signals have a sparse representation in some given basis. In this sense, a different approach, the 'informed source separation' (ISS), was proposed [14]. In some particular audio applications, it is http://asp.eurasipjournals.com/content/2014/1/27 possible to have access to the sources before the mixing process; for example, in a professional studio, the source signals are usually recorded separately and then mixed together to compose the final recording. Thus, one can embed at this stage additional information about the mixing process within the signals in an inaudible manner. This extra information can later be employed by the receiver to help recovering the sources and let the listener manipulate them separately.
For example, in [15], the time-frequency plane is divided in 'molecules' and the watermark information is either the energy contribution of each source to each molecule of the mixture or a coarse description of each molecule of each source. This watermark helps the separation of a linear instantaneous monophonic mixture of four or five sources. In the stereophonic case, [16] proposed to embed the information about the mixture matrix and, for each molecule, the index of the zero, one, or two dominating sources in the molecule. At the receiver's end, thanks to this information, each molecule undergoes the separation process as a (over)determined mixture. Other methods [17][18][19] are described and evaluated in [14] that generally require the transmission of a compressed representation of the sources spectrograms and the mixing filters.
These methods achieve very good performance compared to BSS but require a considerable bit overhead (at least 5 kbit/s per source according to [14]). The compatibility of the ISS with the current normalized formats implies to transmit this information through watermarking. Although high-capacity watermarking was recently proposed [20] for this purpose, it is dedicated to uncompressed formats (16 bits PCM) and would not be robust to bitrate compression. This difficulty is overcome by the coding-based ISS approach [21], where the mixture and the sources are jointly coded. But in the context of audio broadcasting using standard compressed stereo formats, the watermarking approach should be chosen and the watermark should be robust to bitrate compression.
In an attempt to avoid the overhead inherent to ISS and the limitation regarding an additional coding step, we explored the concept of doping watermarking [22]. The principle is to imperceptibly change the properties of an audio signal in order to improve a particular processing task. For example, in [23], this idea was employed to 'stationarize' audio signals, aiming to enhance acoustic echo cancelation; in [24], the authors proposed a 'gaussianization' procedure for non-linear system identification and [25] proposed a method for reducing the spectral support of the probability density function (PDF) of an audio signal in order to match the conditions of the quantization theorem.
The method initially proposed in [22] aims at increasing the sparsity of the source signals without compromising the perceptual audio quality, in order to enhance the performance of sparsity-based source separation methods [1]. Some issues remain however: • Although it was experimentally shown that, for given parameters, this method sparsifies efficiently audio signals without audible distortion, the trade-off between sparsification and audio quality was not explored. In other terms, how sparse can we make the sources without audible distortion? • The robustness of the sparsification against audio coding must be assessed. • The improvement of source separation in [22] was studied only with regard to sources counting and sources direction estimation. The impact of the sparsification on the source separation itself should be studied.
In this paper, we present a extension of this method that will deal with these issues. The studied scheme is represented on Figure 1. We will focus on stereo mixtures of speech signals, which are a more homogeneous material than music and thus provide more easily reliable mean results from corpus of reasonable size.
As in [22], our goal is to imperceptibly sparsify the whole signal, although it would be possible to focus on the time-frequency bins where separation fails, which could distort less the signal for the same result in the separation process. This approach would however restrict the sparsification of a signal to a given mixing scenario, which is another limitation of the ISS that we want to overcome. Our purpose is to facilitate the separation for any mixing scenario, i.e., without knowing in which time-frequency bins the separation will fail.
In order to expose our new methodology, the paper is organized as follows. In Section 2, we present a perceptually controlled sparsification method, trying to increase the sparsity of the signals in the time-frequency domain but maintaining the same level of perceptual audio quality. Section 3 is dedicated to the impact of bitrate compression on sparsity and vice-versa: how sparse signals remain after coding-decoding stages? How sparsification modifies the quality of coded-decoded signals? Finally, we study in Section 4 how the proposed sparsification improves source separation.

State of the art
A sparsification was first proposed in [26], which principle is to set to zero a part of the source time-frequency (TF) coefficients found by a Gabor transform, without audible distortion. For this purpose, a simple simultaneous masking model was proposed, indicating, for each frequency bin, the masking threshold resulting from the http://asp.eurasipjournals.com/content/2014/1/27 other frequency components (which is quite different from a masking threshold computed for coding or watermarking purpose, i.e., for noise addition). Each frequency component falling below this masking threshold shifted by some decibels (typically -6.6 dB), called 'irrelevance function' , is simply removed. According to the experimental results presented in [26], this method allows to remove around 36% of the Gabor coefficients, for sources sampled at 16 kHz. However, as indicated by Balazs et al. [26], the Gabor scheme of analysis-synthesis implies overlapping synthesis windows with a high redundancy factor, which reduces the efficiency of the algorithm: 'components whose levels vary around the irrelevance threshold from one analysis interval to the next are not completely removed'. We ran a preliminary experiment of the irrelevance filter with the same masking model and an overlap-add scheme of analysis-synthesis (which was the one chosen in this paper), on a sequence of 5 s of speech sampled at 16 kHz. Although 32% of the TF coefficients can be removed (an amount similar to that found by Balazs et al. [26], a timefrequency analysis of the filtered signal exhibits almost the same histogram of the TF coefficients as the original signal, with the same amount of coefficients near zero. To overcome this drawback, the principle of the irrelevance filter was revisited by [27], in the framework of modified discrete cosine transform (MDCT)-based analysis-synthesis. This scheme avoids the effects of overlapping in the temporal reconstruction. In other words, going back to the TF representation of a sparsified signal gives again exactly the MDCT resulting from the irrelevance filter, so that the amount of zeroed TF coefficients remains the same.
The algorithm of [27] reaches ca. 75% of the coefficients set to zero without audible distortion. Note that this result was obtained with audio signals sampled at 44.1 kHz, where a larger amount of frequential components are inaudible than in 16 kHz-sampled signals. This method was used as a pre-processing step in the ISS algorithm described in [16], which applies an ICA algorithm to each TF bin of a stereo mix, based on the assumption that there are at most two dominating sources in each bin. Since this sparsification increases the amount of TF bins without any active sources, which do not need to be separated, it reduces the computational complexity of the separation. Nevertheless, as pointed out by the authors, this sparsification procedure leads to a small improvement in separation quality because the bins for which a perfect separation is possible (zero to two sources) represent only 10% of the energy of the mix in the presented experiments, with mixtures of five sources each (real music tracks).
Since our framework is a source separation based on a classic short-time Fourier transform, the method of [27] is not appropriate here, whereas the irrelevance filter of [26] http://asp.eurasipjournals.com/content/2014/1/27 does not provide satisfactory results. Instead of the binary sparsification proposed by the latter, a 'smooth' sparsification, robust to the inter-blocks effects of the temporal reconstruction, was proposed in [22].
The sparsification described in [22] is based on a parametric approach of the source TF coefficients distribution. Denoting by |S(m, f )| the TF coefficients (in modulus) of an audio signal s, their distribution can be approximately modeled by a generalized Gaussian distribution, with a form factor β varying between 0.2 and 0.4 [28]. Thus, the idea is to design a time-varying filter that will transform the original source s(n) into a new signals(n), such that its time-frequency coefficients modulus |S(m, f )| are also distributed according to a generalized Gaussian distribution but with a smaller form factor β . In this sense, the probability density function of the filtered signal timefrequency coefficients modulus should be equal to with α denoting the scale factor of the original distribution, in order to maintain the same variance as the original signal.
The sparsifying method can be summarized as follows: 1. Compute the time-frequency representation S(m, f ), using non-overlapping windows of 32 ms. 2. Estimate the form factor β of the distribution of |S(m, f )|, assuming a generalized Gaussian distribution. 3. For a fixed target form factor β < β, obtain the target time-frequency representation as where F emp (.) denotes the empirical cumulative distribution of |S(m, f )| and F target (.) the target cumulative distribution. 4. For each frame, obtain the sparsifying filter frequency response as and apply it to each time frame.
It was shown experimentally that this method efficiently sparsifies speech signals, while preserving a good audio quality. This sparsification led to better results in SCA, concerning the estimation of the number of sources present in a mixture and the estimation of the mixing matrix. However, the method does not ensure in itself the preservation of the audio quality. Hence, the question of the tradeoff between the sparsification and the audio quality remains open. In other terms, how could we make an audio signal as sparse as possible, while keeping it perceptually unchanged?

Perceptually controlled sparsification
The perceptual cost of the previous algorithm could be reduced by procesing each frequency bin independently, i.e., in each frequency bin reducing the form factor of the distribution while keeping the variance unchanged. Since the range of TF coefficients strongly depends on the frequency, this would avoid the risk of excessive modification of the variances due to processing the whole TF plane globally. However, in this framework, we consider instantaneous mixtures, for which the separation is performed using the whole signal. Consequently, the sparsity is required for the distribution of the whole TF plane, so that we chose to base the sparsification on the distribution of all TF coefficients of the whole signal, while ensuring the perceptual control by other means.
Following the same framework as in [22], our goal is to find a transformation of the spectrogram |S(m, f )| into |S(m, f )| so that the empirical distribution of the latter, f |S| , is as close as possible to f target , while the modified signals is perceptually equivalent to the original signal s. This may be expressed by the following optimization problem: where d is a distance between distributions (for example the Kolmogorov-Smirnov distance, denoted by d KS in the following), d percept is a perceptual distance between audio signals and d th is the audibility threshold for this distance.
We propose to solve this problem in an iterative way, i.e., by reducing step by step d KS Since the Kolmogorov-Smirnov distance is the max of |F |S| − F target | on R, it decreases if this max belongs to the interval I or remains constant otherwise. The proposed rule does not ensure a strict decrease of d KS (f |S| , f target ) at each step, but it reduces step by step the difference between F |S| and F target , which contributes, in the long term, to a decrease of d KS (f |S| , f target ).
The choice of determines the convergence. We experimentally observed that higher values can speed up the decrease of the distance to minimize, but too large values make the condition for shifting more difficult to verify, which may stop the algorihtm before its convergence or at least slow it down. Note that the algorithm is sensitive to the order in which the TF bins are processed. Choosing a smaller value for reduces this sensitivity. Finally, the value of influences the audio quality of the transformed signal since too high values may cause an audible spectral distortion.
Differences between neighboring bins H(m, f ) have also an impact on the audio quality. We observed experimentally that letting each bin evolve independently from its neighbors leads to an audible distortion: the sound is perceived as 'robotic' . Thus, we fixed an additional condition for shifting H(m, f ) and |S(m, f )|: the difference between two neighboring bins H dB (m, f ) should not exceed an arbitrary threshold freq max (f ) in the frequency dimension and time max (f ) in the time dimension. These values depends on the frequency sensitivity of the ear that depends on the frequency.
Many objective perceptual distances between audio signals were proposed in the literature [29], with various complexities and correlations with the real perception. In our case, i.e., a spectral distortion caused by filtering, the Bark spectral distortion (BSD) [30] was shown to be well correlated with the perceived distortion of speech signals [29] and its complexity is moderate. Thus, it is an adequate perceptual distance here.
For two signals s ands (distorted version of s), for each frame m, the power spectra are converted in loudness spectra, representing the perceived loudnesses, in Sones, on a Bark frequency scale, using a basic psychoacoustic model. Hence, the spectrograms of s ands result in loudness spectrograms S s (m, b) and Ss(m, b), respectively. The normalized local BSD for a frame m is defined as: where N b is the number of considered critical bands. The global BSD for the whole signal is the mean of the local BSDs.
In the proposed algorithm, we chose the BSD as perceptual distance and fixed two thresholds: one for the global BSD of the distorted signal, denoted byd th , and another for the local BSD of each frame, denoted by d th , greater thand th .

Implementation in the time domain
In the time domain, the sparsified signal is synthetized according to the overlap-add method. The overlapping in reconstruction avoids the clicks that can be noticed using the time-domain implementation of [22]. On the other hand, it increases the risk of actual values of |S(m, k)| slightly different from the foreseen values, when coming back to the frequency domain. Thus, the robustness of the sparsity against the block overlap should be experimentally assessed. http://asp.eurasipjournals.com/content/2014/1/27

Experimental results
In this experiment as well as in the following ones, the estimation of the distributions and the sparsification is performed only on the non-silent parts of the signal. The form factors are estimated by the moments method [31].
As in [22], we set the target form factor at half of the original one.
We fixed = 0.2 dB and freq max (f ) according to the frequential sensivity of the ear, which is constant below 500 Hz and decreases beyond 500 Hz. Hence, we set freq max (f ) proportional to the width of the critical bands, i.e., inversely proportional to the derivative db/df , where b denotes the Bark frequency, which can be approximated by [29]: Thus, we get: freq where 0 is fixed to 3 dB. The value of time max (f ) is less critical, in particular because of the inter-frames smoothing in the temporal implementation of the filter. We fixed time max (f ) = 4 freq max (f ). Concerning the stop criteria, we fixed ε = 10 −4 and d th = 1. Whereas we observed that the algorithm is not very sensitive to d th , the final quality depends crucially ond th . In preliminary experiments, we output the sparsified signals at each iteration and estimated its mean opinion score (MOS) compared to the original signal s by PESQ [32]. For any source, the MOS decreases as the global BSD increases, unsurprisingly. But the relationship between the MOS and the BSD is not the same for all sources, which makes difficult to fix a BSD threshold corresponding to the inaudibility threshold for any source. Consequently, the optimal BSD thresholdd th has to be learned on a training corpus.

Training corpus
The training corpus was constituted from the TIMIT database [33] in the same manner as the test corpus used in [22] but with different speakers. The corpus is composed of 32 source signals, each consisting in three sentences pronounced by the same speaker (32 different speakers), sampled at 16 kHz, truncated to 5 s.
The algorithm was run on this training corpus, with the following stop criteria:d th = ∞, MOS = 3.5, MAX_IT = 200. As shown by Figure 2, the relationship between MOS and BSD is very variable. However, according to these results, fixingd th = 0.12 should provide a good MOS (≥ 4) for most of the sources a .

Test corpus
The test corpus is the same as that used in [22], with 96 different speech sources of 5 s. In this experiment, the MOS was not output at each iteration and we fixed the following stop criteria: ε = 10 −4 ,d th = 0.12, d th = 1, and MAX_IT = 100.
As an example, Figure 3 displays the convergence of the algorithm in terms of Kolmogorov-Smirnov distance, in parallel with the Bark spectral distortion, for one of the source signals, which original form factor is 0.32. At the end of the algorithm, the form factor of the spectrogram  Figure 4. Figure 3 displays also two other distances between the empirical and the target distribution, which evolution show the robustness of the algorithm to the choice of the distance. They were normalized so that their first value matches that of d KS .
The Cramér-von Mises distance (d CM ) measures a Euclidian distance between the empirical and the target cumulative distributions, defined as: where N is the number of TF coefficients and (X i ) 1≤i≤N denotes the ordered sequence of the TF coefficients |S(m, f )|. Unsurprisingly, this distance decreases, since the algorithm is based on a decrease of |F |S| − F target | on small intervals. The chi squared distance (d chi squared ) is based on a comparison between the distributions themselves. Since the empirical distribution is discrete whereas the theoritical distribution is continuous, the distance is quantile-based. We define r intervals (I i ) 1≤i≤r , containing approximately the same number of coefficients |S(m, f )|. Denoting by P |S| (I i ) and P target (I i ), respectively, the empirical and the target probabilities of the ith interval, the chi squared distance is defined as: We chose r = 1, 000. This distance decreases in the same manner as the others. For the whole corpus, Figure 5 shows for both algorithms (this one, called perceptual, and [22], called reference) the couples (β, βs), where for each source signal β denotes the original form factor and βs the form factor of the sparsified signal. The latter was computed from a time-frequency analysis of the sparsified signal after the reconstruction in the time domain. The time-frequency analysis was based on the same segmentation as for the original signal. The sources are slightly less sparsified with the proposed algorithm, but thanks to the perceptual control, the audio quality is ensured (see Figure 6): only 1 of the 96 sparsified speech sources have a MOS lower than 4 and 80% of the sparsified signals have a MOS greater than or equal to 4.4. The mean MOS on the corpus is 4.4 instead of 4.1 with the previous algorithm [22]. A chi squared test of similarity between the previous distribution of the MOS values and this distribution, with classes of width 0.1, provides a p value of 9.6 × 10 −3 , which indicates that the distributions are significantly different. Figure 7 illustrates the trade-off sparsity/quality, comparing the proposed algorithm to the reference algorithm [22].

Robustness to time/frequency operations
One could wonder if the obtained form factors are different from these computed directly after the sparsification in the frequency domain, before the time-domain reconstruction. In other terms, since this step was a critical issue in the sparsification method of [26], what is the effect of the synthesis through the overlap-add method?
The mean values of the form factors before and after the time-domain reconstruction are, respectively, 0.2315 and 0.2366. Assuming a normal distribution of the form factors, a Student test indicates a p value of 0.051. Hence, the difference between the mean values is weak compared to the sparsity improvement and weakly significant. We can conclude that our sparsification is robust to the overlap-add synthesis.
Another question is the robustness of the sparsity to the frame desynchronization between the transmitting and the receiving part of the communication chain. Since the system is more intended for file transfer than for broadcast, this question is not a critical issue: the beginning of the file remains the same and the frame length can be transmitted through the metadata of the file header. Consequently, we just tested this issue for one speaker of the corpus. For this speaker, the original form factor of the spectrogram computed with non-overlapping frames of 32 ms is 0.26 and the sparsification leads to a form factor of 0.22. Shifting the time-frequency analysis of the sparsified signal of 16 ms (half of a frame) increases the form factor of only 0.004. Choosing another frame length for the analysis (respectively, 16 and 64 ms) modifies the form factor of the original signal (respectively, 0.30 and 0.24) and the form factor of the sparsified signal (respectively, 0.24 and 0.21) in the same way, so that the sparsified signal is kept sparser than the original one.

Quality and sparsity of sparsified coded signals
We have proposed a sparsification algorithm that reduces the form factor of the generalized Gaussian model of the source distribution, while preserving the audio quality. Nevertheless, as indicated in Figure 1, a more realistic scenario should also consider the possible distortion introduced by a coding scheme. In this section, we will consider two codecs: the GSM b [34], which is intended for speech and allows to test the effect of a deep modification of the signal; the MP3 c [35], since it includes natively a stereo mode (unlike GSM) and is more appropriate for the future extention of this work to music signals.
We propose to assess the robustness of doping to coding and its impact on the quality loss due to coding. We consider here two versions of the test corpus of Section 2: original and sparsified. Once the sources have been mixed, the obtained signals are coded, transmitted, and then decoded (see Figure 1). The transmission process is modeled as a simple delay in order to focus our attention on the effect of the codec. http://asp.eurasipjournals.com/content/2014/1/27

Robustness of the sparsity against coding
To test how sparse signals remain after coding-decoding, we coded each source signal separately, as well as its sparsified version. Hence, the MP3 codec works on mono mode, with a bitrate of 96 kbps, known to provide a transparent quality for mono signals. Figure 8 shows the couples (β, β codec ), for the original and the sparsified signals, in both coding cases, where β and β codec denote the form factors of respectively the uncoded and the coded-decoded signal. The coding-decoding process causes almost no variation of the form factors in the MP3 case and a very small variation in the GSM case, even negative. Hence, speech coding, even with a low bitrate, does not alter the sparsity of the signals.

Quality of the coded sparsified signals
As in Section 2, we used PESQ to estimate the perceived audio quality. In the practical scheme presented in Figure 1, the quality should be measured on various mixtures after decoding. But since PESQ is not validated for a mixture of speech signals, we only measured the quality for each source signal coded separately. In the GSM case, the MOS were estimated using 8 kHz sampled signals, since the GSM works only at this frequency.
For each source signal, taking as reference the original signal, we computed two values: • The MOS of the coded-decoded version of the original signal • The MOS of the coded-decoded version of the sparsified signal As shown by Figure 9, • In the MP3-coding case, the impairment due to the sparsification is small compared to this due to the coding. • In the GSM-coding case, the sparsification increases slightly the impairment due to the coding.
Note that we discarded two outliers in the MP3 case, with coordinates (1.75,4.23) and (2.51,4.15). In both original signals, there was a slight whistling, which caused an artefact in the coded signals. The sparsification smoothed this artefact, so that the quality is good for the codeddecoded sparsified signal, whereas it is poor for the codeddecoded signal.

Methods
In SCA approaches, source separation techniques are usually divided in three steps: (i) identification of the number of sources in the mixtures, (ii) identification of the mixing system, and (iii) source separation itself. In this section, we verify the performance of each aforementioned step when the doping watermarking procedure is employed. For the first two steps, we will use the ICA-SCA based approach proposed in [13,36] that was also used in [22]. In a stereo mixing situation, the algorithm can be summarized as follows: 1. Compute FFT of the mixing signals using the same parameters as in the sparsification process. 2. Divide the FFT data in blocks and for each block apply ICA to the mixing signals, assuming that there are two or less sources active in the block. The ICA method will provide a 'local separation matrix' W 2×2 . 3. Compute and store all the angles θ i (two for each block) obtained by: 4. Apply K-means [37], or other clustering method in θ , to find the number of clusters that better fits the data. This number will be the amount of sources present in the mixture. 5. The centroid of each cluster indicates a value of θ that will be related to the direction of one of the columns of the global mixing matrix.
Finally, after estimating the mixing matrix, we used the flexible audio source separation toolbox (FASST) [38] to separate the sources. This comprehensive toolbox contains some of the most common approaches of audio source separation. A set of prior constraints and a decomposition based on local Gaussian modeling of the sources are used to find which framework is more suitable to separate each set of sources. In this work, • The mixture is stereo, instantaneous, and underdetermined in most of the cases. • The STFT was used for signal representation, and the mixing parameters were estimated using the SCA/ICA approach presented in [13] d , giving GEM algorithm a 'good initialization'. The parameters settings used in FASST correspond to the multichannel NMF method presented in [39] in the instantaneous case.
In the following experiments, each step is run with the perfect estimation of the previous step, in order to study separately the impact of the sparsification on each part of the source separation.

Experimental results
In order to verify the improvement provided by the proposed sparsification method in each of the three steps of the source separation, we proposed some simulation scenarios. In each of them, the algorithms were run 100 times and results are an average of them. In each case, the sources were randomly chosen among the 96 speech signals of the test corpus described earlier. The number of sources in the mixtures varied from two to six, and only stereo mixtures were considered. The FFT window had 512 samples and an overlap of half window. All the tests were made with 1-and 5-s sampled sources. The mixing matrix was the same in all the runs and its directions θ were chosen to be equally spaced.

Estimation of the number of sources
In this first scenario, we applied the fourth step of the aforementioned algorithm in order to estimate the number of sources. In the case of samples with 5 s, considering both cases -original sources and sparsified sources -all simulations found the correct number of sources.
With 1-s samples, the sparsification procedure was able to reduce the estimation errors when the number of source is higher than 2. Using the original sources, the estimation errors are 0%, 2%, 2%, 8%, and 11%, for two, three, four, five, and six sources, respectively. However, when the sparsifcation procedure is employed, the estimation errors are 1%, 0%, 0%, 5%, and 9%, for two, three, four, five, and six sources, respectively.

Estimation of the mixing matrices
Considering now that the number of sources was correctly found, we applied the fifth step of the aforementioned algorithm to estimate the direction of each column of the mixing matrix. We computed the angular mean error (AME) between the directions θ of the mixing matrix A and its estimation. The results presented in Figure 10 show that sparsification was able to reduce the  AME, being even more effective as the number of sources increases.

Source separation
With the same configuration, but now assuming that both the number of sources N and the mixing matrix A are known, the source separation was performed using the FASST algorithm. Tables 1 and 2 (for 1-and 5-s sources, respectively) show the result of signal-to-distortion ratio (SDR), signal-to-interference ratio (SIR), and signal-toartifact ratio (SAR), calculated as described in [40].
For the sparsified signals, these metrics were computed taking as references the sparsified signals. Since the goal of the proposed scheme is to objectively distort the original sources while maintaining them perceptually unchanged, taking the original sources as references would lead to a meaningless distortion of objective metrics like SDR, SIR, and SAR, masking the performance of the source separation algorithm. This choice has the further advantage of assessing the performance of each processing step separately.
Observing the values obtained for the sparsified sources (correspond to the values SDR sparse, SIR sparse, and SAR sparse in Tables 1 and 2), the gain found for the proposed scheme, except for the case with two sources, is around 1.5 dB for 5-s samples, and around 1.2 dB for 1-s samples, for the three ratios. For two sources, we are operating in a condition in which it is possible, in theory, to perfectly recover the original signals, and therefore the use of the sparsification does not significantly change the separation results -the three ratios are around 50 dB, indicating a good source recovery.
We also tested the proposed methodology using perception evaluation methods for audio source separation (PEASS) toolkit [41], which describes a set of four perceptual scores (PS): overall (OPS), target-related (TPS), interference-related (IPS), and artifacts-related (APS), generated through a nonlinear mapping of the PEMO-Q auditory model [42]. Figure 11 shows the results of OPS, TPS, IPS, and APS for the 5-s sources, for the separation of the signals without sparsification ('Normal'), the separation of the sparsified signals using the sparsified signals as reference signals ('Sparsified'), and using the original sources as reference signals ('Original as ref '). The use of the original signals as reference would, in theory, give us an overall subjective performance.
Using the sparsified signals as reference, one can observe that there is an improvement using the sparsified sources in some cases, but the difference between the scores obtained when no sparsification procedure is employed and with the proposed sparsification method is small.
When the original sources are used as reference signals, three of the scores show that the performance of the proposal does not meet the expectations, the only exception being the TPS results, for which the proposed method presents a significant improvement in scenarios with a large number of sources. These results can be explained by the fact that the processing steps performed until the sources have been estimated introduce two perceptual impairments: one due to the sparsification procedure (which is inaudible as a single step) and one due to the separation step, since we are operating, in most of the simulated cases, in an underdetermined mixture scenario. Nevertheless, it should be mentioned that PEASS is not intended to evaluate distortions like those introduced by the sparsification process, and therefore the evaluation of the cumulative effect of the perceptual impairments may not be completely reliable.

Robustness to coding
As explained before, one disadvantage of traditional ISS approaches is that the watermarking information can be corrupted due to a signal compression. The robustness of the proposed sparsification to compression (see Subsection 3.1) let expect that the separation should also be robust. In order to verify it, an MP3 coding was considered in the simulations, following the application diagram block depicted in Figure 1, at a bitrate of 192kbit/s. The configuration of the simulations is the same as described before. When the number of sources are estimated, both original and sparsified sources generated exactly the same results. For the estimation of the mixing matrices, there were no significant difference for 1-s sources and the better performance achieved for the sparsified 5-s sources was maintained but with smaller differences among the results.
For the source separation, the results are very similar both using and not using the MP3 coding. For example, Table 3 shows the results of SDR, SIR, and SAR using the 5-s samples, and Figure 12 shows the results for the perceptual evaluation.

Conclusions
We have proposed a doping process that makes audio signals more sparse while preserving their audio quality, thanks to a perceptually controlled algorithm based on a generalized Gaussian model of the time-frequency coefficients. This built sparsity is robust to compression and leads to an improvement of source separation.
Although the improvement of SDR, SIR, SAR, and even of the perceptual evaluation metrics is weak compared to usual results of ISS (1 to 2 dB instead 5 to 20 dB in [14] for the objective metrics), this method has two advantages: it is robust to compression, and the sparsification of each source is valid for any mixture, whereas the information watermarked in ISS is specific to one particular mixture.
Relaxing this specification could however allow to process only the time-frequency bin where the separation fails and hence potentially improve the separation for the same quality of the sparsified sources.