A MLE-based blind signal separation method for time–frequency overlapped signal using neural network

The blind signal separation (BSS) algorithm obtains each original/source signal from the observed signal collected by the receiving antenna or sensor. Objective/loss/cost function and optimization method are two key parts of BSS algorithm. Modifying the objective function and optimization from the perspective of neural network (NN) is a novel concept in BSS domain. L 2 regularization is adopted as a term of maximum likelihood estimation (MLE)-based objective function like in Liu et al. (Sensors 21(3):973, 2021); however, we modified the probability density function (PDF) term of the objec-tive function and used the kernel density estimation method for time–frequency overlapped digital communication signal. Multiple optimizers are studied in this paper, and we figure out the right optimizer for our application scenario. A varies of comparison experiments—whose separation results will be provided in forms of correlation coefficient and performance index—are carried out, which indicate our method can converge quickly and achieve satisfactory separation results with performance index (PI) lower than 0.02 when signal-to-noise ratio (SNR) no less than 10dB. Additionally, it demonstrates performance of our method is better than that of typical separation— FastICA, especially for the lower SNR environment, and it shows that our method is not sensitive to the frequency overlap level (FOL) of the source signal, even FOL as high as 100% ; it still can get high-precision separation results with PI < 0.02 .


Introduction
In nowadays information era, the types of communication or radar electronic equipment for both military and civilian applications are increasing significantly, which leads to various communication and radar signals overcrowded in time domain, overlapped in frequency domain and intertwined in space domain shaping a more complicated electromagnetic environment [1]. As a result, the interception probability of time-frequency overlapped signals has been improved for communication reconnaissance equipment. In order to accurately capture the interested signal or discover the interference signal, separating these time-frequency overlapped signals and extracting the information implied in the useful signal have become a research task with significant for electromagnetic surveillance domain. Since the middle of the 1990s, the blind signal separation (BSS) [2] problem-with aim of separating the source signals from mixed observation signal without knowing information of the original signal and transmission system-has been addressed by many researchers, with expertise in various domains: electromagnetic surveillance and reconnaissance, biomedical signal processing, array signal processing, speech signal processing, image processing [3], wireless communication, neural networks, etc. Many classic BSS algorithm theories have been proposed, such as independent component analysis (ICA), sparse component analysis (SCA) and nonnegative matrix factorization (NMF).
Independent component analysis (ICA) is the most popular and widely used BSS algorithm, and it is mainly used for over-determined and determined BSS-the number of mixed observation signals is more than or equal to the number of original signals to be separated and requires the source signal to have independent characteristics. Jutten et al. [4] firstly made a rigorous mathematical description for the blind signal separation problem and proposed independent component analysis (ICA). Comon [5] gave a detailed explanation of ICA and proposed the mathematical model, basic assumptions and separability conditions of ICA. Bell et al. [6] used the information theory criteria to construct the cost function combining the neural network learning algorithm successfully completed the separation task of ten speech signals. Since then, ICA has attracted the interest of many researchers and proposed many ICA-based BSS methods, such as second-order blind identification algorithm (SOBI) [7], fourth-order blind identification algorithm (FOBI) [8], joint approximate diagonalization of eigenmatrices (JADE) [9] and Fix-point ICA [10]. Among them the fix-point ICA algorithm proposed by Hyvarinen [10] with fast convergence speed and good robustness was the most popular one, well known as fast ICA (FastICA). FastICA algorithm was expanded and improved, Ollila et al. [11] provide a rigorous statistical analysis of the deflation-based FastICA estimator, Dermoune et al. [12] gave a rigorous analysis of the asymptotic errors of FastICA estimators, Wei [13] derived the general and rigorous expression of the limiting distribution and the asymptotic statistics of the FastICA algorithm, and so on. Oja et al. [14] provided a rigorous convergence analysis for FastICA. Novey et al. [15] proposed a complex fast independent component analysis (c-FastICA) algorithm to solve the ICA problems with complex-valued data. FastICA algorithm has been successfully applied in different fields, such as electroencephalography (EEG) processing [16,17], single-channel digital communication signal separation [18], modern power systems [19] and joint radar and communication signal separation [20]. Additionally, some researchers finished the implication of FastICA algorithm, like Shyu et al. [21] implemented the FastICA algorithm in a field-programmable gate array (FPGA), with the ability of real-time sequential mixed signals processing by the proposed pipelined FastICA architecture.
Sparse component analysis (SCA) is a simple yet powerful framework for blind signal separation, especially for the under-determined signal separation-the number of mixed observation signals is less than that of original signal number, and SCA has been successfully applied in BSS for the original signal which can be represent sparsely in a given basis, even for the independence assumption is dropped [22]. SCA has been applied in image mixture separation [23][24][25], speech signal separation [26,27], biological signal separation [28,29] and so on. Reference [23,24] separated a mixture of images using wavelet sparsification technology. Bofill et al. [26] proposed a cluster algorithm-based-with the assumption signal has sparsity character in the frequency domain-under-determined signal separation methods for speech and music signals. Yang et al. [30] proposed a new twostage scheme combining density-based clustering and sparse reconstruction to estimate mixing matrix and sources for speech signal separation. Li et al. [28] proposed a separation method based on SCA, which focused on the applications of sparse representation in brain signal processing, including components extraction, BSS and EEG inverse imaging, feature selection and classification. Tsouri et al. [29] proposed and evaluated a method of 12-lead electrocardiogram (ECG) reconstruction from a three-lead set. Rahbar et al. [31] discussed a frequency-domain method based on SCA for blind identification of multiple-input multiple-output (MIMO) convolutive channels driven by white quasistationary sources.
Except SCA, some under-determined BSS methods utilize nonnegative matrix factorization (NMF) to exploit the nonnegativeness signal, such as speech/audio signal [32][33][34], image [35] and biological signal [36]. Gao et al. [32] proposed a new unsupervised single-channel source separation method for mixed audio signal, which employed gammatone filterbank to replace time-frequency representation. Nikunen et al. [33] addressed the problem of sound source separation from a multi-channel microphone array capture via estimation of source spatial covariance matrix (SCM) of a short-time Fourier-transform mixture signal. Pezzoli et al. [34] proposed a ray-space-based multi-channel NMF method for audio source separation. Yang et al. [35] proposed an adaptive non-smooth NMF separation method for image signal. Gurve et al. [36] proposed a method for separation of fetal electrocardiogram (ECG) from abdominal ECG using activation scaled NMF. Gao et al. [37] proposed a graph-based blind hyperspectral unmixing via NMF.
BSS problem has three mainstream methods, such as ICA, SCA and NMF, but not limited to those three methods, taking source signal characteristics-based BSS method proposed in reference [38][39][40], for example. Szu et al. [38] proposed an effective singlechannel BSS method based on the limited character set feature of digital communication signal. Warner et al. [39] presented a single-channel separation approach based on the differences between shaping filters. Pang et al. [40] proposed a novel BSS method for single-input multi-output (SIMO) system based on the periodicity of original signal, which can separate time-frequency overlapped multi-component signal effectively. Recently, a BSS method, combining the maximum likelihood estimation (MLE) criterion and a neural network (NN) with a bias term, is proposed in reference [41]. Based on this architecture, we employ neural network to implicate time-frequency signal communication signal separation based on MLE, the main difference to reference [41] is the application field, and the main innovation is that-for our application area time-frequency overlapped digital communication signal separation-we use the kernel density estimation method to estimate the probability density of the digital communication signal instead that in paper [41] using fixed function expression based on the type of the source signals.

Our contributions
The main contributions and results are summarized as follows.
• To the best of our knowledge, we are the first to explicitly explore the applicability of using neural network to accomplish time-frequency overlapped digital signals separation based on maximum likelihood estimation. In contrast, the prior work [41] employed a fixed function to express the original signals' probability density based on signal type-super-Gaussian distribution or sub-Gaussian distribution or Gaussian distribution; however, we use kernel density estimation method to estimate the probability density of the original digital communication signal, and then, the estimation results will be regarded as a term of cost function. • We provide the cost function based on MLE-the detail will be introduced in Sect. 3.1, and we further examine the convergence and the separation performance of different optimizers, such as Adam and RMSprop, which will be provided in Sect. 4. • We formulate critical performance metrics to evaluate the separation results, i.e., correlation coefficient ( ζ ) and performance index ( PI ), and perform an extensive evaluation of the separation methodology to validate the efficacy of the formulations. Additionally, we compare the separation performance of our method with most widely used BSS algorithms-FastICA and JADE.

Paper organization
In Sect. 2, we provide signal mix model, separation model and separation results evaluation index. In Sect. 3, we present our theoretical framework and provide each parts in detail, as signal preprocessing, cost function, probability density function estimation, optimizer and the used NN structure. We further provide a discussion together with future work in Sect. 5. In Sect. 4, by using two BPSK and one QPSK time-frequency overlapped signal as a case study, we examine and compare our separation method's performance-including comparison between different optimizers-with FastICA and JADE in terms of correlation coefficient ( ζ ) and performance index ( PI ). We conclude this work in Sect. 6.

Signal model
The aim of the blind signal separation is to obtain each original signal from mixed observation signal.
where D and M represent source signal number and observation signal number, respectively. T means transpose operation. A ∈ R M×D is mixed matrix, which is a full rank matrix. v is the additive white Gaussian noise with variance σ 2 . The signal separation system is shown in Fig. 1, W ∈ R D×M , stands for the unmixing or separation matrix, and our goal is to find a unmixing matrix W which is approximately equal to the inverse matrix A , as shown in Eq. (5).
where Wv is the noise component; in the theoretical derivation process, we ignored this noise component, and then, Eq. (5) can be simplified as: However, the noise component will be given full consideration in the simulation, and we will add a bias term b into the our cost function. The bias term b is the just component that represents the noise part and participates in the optimization process of the  proposed separation algorithm. The bias term b is not only beneficial for reducing the static error of the separation system, but also improved the flexibility of the separation system.
In this work, the correlation coefficient ζ s iŝi -between s i and its corresponding estimated signal ŝ i (i = 1, 2, . . . , D ), and the performance index PI [42][43][44] are employed to measure the separation performance. The definition of ζ s iŝi and PI is shown in Eqs. (7) and (8), respectively.
where cov(·) , E(·) and V (·) represent covariance, mean value and variance, respectively. 0 ≤ ζ s iŝi ≤ 1 and the lager ζ s iŝi is, the better separation performance will be. p ij is the ith row and jth column of matrix P: PI ≥ 0 , and the lower PI is, the higher the separation accuracy will be, and PI < 0.1 typically indicating the algorithm is performing adequately [44].

Separation model
The theoretical framework of blind signal separation can be divided into two parts: objective function and optimization algorithm. The objective function is usually called the cost function. Figure 2 provides the separation methodology's topological structure diagram-expanded around that two core parts objective/cost function and optimization algorithm-of this paper.
As shown in Fig. 2, the topological structure of our separation methodology includes observation signal model, signal separation model and estimated signal. The observation signal model has been introduced in detail in Sect. 2. The signal separation model-the core of this paper-contains cost function and its optimization-detail introduction will be given in Sects. 3.2 and 3.4, respectively, and we employed the neural network(NN)detail introduction as shown in Sect. 3.5-to complete this task. The inputs of NN contains preprocessed-as it is introduced in Sect. 3.1-original mixed signal and its corresponding probability destiny function (PDF) estimation-one term of the cost function-as it is given in Sect. 3.3. The estimated signal obtained will be evaluated by ς and PI , defined by Eqs. (7) and (8), respectively.

Preprocessing
Preprocessing on received mixed observation signal includes de-averaging and whiten, and the corresponding mathematical explanation is shown in Eqs. (10) and (11).
where E(·) represents taking the mean value. The zero-mean signal form can simplify the separation process.
where V is the whiten matrix: where G is a diagonal matrix and its diagonal element g i is eigenvalue of the covariance matrix of x , and e i is their corresponding eigenvectors, i = 1, 2, . . . , D . H means conjugate transpose operation.
It is worth to mention that the mixed matrix A has changed into A ′ = VA after whiten.
Therefore, we should take the whitening matrix into consideration when calculate PI.

Cost function
The cost function of our separation method is built based on maximum likelihood estimation (MLE). First, the maximum likelihood (ML) estimation derivative process for blind signal separation will be illustrated. Then, the cost function of our separation method will be provided based on ML criterion. Additionally, the probability density function of original signal will be estimated through kernel density function estimation method.

Maximum likelihood criterion
After preprocessing the observation signal can be expressed as x = VAs , and its joint probability density function is shown in Eq. (15).
where W is the unmixed/separation matrix, and p s is the joint probability density function of the source components. We can assume that the source signal is statistically independent. Using w i to represent the ith column vector of W , then: N ) to express the sample points of estimated signal ŝ i , N is the total sampling points number. Then, we can implement the likelihood function operation by Eq. (16) [44,45]: Performing logarithmic operation and dividing the number of samples on both sides of Eq. (17): According to the maximum likelihood estimation criterion, we can obtain the optimal solution by maximizing L(W) . Therefore, −L(W) function is employed as part components of our cost function.

MLE-based cost function
The MLE-based cost function of our method is composed by log-likelihood function and a bias term ( b ) [41]. However, the bias term ( b ) in our method is much different from that of reference [41]. We modified the second part of log-likelihood function and use the kernel density estimation method to obtain the joint probability density function of the original signal. Additionally, we add a constant in the cost function in case there appear illegal values. Then, the cost function used in this paper is shown in Eq. (19).
where 'argmin' means taking the minimum value. The first two parts are derived from MLE; we add a constant ' c ' in the second part, which is used to avoid illegal values in the original signal joint probability density function estimation. The third part of cost function is L 2 regularization, which plays a key role in preventing over-fitting during (16) optimization, and a comparison between L 2 and L 1 regularization-together with the regularization parameter ( )-will be given in Sect. 4.2. By minimizing the cost function as Eq. (19), the optimal unmixing matrix W and bias term b can be obtained.

Probability density function estimation
The probability density function (PDF) of original/source signal is an necessary part of MLE-based cost function as shown in Eq. (19). Liu et al. [41] employed the simple PDF estimation method, which adopted three approximate functions to represent the probability density function of super-Gaussian signal, sub-Gaussian signal and Gaussian signal, respectively, and then selected one approximate function as the PDF estimation of the source signal based on the its distribution. In practical application, super-Gaussian signal or sub-Gaussian signal has a relatively wide range; therefore, using an approximate function to describe a class of signals (super-Gaussian signal/sub-Gaussian signal) will inevitably introduce absolute error. Histogram method is a traditional PDF estimation algorithm. Comparing with histogram method, kernel density estimation can provide a smoother PDF curve [46,47]. Therefore, in order to minimize the influence introduced by PDF estimation on separation accuracy, we employ the kernel density estimation (KDE) [46][47][48] to estimate the probability density function of the source signal.
Let the series {x 1 , x 2 , . . . , x N } be an independent and identically distributed sample of observation signal with an unknown probability distribution function p(x). KDE p(x) of original p(x) assigns each nth sample data point x n a function K (x n , t) called a kernel function in the following way [46,47]: Equation (21) ensures the required normalization of KDE p(x): That is to say, KDE transforms the location of x n into a self-centered interval, symmetrically or asymmetrical. Many kernel functions both symmetric and asymmetric have been published as shown in "Appendix. " However, in practical applications, the symmetric kernel function is more widely used than asymmetry. Symmetry property allows to write the kernel function in a form used most frequently [46]: where h is the smoothing parameter who governs the amount of smoothing applied to the sample. Too small value of h may result the estimator to show insignificant details, while too large value of h causes over smoothing of the information contained in the sample, which, in consequence, may mask some of important characteristics, e.g., multimodality [46], of p(x). Therefore, a certain compromise is necessary in actual application. Multivariate extensions of the kernel approach generally rely on the product kernel [49]; taking bivariate data x n , y n , n = 1, 2, . . . , N , for example, the bivariate kernel estimator can be expressed as: where x n , y n , n = 1, 2, . . . , N is a sample, and h x and h y are smoothing parameters. Based on the Euclidean distance between an arbitrary point x, y and sample point x n , y n , n = 1, 2, . . . , N , the bivariate kernel estimator shown in Eq. (24) can be changed into: where K (·) is the kernel function and "Appendix" gives several kernel function including symmetric kernel functions and asymmetry ones. The effective of KDE will be exhibited in Sect. 4 through signal separation results.

Optimization algorithm
The optimization of traditional blind signal separation method includes negative gradient descent algorithm [50], Newton algorithm [51], fixed point algorithm [2] and so on. Recently, some research has been done on adaptive gradient optimization algorithms and its variant for training deep neural networks, such as stochastic gradient descent (SGD) [52][53][54], Adagrad [55,56], RMSprop [41,[56][57][58] and Adam [59]. The optimization process of those algorithms can be considered as the problem of minimum the cost function (or objective function) in the form of summation: where w is the estimated parameter by minimizing J (w) . Each sum and function J n (w) are typically associated with the n-th observation in the data set. One thing worth mentioning is that the parameter b to be estimated in Eq. (19) is omitted by Eq. (26), but it will participate in the actual optimization. In the following, we will briefly introduce each optimization algorithm. (24)

Stochastic gradient descent
SGD is an iterative method for optimizing an objective function with smoothness properties (e.g., differentiable or sub-differentiable). It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient (calculated from the entire data set) by an estimate thereof (calculated from a randomly selected subset of the data). In SGD algorithm, the true gradient of objective function is approximated by a gradient at a single example: where η is a step size or called learning rate in machine learning. Sutskever et al. [54] proposed that a SGD method with momentum remembers the update at each iteration and determines the next update as a linear combination of the gradient and the previous update: where ρ is an exponential decay factor between 0 and 1, which determines the relative contribution of the current gradient and earlier gradients to the weight change. Combining Eqs. (28) and (29), we can get the final update formula of SGD with momentum: SGD with momentum (named SGDM) tends to keep convergence in the same direction, preventing oscillations.

Adagrad
Duch et al. [55] proposed a modified stochastic gradient descent algorithm with perparameter learning rate, named adaptive gradient algorithm (Adagrad), which improved convergence performance of SGD in settings where data are sparse and sparse parameters are more informative. The update formula of Adagrad [55] is: or written in the form of per-parameter updates: where ⊙ means the element-wise product. {G j,j } is a vector which is the diagonal of the outer product matrix G: where g τ is the gradient at iteration τ , and the diagonal of G is given by (27) w t+1 = w t − η∇J n (w), As in reference [55,56], Adagrad was designed for convex problems; however, it has been successfully applied to non-convex optimization [60].

RMSprop
Root mean square propagation (RMSprop) is also a method in which the learning rate is adapted for each of the parameters. The idea is to divide the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight [58]. So, first the running average is calculated in terms of means square: where ρ is the forgetting factor, and r(w, t) is the gradient accelerating variable. Then, the parameters are updated as: RMSprop has shown good adaptation of learning rate in different applications. RMSprop can be seen as a generalization of resilient back-propagation (BP) and is capable to work with mini-batches as well opposed to only full-batches [58]. Reference [41] improved RMSprop by introducing in the estimation of the first-order moment of the gradient ( g(w, t) ), and the original r(w, t) is modified to the central second-order moment through the operation ( r(w, t) − g(w, t) where ρ is the decay rate of the exponential moving average between 0 and 1, β is the momentum term, and ǫ is a small scalar (e.g., 10 −8 ), which avoids divide-by-zero errors in the update process.
The introduction of first-order and second-order moment in RMSprop (named RMSpropM) stabilized the exponentially weighted root mean square, and this operation flattens the steep gradient in the parameter space [41]. In practice, the algorithm finds a smoother descent direction in the parameter space, increasing the training speed.

Adam
Adaptive moment estimation (Adam) is an update method of RMSprop optimizer. In this optimization algorithm, running averages of both the gradients and the second moments of the gradients are used. Given parameters w (t) and a loss function J (t) , where t indexes the current training iteration, Adam's parameter update is given by [59]: where ǫ is a small scalar (e.g., 10 −8 ). m(w, t) and v(w, t) are the first moments of gradients and second moments of gradients, respectively, and β 1 and β 2 are their corresponding forgetting factor between 0 and 1 (e.g., β 1 = 0.9 , β 2 = 0.999).
The optimization algorithm of neural network includes SGD, Adagrad, RMSprop and Adam, but not limited to them, e.g., Adadelta [?], the detailed introduction is omitted here. We will show their performance in optimizing the signal separation cost function Eq. (19) in Sect. 4.

Neural network
A neural network (NN), in the case of artificial neurons called artificial neural network (ANN) or simulated neural network (SNN), is an interconnected group of natural or artificial neurons that uses a mathematical or computational model for information processing based on a connectionist approach to computation. In the artificial intelligence field, artificial neural networks have been applied successfully to speech recognition [61], image analysis [62], pattern recognition [63], data classification [64], through a learning process. The input of the NN is the feature vector corresponding to observation signal.
As shown in Fig. 3, the NN architecture has four layers, input layer, dense layer, lambda layer and output layer, and the relation between the neural network and separation process is shown in Table. 1. The input layer corresponds to the observed signal x and bias term b of the separation system. The neuron number of input layer is (M + 1) , where M is the observation signal number, and the other neuron is used to input the initialization of bias term b-as analyzed in Sect. 2. The dense layer is used to optimize the separation matrix W and the bias term b, and the dense layer has D neurons, where D is number of original signal. Lambda layer is the self-definition layer with two neurons, one neuron is for the regularization of W and b, and the other neurons stand for the second term of cost function as shown in Eq. (19). The output layer with one neuron is used to provide the sum value of cost function. (40) m(w, t + 1) = β 1 m(w, t) + (1 − β 1 )∇J n (w),  logarithmic form of the ratio of the observed signal power to the noise power and multiplied by 100 Mhz. In the following, we will exhibit the separation performance of the our method from different aspects.

Comparison between different optimizers
In this experiment, we will inspect the convergence speed of different optimizers and the corresponding separation efficacy in the form of correlation coefficient (ζ ) and performance index (PI) , and the simulation condition as the case 1 is shown in Table 2 with SNR = 10dB and = 0.015 using L 2 regularization. Table 3 shows the optimizer candidates participating in the comparison and their empirical parameter setting in the first two rows. Figure 4 gives the convergence speed of each optimizer, and we can see all the optimizers can reach convergence state with   Table 3 The separation results of different optimizers in the form of ζ and PI with SNR=10dB epoch less than 50. To be precise, the convergence of RMSprop, RMSpropM, Adam and Adadelta optimizer can be completed with epoch less than 40, and their convergence value-smaller than − 2.2-is smaller than the other three optimizers-SGD, SGDM and Adagrad.
The separation results of different optimizers in the form of ζ and PI, while SNR=10dB with 200 times Monte Carlo test as shown in Table 3. We can see the separation accuracy of RMSprop, RMSpropM, Adam and Adadelta optimizer-with PI < 0.08 and ζ > 0.85 -is much better than that of SGD, SGDM and Adagrad-with PI > 0.8 and ζ < 0.75 . As PI < 0.1 typically indicating that the algorithm is performing adequately [44], we can say RMSprop, RMSpropM, Adam and Adadelta optimizer are more suitable for our application scenarios-time-frequency overlapped digital communication signal separation-than the other three optimizers-SGD, SGDM and Adagrad. Therefore, those four optimization algorithms will be employed in the following simulation test.

Comparative test for regularization term of cost function
A comparative test for regularization term of cost function will be presented in this subsection with regularization term parameter . The simulation conditions and optimizers-RMSprop, RMSpropM, Adam and Adadelta-parameters keep the same as that of the experiment in Sect. 4.1, except for the regularization term-varying from L 1 to L 2and its parameter -changing from 0 to 0.1, the simulation results-average value of 200 times Monte Carlo test-as shown in Fig. 5. There has one thing worth mentioning that the correlation coefficient (ζ ) is the average value of each original signal: where D is the number of original signal, and s i and ŝ i are the ith (i =, 1, 2, . . . , D) original signal and its corresponding estimation, respectively.
From Fig. 5, we can see when L 2 regularization is employed in the cost function, the separation accuracy gradually improves with increasing from 0 to 0.01, and then, it will reach a stable level ( ζ ≈ 0.85 and PI ≈ 0.025 ), while ∈ [0.01, 0.1] , expect for the Adadelta whose separation accuracy gradually decreased for changing from 0.01 to 0.1. On the contrary, when L 1 regularization is selected, the separation accuracy of our method will decrease rapidly-with slight fluctuations for RMSprop optimizerwhile increases from 0.01 to 0.1, and it will reach a stable level when ∈ [0.01, 0.1] for RMSpropM and Adam optimizer, and ∈ [0.06, 0.1] for RMSprop and Adadelta optimizer. Additionally, the best separation that can be achieved using L 1 regularization is ζ ≈ 0.78 and PI ≈ 0.5 , which is much lower than that of L 2 regularization method. Therefore, L 2 regularization is the best choice for our cost function, and according to the above analysis, we set to 0.015.

Separation performance against noise and comparison with typical methods
The purpose of this experiment is to figure out the performance of our separation method against noise and analysis its computational complexity. In addition, a comparison with the typical separation methods-FastICA and JADE-will be carried out. The simulation conditions still keep the same as the first two experiments-as case 1 described in Table 2, except for SNR. Based on the experimental analysis in Sect. 4.2, L 2 regularization term is set to 0.015. Figure 6 shows the separation performance of our methods (including RMSprop, RMSpropM, Adam and Adadelta four optimizers) changing trend with SNR-varying from 5dB to 25dB-and compared with FastICA and JADE in the form of ζ and PI.
As shown in Fig. 6, the separation accuracy gradually improves with SNR increasing for both of our method and FastICA/JADE. When SNR ≥ 14 dB , the improvement speed of the separation results becomes slower compared with that of SNR ≤ 14 dB , especially for the performance index PI . To be precise, when SNR ≥ 14 dB , the average value of the each source signals' correlation coefficient ζ will be higher than 0.95 and PI will be lower than 0.01, no matter RMSprop, RMSpropM, Adam or Adadelta optimizer is used. What's more, the performance of our method outperforms classical algorithms-Fas-tICA, especially for SNR ≤ 14 dB . To be exactly, as shown in Fig. 6a, the ζ obtained by our method is bigger than that of FastICA and keeps the same level with JADE method. For SNR ≥ 14 dB , the separation performance of all methods reaches a similar stable high accuracy level with ζ > 0.97 . Meanwhile, the elevation in the form of performance index PI is lower than 0.1 for SNR no less than 8 dB for all method-as shown in Fig. 6b, we can see the PI achieved by our method is much lower than that of FastICA and keeps the same level with JADE method, while SNR ≤ 14 dB , and they will converge to similar stable low level, while SNR ≥ 15 dB , to be precise, PI on more than 0.01. In other words, in the low SNR environment ( SNR ≤ 14 dB ), the separation performance of our method is much better than that of the classical FastICA method. As the SNR increases, our separation method can converge to the same level as the classical method. Additionally, the computational complexity comparison between our proposed method and typical ones is shown in Table 4 in the form of running time, and the simulation conditions keep the same as the performance comparison test except for SNR=10dB. We can see the separation results of the proposed method are similar to that of JADE, but it is much better than that of FastICA-the PI value is about twice of the proposed method and JADE algorithm. The running time of the proposed method is 0.3-0.5 s; however, the typical separation methods only need about 10 ms. Our method improves the signal separation result, but it costs longer time. To be specify, 2 × (M + 1) × D × N flops-one multiplication and one addition named one flopcomputation is needed for FastICA in one iteration loop [65], and (M + 15) × D × N flops for the proposed method; therefore, optimization methods with low computational complexity will be studied in future work.

Comparative test for frequency overlapped level
This experiment is used to evaluate the effect of original signal's FOL on the separation results through three of experiments as three cases shown in Table 2 with mixing matrix A = [1, 0.5, 0.5; 0.5, 1, 0.5; 0.5, 0.5, 1] and f s = 100 MHz . The other simulation conditions setting as: adopted L 2 regularization term with = 0.015 , employed four effective optimizers (including RMSprop, RMSpropM, Adam and Adadelta), implement 200 times Monte Carlo tests and set 400 epochs. Figure 7 shows the separation performance changing trend with SNR in varied FOL environment evaluated in the form of ζ and PI.
From the simulation conditions shown in Table 2, we can see the difference between case 1 and case 2 is the FOL of original signal, to be exactly, and the FOL of each signal is ψ s 1 = 50% , ψ s 2 = 100% and ψ s 3 = 50% in case 1, respectively, and in Case 2 ψ s 1 and ψ s 3 are all increase to 100% by changing their bit transmission rate (r b ) . However, the separation results of those two case are almost the same as shown in Fig. 7, especially for SNR ≥ 8 dB situation. By comparing the simulation conditions of case 3 with that of case 2, we can see the center frequency interval between original signal of case 3 is half of case 2; in other words, although ψ s i (i = 1, 2, 3)-as defined in Eq. (45)-are the same in those two case, the signal dense in frequency domain of Case 3 is twice of Case 2. Therefore, to a certain extent, the frequency-domain overlap complexity of Case 3 is higher than that of Case 2. The separation results of those two cases keep a high degree of consistency, especially while SNR ≥ 8 dB . Through the comparative analysis of case 1 with case 2 and case 2 with case 3, we can draw a conclusion that our signal separation method is not sensitive to FOL, no matter the FOL reaches 100% or the frequencydomain complexity is high, and our method still can obtain high separation accuracy.

Section summary
Firstly, we studied optimizer (RMSprop, RMSpropM, Adam and Adadelta), regularization term ( L 2 ) and its parameter ( = 0.015 ) that match our application scenarios-time-frequency overlapped digital communication signal separation. Then, the performance of our method was given by comparing with typical method, and the simulation results show our method is much better than that of FastICA, especially for SNR ≤ 14 dB . After that, through three groups comparative experiment, we illustrated our method not sensitive to the FOL and the frequency-domain complexity degree, even for ψ = 100% and high-frequency complexity condition, our method still can provide satisfied result.

Separation method
For the over-determined/determined blind signal separation problem, ICA-in particularly, FastICA [10]-is the most widely used and most popular separation method. ICA and its variants only require that the source signals are independent to each other and have been successfully applied in all kinds of signal separation, like speech signal [66], biomedical signal-e.g., electroencephalographic (EEG) and magnetoencephalographic (MEG) [67], and so on. What's more, ICA also can be used to undetermined signal separation problem under certain condition that the under-determined observation matrix can be transformed into an observation matrix whose rank is no less than source signal number [18]. For signals with sparsity, sparse component analysis (SCA) is another popular method, and it has been successfully separate image mixture separation [23][24][25], speech signal [26,27], biological signal [28,29] and so on. Additionally, SCA can handle under-determined signal separation problems apart from over-determined and determined situation, in under-determined music and speech signal separation in [26]. Except for SCA, NMF is another main under-determined signal separation method with successful application in various signal separations [32][33][34][35][36][37].
One common of those three popular and successful separation methods is that they are all use traditional signal separation methods. Liu et al. [41] introduced a separation method using neural network (NN) and applied machine learning mechanisms and optimization methods to signal separation domain. As an important term of observation/cost function, the probability density function (PDF) term was expressed by a fixed function based on the type of the source signal-super-Gaussian distribution or sub-Gaussian distribution or Gaussian distribution, which can hardly handle complex time-frequency overlapped digital communication signal separation. In this paper, we employed the kernel density estimation method to estimate signal PDF instead of one simply fixed expression, and we achieved satisfactory separation results, to be exactly; it can provide similar separation accuracy as the most famous traditional signal separation methods-FastICA and JADE-for time-frequency overlapped digital communication signal separation.

Bias term and optimizer
A regularization term was added to observation/cost/loss function, and simulation test shows L 2 regularization can improve signal separation accuracy; however, the participation L 1 regularization will bring in negative effect. Additionally, the regularization term parameter is set to 0.015 based on simulation tests. Meanwhile, we figured out four optimizers-RMSprop [58], RMSpropM [41], Adam [59] and Adadelta [?]-that are more friendly to our application background.

Future work
Future work can be carried out from the perspective of both objective/loss/cost function and optimization of BSS and improve its performance from the perspective of neural networks (NNs), which is a new concept in BSS domain [41]. We can combine conventional separation algorithms' estimation criterion and the advantages of NN or other excellent machine learning framework to modify-even derive novel-objective/loss/cost function and improve the convergence, computational complexity as well as separation accuracy of BSS algorithm.

Conclusion
In this paper, we introduced a maximum likelihood estimation (MLE)-based blind timefrequency overlapped digital communication signal separation method using neural network, in which L 2 regularization is employed as one term of observation function, and kernel density estimation is selected to estimate the PDF. Through theoretical introduction and experimental analysis, we figured out the optimizer of neural network suitable for our application background, to be exactly, RMSprop, RMSpropM, Adam and Adadelta, with ζ > 0.82 and PI < 0.01-typically indicating the algorithm is performing adequately [44]-while SNR ≤ 8 dB , and ζ will increase to 0.97 and PI decreases to 0.01 for SNR ≥ 15 dB.
The comparison between our method and typical separation method (FastICA/JADE) indicated our method performance better than FastICA in low SNR environment, and it can achieve the same stable high precision level as FastICA/JADE, while SNR > 15 dB with ζ > 0.96 and PI < 0.004 . Comparison tests for different FOL cases and frequency complexity cases demonstrate our method not sensitive to the FOL and the frequencydomain complexity degree, even for ψ = 100% and high-frequency complexity condition, our method still can provide satisfied result, to be precise, ζ > 0.90 and PI < 0.02 for SNR ≈ 10 dB.