Research on heart and lung sound separation method based on DAE–NMF–VMD

Auscultation is the most effective method for diagnosing cardiovascular and respiratory diseases. However, stethoscopes typically capture mixed signals of heart and lung sounds, which can affect the auscultation effect of doctors. Therefore, the efficient separation of mixed heart and lung sound signals plays a crucial role in improving the diagnosis of cardiovascular and respiratory diseases. In this paper, we propose a blind source separation method for heart and lung sounds based on deep autoencoder (DAE), nonnegative matrix factorization (NMF) and variational mode decomposition (VMD). Firstly, DAE is employed to extract highly informative features from the heart and lung sound signals. Subsequently, NMF clustering is applied to group the heart and lung sounds based on their distinct periodicities, achieving the separation of the mixed heart and lung sounds. Finally, variational mode decomposition is used for denoising the separated signals. Experimental results demonstrate that the proposed method effectively separates heart and lung sound signals and exhibits significant advantages in terms of standardized evaluation metrics when compared to contrast methods.


Introduction
Under normal circumstances, the frequency of heart sound signals falls within the range of 20 to 150Hz [1], while lung sound signals fall within the range of 50 to 2500Hz [2].There exists a frequency overlap region between heart and lung sound signals, leading to mutual interference between them.When medical professionals use stethoscopes for auscultation, noise from the friction of the stethoscope with clothing, ambient environmental noise and the operation of the instrument all get collected along with heart and lung sounds into electronic stethoscopes.This significantly diminishes the effectiveness of auscultation and diagnosis.In recent years, research on classification algorithms for lung sounds has increased [3][4][5][6].However, the primary challenge faced in current lung sound recognition research is that traditional classification methods struggle to extract crucial information from lung sound features, resulting in suboptimal recognition performance.Additionally, lung sound classification methods have a high dependence on data, and publicly available lung sound datasets on the internet often contain heart sound interference.Pure lung sound data is scarce and challenging to obtain, making recognition networks prone to overfitting and less capable of achieving precise and efficient classification and recognition.To better implement lung sound classification algorithms and diagnose medical conditions, it is essential to perform preprocessing through heart and lung sound separation.
To date, researchers worldwide have developed various heart and lung sound separation algorithms.These include methods based on wavelet transformations [7,8], but they suffer from poor adaptability and ineffective suppression of interference factors.Independent component analysis (ICA) and its extensions have also been explored [9,10], requiring at least two sensors and therefore not suitable for single-channel devices.In recent years, nonnegative matrix factorization (NMF) has been used to separate different sound sources [11][12][13], with its ability to handle overlapping frequency bands recognized.Deep learning has also been employed in source separation, where these deep learning models directly decompose mixed sources into target sources, and their effectiveness surpasses that of NMF [14][15][16].Since it is challenging to acquire pure heart and lung sounds as training data due to the limitations of stethoscope data collection, this paper proposes an unsupervised learning approach using deep autoencoders (DAE) and variational mode decomposition (VMD) to separate mixed heart and lung sound signals.
The algorithm first utilizes a DAE model to extract highly informative representations of the mixed sounds.By applying the periodic clustering algorithm to the potential representation, the mixed cardiopulmonary sounds are separated.VMD boasts a clear mathematical theoretical framework and unique advantages in noise robustness and avoiding mode mixing, compared to other classical methods [17].Therefore, VMD is employed to denoise the separated heart and lung sound signals.In contrast to other deep learningbased methods, the advantage of this paper's approach is that it does not require labeled training data.Leveraging periodic structures, it provides better separation performance compared to traditional methods.The main contributions of this study are summarized below.

Model frame
The model framework adopted in this study is shown in Fig. 1.Firstly, the feature representation is obtained by training DAE model, and the periodic coding matrix is generated by discrete Fourier transform of the feature representation.Then, the clustering results are obtained by sparse NMF clustering, and then the separated heart sounds and lung sounds are encoded.Finally, the preliminary separated heart sounds and lung sounds are denoised by VMD to obtain pure heart sounds and pure lung sounds.

Deep autoencoder (DAE)
DAE includes encoder and decoder, and the framework of DAE model is shown in Fig. 2. The encoder compresses the input data into a low dimension to provide a feature representation of a smaller dimension, and the decoder decodes this low-dimensional representation into an output as similar as possible to the original input.Both encoder and decoder are composed of full-scroll layers.In the training process, the purpose of DAE is to minimize the reconstruction error between input and output.The internal structure of the encoder and decoder of DAE is shown in Fig. 3, which is a convolution and deconvolution process from left to right.The input signal is sent to the encoder composed of convolution layer and activation function to get the feature representation, and then, the reconstructed signal is obtained after passing through the encoder composed of deconvolution and activation function.The encoder in the DAE algorithm is composed of convolutional units that perform convolution functions, and the computation process is as shown in Eq. ( 1).
In Eq. ( 1), C stands for convolution process, f j (k) represents the j-th feature map in the k-th layer, and I denotes the total number of channels.Each coding layer has j convolution kernels, and the convolution kernel size is L*1.W ji represents the i-th channel of W j .Each neuron f j k+1 in the feature map of the (k + 1)-th layer is calculated as the weighted sum of elements obtained by performing convolution operations with the receptive fields of all previous feature maps f (k) , using the weights from W j , and b j represents the bias.The corresponding convolution operation is illustrated in Fig. 4. The local area of input data is weighted and summed by sliding convolution kernel to extract the feature representation of this area.The decoder is composed of deconvolution units, and the calculation process is shown in formula (2).
(1) Here, K All = K E + K D represents the total number of layers in the DAE, where K E and K D are the numbers of layers in the encoder and decoder, respectively.D is deconvolution process.Each decoding layer has j convolution kernels, and the convolution kernel size is L*1.W ji represents the i-th channel of W j .In the k + 1-th layer's feature map, each neuron f j (k+1) is the weighted sum of the element-wise deconvolution of W j with the receptive field from all previous feature maps f (k) , with b j (k) denoting the bias.The corresponding deconvolution operation is shown in Fig. 5.In the deconvolution process, all feature maps f (k) of the k-th layer are zero-filled, and then the deconvolution process is carried out to reconstruct the data with the same size as the original signal.
Initially, the mixed heart and lung sound signal is transformed into the frequency and phase components using short-time Fourier transform (STFT).Then, the spectral features are converted into logarithmic power spectra (LPS).X = [x 1 ,…,x n ,…,x N ] represents the input, where N is the number of frames in X. DAE then encodes the mixed heart and lung sound LPS through encoder, transforming X into a matrix representing feature representations . The decoder reconstructs the matrix of feature representations back into the original spectral features.The parameters of the DAE are trained using the back-propagation algorithm to minimize mean squared error (MSE).Due to the input and output being the same, the DAE is trained in an unsupervised manner.

NMF periodic clustering method
Because the heart sound signal and lung sound signal have different periods, the heartlung sound is separated and mixed by using the different periods of the heart sound signal and lung sound signal.By training DAE model, we get the potential features, and the set of potential feature representation and time series is matrix F. We transpose the original L to obtain S mix = F T .Based on S mix , the entire set of neurons is divided into two groups: one corresponds to heart sounds, and the other corresponds to lung sounds.To analyze the periodicity of each submatrix s j mix , we apply discrete Fourier transform (DFT) to s j mix [18], forming a periodic encoding matrix sparse NMF clustering is employed to cluster the vectors in P into two groups.Equation (4) illustrates the NMF clustering process, which is achieved by minimizing the error function.Based on the encoding matrix H P (the transpose of P T ) with the highest scores, the clustering assignment of S mix can be determined.where W P represents the cluster centroid, H P is the transpose of matrix p, and H P = [h 1 ,… ,h j ,…,h M ] stands the cluster members.λ represents the sparsity penalty factor, this study selects an λ value of 1. ∥•∥ 1 represents the L1 norm, and ∥•∥ 2 represents the Frobenius distance.On the basis of the highest score in H P , the cluster allocation of S mix can be determined.The separation of heart sounds and lung sounds is realized by stopping the submatrix that does not belong to the target, and the separation results are S c and S r .After obtaining the coding matrix of each source, we decode it to get the separated heart sounds and lung sounds.
By training DAE model, we get the potential feature F, and then, the coding matrix is transformed into a coding matrix P with obvious periodicity by discrete Fourier transform (DFT).A sparse nonnegative matrix factorization (NMF) clustering method is employed to separate the encoding matrix P into representative encoding matrices corresponding to heart and lung sound signals.Then, the source encoding matrix is reconstructed using the encoder.Finally, the obtained heart sound LPS (log power spectrum) sequence and lung sound LPS sequence are transformed into heart and lung sound signals using the inverse short-time Fourier transform (ISTFT).

Variational mode decomposition (VMD)
The VMD method decomposes the signal x into a series of intrinsic mode functions (IMFs) with limited bandwidth, adaptively updating the optimal center frequencies and bandwidths for each IMF.The constrained variational problem generated based on u 1 ,u 2 ,…,u k and the predetermined scaling parameter K is shown in Eq. ( 5): where {u k } = {u 1 ,…,u k } and {w k } = {w 1 ,…,w k } represent the decomposed IMF components and the center frequencies of each component, respectively.∂(t) represents the partial derivative of the function with respect to time t, δ(t) is the unit impulse function, and * denotes convolution operation.
The introduction of the enhanced Lagrangian ζ, as given below, allows for the transformation of the constrained variational problem into an unconstrained variational problem: In Eq. ( 6), λ and α are, respectively, the Lagrange multipliers and second-order penalty factors.The solution to the original minimization problem is now found as a saddle point of the enhanced Lagrangian during a series of alternating-direction optimizations of the multipliers, known as the method of alternating direction for multipliers. (4) The derivatives of the IMF components u k and center frequencies w k are then derived as shown in Eqs. ( 7) and ( 8): In the above equations, ω represents frequency, and ûn+1 k (ω),ŝ(ω) and u n k (t) represent the Fourier transforms of s(t), u(t) and λ(t).
By using the algorithm described above, for a convergence tolerance e > 0, the decomposition stops when Eq. ( 9) is satisfied, and the final modal components and their center frequencies w k are obtained.
The VMD algorithm is used to decompose the heart sound and lung sound signals separated by DAE into a series of IMFs with finite bandwidth.The choice of the predetermined number of decomposition modes K and the penalty factor α directly affects the accuracy of the VMD decomposition results.Therefore, selecting suitable values for K and α is crucial for obtaining pure heart and lung sounds.
In the VMD algorithm, the value of K represents the number of IMF components into which the signal is decomposed.If the optimal value of K is obtained, it means that the center frequency distribution between adjacent IMF components is reasonable, and there will be no similar or mixed results in the decomposition.In this study, the value of K is determined using empirical mode decomposition (EMD) [19].Based on experiments, this study selects a value of K as 7.
α is another important parameter to be set during the VMD decomposition process, and it determines the bandwidth of the IMF.A larger α value results in smaller bandwidths for each IMF component obtained by VMD.The value of α should neither be too large nor too small.Additionally, it is found that within a relative range, this parameter has a minimal impact on the results.For heart and lung sound signals, this study selects an α value of 1500.
Using the above algorithm, the heart and lung sound signals are decomposed into seven IMF components by the VMD algorithm, and the high-frequency IMF components are summed to obtain denoised heart or lung sound signals.(7) The separation algorithm is as follows: Algorithm 1:Cardiopulmonary sound separation

Experiment and discussion
In this section, the experiment and performance evaluation of the proposed heart and lung sound separation method are discussed.The evaluation metrics used include signal-to-distortion ratio (SDR), perceptual evaluation of speech quality (PESQ) and shorttime objective intelligibility (STOI) to validate the effectiveness of the proposed method.

Experimental parameters
The proposed method takes the spectrogram of the mixed signal as input and outputs separated heart and lung sound signals (Table 1).The DAE model's structure is as follows: The DAE model uses a stride of 1 for both the convolutional and deconvolutional units.Activation function is ReLU.The optimizer utilized is Adam.Unsupervised NMF method is employed as the baseline, with the L2 norm serving as the loss function.

Experimental data
The experimental data in this study were obtained from real heart and lung sounds.The dataset used in this research is sourced from publicly available datasets [20,21].Heart and lung sounds were collected under conditions with relatively low noise and can be considered as clean heart and lung sound signals.These clean heart and lung sounds were linearly mixed to create the mixed heart-lung sound signals.Assuming x c represents the heart sound signal and x r represents the lung sound signal, the mixed signal takes the form: signal = x c + ax r , where a is a coefficient.Based on the signal-to-noise ratio formula (Eq.10), we have: In the equation, r represents the logarithm of the ratio between the energy of the output mixed signal's heart sound and lung sound.If the energy of the output heart sound and lung sound is equal, then r equals 0. The coefficient a can be determined by using the equation provided.Finally, the signal is normalized.

Evaluation indicators
For the heart and lung sound separation method studied in this paper, we obtain separated heart and lung sounds.We use pure heart and lung sounds as references to calculate the separation performance, and we employ three standardized evaluation metrics: signal-to-distortion ratio (SDR), perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) to assess the separation performance.The formulas for calculating SDR, PESQ and STOI are as follows.( (1) Formula for calculating SDR: SDR was put forward by Vincent and others in 2006, and it is an evaluation index of blind source separation task [22].In source separation tasks, there are three types of noise: interference due to missed separation (e interf ), artifacts due to the reconstruction process (e artif ) and residual noise (e noise ).Here, ŝ(t) represents the estimated result, and s target (t) represents the target.SDR is calculated as shown in formula (12), where �•� 2 rep- resents the signal energy value.
(2) PESQ calculation formula: PESQ evaluation was proposed by Rix et al. for evaluating the quality of sound signals, which has been defined by the ITU-T recommendation P.862 [23].As shown in formula (13), where d SYM and d ASYM represent symmetric and asymmetric disturbances, respectively, providing a good balance between prediction accuracy and generalization capability.
The values of PESQ range from − 0.5 to 4.5.In cases of severe distortion, the PESQ value may be below 1.0.
(3) STOI calculation formula: The STOI was proposed by Cees et al. for predicting the intelligibility of noisy speech [24].As shown in formula (14), where j = 1,2,…,J represents the index of one-third octave bands, N is typically set to 30, and d j,n is the correlation coefficient of the short-time spectral vectors between the test speech and clean speech.For these three metrics, higher scores indicate better source separation results.

Experiment
A randomly selected heart sound (as the signal) and lung sound (as the noise) were mixed at a signal-to-noise ratio of 0 dB to create a mixed heart-lung sound signal.After encoding and decoding with DAE, the Waveform and Spectrogram of the reconstructed signal are shown in Figs. 6 and 7  Comparing the mixed signal (a1) with the separated heart sound signal (b1) and lung sound signal (c1) in Fig. 8, it is evident that the heart and lung sounds have been effectively separated, demonstrating the effectiveness of the DAE algorithm in separating the signals.Observing the Waveform in Fig. 7a1 and Spectrogram in Fig. 7a2 of the mixed signal, it can be seen that the heart and lung sounds overlap significantly in the lowfrequency region.After separation using the DAE algorithm, the heart sound signal primarily concentrates in the low-frequency part, while the lung sound signal mainly concentrates in the high-frequency part.Examining the Spectrogram in Figs.8b2, c2, it is noticeable that there is some minor interference of high-frequency lung sounds in the separated heart sound signal and vice versa in the separated lung sound signal.In the separated Waveform of the heart-lung sound mixed signal obtained using the DAE-NMF algorithm, it can be observed that there is a slight presence of lung sound noise in the heart sound signal and vice versa.Therefore, denoising using the VMD algorithm was applied to the separated heart and lung sound signals.The Waveform and Spectrogram of the denoised heart and lung sound signals are shown in Fig. 9.
Fig. 9 VMD noise reduction processing of heart sounds and lung sounds 0.108837 higher than those of NMF-K-means algorithm, respectively.Under the same feature extraction algorithm, the separation method using NMF periodic clustering algorithm has higher SDR, PESQ and STOI values than the separation method using K-Means clustering algorithm.The SDR, PESQ and STOI values of DAE-NMF periodic clustering algorithm are 1.500000, 0.186008 and 0.187283 higher than those of DAE-K-means algorithm, respectively.
The comparison of the above algorithms proves the effectiveness of DAE-NMF algorithm in separating heart-lung sounds.After adding VMD algorithm to DAE-NMF, the values of SDR, PESQ and STOI separated from heart sounds are 3.535230, 0.552162 and 0.171111 higher than the original ones, respectively.It is proved that the DAE-NMF-VMD algorithm proposed in this paper can not only effectively separate cardiopulmonary sounds, but also have good quality of separated heart sounds.
The same as the evaluation index of heart sound, we found that under the same clustering method, the SDR, PESQ and STOI values of the separation method using DAE to extract features are higher than those of the separation method using NMF to extract features.The SDR, PESQ and STOI values of DAE-K-means algorithm are 0.931442, 0.220753 and 0.206424 higher than those of NMF-K-means algorithm, respectively.Under the same feature extraction algorithm, the separation method using NMF periodic clustering algorithm has higher SDR, PESQ and STOI values than the separation method using K-Means clustering algorithm.The SDR, PESQ and STOI values of DAE-NMF algorithm are 0.673863, 0.096222 and 0.010937 higher than those of DAE-Kmeans algorithm, respectively.
After adding VMD algorithm to DAE-NMF, the values of SDR, PESQ and STOI of lung sounds were 2.972652, 1.9426 and 0.05758 higher than the original ones, respectively.It is proved that the DAE-NMF-VMD algorithm proposed in this paper can not only effectively separate cardiopulmonary sounds, but also have good quality of separated lung sounds.
By observing and comparing the values of SDR, PESQ and STOI, it can be concluded that compared to other methods for heart and lung sound separation, the proposed DAE-NMF-VMD algorithm has achieved improvements in all three evaluation metrics.This demonstrates that the quality of heart and lung sound separation using the DAE-NMF-VMD algorithm is significantly higher than other methods and indicates the effectiveness of using this approach for heart and lung sound separation.

Result verification
To further validate the effectiveness of the DAE-NMF-VMD model in separating heart and lung sounds, real heart and lung sound mixed signals were used for the separation experiments.Figures 7 and 8 show the spectrograms and spectrograms of the separated heart and lung sound signals obtained by the DAE-NMF-VMD model from the audio files 113_1306244002866_A.wav and 101_1305030823364_B.wav, respectively.
Comparing the Waveform of the mixed signal with the separated heart and lung sounds in Fig. 10a, it can be observed that the heart and lung sound signals are effectively separated.In Fig. 10b, the spectrogram of the separated heart and lung sound signals shows that the frequencies are primarily concentrated in the range of 20 ~ 150 Hz and 50 ~ 1000 Hz, which aligns with the frequency ranges of heart and lung sound signals.
Similarly, the mixed heart-lung sound separation results shown in Fig. 11 also confirm this, demonstrating the effectiveness of the DAE-NMF-VMD algorithm for separating heart-lung sound signals.

Conclusion
This paper presents a heart-lung sound separation method based on DAE-NMF-VMD, which, in addition to separating mixed heart-lung sounds based on DAE, applies VMD for denoising and enhancement of the separated signals.Unlike traditional heart-lung sound separation methods, DAE-NMF-VMD does not require supervised training data and leverages the periodic characteristics of heart and lung sound signals for separation.The research results indicate that this method yields satisfactory separation outcomes compared to other methods.SDR, PESQ and STOI

( 3 )Fig. 5
Fig. 5 deconvolution operation , respectively.Comparing the original signal with the reconstructed signal, it can be observed that the spectrogram and cepstrogram of the signal reconstructed by the DAE model closely match the original signal, demonstrating the effectiveness of the DAE model training.

1 .
Blind source separation heart and lung sound model based on deep autoencoders, nonnegative matrix decomposition and variational mode decomposition was established.Use the autoencoder to extract the potential height expression of the heart and lung sound signals, and then send the obtained potential height expression into a sparse nonnegative matrix to perform clustering according to the different periods of the heart sound signal and lung sound signal to achieve heart and lung sound separation.Finally, the obtained heart and lung sound signals are denoised and enhanced using variational mode decomposition to obtain clean heart and lung sound signals.

Table 1
DAE model structure