Disentangled generative adversarial network for low-dose CT

Generative adversarial network (GAN) has been applied for low-dose CT images to predict normal-dose CT images. However, the undesired artifacts and details bring uncertainty to the clinical diagnosis. In order to improve the visual quality while suppressing the noise, in this paper, we mainly studied the two key components of deep learning based low-dose CT (LDCT) restoration models—network architecture and adversarial loss, and proposed a disentangled noise suppression method based on GAN (DNSGAN) for LDCT. Specifically, a generator network, which contains the noise suppression and structure recovery modules, is proposed. Furthermore, a multi-scaled relativistic adversarial loss is introduced to preserve the finer structures of generated images. Experiments on simulated and real LDCT datasets show that the proposed method can effectively remove noise while recovering finer details and provide better visual perception than other state-of-the-art methods.

of noise in LDCT images, these methods cannot provide similar performance as that for natural images.
After the pioneering work was proposed by Chen et al. [16], deep neural network (DNN) approaches have brought a prosperous development in this field [17][18][19]. Various network architectures [20][21][22][23] have continuously improved the LDCT denoising performance. However, most of these methods utilize L2 norm as the target function, which produce results with high PSNR and structural similarity (SSIM) [24] but increase Fréchet inception distance (FID) [25] scores due to smoothed structural details. Since the PSNR metric does not completely coherent to the subjective evaluation of human observers, this fact may have uncertainly negative impact on clinical diagnosis. To circumvent this obstacle, generative adversarial network (GAN) and different loss functions were introduced to restore finer structural details as much as possible [26][27][28][29][30][31].
As the most representative one, WGAN-VGG [26], aided by stable Wasserstein GAN [32] training and perceptual loss [33], was proposed to encourage the network to favor solutions that look more like realistic normal-dose CT (NDCT) images. Although considerable improvements have been obtained, there still exists a noticeable gap between WGAN-VGG results and the NDCT images. One example is shown in Fig. 1. Although the result generated by WGAN-VGG has similar mottle-like noise, the distribution is quite different from the real NDCT image. The reason may lie in that most existing methods endeavor to transform LDCT images directly into corresponding normal-dose CT (NDCT) ones, which require a quite powerful model. Actually, an important problem was ignored that for LDCT, noise always adheres to the high frequency details. Therefore, the DNN-based methods with L2 norm tend to generate over-smooth results and the GANbased methods introduce extra noise into the generated images, which would lead to better visualization but lower PSNR and higher FID scores.
In order to alleviate this contradiction, in this paper, we propose a novel disentangled noise suppressing method for LDCT. We disentangle the procedure of LDCT denoising into two stages, noise removal and structural detail enhancement, instead of one-step mapping. Specifically, we firstly transform the source distribution into an intermediate distribution by paying more attention on noise removal, which may lead to oversmooth results with varying degree. After that the intermediate distribution is transformed into the final target distribution. This process is implemented by recovering the finer details from the denoised images from last step. In addition, the proposed disentangled noise suppressing method is embedded into the framework of GAN [34], termed as DNSGAN, to further enhance the visual perception of reconstructed images. The main contribution of this paper can be summarized as that proposed a novel disentangled noise suppression method-DNSGAN. Instead of one-step mapping, DNSG AN is more effective to handle LDCT restoration with the divide and conquer strategy that decoupling image denoising into two stages-noise removal and structure enhancement. Proposed method achieved higher-quality image reconstruction and improved the generalization for kinds of noise-levels than other competitive methods, resulting in better balance between the details retaining and quantitative metrics.
The rest of this paper is organized as follows. In section 2, the proposed DNSGAN method is described in detail. The experimental results are demonstrated in section 3 and the final section concludes this paper.

Noise reduction model
The general image restoration problem can be considered from the perspective of domain transform [35]. A source domain S and a target domain T contain samples from two different given distributions P S and P T respectively. x∈S denotes the LDCT image from the source domain and y∈T denotes the corresponding NDCT image from the target domain where x ∼ P S , and y ∼ P T . For the image restoration task, a generic denoising process for LDCT can be expressed as: where F : x → y represents a nonlinear degrading process by the noise and ε stands for the additive part of noise and other unmodeled factors. Current methods based DNN focus on learning a nonlinear function F † to directly map x into y, which can be expressed as: The general ideal is to find the optimal F † to minimize the distance between P S and P T .
Unfortunately, since the noise in LDCT images does not obey any specific statistical distribution, the denoising operation will inevitably smooth the details to a certain degree, which makes it difficult to directly learn F † , even GAN is introduced to enforce stronger constraint. As a result, the result may quite depend on the specific form of loss function.
In order to solve this problem, inspired by the idea of domain transform, this process is disentangled into two steps: noise reduction and structural enhancement. The first step follows the general idea of learning based methods to learn an image-to-image denoising model and the second step is to recover the details smoothed by the first step. This is similar with the pre-upsampling image super-resolution models [36], which is upsampling the original image first and then recovering the details on the upsampled image. The denoised result y' after first step can be viewed as an intermediate result, which bridge the gap between the low-dose image x and normal-dose image y. Based on this consideration, Eq. 2 is reformulated as follow, where SðÁÞ : x∈S→y 0 ∈I represents the noise suppression process, which transforms the sample x into the intermediate domain I . RðÁÞ : y 0 ∈I →ŷ∈T denotes the detail recovery process, which aims to enhance the structures and recover finer details from the denoised (probably over-smoothed) intermediate image.

Network architecture
The proposed network model follows the classical architecture of GAN, which contains a disentangled generator network and a relativistic multi-scale discriminator network. The generator is composed of two modules, a dynamic filter module for noise suppression and a structure enhancement module for detail recovery. The network architecture is shown in Fig. 2 and the details of each module are elaborated in the following subsections.

Noise removal module
Due to the nondeterminacy of the noise distribution in image domain, we propose to adopt dynamic filter network (DFN) [37], which is learned adaptively from the input data. The proposed noise suppressing model could be represented as: where f θ = DFN(x), which denotes the output filter generated by DFN. θ ∈ ℝ s × s is the parameter set of the filter f. s is the filter size. f θ is applied to the input as y ′ = f θ ⊙ x, where ⊙ is the point-wise multiplication operator.
In order to reduce the complexity of network structure while improving the performance of noise suppression, a LSTM unit is introduced into DFN to progressively generate dynamic filters. Furthermore, an adaptive strategy is used to guide the dynamic filter generation and we concatenated the last updated filter f t−1 θ and current input as the updated input in each time step.
Considering that the DFN focuses on noise suppression, mean square error (MSE) loss function is utilized, which is formulated as: where y θ Þ is the updated image, and ⊕ denotes the channel-wise concatenation operation. f 0 θ is initialized with Gaussian distribution. To balance the training time and performance, we set N = 3 and λ = [0.25, 0.5, 1] in our experiments.

Structure enhancement module
Inspired by the deep learning based works for image super-resolution [38,39], our structural enhancement module used a residual dense network (RDN) [40] to recovery structural details, as shown in Fig. 2, which is similar with [39]. To further enhance the performance, we made the following improvements on [39]: 2.2.2.1 Richer input RDN aims to enhance the structure details for the denoised input. However, DFN tends to generate over-smoothed results with varying degrees. In order to avoid excessive details loss, we concatenated each y 0 t at different time steps as input of RDN.

Lightweight backbone
Compared with other networks for super-restoration task, our RDN module aims to recovery structural details from over-smoother inputs, which needs to pay more attention on the finer structures and details. Based on this consideration, we removed the up/down-sample operations in RDN, and used five residual dense blocks as backbone of RDN, which demonstrate powerful performance on detail recovery in our experiments.

Improved feature loss
Considering that feature loss has been widely used for detail recovery, we borrowed the improved feature loss [39] into RDN. The features before the activation layer are utilized to enhance the performance of recovering details, which can avoid the inconsistent details due to the sparseness of activated feature. A pretrained VGG-19 [41] model was used for the feature loss.
As a result, the total loss for generator in DNRGAN is defined as: where ℒ c ¼ E x kRDNðy 0 Þ-yk 1 is the content loss that evaluates the differences between the generated images and ground truth images, ℒ fea is the feature loss, ℒ G Ra is the GAN loss, and λ, η, and γ are the balancing coefficients.

Relativistic PatchGAN
In order to reduce the complexity of the network and improve the visual quality of the generated image, we also made two modifications on the traditional discriminator architecture to enhance the training efficiency: (a) one is introducing the relativistic adversarial loss into discriminator, which mainly predict the probability that a real input is relatively more realistic than the fake input instead of a binary output, and (b) the other is using a multi-scale PatctGAN [42,43] to simplify the network structure while enhance the performance of discriminator.
The traditional discriminator can be expressed as D(x) = σ(C(x)), where σ(⋅) is the sigmoid function and C(x) is the non-transformed discriminator output. In our DNSGAN, a relativistic average discriminator [44] is used, referred as D Ra , which is formulated as D Ra ðy r ; y f Þ ¼ σðCðy r Þ−E y f ½Cðy f ÞÞ, where y r represents the NDCT image, y f represents the generated denoised CT image, and E y f ½Á represents the averaging operation on all fake data in the mini-batch, as shown in Fig. 3.
The discriminator loss is then defined as: and the adversarial loss for generator is formulated with a symmetrical form: PatchGAN (Markovian discriminator) identifies each N×N image patch real or fake. It is more suitable for the tasks which focus on detail or texture preservation. Further, we introduced the relativistic discriminator to further enhance the performance of discriminator. Compared with standard PatchGAN, the relativistic PatchGAN loss in the proposed DNSGAN can be expressed as: The relativistic discriminator contains five convolution layers and an average pooling layer, in our experiments and we selected two scaled patches from the last and penultimate layers to obtain the scores from real or fake samples.
Proposed method builds on an end-to-end learning architecture, which accepts arbitrary image size as input. Therefore, our method is trained using image patches and applied on the whole images. The details are provided in section 3 on experiments.

Experiments
This section presents the experimental setup and evaluates the performance of the proposed DNRGAN. Comprehensive experiments are set up with several competitive methods on two low-dose CT datasets respectively with simulated and real noise. In addition, peak noise-to-signal rate (PSNR), structural similarity (SSIM), and Fréchet inception distance (FID) are used to quantitatively evaluate the results. All these metrics were calculated based on the whole images.

Low-dose CT dataset with simulated noise
The Mayo clinic CT dataset [45] was used in our experiments, which is prepared for "the 2016 NIH-AAPM-Mayo Clinic Low Dose CT Ground Challenge" to evaluate competing LDCT image reconstruction algorithms. The dataset consists of 5936 normal-dose abdominal CT images with 512×512-pixel taken from 10 anonymous patients and corresponding simulated quarter-dose images after realistic noise insertion. The slice thickness and reconstruction interval in this dataset are 1.0 mm and 3.0 mm, respectively. The scanning tube potential and effective mAs used for this dataset were 120 kV and 200 mAs, respectively. All data were obtained on similar scanner models (Somatom Definition AS+, or Somatom Definition Flash operated in single-source mode, Siemens Healthcare, Forchheim, Germany). Please refer [45] for more details.

Experiment setting
We randomly selected 4000 slices from the LDCT images and corresponding NDCT images as training set, and the rest 1936 LDCT images were used as testing set. We generated approximately 120,000 samples with size of 128×128-pixel randomly cropped from the training set and validated the proposed model with the whole images in the testing set. The data in the experiments is normalized to [0, 1]. The batch size was set to 8. In order to speed up the training process, a PSNR-oriented model was trained.
The learning rate is initialized as 2 × 10 −4 . A GAN-based model is trained by fineturning with learning rate 1 × 10 −4 . For optimization, we used Adam algorithm [46] with β 1 = 0.9 and β 2 = 0.99. We implemented our model with the PyTorch framework [47] and trained on a NVIDIA Titan V GPU.

Components analysis
We first investigated the impacts of different modules and loss function combination for the proposed DNSGAN in noise suppression and structure recovery. For PSNR-oriented generator network, we first studied the effect of separate DFN module for noise removal, referred as DNSN-DF. For the enhancement module RDN, we mainly analyzed the effect on richer inputs, referred as DNSN and DNSGAN. For discriminator, we mainly focused on the factor of adversarial loss. A standard cross entropy loss was used in our proposed method for comparison. In addition, the improved feature loss was also analyzed. Table 1 gives the detailed descriptions on each variant combining different modules and loss functions. A representative slice from testing set was selected to show the performance of method in Fig. 4. It is obvious that the methods with L2/L1 norm achieved smoother results, e.g., DNSN-DF, DNSN-1, and DNSN. Compared with DNSN-DF, DNSN-1, and DNSN with RDN module obtained more clear structures but smoother details, which resulted in higher FID scores, and revealed that the richer input is effective for structure enhancement. On the other hand, the methods with adversarial loss, such as DNSGAN-1, DNSGAN, DNSGAN-CS, and DNSGAN-NF, achieved better visual perception with lower FID scores. DNSGAN-1 and DNSGAN had finer structures. In addition, improved feature loss promoted the structure recovery and artifact removal compared to DNSGAN-NF. The quantitative results from the whole testing set are shown in Table 2. It can be noticed that DNSN had the best PSNR and SSIM values, but DNSGAN achieved better balance between the visual perception and noise suppression.

Qualitative and quantitative results
In this section, DNSN, DNSGAN-CS, and DNSGAN were selected as our baselines to compare with other state-of-the-art methods including BM3D, RedCNN [21], and WGAN-VGG. A visualized result is given in Fig. 5. The zoomed regions (indicated by red and blue arrows) are used to visualize structural differences. All the methods presented powerful capacity of noise removal, but BM3D, RedCNN, and DNSN had smoother local details. BM3D even introduced extra artifacts compared with the other methods. DNSN achieved the best PSNR scores with only L1/L2 norm. The adversarial learning based methods brought better visual perception than the PSNR-oriented methods. However, WGAN-VGG generated some unpleasing artifacts. DNSGAN-CS achieved better qualitative results on noise reduction and structure restoration. The improved discriminator further enhanced the ability of model on details retaining. In addition, we introduced the noise power spectrum (NPS) [48] to validate the performance of our method. We selected a structure-rich ROI area from the LDCT image, which was indicated by an orange rectangle in Fig. 5, to calculate the 2D and 1D NPS metrics and the results using different methods are shown in Fig. 6. All the methods presented the ability in noise removal to varying degree. However, undesired waxy artifacts leaded BM3D has a higher peak. Although WGAN-VGG brought better visual perception than BM3D, unpleasant details lead to a higher peak in 1D NPS curve and lower metrics (i.e., PSNR and SSIM). Instead, our method achieved a better trade-off between the noise removal and visual perception than other methods.  Figure 7 presents the results in coronal and sagittal planes with different methods. All the methods demonstrated similar trend to that in transverse plane and our method show the best balance between the fine structure recovery and noise reduction.
The quantitative results on the whole testing set are given in Table 3. DNSN achieved the higher PSNR and SSIM scores. The proposed DNSGAN achieved a better balance among these metrics.

Low-dose CT dataset with real noise
The proposed DNSGAN was also validated on a real low-dose CT dataset, Dongpu General Hospital (DGH) dataset, which contains 4872 one-sixth-dose head CT scans  with 512×512 pixels and corresponding normal-dose CT images from 11 patients with representative protocols. All data were obtained on same scanner (MinFound ScintCare CT16). Each head CT scan data from patients consist of three different scan thicknesses, e.g., 1.16 mm, 2.32 mm, and 4.64 mm. In addition, these CT scans are acquired by two different reconstructed kernels. Since the low-dose CT images and corresponding normal-dose CT images are not in perfect registration due to the error in the patient table re-positioning and uncertainty in the source angle initialization, in this experiment, we just validated the generalization performance of the proposed method on different noise-level datasets. The data in the experiments is normalized to [0 1].

Experiment setting
Considering DGH dataset contains varying scan thickness, which results in different noise levels in LDCT images. The dataset is divided into three parts according to the scan thickness, referred as DGH-L, DGH-M, and DGH-H, which separately denote different noise-level LDCT with the thickness of 4.64 mm, 2.32 mm, and 1.16 mm. In this experiment, we did not retrain models due to the non-ideal data situation. An alternative method was adopted that the pre-trained model on Mayo dataset was used to evaluate the results of DGH dataset, which is effective to evaluate the generalization ability of the proposed method for different data sources. In addition, due to the lack of referenced images with accurate registration, PSNR and SSIM were abandoned in DGH dataset. Instead, the histogram of the gray-level, and FID were used to measure the capacity of model for noise removal and structure recovery.

Results for blind image restoration
One slice is selected from the DGH-L dataset and shown in Fig. 8. It is obvious that BM3D led to smoother result than RedCNN and DNRN. WGAN-VGG, DNSGAN-CS, and DNSGAN generate better results, but WGAN-VGG introduced extra artifacts at the edges. The histogram of DGH-L is illustrated in Fig. 9. All the methods tended to produce similar distribution with the ground truth in   Fig. 9b and c. In order to evaluate the robustness of the proposed method further, a slice with higher noise-level from DGH-H is shown in Fig. 10. DNSGAN still achieved the better metric than others. In addition, Table 4 gives the statistical results of FID produced by different methods on each datasets. All the methods presented strong ability on noise removal, but BM3D led to the worst FID value due to over-smoothing structures. Although RedCNN and DNSN had better results than BM3D, they had lower FID values than LDCT due to lack of finer details. GANbased models achieved a better balance between the noise removal and detail preservation, but WGAN-VGG brought extra artifacts near the edges due to its poor generalization.
Furthermore, in Table 4, we can find that all the supervised learning based methods attained the best metric on DGH-M. However, CNN-based models  achieved better results on DGH-L and followed by DGH-H, but for GAN-based models, both WGAN-VGG and ours have opposite trend. Considering that all the methods were trained on Mayo dataset with specific low dose scans (e.g., quarter-dose), they tend to achieve better results on similar or lower noise levels, but for GAN-based methods, extra discriminative constraints provide more generalized ability on higher noise-level, which enables better results on DGH-H than DGH-L. Even so, the proposed DNSGAN still had the best metrics on all the datasets.

Conclusion
In this paper, we mainly propose a disentangled LDCT restoration model-DNSG AN, which explicitly decouples noise removal into two steps: noise suppression and structure recovery and achieves a better balance between quantitative metrics and visual perception than other state-of-the-art methods. In addition, some advanced techniques including dynamic filter network and residual dense network were introduced. Relativistic multi-scaled PatchGAN was also injected into the discriminator network to recover finer structures further. Experiments on both datasets with simulated and real noise respectively show that the proposed DNSG AN has competitive performance for LDCT restoration and strong generalization for different imaging protocols.