2.1 Noise reduction model
The general image restoration problem can be considered from the perspective of domain transform [35]. A source domain \( \mathcal{S} \) and a target domain \( \mathcal{T} \) contain samples from two different given distributions PS and PT respectively. \( x\in \mathcal{S} \) denotes the LDCT image from the source domain and \( y\in \mathcal{T} \) denotes the corresponding NDCT image from the target domain where x ∼ PS, and y ∼ PT.
For the image restoration task, a generic denoising process for LDCT can be expressed as:
$$ x=F(y)+\varepsilon $$
(1)
where F : x → y represents a nonlinear degrading process by the noise and ε stands for the additive part of noise and other unmodeled factors. Current methods based DNN focus on learning a nonlinear function F† to directly map x into y, which can be expressed as:
$$ {F}^{\dagger }(x)=\hat{y}\approx y $$
(2)
The general ideal is to find the optimal F† to minimize the distance between PS and PT.
Unfortunately, since the noise in LDCT images does not obey any specific statistical distribution, the denoising operation will inevitably smooth the details to a certain degree, which makes it difficult to directly learn F†, even GAN is introduced to enforce stronger constraint. As a result, the result may quite depend on the specific form of loss function.
In order to solve this problem, inspired by the idea of domain transform, this process is disentangled into two steps: noise reduction and structural enhancement. The first step follows the general idea of learning based methods to learn an image-to-image denoising model and the second step is to recover the details smoothed by the first step. This is similar with the pre-upsampling image super-resolution models [36], which is upsampling the original image first and then recovering the details on the upsampled image. The denoised result y' after first step can be viewed as an intermediate result, which bridge the gap between the low-dose image x and normal-dose image y. Based on this consideration, Eq. 2 is reformulated as follow,
$$ {F}^{\dagger }(x)=R\left({y}^{\prime}\right)=\hat{y}\approx y,\kern0.6em {y}^{\prime }=S(x) $$
(3)
where \( S\left(\cdot \right):x\in \mathcal{S}\to {y}^{\prime}\in \mathcal{I} \) represents the noise suppression process, which transforms the sample x into the intermediate domain \( \mathcal{I} \). \( R\left(\cdot \right):{y}^{\prime}\in \mathcal{I}\to \hat{y}\in \mathcal{T} \) denotes the detail recovery process, which aims to enhance the structures and recover finer details from the denoised (probably over-smoothed) intermediate image.
2.2 Network architecture
The proposed network model follows the classical architecture of GAN, which contains a disentangled generator network and a relativistic multi-scale discriminator network. The generator is composed of two modules, a dynamic filter module for noise suppression and a structure enhancement module for detail recovery. The network architecture is shown in Fig. 2 and the details of each module are elaborated in the following subsections.
2.2.1 Noise removal module
Due to the nondeterminacy of the noise distribution in image domain, we propose to adopt dynamic filter network (DFN) [37], which is learned adaptively from the input data. The proposed noise suppressing model could be represented as:
$$ {y}^{\prime }=S(x)={f}_{\theta}\odot x $$
(4)
where fθ = DFN(x), which denotes the output filter generated by DFN. θ ∈ ℝs × s is the parameter set of the filter f. s is the filter size. fθ is applied to the input as y′ = fθ ⊙ x, where ⊙ is the point-wise multiplication operator.
In order to reduce the complexity of network structure while improving the performance of noise suppression, a LSTM unit is introduced into DFN to progressively generate dynamic filters. Furthermore, an adaptive strategy is used to guide the dynamic filter generation and we concatenated the last updated filter \( {f}_{\theta}^{t-1} \) and current input as the updated input in each time step.
Considering that the DFN focuses on noise suppression, mean square error (MSE) loss function is utilized, which is formulated as:
$$ {\mathrm{\mathcal{L}}}_{dfn}\left(\left\{{y}^{\prime}\right\},y\right)=\sum \limits_{t=1}^N{\lambda}_t{\mathrm{\mathcal{L}}}_{mse}\left({y}_{f_{\theta}^t}^{\prime },y\right) $$
(5)
where \( {y}_{f_{\theta}^t}^{\prime }= DFN\left(x\oplus {f}_{\theta}^{t-1}\right) \) is the updated image, and ⊕ denotes the channel-wise concatenation operation. \( {f}_{\theta}^0 \) is initialized with Gaussian distribution. To balance the training time and performance, we set N = 3 and λ = [0.25, 0.5, 1] in our experiments.
2.2.2 Structure enhancement module
Inspired by the deep learning based works for image super-resolution [38, 39], our structural enhancement module used a residual dense network (RDN) [40] to recovery structural details, as shown in Fig. 2, which is similar with [39]. To further enhance the performance, we made the following improvements on [39]:
Richer input
RDN aims to enhance the structure details for the denoised input. However, DFN tends to generate over-smoothed results with varying degrees. In order to avoid excessive details loss, we concatenated each \( {y}_t^{\prime } \) at different time steps as input of RDN.
Lightweight backbone
Compared with other networks for super-restoration task, our RDN module aims to recovery structural details from over-smoother inputs, which needs to pay more attention on the finer structures and details. Based on this consideration, we removed the up/down-sample operations in RDN, and used five residual dense blocks as backbone of RDN, which demonstrate powerful performance on detail recovery in our experiments.
Improved feature loss
Considering that feature loss has been widely used for detail recovery, we borrowed the improved feature loss [39] into RDN. The features before the activation layer are utilized to enhance the performance of recovering details, which can avoid the inconsistent details due to the sparseness of activated feature. A pretrained VGG-19 [41] model was used for the feature loss.
As a result, the total loss for generator in DNRGAN is defined as:
$$ {\mathrm{\mathcal{L}}}_{Gen}=\lambda {\mathrm{\mathcal{L}}}_{dfn}+{\mathrm{\mathcal{L}}}_c+\eta {\mathrm{\mathcal{L}}}_{fea}+\gamma {\mathrm{\mathcal{L}}}_{G^{Ra}} $$
(6)
where \( {\mathrm{\mathcal{L}}}_c={\mathbbm{E}}_{\mathrm{x}}{\left\Vert RDN\left(y\hbox{'}\right)\hbox{-} y\right\Vert}_1 \) is the content loss that evaluates the differences between the generated images and ground truth images, ℒfea is the feature loss, \( {\mathrm{\mathcal{L}}}_{G^{Ra}} \) is the GAN loss, and λ, η, and γ are the balancing coefficients.
2.2.3 Relativistic PatchGAN
In order to reduce the complexity of the network and improve the visual quality of the generated image, we also made two modifications on the traditional discriminator architecture to enhance the training efficiency: (a) one is introducing the relativistic adversarial loss into discriminator, which mainly predict the probability that a real input is relatively more realistic than the fake input instead of a binary output, and (b) the other is using a multi-scale PatctGAN [42, 43] to simplify the network structure while enhance the performance of discriminator.
The traditional discriminator can be expressed as D(x) = σ(C(x)), where σ(⋅) is the sigmoid function and C(x) is the non-transformed discriminator output. In our DNSGAN, a relativistic average discriminator [44] is used, referred as DRa, which is formulated as \( {D}_{Ra}\left({y}_r,{y}_f\right)=\sigma \left(C\left({y}_r\right)-{\mathbbm{E}}_{y_f}\left[C\left({y}_f\right)\right]\right) \), where yr represents the NDCT image, yf represents the generated denoised CT image, and \( {\mathbbm{E}}_{y_f}\left[\cdotp \right] \) represents the averaging operation on all fake data in the mini-batch, as shown in Fig. 3.
The discriminator loss is then defined as:
$$ {\mathrm{\mathcal{L}}}_{D^{Ra}}=-{\mathbbm{E}}_{y_r}\left[\log \left({D}_{Ra}\left({y}_r,{y}_f\right)\right)\right]-{\mathbbm{E}}_{y_f}\left[\log \left(1-{D}_{Ra}\left({y}_f,{y}_r\right)\right)\right] $$
(7)
and the adversarial loss for generator is formulated with a symmetrical form:
$$ {\mathrm{\mathcal{L}}}_{G^{Ra}}=-{\mathbbm{E}}_{y_r}\left[\log \left(1-{D}_{Ra}\left({y}_r,{y}_f\right)\right)\right]-{\mathbbm{E}}_{y_f}\left[\log \left({D}_{Ra}\left({y}_f,{y}_r\right)\right)\right] $$
(8)
PatchGAN (Markovian discriminator) identifies each N×N image patch real or fake. It is more suitable for the tasks which focus on detail or texture preservation. Further, we introduced the relativistic discriminator to further enhance the performance of discriminator. Compared with standard PatchGAN, the relativistic PatchGAN loss in the proposed DNSGAN can be expressed as:
$$ \underset{G}{\min}\underset{D_k}{\max}\sum \limits_{k=1,2..}^K{\mathrm{\mathcal{L}}}_{GAN}\left({G}^{Ra},{D}_k^{Ra}\right) $$
(9)
The relativistic discriminator contains five convolution layers and an average pooling layer, in our experiments and we selected two scaled patches from the last and penultimate layers to obtain the scores from real or fake samples.
Proposed method builds on an end-to-end learning architecture, which accepts arbitrary image size as input. Therefore, our method is trained using image patches and applied on the whole images. The details are provided in section 3 on experiments.