4.1 Gauss-Seidel iterative method
GS is one of the common iterative methods used to solve systems of linear equations. If a system of linear equations \(\mathbf {Ax} ={\mathbf {b}}\) is required to be solved, it will be decomposed as follows.
$$\begin{aligned} \begin{aligned} a_{i1}x_1+a_{i2}x_2+\dots +a_{in}x_n=b_i\quad (i=1,2,\dots ,n) \end{aligned} \end{aligned}$$
(29)
The iterative formula of GS [36] is
$$\begin{aligned} \begin{aligned} x_i^{(k+1)}=(b_i- {\textstyle \sum _{j=1}^{i-1}a_{ij}x_j^{(k+1)}- {\textstyle \sum _{j=i+1}^{n}}a_{ij}x_j^{(k)})/a_{ii} }\\(i=1,2,\dots ,n;k=0,1,2,\dots ,t) \end{aligned} \end{aligned}$$
(30)
For each layer, \(b_i\) is used to subtract the updated \({\textstyle \sum _{j=1}^{i-1}a_{ij}x_j^{(k+1)}}\) and the not yet updated \({\textstyle \sum _{j=i+1}^{n}}a_{ij}x_j^{(k)}\), its matrix representation is
$$\begin{aligned} \begin{aligned} {\mathbf {x}}^\mathrm {(k+1)} ={\mathbf {D}}^{-1} (-\mathbf {Lx}^\mathrm {(k+1)} -\mathbf {Ux}^\mathrm {(k)} +{\mathbf {b}} ) \end{aligned} \end{aligned}$$
(31)
$$\begin{aligned} \begin{aligned} ({\mathbf {D}} +{\mathbf {L}} ){\mathbf {x}}^\mathrm {(k+1)}=-\mathbf {Ux}^\mathrm {(k)} +{\mathbf {b}} \end{aligned} \end{aligned}$$
(32)
$$\begin{aligned} \begin{aligned} {\mathbf {x}}^\mathrm {(k+1)} =({\mathbf {D}} +{\mathbf {L}} )^{-1}(-{\mathbf {U}} {\mathbf {x}}^\mathrm {(k)} +{\mathbf {b}} ) \end{aligned} \end{aligned}$$
(33)
GS is used in communication, that is, the Hermitian positive semi-definite matrix \({\mathbf {A}}\) is decomposed into strictly lower triangular terms \({\mathbf {L}}\), strictly upper triangular terms \({\mathbf {U}}\) and diagonal terms \({\mathbf {D}}\)
$$\begin{aligned} \begin{aligned} {\mathbf {A}} ={\mathbf {H}} ^\mathrm {T} {\mathbf {H}} +\frac{\sigma ^2 }{2} {\mathbf {I}} \end{aligned} \end{aligned}$$
(34)
$$\begin{aligned} \begin{aligned} {\mathbf {A}} ={\mathbf {D}} +{\mathbf {L}} +{\mathbf {U}} \end{aligned} \end{aligned}$$
(35)
Then the problem we solve is
$$\begin{aligned} \begin{aligned} \mathbf {Ax} ={\mathbf {H}} ^\mathrm {T} {\mathbf {y}} \end{aligned} \end{aligned}$$
(36)
Then solve a set of linear equations by calculating the solution of the iterative be havior [37].
$$\begin{aligned} \begin{aligned} {\hat{\mathbf {x}}}^{(\mathrm {n} )} =({\mathbf {D}} +{\mathbf {L}} )^{-1}[{\hat{\mathbf {x}}}_{MF} -{\mathbf {U}}{\hat{\mathbf {x}}}^{\mathrm {(n-1)} } ] \end{aligned} \end{aligned}$$
(37)
where \({\hat{\mathbf {x}}}^\mathrm {(n)}\) is the estimated signal, iteratively refined in each iteration and \({\hat{\mathbf {x}}}_{MF} ={\mathbf {H}} ^\mathrm {T}{\mathbf {y}}\), replacing \({\mathbf {b}}\) in (32). Here \({\hat{\mathbf {x}} }^{(0)}\) is initialised to \({\mathbf {D}} ^{-1}{\mathbf {H}} ^\mathrm {T} {\mathbf {y}}\). Gauss-Seidel has good convergence, and is guaranteed to converge when \({\mathbf {A}}\) is diagonally dominant or symmetrically positive definite. This is because in MAUE systems, \({\mathbf {A}}\) is not guaranteed to be diagonally dominant, but \({\mathbf {A}}\) is definitely symmetric positive definite. The following is a proof of convergence of Gauss-Seidel for diagonally dominant or symmetric positive times, respectively [38].
Theorem 1
When \({\mathbf {A}}\) is diagonally dominant, Gauss-Seidel can guarantee convergence.
Proof of Theorem 1. For strictly diagonally dominant matrix \({\mathbf {A}}\), its diagonal elements \(a_{ii}\ne 0,i=1,2,\dots ,n\), so
$$\begin{aligned} \begin{aligned} \left| {\mathbf {D}} +{\mathbf {L}} \right| = {\textstyle \prod _{i=1}^{n}}a_{ii} \ne 0 \end{aligned} \end{aligned}$$
(38)
Suppose \(\mathbf {B_G} =-({\mathbf {D}} +{\mathbf {L}} )^{-1}{\mathbf {U}}\), the characteristic value is \(\lambda\), then the characteristic equation is
$$\begin{aligned} \begin{aligned} \left| \lambda {\mathbf {I}} -\mathbf {B_G} \right|&=\left| \lambda {\mathbf {I}} +({\mathbf {D}} +{\mathbf {L}} )^{-1}{\mathbf {U}} \right| =\left| ({\mathbf {D}} +{\mathbf {L}} )^{-1} \right| \left| \lambda ({\mathbf {D}} +{\mathbf {L}} )+{\mathbf {U}} \right| =0\\ {}&\Rightarrow \left| \lambda ({\mathbf {D}} +{\mathbf {L}} )+{\mathbf {U}} \right| =0 \end{aligned} \end{aligned}$$
(39)
When the determinant is zero, the equation has a non-zero solution. Use contradiction: suppose \(\left| \lambda \right| \ge 1\),
$$\begin{aligned} \begin{aligned} \lambda ({\mathbf {D}} +{\mathbf {L}} )+{\mathbf {U}} =\begin{bmatrix} \lambda a_{11} &{} a_{12} &{} \dots &{} a_{1n}\\ \lambda a_{21} &{} \lambda a_{22} &{} \dots &{}a_{2n} \\ \vdots &{} \dots &{} \ddots &{}\vdots \\ \lambda a_{n1} &{} \lambda a_{n2} &{} \dots &{} \lambda a_{nn} \end{bmatrix} \end{aligned} \end{aligned}$$
(40)
It is a strictly diagonally dominant matrix, so it is non-singular, that is, \(\left| \lambda ({\mathbf {D}} +{\mathbf {L}} )+{\mathbf {U}} \right| \ne 0\) contradicts the eigenvalue \(\lambda\) satisfying \(\left| \lambda ({\mathbf {D}} +{\mathbf {L}} )+{\mathbf {U}} \right| = 0\). So \(\left| \lambda \right| < 1\) is \(\rho (\mathbf {B_G} )< 1\) , when \({\mathbf {A}}\) is diagonally dominant, Gauss-Seidel converges.
Theorem 2
When \({\mathbf {A}}\) is symmetrically positive, Gauss-Seidel can guarantee convergence.
Proof of Theorem 2. Suppose \(\mathbf {B_G} =-({\mathbf {D}} +{\mathbf {L}} )^{-1}{\mathbf {U}}\), the eigenvalue is \(\lambda\), and \({\mathbf {x}}\) is the eigenvector, then
$$\begin{aligned} & - ({\mathbf{D}} + {\mathbf{L}})^{{ - 1}} {\mathbf{Ux}} = \lambda {\mathbf{x}} \\ & \to - {\mathbf{Ux}} = \lambda ({\mathbf{D}} + {\mathbf{L}}){\mathbf{x}} \\ & \to - {\mathbf{x}}^{{\text{T}}} {\mathbf{Ux}} = \lambda {\mathbf{x}}^{{\text{T}}} ({\mathbf{D}} + {\mathbf{L}}){\mathbf{x}} \\ \end{aligned}$$
(41)
Because \({\mathbf {A}}\) is positive definite, so \(p={\mathbf {x}} ^\mathrm {T} \mathbf {Dx} > 0\), set \(-{\mathbf {x}} ^\mathrm {T} \mathbf {Ux} =a\), then
$$\begin{aligned} \begin{aligned} {\mathbf {x}} ^\mathrm {T} \mathbf {Ax} ={\mathbf {x}}^\mathrm {T} ({\mathbf {D}} +{\mathbf {L}} +{\mathbf {U}} ){\mathbf {x}} =p-2a>0 \end{aligned} \end{aligned}$$
(42)
$$\begin{aligned} \begin{aligned} \lambda =\frac{-{\mathbf {x}} ^\mathrm {T} \mathbf {Ux} }{{\mathbf {x}} ^\mathrm {T}({\mathbf {D}} +{\mathbf {L}} ){\mathbf {x}} }=\frac{a}{p-a} \end{aligned} \end{aligned}$$
(43)
$$\begin{aligned} \begin{aligned} \lambda ^2=\frac{a^2}{p^2-2pa+a^2}=\frac{a^2}{p(p-2a)+a^2} <1 \end{aligned} \end{aligned}$$
(44)
So \(\left| \lambda \right| < 1\) is \(\rho (\mathbf {B_G} )< 1\). When \({\mathbf {A}}\) is symmetric positive definite, Gauss-Seidel converges, and we divide the numerator and denominator of (44) by \(a^2\) at the same time, we get
$$\begin{aligned} \begin{aligned} \lambda ^2=\frac{1}{(\frac{p}{a}-1 )^2} \end{aligned} \end{aligned}$$
(45)
From (45) we can see that the more diagonally dominant, the smaller the \(\lambda ^2\), the faster the convergence.
4.2 BGS-Net architecture
In this section, a model-driven DL detector network (called BGS-Net) is proposed. The signal detector uses the Gauss-Seidel method and nonlinear activation to improve the detection performance. The only training parameters is \(\mathbf {\Omega } =\mathbf {\gamma }_t\), \(\mathbf {\gamma } _t\in \mathrm {R} ^{\mathrm {M} \times 1}\). In the algorithm \(({\mathbf {D}} +{\mathbf {L}} )^{-1}\), \({\hat{\mathbf {x}} }_{MF}\), \({\mathbf {U}}\), \(tr({\mathbf {H}} ^\mathrm {T} {\mathbf {H}} )\), \(\frac{1}{\mathrm {M} } tr({\mathbf {C}} _t{\mathbf {C}} _t^\mathrm {T} )\), \(\frac{\sigma ^2}{\mathrm {M} } tr({\mathbf {W}} _t{\mathbf {W}} _t^\mathrm {T} )\), all of which need to be computed only once and then reused at each layer. In contrast, \({\mathbf {W}}_t\) and \({\mathbf {A}}_t\) for OAMPNet and MMNet-iid need to be calculated once per layer because of the training parameters present in them. The structure of BGS-Net is shown in Figure 6, which is an improved algorithm by adding a learnable vector variable \(\mathbf {\gamma } _{t}\) . The network consists of \(L_{layer}\) cascaded layers, each of which has the same structure, including nonlinear estimator, error variance \(\mathbf {\tau } _{t}^{2}\) , and tied weights. The input of the BGS-Net network is \(\hat{{\mathbf {x}} }_{MF}\) and the initial value \(\hat{{\mathbf {x}} }_{0}\), and the output is the final estimate of the signal \(\hat{{\mathbf {x}} }_{Llayer}\) . To make it easier to see the deep learning structure, see Figure 7. We first calculate \(\hat{{\mathbf {z}} } _{t}\) and scalar \(\tau _{t}^{2}\) through GS detection block, plus the constellation map S as the input of the network. We introduce a vector variable \(\mathbf {\gamma } _{t}\) in nonlinear detection, and finally output \(\hat{{\mathbf {x}} } _{t+1}\) . The difference between model-driven and DNN is that many of the parameters of model-driven are fixed values obtained from past experience, while the parameters of DNN are all variable values.
Algorithm1: BGS-Net algorithm for MIMO detection |
---|
Input: Received signal \({\mathbf {y}}\), channel matrix \({\mathbf {H}}\), noise level \(\sigma ^2/2\) |
Initialize: \({\hat{\mathbf {x}}}_0 \leftarrow {\mathbf {D}} ^{-1}{\mathbf {H}} ^\mathrm {T} {\mathbf {y}}\) |
\(1.{\hat{\mathbf {z}}}_t =({\mathbf {D}} +{\mathbf {L}} )^{-1}[{\hat{\mathbf {x}}} _{MF} -{\mathbf {U}}{\hat{\mathbf {x}}}_t ]\) |
\(2.v_t^2=\frac{\left\| {\mathbf {y}} -{\mathbf {H}}{\hat{\mathbf {x}}}_t \right\| _2^2-\mathrm {N} \frac{\sigma ^2}{2} }{tr({\mathbf {H}} ^\mathrm {T} {\mathbf {H}} )}\) |
\(3.v_t^2=\max (v_t^2,10^{-9} )\) |
\(4.\tau _t^2 =\frac{1}{\mathrm {M} }tr({\mathbf {C}} _t{\mathbf {C}} _t^\mathrm {T} ) v_t^2 +\frac{\sigma ^2}{\mathrm {M} } tr({\mathbf {W}} _t{\mathbf {W}} _t^\mathrm {T} )\) |
\(5.\mathbf {\tau }_t^2 =\frac{\mathbf {\tau }_t^2 }{\mathbf {\gamma }_t }\) |
\(6.{\hat{\mathbf {x}}}_{t+1} =E\left\{ {\mathbf {x}} |{\hat{\mathbf {z}}}_t ,\mathbf {\tau }_t \right\}\) |
where
$$\begin{aligned} \begin{aligned} {\mathbf {W}}_t ={\mathbf {H}} ^\mathrm {T} \end{aligned} \end{aligned}$$
(46)
$$\begin{aligned} \begin{aligned} {\mathbf {C}}_t ={\mathbf {I}} -{\mathbf {W}} _t{\mathbf {H}} \end{aligned} \end{aligned}$$
(47)
$$\begin{aligned} \begin{aligned} E\left\{ x_{ti}|{\hat{z}}_{ti} ,\mathbf {\tau }_{ti} \right\}&= {\textstyle \sum _{s_j\in S}s_j\times p(s_j/{\hat{z}}_{ti} ,\tau _{ti} )}\\ {}&= {\textstyle \sum _{s_j\in S}s_j\times softmax(\frac{-\left\| {\hat{z}}_{ti}-s_j \right\| ^2 }{\tau _{ti}^2} )} \end{aligned} \end{aligned}$$
(48)
where \(softmax(V_i)= \frac{exp^{V_i}}{ {\textstyle \sum _{j}exp^{V_j}} }\). It can be seen from Algorithm1 that we only have one training parameter per layer, and vector \(\mathbf {\gamma } _t\) is used to adjust the variance \(\mathbf {\tau } _t ^2\) which is an estimated value. Because \(\frac{1}{\mathrm {M} } tr({\mathbf {C}} _t{\mathbf {C}} _t ^\mathrm {T})\) used is a constant value, it is multiplied by \({\mathbf {v}}_t^2\) each time in \(\mathbf {\tau }_t ^2\), which saves a large amount of calculation. What we need to pay attention to is that when \(4\rightarrow 5\), \(\tau _t^2\)’s dimension expansion to a vector \(\mathbf {\tau }_t^2\) .
4.3 Low-complexity algorithm for \(({\mathbf {D}} +{\mathbf {L}} )^{-1}\)
In this section, the complexity of \(({\mathbf {D}} +{\mathbf {L}} )^{-1}\) will be reduced. The complexity in calculating \({\hat{\mathbf {x}}}_t =({\mathbf {D}} +{\mathbf {L}} )^{-1}[{\hat{\mathbf {x}}}_{MF} -{\mathbf {U}}{\hat{\mathbf {x}}} _{t-1} ]\) is mainly concentrated in solving \(({\mathbf {D}} +{\mathbf {L}} )^{-1}\). If the inverse is solved directly, the complexity of the algorithm will reach \(O(\mathrm {M} ^3)\) , so a circular nesting method is proposed to reduce its complexity, as described below :
The first: From Eq. (30), the complexity of each row is \(\mathrm {M-1}\) multiplication, 1 division, for a total of \(\mathrm {M}\) rows, so the complexity of one iteration is \(\mathrm {M^2}\). However, this method is not applicable to BGS-Net.
The second: Solve the lower triangular matrix in parallel, the structure is shown in Figure 8. For the inversion of the lower triangular matrix, we have the following properties:
$$\begin{aligned} \begin{aligned} {\mathbf {Y}} =\begin{bmatrix} {\mathbf {B}} &{} {\mathbf {0}} \\ {\mathbf {C}} &{}{\mathbf {F}} \end{bmatrix}\rightarrow {\mathbf {Y}}^{-1} =\begin{bmatrix} {\mathbf {B}} ^{-1}&{} {\mathbf {0}} \\ -{\mathbf {F}} ^{-1}{\mathbf {C}} {\mathbf {B}} ^{-1} &{}{\mathbf {F}} ^{-1} \end{bmatrix} \end{aligned} \end{aligned}$$
(49)
where \({\mathbf {B}}\), \({\mathbf {C}}\), \({\mathbf {F}}\) have the same size. The main complexity of (37) is to solve \(({\mathbf {D}} +{\mathbf {L}} )^{-1}\), which we know to be a lower triangular matrix, and use the above property for its inverse. It can be known from (8),(10) and (14) that when the system is SAUE or MAUE (only \({\mathbf {T}}\)), it has the following (50) properties; when MAUE (both \({\mathbf {T}}\) and \({\mathbf {R}}\)), there is no (50) property:
$$\begin{aligned} \begin{aligned} \begin{bmatrix} a_{1,1} &{} 0 &{}\dots &{} 0 &{}0 \\ a_{2,1}&{} a_{2,2}&{} \dots &{} 0&{}0 \\ \vdots &{} \dots &{} \ddots &{} \dots &{} \vdots \\ a_{\frac{M}{2}-1 ,1} &{} a_{\frac{M}{2}-1 ,2} &{} \dots &{} a_{\frac{M}{2}-1 ,\frac{M}{2}-1} &{} 0\\ a_{\frac{M}{2} ,1} &{} a_{\frac{M}{2} ,2} &{} \dots &{} a_{\frac{M}{2} ,\frac{M}{2}-1} &{}a_{\frac{M}{2} ,\frac{M}{2}} \end{bmatrix}\\=\begin{bmatrix} a_{\frac{M}{2}+1 ,\frac{M}{2}+1}&{} 0 &{}\dots &{} 0 &{}0 \\ a_{\frac{M}{2}+2 ,\frac{M}{2}+1}&{} a_{\frac{M}{2}+2 ,\frac{M}{2}+2}&{} \dots &{} 0&{}0 \\ \vdots &{} \dots &{} \ddots &{} \dots &{} \vdots \\ a_{M-1,\frac{M}{2}+1} &{} a_{M-1,\frac{M}{2}+2} &{} \dots &{} a_{M-1 ,M-1} &{} 0\\ a_{M,\frac{M}{2}+1} &{} a_{M,\frac{M}{2}+2} &{} \dots &{} a_{M,M-1} &{}a_{M,M} \end{bmatrix} \end{aligned} \end{aligned}$$
(50)
We can see from Figure 8 that the specific steps of the loop nesting method are as follows.
Step 1: Find the reciprocals of \(a_{1,1},a_{2,2},a_{3,3},\dots ,,a_{\frac{\mathrm {M}}{2} ,\frac{\mathrm {M}}{2}}\) and assign them to \({\mathbf {B}}_{1,t}^{-1}\) and \({\mathbf {F}}_{1,t}^{-1}\) , \(t\in (1,\frac{\mathrm {M}}{4} )\) respectively.
Step 2: Bring the resulting \({\mathbf {B}}_{i,t}^{-1}\) and \({\mathbf {F}}_{i,t}^{-1}\), and the corresponding \({\mathbf {C}}_{i,t}\), \(t\in (1,\frac{\mathrm {M}}{2^{i+1}} )\) into (49), we can get \({\mathbf {B}}_{i+1,t} ^{-1}\) and \({\mathbf {F}}_{i+1,t}^{-1}\), \(t\in (1,\frac{\mathrm {M}}{2^{i+2}} )\). If \({\mathbf {B}}_{i+1,t}^{-1}\) is an \(\mathrm {\frac{M}{2} \times \frac{M}{2}}\) matrix, then the next step, otherwise loop the second step.
Step 3:In the case of Section 2.3 (a) (b), assign \({\mathbf {B}}^{-1}={\mathbf {B}}_{i+1,t}^{-1}\) to \({\mathbf {F}}^{-1}\); otherwise, the same method as for \({\mathbf {B}}^{-1}\), solve \({\mathbf {F}}^{-1}\). In this way, we can obtain \(({\mathbf {D}} +{\mathbf {L}} )^{-1}=\begin{bmatrix} {\mathbf {B}}^{-1} &{}{\mathbf {0}} \\ -{\mathbf {F}} ^{-1}{\mathbf {C}} {\mathbf {B}} ^{-1} &{}{\mathbf {F}} ^{-1} \end{bmatrix}\)
Note that we don’t need to find the value of \(({\mathbf {D}} +{\mathbf {L}} )^{-1}\), just take the following formula to solve the linear detection term. (Because \({\mathbf {P}}_1 \in \mathrm {R}^{\mathrm {P}\times \mathrm {Q}}\), \({\mathbf {P}}_2 \in \mathrm {R} ^{\mathrm {Q}\times \mathrm {K}}\), \({\mathbf {b}} \in \mathrm {R} ^{\mathrm {K}\times 1}\), we know \(({\mathbf {P}} _1{\mathbf {P}} _2 ){\mathbf {b}} ={\mathbf {P}}_1 ({\mathbf {P}} _2{\mathbf {b}} )\). Let \({\mathbf {b}} ={\hat{\mathbf {x}}}_{MF} -{\mathbf {U}}{\hat{\mathbf {x}}}_t =\begin{bmatrix} {\mathbf {c}}_1 \\ {\mathbf {c}}_2 \end{bmatrix}\), where \({\mathbf {c}}_1 ,{\mathbf {c}}_2 \in \mathrm {R} ^{\mathrm {\frac{M}{2}} \times 1 }\). So
$$\begin{aligned} \begin{aligned} ({\mathbf {D}} +{\mathbf {L}} )^{-1}[{\hat{\mathbf {x}}}_{MF} -{\mathbf {U}}{\hat{\mathbf {x}}}_t ]&=\begin{bmatrix} {\mathbf {B}} ^{-1} &{} {\mathbf {0}} \\ -{\mathbf {F}} ^{-1}\mathbf {CB} ^{-1} &{}{\mathbf {F}} ^{-1} \end{bmatrix}\begin{bmatrix} {\mathbf {c}}_1 \\ {\mathbf {c}}_2 \end{bmatrix}\\ {}&=\begin{bmatrix} {\mathbf {B}} ^{-1}{\mathbf {c}}_1 \\ -{\mathbf {F}} ^{-1}\mathbf {CB} ^{-1}{\mathbf {c}}_1 +{\mathbf {F}} ^{-1}{\mathbf {c}}_2 \end{bmatrix} \end{aligned} \end{aligned}$$
(51)
4.4 Error analysis
In this section, we will study the reasons why BGS-Net can improve performance. The analysis of the error (\({\hat{\mathbf {x}}}_t -{\mathbf {x}}\)) is as follows:
Define the output error \({\mathbf {e}}_t^{lin} ={\mathbf {z}}_t -{\mathbf {x}}\) for the linear phase at iteration t and the output error at the previous iteration \(t -1\) as \({\mathbf {e}}_{t-1}^{den} ={\hat{\mathbf {x}}} _t -{\mathbf {x}}\). We can rewrite the update equation of Algorithm 1 based on these two output errors as:
$$\begin{aligned} \begin{aligned} {\mathbf {e}}_t^{lin}&=({\mathbf {D}} +{\mathbf {L}} )^{-1}{\mathbf {H}} ^\mathrm {T} {\mathbf {y}} -({\mathbf {D}} +{\mathbf {L}} )^{-1}{\mathbf {U}}{\hat{\mathbf {x}}}_t -{\mathbf {x}} \\ {}&=({\mathbf {D}} +{\mathbf {L}} )^{-1}{\mathbf {H}}^\mathrm {T} (\mathbf {Hx} +{\mathbf {n}} )-({\mathbf {D}} +{\mathbf {L}} )^{-1} {\mathbf {U}}{\hat{\mathbf {x}}}_t -{\mathbf {x}} \\ {}&=({\mathbf {D}} +{\mathbf {L}} )^{-1}{\mathbf {H}}^\mathrm {T} (\mathbf {Hx} +{\mathbf {n}} )-({\mathbf {D}} +{\mathbf {L}} )^{-1}({\mathbf {A}} -{\mathbf {D}} -{\mathbf {L}} ){\hat{\mathbf {x}}}_t -{\mathbf {x}}\\ {}&={\mathbf {e}}_{t-1}^{den} +({\mathbf {D}} +{\mathbf {L}} )^{-1}{\mathbf {H}}^\mathrm {T} (\mathbf {Hx} +{\mathbf {n}} )-({\mathbf {D}} +{\mathbf {L}} )^{-1}({\mathbf {H}} ^\mathrm {T}{\mathbf {H}} +\frac{\sigma ^2}{2}{\mathbf {I}} ){\hat{\mathbf {x}}}_t\\ {}&=({\mathbf {I}} -({\mathbf {D}} +{\mathbf {L}} )^{-1}{\mathbf {H}} ^\mathrm {T} {\mathbf {H}} ){\mathbf {e}}_{t-1}^{den} +({\mathbf {D}} +{\mathbf {L}} )^{-1}({\mathbf {H}} ^\mathrm {T} {\mathbf {n}} -\frac{\sigma ^2}{2}{\hat{\mathbf {x}} }_t )\\ {}&=({\mathbf {I}} -({\mathbf {D}} +{\mathbf {L}} )^{-1}({\mathbf {H}} ^\mathrm {T} {\mathbf {H}} +\frac{\sigma ^2}{2} {\mathbf {I}} )){\mathbf {e}}_{t-1}^{den} +({\mathbf {D}} +{\mathbf {L}} )^{-1}({\mathbf {H}} ^\mathrm {T} {\mathbf {n}} -\frac{\sigma ^2}{2} {\mathbf {x}} ) \end{aligned} \end{aligned}$$
(52)
and
$$\begin{aligned} \begin{aligned} {\mathbf {e}}_t^{den} =E\left\{ {\mathbf {x}} |{\hat{\mathbf {z}} }_t ,\mathbf {\tau }_t \right\} -{\mathbf {x}} \end{aligned} \end{aligned}$$
(53)
From Figure 2, we know that under channel hardening conditions, the first term of equation (52) \(({\mathbf {I}} -({\mathbf {D}} +{\mathbf {L}} )^{-1}({\mathbf {H}} ^\mathrm {T} {\mathbf {H}} +\frac{\sigma ^2}{2}{\mathbf {I}} ))\) tends to 0; the second term is divided into the effect of \({\mathbf {n}}\) and the effect of \({\mathbf {x}}\), for the noise, where \({\mathbf {n}} =\sqrt{\sigma ^2/2}*N(0,1)\), so when the signal-to-noise ratio(SNR) is small, \({\mathbf {H}} ^\mathrm {T} {\mathbf {n}}\) becomes larger, \(\frac{\sigma ^2}{2} {\mathbf {x}}\) also becomes larger; when the SNR is large, \({\mathbf {H}} ^\mathrm {T} {\mathbf {n}}\) becomes smaller and \(\frac{\sigma ^2}{2} {\mathbf {x}}\) also becomes smaller. And \(({\mathbf {D}} +{\mathbf {L}} )^{-1}({\mathbf {H}} ^\mathrm {T} {\mathbf {n}} -\frac{\sigma ^2}{2}{\mathbf {x}} )\) can in turn be reduced to \(({\mathbf {D}} +{\mathbf {L}} )^{-1}{\mathbf {H}} ^\mathrm {T} {\mathbf {y}} -({\mathbf {D}} +{\mathbf {L}} )^{-1}({\mathbf {H}} ^\mathrm {T} \mathbf {Hx} +\frac{\sigma ^2}{2}{\mathbf {x}} )=({\mathbf {D}} +{\mathbf {L}} )^{-1}{\mathbf {H}} ^\mathrm {T} {\mathbf {y}} -({\mathbf {D}} +{\mathbf {L}} )^{-1}({\mathbf {H}} ^\mathrm {T} {\mathbf {H}} +\frac{\sigma ^2}{2}{\mathbf {I}} ){\mathbf {x}}\), which is approximated under good channel hardening as \({\mathbf {x}}_{MMSE} -{\mathbf {x}}\), and as far as we know, the gap between \({\mathbf {x}}_{MMSE} -{\mathbf {x}}\) decreases as the channel hardens more. Under channel hardening conditions the error \({\mathbf {e}}_{t-1}^{den}\) from the previous stage, suppressed by \(({\mathbf {I}} -({\mathbf {D}} +{\mathbf {L}} )^{-1}({\mathbf {H}} ^\mathrm {T} {\mathbf {H}} +\frac{\sigma ^2}{2} {\mathbf {I}} ))\), is significantly attenuated. These calculations explain why BGS-Net has good performance on i.i.d Gaussian channels. Moreover, it is better than MMNet-iid’s \({\mathbf {I}} -\theta _t^{(1)}{\mathbf {H}} ^\mathrm {T} {\mathbf {H}}\) on correlated channels, where channel hardening disappears when the channel is correlated and there is no way for \({\mathbf {I}} -\theta _t^{(1)}{\mathbf {H}} ^\mathrm {T} {\mathbf {H}}\) to converge to \({\mathbf {0}}\) as the number of antennas increases, while \({\mathbf {I}} -({\mathbf {D}} +{\mathbf {L}} )^{-1}({\mathbf {H}} ^\mathrm {T} {\mathbf {H}} +\frac{\sigma ^2}{2} {\mathbf {I}} )\), since \({\mathbf {A}}\) is symmetric and \(\mathbf {D+L}\) itself contains all the information in \({\mathbf {A}}\). When the number of antennas increases, it can be approximated as \({\mathbf {I}} -{\mathbf {A}} ^{-1}{\mathbf {A}}\), tending to \({\mathbf {0}}\) but not to \({\mathbf {0}}\).
For the effect of the nonlinear activation function, \(E\left\{ {\mathbf {x}} |{\hat{\mathbf {z}}}_t ,\mathbf {\tau }_t \right\} -{\mathbf {x}}\) in (53) reduces the difference of \({\hat{\mathbf {x}}}_{t+1} -{\mathbf {x}}\). The proof is as follows:
$$\begin{aligned} \begin{aligned} E\left\{ x_{ti} |{\hat{z}}_{ti} ,\tau _{ti} \right\} -x_{ti} = {\textstyle \sum _{s_j\in S}s_j\times softmax(\frac{-\left\| {\hat{z}}_{ti}-s_j \right\| ^2 }{\tau _{ti}^2} )-x_{ti}} \end{aligned} \end{aligned}$$
(54)
Assuming that the true value \(x_{ti}\) is \(s_1\), then the above formula is equal to
$$\begin{aligned} \begin{aligned} {\textstyle \sum _{s_j\in S}s_j\times p(s_j/ {\hat{z}}_{ti},\tau _{ti} )-s_1} \end{aligned} \end{aligned}$$
(55)
We know that the softmax soft decision uses an exponent that allows judgments with larger probabilities to become larger and smaller probabilities to become smaller, but the total probability is still 1. It can be seen that as the probability of judging \(s_1\) increases, the first term of equation (55) gets closer and closer to \(s_1\), so using this activation function can further reduce the error.