Conditionally optimal classification based on CFAR and invariance property for blind receivers

This paper proposes a new approach for finding the conditionally optimal solution (the classifier with minimum error probability) for the classification problem where the observations are from the multivariate normal distribution. The optimal Bayes classifier does not exist when the covariance matrix is unknown for this problem. However, this paper proposes a classifier based on the constant false alarm rate (CFAR) and invariance property. The proposed classifier is optimal conditionally as it has the minimum error probability in a subset of solutions. This approach has an analogy to hypothesis testing problems where uniformly most powerful invariant (UMPI) and uniformly most powerful unbiased (UMPU) detectors are used instead of the non-existing optimal UMP detector. Furthermore, this paper investigates using the proposed classifier for modulation classification as an application in signal processing.

is the main case study of this paper-different features are proposed in the literature like instantaneous amplitude, phase and frequency [20], fourth-order cumulant [21], constellation shape [22], cyclostationarity [23], and wavelength coefficients [24]. For the second step, i.e., non-parametric classification, different methods have been proposed in the literature like artificial neural network (ANN) [25] and support vector machine (SVM) [26]. These approaches are not optimal as the mathematical system model of the problem is not taken into study. However, they are computationally more simple and more robust to model parameter mismatches.
On the other hand, in the parametric approach, the firm mathematical system model is exploited in order to derive optimal classifiers and therefore is the focus of this paper. The optimal solution for the classification problem is named Bayes' classifier. In many practical signal processing applications, the Bayes' classifier does not exist due to the existence of the unknown parameters under the test. Therefore some (suboptimal) alternatives are employed in the literature like averaged likelihood ratio (ALR) [18,[27][28][29][30], generalized likelihood ratio (GLR) [18,[31][32][33], hybrid likelihood ratio (HLR) [18,31,[34][35][36], and quasi HLR (QHLR) [37]. These techniques are all common in the fact that they are based on the likelihood function of the different classes being considered. However, the way they treat unknown parameters is different. In ALRT, the unknown parameters are considered random variables (RVs); therefore, the likelihood function of different classes is calculated by integrating over the unknown parameters. The ALR is optimal in the sense that the considered pdf for different parameters is valid. However, this assumption can not be guaranteed. Moreover, the ALR approach is very computationally complex. In GLRT, however, the unknown parameters are considered unknown but deterministic and the maximum likelihood (ML) estimates of unknown parameters are used instead. The GLRT is suboptimal; however, it can be proved that it has asymptotic optimality [17]. Furthermore, the GLR classifier can not be used for nested problems [18]. The HLRT however, by averaging over data symbols removes the difficulty of GLRT for nested constellations. The QHLRT is an HLRT in which the unknown parameters are estimated using low-complexity techniques.
In this paper, our focus is on finding a conditionally optimal solution for a specific classification problem in which the observations are from the multivariate normal distribution. This problem and its special cases have been used extensively for blind receiver applications like AMC in frequency selective fading channel [1], AMC for Alamouti space time blind code (STBC) scheme [2], blind identification of Alamouti or spatial multiplexing (SM) [5], and so on. Based on the aforementioned references first, some sorts of features are extracted, and then the suboptimal parametric classification methods are applied. However, in this paper, we propose a conditionally optimal classifier for such a problem. By conditionally optimal classification, we mean that the proposed classifier is optimal in a subgroup of solutions that are all common in having a constant error floor in high signal to noise ratio (SNR) regimes. Interestingly, the adopted procedure leads to solving a corresponding hypothesis testing problem (a problem by NP criterion) instead of solving the original classification problem (a problem by minimum error probability criterion). We have proved that the optimal uniformly most powerful (UMP) detector for the corresponding hypothesis testing problem is an optimal uniformly minimum error probability (UMEP) 1 classifier in the group of classifiers (which have error floor in high SNR regime) for the original classification problem. This technique, i.e., restricting the domain of solutions of an arbitrary classification problem -which is one the main novelties of this paper-has an analogy to the adopted approach in hypothesis testing problems when the optimal UMP detector does not exist. In these cases in hypothesis testing literature, the alternative UMP invariant (UMPI) and UMP unbiased (UMPU) detectors are employed instead [39][40][41][42][43]. UMPI and UMPU are optimal in the sense that they satisfy the NP criterion in the subgroup of detectors, i.e., all invariant and unbiased detectors, receptively. Similarly, our proposed classifier is optimal in the sense that it has the minimum error probability (uniformly over different values of the unknown parameters) in a "subgroup" of classifiers that all have a predefined error floor value in high SNR regimes. The novelties of this paper can be stated as: -A new criterion for finding the optimal/suboptimal classifier is proposed based on which the well-studied optimal/suboptimal solution for hypothesis testing problems can now be applied for blind receiver applications. This approach has a deep analogy to the taken approach in finding UMPI and UMPU for hypothesis testing problems for which the UMP detector does not exist. -The proposed classifier is CFAR and invariant with respect to the group of transformations under which the studied classification problem remains invariant. -The effect of the error floor values on the overall classifier performance is investigated. Interestingly, it is proved that under some mild circumstances, the higher error floor in the high SNR regimes leads to a better performance in low SNR regimes. Therefore, selecting the error floor value is a trade-off between high and low SNR regime performances. -The proposed approach is applied to binary classification problems. However, an example of the multiclass problem is also included.
This paper is organized as follows: Sec. 3 introduces the system model. In Sec. 4, the conditionally optimal classifier is investigated. The effect of error floor on the overall performance of the classifier is elaborated in Sec. 5. The simulation results for the modulation classification application are presented in Sec. 6. Section 7 concludes the paper.

Notations
Throughout this paper, bold-face upper case letters (e.g. X) denote matrices, bold-face lowercase letters (e.g. x) represent vectors, light-face upper case letters (e.g. G m ) denote sets or groups, and light-face lower case letters (e.g. x and g m ) represent scalars or transformations (deterministic or random). The pdf of a random vector x is denoted by f (x; ρ) in which ρ denotes its unknown parameter. The statistic (whether a classifier or a detector) is represented by T(. The error probability, the missed detection probability, the detection probability, and the false alarm probability at a given point in the parameter space like ρ are dented by P e (ρ), P MD (ρ), P D (ρ), and P FA (ρ), receptively. The normal distribution is denoted by N(μ, σ 2 ) in which μ and σ 2 represents its mean and variance, respectively. Furthermore, the L 2 norm of (.) is represented by ||(.)||, and |.| represents the determinant operator.

System model
Consider the multivariate complex normal distribution as: where X =[ x 1 , · · · , x M ], in which x i ∈ C N s are i.i.d. observations for the following binary classification problem: Furthermore, μ is the mean vector, and the covariance matrix = LL H (L is lower triangular) is assumed to be unknown. Moreover, we define ρ L −1 μ ∈ ∈ C N in which represents the parameter space of the problem, g m ∈ G m g m |g m (X) = KX, K ∈ L where L is the set of all N × N positive definite lower triangular matrices and therefore K denotes all lower triangular N × N transformations on the observation vector. In this problem, we want to find the classifier which has the minimum error probability, i.e., P e = P(C 0 )P(C 1 |C 0 ) + P(C 1 )P(C 0 |C 1 ) in which P(C i ) represents the prior probability of ith class and P(C i |C j ) denotes the probability that the decision of the classifier is i when the true class is j.
For the sake of completeness and clarity for the rest of the paper, we define the following hypothesis testing problem and name it as the "corresponding" hypothesis testing problem for the classification problem under the study in (2) : The observation vectors are assumed to be distributed as the multivariate normal distribution in (1). By hypothesis testing problem, we mean the optimality criterion is based on NP.

Proposed method for deriving a conditionally optimal classifier
In this section, a new type of conditionally optimal classifier for (2) is presented. It is stated as "conditionally optimal" because it is optimal in a subgroup of classifiers having a specific property. We name this subgroup as C α and is fully elaborated in the following sections.

Definition 1
We call the set of all unbiased invariant 2 classifiers for (2) the C α group if their error floor is α 2 ; i.e: All the detectors in the C α group have two properties in common: First, they are all invariant concerning the group of transformations, i.e., G m . Furthermore, they all have error floor value even if ||ρ|| → ∞. As an example, it means that as far as is kept constant and ||μ|| → ∞ the error probability would have a floor value. Although it forces some loss of information likewise in the hypothesis testing problems, it makes finding the optimal solution possible. On the other hand, as it will be proved in the following, the detectors in C α have the CFAR property. Furthermore, as it will be shown in the subsequent sections, the group is rich enough to include the well-known tests like GLRT and Generalized Wald Detector (GWT) [44].

Theorem 1
The optimal UMEP classifier for (2) in the C α group is the UMP(U) detector for (3) (which is the corresponding hypothesis testing problem for (2)) when its P FA is set to α. Furthermore, if T 1 (.) and T 2 (.) be two different detectors for (3) for which ∀P FA , ∀ρ : P d1 > P d2 ; then T 1 (.) has uniformly less error probability for (2) over the parameter space than T 2 (.) in every C α group.
Proof Refer to Appendix A Remark 1 Theorem 1 states a new perspective toward binary classification for (2). It provides a new optimal classifier based on a new optimality criterion, i.e., error floor concept and the optimal UMP(U) detector used in hypothesis testing problem. We will show in the subsequent sections that the UMPU detector for (3) exists under some special cases. However, unfortunately, the UMP(U) detector does not exist in many other circumstances or at least is not known to exist. Theorem 1 also states that selecting a better detector for (3) leads to a better classifier for (2) in terms of the new optimality criterion. In these cases, we can choose the classic suboptimal detectors like GLRT, Wald, and Rao for (3). However, deriving the aforementioned suboptimal detectors which are based on the pdf of (3) may not be straightforward as the transformation group is applied to the main problem. Recently, a new family of detectors named Separating Function Estimation Test (SFET) is proposed in [45]. 3 Based on this scheme, we can find the appropriate suboptimal detector (approximately optimal) based on the FIM directly without needing the closed form expression of the pdf which is named GWT [44]. The systematic approach for deriving the GWT, i.e., deriving the FIM of (3) and solving the differential equation for finding the separating function is conducted in the following section. 2 We call T(.) an invariant classifier with respect to g m if T(g m (.)) = T(.)

Deriving GWT classifier
1-Finding the FIM: For driving the GWT classifier, we should find the FIM of the corresponding hypothesis testing problem, i.e (3).

Proposition 1
The FIM of (3) equals the identity matrix, I.

Proof
The FIM of (2) (before applying the group of transformation) in terms of θ = μ T , vec( ) T T can be written as [46]. If λ = g(θ), then the FIM with respect to λ can be written as I θ = J λ I λ J T λ in which J = ∂g ∂θ is the Jacobian matrix [46] . We make an auxiliary variable λ ρ T , vec( ) T T . The Jacobian matrix can be written as: By substituting I λ and J λ in I θ , I ρ = I is concluded.
2-Deriving SF:According to [44], when the induced FIM of the problem is I, the GWT detector coincides with the well-known Wald detector, and its statistic; can be written as Interestingly, the GWT also coincides with the GLRT derived in [3].

Proposition 2
The GWT based classifier for (2) is UMEP in the C α group for the special case of = σ 2 I.
Proof It can be proved that the GLR detector for (2) in the case of = σ 2 I is UMPU detector [47]. As far as the GWT detector coincides with GLR detector (refer to (5)) according to Theorem 1, it is a UMEP classifier in the corresponding C α Based on the preceding reasoning, using the GWT classifier for (3) is the optimal classifier in the C α group at least for = σ 2 I. On the other hand, when has a different structure, the GWT can also be used as an asymptotically optimal classifier in the C α group.

Error floor value evaluation
In this section, we want to evaluate the effect of the error floor on the classifier performance. In other words, we want to discuss which C α group should be taken. It will be shown that under some circumstances, the C α group is just a design parameter and α-selection is a trade-off between high SNR regime and low SNR regime performances. Example 1 Assume a special case for (2) in which x obeys normal distribution, i.e., x ∼ N(ρ, 1) . Furthermore, the N-tuple observation vector x = [x 0 , x 1 , · · · , x N−1 ] T is also available. Consider the following binary classification problem: in which ρ is unknown. For this simple problem, it can be easily seen that the corresponding hypothesis testing problem noted in Theorem 1 is itself. i.e.: It is well-known that the UMP detector for (7) is [17]. Therefore, based on Theorem 1 using T UMP as a classifier for (6) is the UMEP classifier in the C α group. The error probability of the UMP-based classifier for (2) vs. ρ (which is a measure of SNR) for different error floor values (different C α groups) is depicted in Fig. 1. As it is illustrated, increasing P FA results in a higher error floor in large SNR regimes. But on the other hand, it results in a better error probability for lower values of SNR. We are going to evaluate this property in detail. Specifically, we would propose some sufficient conditions under which this property holds.
Proof Refer to Appendix B Remark 2 This lemma establishes some sufficient conditions guaranteeing that increasing the error floor value would lead to lower error probability for some SNR values. In the following, we are going to exploit the conditions stated in Lemma 1 to establish sufficient conditions in more general signal processing examples, i.e when the distribution of T(.) is an element of exponential family.

Theorem 2 Assume that in
Proof Refer to Appendix 7 Example 2 Suppose Ex. 1 again. It can be easily seen that T UMP (x) is distributed as N(ρ, 1). The normal distribution is an element of the exponential family with the following parameters: For this example, B(ρ) > B(ρ 0 ) leads to ρ > 0. On the other hand applying (9) leads to: For every η > 0 any ρ ∈ (0, η) satisfies (11). Therefore, for this examples as it was illustrated in Fig. 1 this property holds.

Corollary 1 Suppose g(ρ)
is an SF for (3). Then in the asymptotic regime , i.e., when the number of samples forĝ(ρ) ML approaches infinity, for any arbitrary P FA2 < P FA1 there exist a ρ * for which for anyρ ∈ (0, ρ * ) we have P e1 (ρ) < P e2 (ρ) assuming g(ρ 0 ) = 0, Proof In asymptomatic regime the distribution ofĝ(ρ) ML can be modeled by ∂ρ and I ρ is the FIM of f (m; ρ) [46]. Any normal distribution is an element of family distribution with the following parameter: Applying conditions of Theorem 2 2η > g(ρ) (13) Because g(.) is continuous, for every P FA we can find a ρ which satisfies (13).
Remark 3 Corollary 1 states that in asymptotic regimes, increasing error floor will result in a higher correct probability for lower SNR values for any SFET detector stratifying g(ρ 0 ) = 0. According to (5), g(ρ) = 0 and therefore asymptotically speaking, increasing the error floor for the GWT based classifier for (2), results in a lower error probability for some lower SNR values.

Results and discussion
Example 3 Consider the modulation classification problem in a multipath fading channel as y i = L−1 l=0 h l x i−l + n i in which h i is the channel impulse response, x i s are the transmitted symbols, n i represents the white Gaussian noise and L denotes the number of channel paths. Suppose that the classification is to be done from two different dictionaries D 1 ={BPSK, QPSK} and D 2 ={QPSK, 8PSK}. Based on [1] we can use f 1 E{y i y i+q } and f 2 E{y 2 i y 2 i+q } features for discriminating between D 1 and D 2 dictionaries. It is straightforward to see that for QPSK, E{x 2 i } is zero but for BPSK it is not. Furthermore, E{x 4 i } is zero for 8PSK and non-zero for QPSK. For estimating f 1 and f 2 we can use the sample mean estimator z f 1 (q) MC for the considered channel model based on f 1 and f 2 features can be modeled by the following binary classification problem (as the number of observations approaches infinity according to the central limit theorem) [1] C 0 : z f i ∼ CN(0, σ 2 I) vs. C 1 : z f i ∼ CN(β, σ 2 I) in which β is unknown which models the uncertainty due to fading channel and σ 2 is unknown representing the ambiguity of noise power and accuracy of the sample mean estimator. This problem is a special case of (2). Therefore, we can use the GWT classifier proposed for this problem. Assuming that N observation vectors are already available in the receiver and y i = y i 0 , y i 1 , · · · , y i  Figs. 2, 3, and 4. The simulation is conducted for different channel conditions and number of observation vectors. As it can be implied by the results, as the number of channel taps in a frequency selective fading channel increases, the error probability also increases. This is because the estimation error of parameters gets worse as the dimension of the observation vectors increases. On the other hand, increasing the number of observation vectors helps to increase the correct classification probability. In Fig. 3, the P FA is depicted over different SNR values (noise power). As far as a CFAR detector is used for classification, it is expected that by changing the noise power, the P FA remains constant. This phenomenon is verified through Monte Carlo simulation in Fig. 3. Furthermore, the effect of error floor on the classifier performance is depicted in Fig. 4 for C 0.01 , C 0.05 and C 0.1 groups.
As it is expected, taking a larger error floor leads to the lower SNR values performance improvement. On the other hand, the simulation results for D = {8PSK, QPSK} dictionary is depicted in Figs. 5 and 6. The classifier behaves as in the previous case. Increasing the number of channel taps results in the decrease of the correct probability of the classifier. Furthermore, as the number of observation vectors increases, the error probability decreases. The effect of error floor on the overall classifier performance is depicted in Fig. 6 for C 0.1 , C 0.05 , and C 0.01 . In this case, also increasing the error floor results in a better performance in lower SNR regions.
On the other hand, in order to better evaluate our proposed classifier against state of art solutions, the performance is compared against two Convolution Neural Network (CNN) classifiers. Adopting CNNs for modulation classification application is already proposed in the literature in [48][49][50][51]. The first selected CNN layout structure is given in Table 1. This structure is proposed by [50] which is somehow a similar structure used in [48]. The adopted classifier uses a CNN that consists of six convolution layers and one fully connected layer. Each convolution layer except the last is followed by a batch normalization layer, rectified linear unit (ReLU) activation layer, and a max-pooling layer. In the last convolution layer, the max-pooling layer is replaced with an average pooling layer.
The output layer has softmax activation. A stochastic gradient descent with Momentum (SGDM) solver with a mini-batch size of 256 is used. The maximum number of epochs is set to 12 since a larger number of epochs provides no further training advantage. Furthermore, the initial learning rate is set to 0.02. On the other hand, the second CNN structure which is shown in Table 2 is adopted from [51] (very similar but not exactly identical). This CNN consists of two convolution layers and four fully connected layers. Each convolution layer and fully connected layer (except the lest one) is followed by a rectified linear unit (ReLU) activation layer and a max-pooling layer. The output layer has softmax activation. An SGDM solver is used and the maximum number of epochs is set to 8. Furthermore, the initial learning rate is set to 0.01. The simulation results of the trained networks for L = 4, D = {BPSK, QPSK} are depicted in Fig. 7. The sample per symbol parameter is set to 1 and the number of samples fed to the CNN classifiers is taken in such a way that the observation elements available for the adopted classifiers (CNNs and our proposed one) are almost the same. For high SNR regimes (20 dB) the first CNN classifier performance reaches the probability of 98.2% for L = 4 while the second CNN has an error floor in the high SNR regime around 0.15. The first CNN has a better performance in the high SNR regime while the second CNN network has better performance in the low SNR regime. Our proposed classifier has better performance compared to the first CNN performance for L = 4, N = 2. (e.g., 99% accuracy in SNR around 0 dB). However, in the low SNR regime, the second CNN has better performance while it has much more error floor against our proposed classifier.  Furthermore, it should be noted that our proposed classifier has the optimal performance in C 0.05 group. Based on Theorem 1 proof, there may be a classifier outside C 0.05 which has better performance in some SNR points. However, it should be noted that selecting the CNN number of layers and its structure needs optimization which is definitely out of the scope of this paper. In other words, we may reach a better performance by changing the network structure. On other hand, the CNN classifiers are much more complex than our proposed classifier. Furthermore, our proposed classifier can control its error floor in order to boost its performance in low SNR regimes by changing the C α group.

Example 4
Consider Example 3 again. Now suppose that we want to classify in D = {BPSK, QPSK, 8PSK} dictionary. In this approach we convert the M-case classification problem into M binary classification problems. This procedure can be described by a decision tree as in Fig. 8. In each stage one of the candidates is tested; accordingly, the corresponding decision is taken until the final answer. At each node, a detector-based classifiers introduced in the previous sections for each binary decision-making is used. For classification, at first the discrimination is done between D 1 = {BPSK, QPSK} and then between D 2 = {QPSK,8PSK}. All of the detector's parameters, i.e., K, Q, and P FA are taken

Conclusion
In this paper, the conditionally optimal solution for the classification problem when the observation vectors were from multivariate normal distribution was investigated. It was shown that the optimal classifier in the C α group for such a problem is possible under some circumstances. Furthermore, when the optimal solution in the C α group did not exist either, it was proved that taking a better detector for the corresponding hypothesis test problem leads to a better classifier. The GWT classifier was derived and was shown that it is optimal in the C α group when the covariance matrix is a scaled identity matrix. The GWT classifier was applied to the AMC problem in the multipath channel. The simulation results verified the analytical findings. Furthermore, the superiority of our approach was evaluated against its alternative solution, i.e., CCN approach. prove that the induced maximal invariant with repect to the considred group of transformation, i.e., G is ρ because of the following two properties: 1-ρ ḡ m ((μ, )) = ρ L 1 μ, L 1 L 1 H = L −1 L 1 −1 L 1 μ = L −1 μ = ρ ((μ, )) , ∀L 1 ∈ L. 2-If ρ(θ 1 ) = ρ(θ 2 ) then μ 1 = L 1 L −1 2 μ 2 , L 1 L −1 2 2 L −H 2 L H 1 = 1 thereforeḡ m = L 1 L −1 2 maps θ 2 to θ 1 . To cmplete the proof we show that all classifiers in the C α group are CFAR. We should show P FA (μ 1 = 0, = 1 ) = P FA (μ 1 = 0, = 2 ), ∀ 1 , 2 . P FA (μ 1 = 0, 1 ) = P (μ 1 =0, 1 ) {T(x) ∈ A} = P (μ 1 =0, 1 ) T g m (x) ∈ A = P (μ 1 =0, 1 ) g m (x) ∈ T −1 (A) = P g m (μ 1 =0, 1 ) x ∈ T −1 (A) = P (μ 1 =0, 2 ) {T(x) ∈ A}. It should be noted that for every , 2 , g m exists and equals L 1 L −1 2 . Therefore, every classifier in the C α group is CFAR and as far as P MD (||ρ|| → ∞) approaches 0, P e (||ρ|| → ∞) forces P FA to be α for all the classifiers in the C α . Therefore, the UMP detector for (3) is the optimal classifier for (2) in the C α group as it has the minimum missed detection probability with respect to a fixed false alarm probability. Furthermore, as all the classifiers in the C α are unbiased, the UMPU detector for (3) is also the optimal solution because it can be easily proved that the unbiased detector for (3) is also an unbiased classifier for (2) (As far as the prior probability of different classes are equal). On the other hand, if T 1 (.) has a better detection probability with respect to a predefined false alarm rate, it would have a better error probability in the corresponding C α group because all the classifiers in the C α has the same false alarm probability.

Appendix B: Proof of Lemma 1
Proof We should find ρ * : P e1 (ρ * ) < P e2 (ρ * ). Therefore we can write: Then: In the next steps, we prove that based on assumptions, (15) is satisfied. We can find ρ * for P FA2 in which we have ∂P D (ρ * ) ∂P FA > 1. Then based on mean value theorem [52], we can find P FA3 in (P FA2 , P FA1 ) interval for which we have: On the other hand based on ∂ 2 P D ∂P FA 2 < 0, we can write ∂P D (ρ * ) ∂P FA(P FA3 ) > 1. Therefore based on (16), (15) holds.