Digital Communication Receivers Using Gaussian Processes for Machine Learning

We propose Gaussian processes (GPs) as a novel nonlinear receiver for digital communication systems. The GPs framework can be used to solve both classiﬁcation (GPC) and regression (GPR) problems. The optimal minimum mean squared error solution is the expectation of the transmitted symbol given the information at the receiver, which is a nonlinear function of the received symbols for discrete inputs. GPR can be presented as a non-linear MMSE estimator and thus capable of achieving optimal performance from MMSE viewpoint. Also, the design of digital communication receivers can be viewed as a detection problem, in which GPC is specially suited as it assigns posterior probabilities to each possible transmitted symbol. In this paper, we explore the suitability of GPC and GPR as nonlinear digital communication receivers. The major advantage of GPs is that they are Bayesian

perparameters (structure), since standard methods for searching the optimal hyperparameters (i.e. cross-validation [15,2]) require immense computational resources, which are not available in most communication receivers, and also their training time is highly variable. As a result, they use a suboptimal structure that requires longer training sequences for ensuring optimal receiver performance. Also, it makes the length of the training sequence hard to predict, as it depends on how well the chosen structure or hypeparameters fits the current problem.
For example, SVM with a Gaussian kernel needs to fit its width, which is proportional to the noise level [20,27,6]. If the width is too large, the SVM can be optimized with short training sequences, but its performance is poor. If it is too small, it requires a significantly longer training sequence to avoid overfitting.
For each instantiation of the problem there is an optimal width. This kernel width depends not only on the channel values and noise level, as we would expect, but also on the actual values of the noise themselves. Ideally, we would like to choose the kernel width every time we receive a new training sequence. But this would involve training a different SVM for each possible width and then choosing the optimal receiver (validation). In addition, this width is not the only SVM's hyperparameter. We must also validate the soft-margin that trades off the minimization of the training errors and the maximization of the margin. Therefore, we would have to train a set of receivers with different width and soft-margin hyperparameters to find the optimal setting in each problem. However, typically, we can only solve a single optimization problem in the receiver. We thus prespecify the SVM hyperparameters, as it is the case with other nonlinear tools referenced earlier.
In previous work, we introduced Gaussian processes for machine learning as a novel nonlinear tool for designing digital communication receivers. Gaussian processes can be applied to regression and classification problems [31] and in this paper we use both settings for tuning digital communication receivers with short training sequences. We compare Gaussian processes for regression (GPR) and Gaussian processes for classification (GPC) to state-of-the-art linear and nonlinear receivers to show their strength in solving this relevant problem. We have presented some preliminaries results for multi-user detection in CDMA systems [21,26] and channel equalization in [3]. In this paper we extend these results and include GPC in our comparisons.
Gaussian processes for machine learning are rooted in Bayesian statistics [31] and, consequently, build a likelihood function for its hyperparameters given the training examples. This likelihood can be optimized to set the hyperparameters.
This property makes GPs an attractive tool for designing nonlinear digital communication receivers, compared to other nonlinear machine learning tools, because the hyperparameters can be optimally set for each instantiation of our problem with a single optimization procedure.
For short training sequences hyperparameter mismatch significantly affects the performance of digital communication receivers, while for longer training sequences this performance is not sensitive to variations in the hyperparameters.
Most papers applying nonlinear machine learning for designing digital communication receivers propose fixed hyperparameters and sufficiently long training sequences. We focus on short training sequences and show that fixed hyperparameters underperform compared to GPR receivers with optimally trained hyperparameters.
Gaussian processes can be extended for solving classification problems. In this case the posterior is no longer tractable and we need to use approximations to compute the prediction for each class label [31]. A Gaussian distribution is typically used to approximate the GPC's posterior, either using Laplace [38] or expectation propagation methods [17]. However, GPC computational complexity is significantly higher than that of GPR and hence they might not be as suited for designing digital communication receivers as GPR are. Moreover, their performance is not as good as that of GPR receivers as we show and explain in the experimental section.
The rest of the paper is organized as follows. We present the design of digital communication receivers as an optimization problem in Section 2 and show how different nonlinear machine learning tools can be fitted in this framework. Section 3 is devoted to Gaussian processes for regression and how it can be understood as a nonlinear MMSE estimation. The optimization of the GPR hyperparameters is proposed in Section 4. GPC are introduced briefly in Section 5. We present some computer simulations in Section 6 to illustrate the benefits of GPR for channel equalization and multi-user detection compared to other state-of-the-art nonlinear tools. We conclude with some final remarks and proposed further work in Section 7.
2 Nonlinear optimization for communication receivers

Channel model and MMSE
We consider throughout the paper the following deterministic channel model: where s is a random variable column-vector representing the transmitted symbols, H corresponds to the deterministic channel gains, unknown to both the transmitter and receiver, z is zero-mean Gaussian noise, and x represents the received symbols.
This model is general enough to capture most standard communication systems.
For example: • Inter Symbol Interference: Each element in s is a symbol transmitted at a different time instant. H is a Toeplitz matrix, in which each row represents the channel impulsive response.
• Multiple-Input Multiple-Output: (H) ij represents the gain from the i th receiving antenna to the j th transmitting antenna and s represents the symbols transmitted by the antenna array.
• Fading: H is a diagonal matrix with the fading coefficients and s represents the symbols transmitted at each time instant. The source s that achieves capacity (maximum information transmission rate) [8] is a zero-mean Gaussian distribution with a covariance matrix given by the right eigenvectors of the channel matrix [30]. s being a continuous random variable, we can estimate in the receiver the transmitted vector using a minimum mean squared error (MMSE) detector: The function f mmse (x) is the mean value of s given the received vector x, If H is unknown, we can replace the expectations by sample averages using a training sequence.

Machine learning for digital communication receivers
The design of digital communication receivers can be readily understood as a supervised classification problem [10,23], in which the receiver constructs a classifier for deciding over the incoming symbols. Machine learning tools optimize the risk of misclassification: where L(·) is a loss-function that measures the penalty for wrongly classifying a pattern and f (x) is the nonlinear model to predict s.
The joint density, p(s, x), is typically unknown and thus we use a training sequence {x i , s i } n i=1 and the empirical risk minimization (ERM) inductive principle [36] to obtain the optimal solution: where we have included a regularization term, λΩ(||f ||), to avoid overfitting and to ensure that the minimum of the empirical risk converges to the minimum risk [36] as the number of training samples increases. The number of training patterns n determines the symbols in the preamble of each transmission needed to adjust the receiver. This number should be small to maximize the number of bits used to transmit information, as we need to retransmit the preamble in each burst of data.
The nonlinear machine learning approaches mentioned in the introduction can be cast as the optimization in (5) using an appropriate nonlinear model, lossfunction and regularizer. For example: hinge-loss 2 ; and Ω(||f ||)) = ||w|| 2 weight decay [2], gives an SVM for a binary antipodal constellation, which constructs the nonlinear classifier using the 'kernel trick' for φ(·) [34].
The convexity of the optimization in (5) depends on f (·), L(·, ·) and Ω(·). In some cases, as in SVM or KA, it leads to a convex functional and in others, as in 2 (y)+ = max(y, 0) MLP or RBFN, it does not. But in any case, these machine learning approaches rely on an iterative optimization tool [2,34] for solving (5). 2 and Ω(f ) = ||w|| 2 we get a convex functional: that can be analytically optimized: where Φ = [φ(x 1 ), . . . , φ(x n )] and s = [s 1 , . . . , s n ] . We denote this solution as nonlinear MMSE, since it is a nonlinear extension of (3), in which we have substituted x by φ(x) and we have replaced the expectations by sample averages.
In the next section we show (7) is equivalent to the mean solution provided by Gaussian processes for regression with a Gaussian likelihood function and that it can be solved using kernels [24]. Moreover, interpreting (7) as GPR allows optimizing its hyperparameters by maximum likelihood (Section 4). This optimization improves the performance of (7) with respect to other nonlinear machine learning procedures when the number of training samples is low, because for reduced training datasets the performance of nonlinear machine learning methods significantly depends on its hyperparameters.

Gaussian Processes for Regression
In the past few years a new Bayesian machine learning tool based on Gaussian processes (GPs) has been developed for nonlinear regression estimation [39,31,37]. In a nutshell, Gaussian processes for regression (GPR) assume that a GP prior governs the set of possible regressors. Consequently, the joint distribution of training and test data is given by a multidimensional Gaussian density function and the predicted distribution for each test point is estimated by conditioning on the training data.
We present GPR from the Bayesian generalized linear regression viewpoint.
Although from this opening we lose the GPs interpretation and we can only work with Gaussian likelihood models, we believe it is a simpler way to understand GPR.
This approach mimics how most machine learning textbooks introduce nonlinear regression [2,34,14] and it helps understanding GPR as a nonlinear MMSE estimation. Therefore, practitioners in signal processing for digital communications can readily relate to this new tool for estimation and detection. Both interpretations are described in [37], where they are shown to be identical for Gaussian likelihood models. There is more to GPs than what we introduce in this summary, for interested readers GPs extensions can be found in [31].
A generalized linear regressor expresses the input-output relation as where φ(·) is a nonlinear transformation to a higher dimensional feature space and ν is a random variable that measures the deviation between s and its estimate.
Given a labeled training sequence (D = {x i , s i } n i=1 , where the input x i ∈ R d and the output s i ∈ R) and a statistical model for ν, we can compute the regressor w by maximum likelihood (ML), We use these ML weights to predict the outputs for future test points x * : In Bayesian machine learning w is considered to be a random variable and, to predict the outcome of x * , we use its conditional density given the training dataset, p(w|D). This conditional density, known as the posterior of w, can be computed through Bayes rule, where p(s i |x i , w) is the likelihood function of w, p(w) its prior distribution and To predict the output for a new test point x * we integrate out w: in which the conditional density of each s * (the likelihood of w) is weighted by the posterior of w and sum over all possible w. As a result, we get a full statistical description of s * , given all the available information (x * and D). In this setting, we predict the value of s * using the full statistical model of w, not only its maximum likelihood estimate.
This setting is quite general, as we can use any model for the likelihood and prior for solving the regression estimation problem. Gaussian likelihood, , leads to the MMSE criterion, and a zero-mean Gaussian prior, p(w) ∼ N (0, σ 2 w I), allocates probability mass to every possible w and allows solving (12) analytically. The posterior distribution in (11) is then a Gaussian den- Actually, the posterior mean in (13) is identical to the maximum a posteriori (MAP) of (11): which is identical to (6) for λ = σ 2 ν /σ 2 w . We can also check that (13) is equal to (7). Therefore the GPR mean prediction can be regarded as a nonlinear MMSE estimation for the nonlinear mapping φ(·).
The prediction for s * in (12) is a Gaussian density function, p(s * |x * , D) ∼ N (µ s * , σ s * ): There is an alternative formulation for µ s * and σ 2 s * , in which we do not need to know the nonlinear mapping φ(·) and we only need to work with its inner product or kernel, defined as: To obtain this alternative formulation, we first define the covariance matrix C as: which can be related to Σ w as follows: Now if we pre-multiply (20) by Σ w and post-multiply it by C −1 , we obtain the following equivalency: Σ w Φ /σ 2 ν = σ 2 w Φ C −1 , which can be used to simplify (16) and express the GPR prediction mean as: where To compute the prediction for any vector x * , we do not need to know the nonlinear mapping φ(·), only its kernel. The complexity of computing µ s * in (21) is linear, because we can pre-compute the vector C −1 s that does not depend on x * and we only need to filter k with it for each new test pattern.
which is achieved after applying the matrix inversion lemma [33] to (14).
Equations in (21) and (23) represent the predictions for x * given by the Gaussian processes view of GPR. The matrix C is the covariance matrix of a multidimensional Gaussian distribution, hence its name, that describes the training data, and the vector k represents the covariance vector between the training dataset and the test vector. Therefore, the function k(·, ·) has to be a positive-definite function to ensure that the Gaussian processes covariance matrix C is also positive-definite.

Hyperparameter optimization
If either φ(·) or k(·, ·) are known, we can analytically predict the output of any incoming sample using (21). But for most estimation problems the best nonlinear transformation (or its kernel) is unknown. As discussed in the Section 2, the optimal setting of the hyperparameters could be obtained by cross-validation, similarly to any other nonlinear machine learning method. In this case the nonlinear MMSE would be as good as any of the other methods, as it would require either to try different settings or to rely on a prespecify one.
From the point of view of Bayesian machine learning, we can proceed as we did for the parameters w in Section 3. First, we compute the likelihood of the hyperparameters of the kernel given the training dataset: where θ represents the hyperparameters of the covariance function or kernel. We have added θ to the covariance matrix, likelihood and posterior to explicitly indicate that they depend on the kernel's hyperparameters. This was omitted in the GPR presentation in Section 3 for clarity purposes.
Second, we can define a prior for the hyperparameters, p(θ), that can be used to construct its posterior density: Third, we can integrate out the hyperparameters to obtain the predictions: However, in this case, the hyperparameters' likelihood does not have a conjugate prior and the posterior is non-analytical. Hence the integration has to be done either by sampling or approximations. Although this approach is well principled, it is computational intensive and it is not feasible for digital communications receivers. For example, Markov-chain Monte Carlo (MCMC) methods require several hundreds to several thousands samples from the posterior of θ to integrate it out in (26). For the interested readers, further details can be found in [31].
Alternatively, we can use the likelihood function of the hyperparameters and compute its maximum to obtain its optimal setting [39], which is used to describe the kernel for the test samples. Although setting the hyperparameters by maximum likelihood is not a purely Bayesian solution, it is fairly standard in the community and it allows using Bayesian solutions in time sensitive applications. The maximum likelihood hyperparameters are given by: This optimization is non-convex [18]. But as we increase the number of training samples the likelihood becomes a unimodal distribution around the maximum likelihood hyperparameters and the ML solution can be found using gradient ascent techniques. See [31] for further details.

Covariance matrix
To optimize the kernel hyperparameters in (27) we need to describe a kernel in a parametric form. Kernel design is one of the most challenging open problems in machine learning, as it is mainly driven by each particular application. We need to incorporate our prior knowledge into the kernel, but, at the same time, we want the kernel to be flexible to explain previously unknown trends in the data. In [31], a list of flexible kernels -i.e. linear, Gaussian, neural networks, Matérn, among othersand their properties are described. The rules on how to combine them are also described -e.g. the sum or product of two kernel functions is also a valid kernel function-.
For example, if we know the optimal solution to be linear, we could use the The only unknown hyperparameters in this case are σ 2 ν and σ 2 w , as we do not need to know these variances a priori. In the remaining of this text, we consider, without loss of generality, the last term in (19) to be part of the designed kernel, as δ ij is a valid kernel and the weighted sum of kernel functions (with nonnegative weights) is also a kernel. In general, kernel functions are more complex and they incorporate several hyperparameters.
For example, the Gaussian kernel with automatic relevance determination (ARD) proposes one nonnegative weight, γ , per input dimension: where we have added a linear kernel to use this covariance function for designing digital communication receivers. For this kernel function we define the hyperparameters as θ = [log α 0 , log α 1 , log α 2 , log γ ], because these hyperparameters need to be positive to ensure that k(·, ·) is a positive semi-definite function. Hence, we can apply unconstrained optimization tools if we work over θ.
The covariance function in (28) is a good kernel for designing digital communication receivers using GPR, because it contains a linear and a universal nonlinear part, as the RBF kernel has an infinite VC dimension [36]. The proposed covariance function is a good match for designing digital communication receivers.
The linear part can mimic the best linear decision boundary and the nonlinear part modifies it, where the linear explanation is not optimal to obtain the expectation of s given x. If the channel is linear, then the ML solution sets α 1 = 0 and there is no interference of the nonlinear term with the nonlinear one in the solution. Also, using a radial basis kernel for the nonlinear part seems an appropriate choice to achieve nonlinear decisions for digital communication receivers, because the received symbols form a constellation of clouds of points with Gaussian spread around its centers.

Discussion
Gaussian Processes for regression is a nonlinear regression tool that, given the covariance function, provides an analytical solution to any regression estimation problem. Moreover, it does not only give point estimates, but it also assigns confidence intervals for them. In GPR, we perform the optimization step to set the covariance function hyperparameters by maximum likelihood, unlike SVM or other nonlinear machine learning tools, in which the optimization is used to set the optimal parameters. In these methods, the hyperparameters have to be either prespecified or estimated by cross-validation [15].
Cross-validation optimizes several functionals (typically less than 10) for each possible setting of the hyperparameters [2]. The number of hyperparameters that can be tuned is quite limited (at most 2 or 3), as the computational complexity of cross-validation increases exponentially with the number of hyperparameters.
These remarkable drawbacks limit the application of these nonlinear tools to digital communications receivers, since we face complex nonlinear problems with reduced computational resources and short training sequences. By exploiting the GPs framework, as stated in this paper, we can avoid them.

Gaussian process for classification
Gaussian process for classification is a bit trickier than the regression counterpart, because we cannot rely on a Gaussian likelihood function to predict the labels of each class as the outcomes come from a discrete set [31]. Thereby to predict the class labels we need to resort to numerical integration or approximations to tractable density models. A generalized linear binary classifier predicts for an input x the class label as follow: where f = w φ(x) is an underlying continuous function, σ(·) is a sigmoid 3 that squashes f between 0 and 1, and p(s = −1|f ) = 1 − p(s = +1|f ).
Given a labeled training sequence (D = {x i , s i } n i=1 , where the input x i ∈ R d and the output s i ∈ {±1}), we can compute the posterior over the underlying function f = [f 1 , . . . , f n ] using Bayes rule, as we did in Section 3 for GPR with w, and we can integrate out f to predict the class label for any new test point x * .
We can compute the class label for the test samples as follows: where and In (31) we compute the distribution for the underlying function in the test point and in (30) we integrate out the underlying function to predict the probability that the class label of that point is +1. Both integrals are intractable due to the likelihood model employed for f in (29). GPC typically relies on a Gaussian approximation 4 for the posterior density p(f |D), to analytically solve (31), and (30) is a one-dimensional integral that can be easily solved numerically. Further details on how to approximate the posterior and train the covariance function hyperparameters can be found in [31].
We carry out two sets of experiments. First, we design a receiver for a CDMA system with strong near-far requirements and inter-symbol interference. In the second experiment, we deal with a channel equalization problem with a nonlinear amplifier in the receiver. The results in these experiments allow drawing some general conclusions about the advantages of GPs for designing digital communication receivers. For both experiments the channel model is given by: For all these systems we train a linear MMSE receiver (denoted by 'MMSE' and a dashed line), a GPR ('GPR' and a solid line) and a GPC with an EP approximation to its posterior ('GPC' and a dash-dotted line). We approximate the GPC posterior using the EP algorithm, because it provides superior performances than the Laplace approximation as suggested in [17]. For the GPs receivers we work with the covariance matrix in (28). We also report a linear SVM receiver ('SVMl' and a dotted line with circles) and a nonlinear SVM ('SVMnl' and a dotted line with bullets) with an RBF kernel [34]. For the SVMs we train a set of receivers with different hyperparameters and we report the best result. We use C = 0.5, 1, 2, 5 and 10 and σ = kσ z with k = 1, 2, 5 and 10. Thereby, the comparison is biased in favor of the SVM when compared to the GPR and GPC solutions. All the figures are obtained for 100 independently trained trials with 10 5 test symbols.

Linear multi-user detection
In our first experiment we employ Gold spreading codes with 31 chips per user, because they have favorable cross-correlation properties that limit the interferences by other users and their delayed replicas [11]. We report results for systems operating with 3 and 16 users and we assume the user of interest is 50dB bellow the other users. This is a fairly standard scenario when one of the users is close to the base station and it is assigned little power. We use the received 31 chips to detect each transmitted symbol. We show the bit error rate (BER) versus the signal to noise ratio (snr) for 3 users in Figure 1(a) and 16 users in Figure 1 The optimal solution is almost linear and all the proposed procedures perform equally well, once the training sequence is long enough. The training sequence of 512 symbols is not long enough for the nonlinear SVM with 16 users and it is unable to correctly tune its multi-user detector. If we had increased the training sequence to several thousands samples, the nonlinear SVM would converge and it would provide a solution close to the other algorithms. The differences in BER are not significant to decide which method is best, but the differences in training time might lead us to choose one over the others, as we discuss in short.
We report the BER as a function of the training examples for 3 users in Figure   2(a) and 16 users in Figure 2(b). For this experiment, these results are more meaningful than the BER versus snr reported in Figure 1, because there is a significant disparity between the performances of the different methods. For 3 users ( Figure   2 For 16 users, the GPR receiver presents the fastest learning curve closely followed by the linear MMSE and linear SVM solutions. We conjecture this is due to the GPR optimal training of its hyperparameter, because it is able to adjust them for each training sequence, while the linear SVM uses a constant setting, which might be good for a long training sequence, but not as good for shorter ones.
In this example we can readily understand the advantages of using GPR for solving multi-user detection problems, as for very short training sequences we are able to obtain the best possible solution, and if it is linear, it even improves the linear MMSE solution. The GPR and linear MMSE detectors provide the same solution as the number of samples increases, but for short training sequence the GPR detector is able to optimally set its hyperparameters to provide better performance than the linear MMSE. Also, as we see in the next example, if the solution is nonlinear, it is able to achieve nonlinear multi-user detectors, significantly improving the linear MMSE solution.

Nonlinear multi-user detection
We repeat the Experiment 2 in [6], in which 3 users transmit with an orthogonal 8-dimension spreading code. The solution for user 2 is highly nonlinear and we report the BER versus the snr in Figure 3. which is a waste of resources in wireless communication systems, as the preamble must be as short as possible. Also a SVM cannot use a kernel as in (28), because it would need to cross validate (or hand pick) too many hyperparameters.

Nonlinear channel equalization
Now we turn to the channel equalization problem, in which the channel is represented by (33), and we add a memoryless nonlinearity to the receiver that transforms each received signal as follows: wherex i = (Hs) i . This channel model is typically used to described nonlinear amplifiers in wireless communication receivers as explained in [20]. To construct the equalizers, we use 6 received samples to predict each transmitted symbol with a delay of 2 samples.
In Figure 4, we show the BER versus the snr for all equalizers and n = 512.
For snr less than 22dB the nonlinear GPR equalizer achieves the minimum BER with a gain larger than 3dB for BER around 10 −3 . For larger snr the performance of this nonlinear equalizer degrades and the linear equalizers perform significantly better. The nonlinear SVM equalizer performs as the GPR equalizer for snr lower than 17dB, but for larger snr the training sequence is not long enough and its solution degrades (overfiting). For snr larger than 20dB, the nonlinear SVM equalizer is not able to reduce the achieved BER. The nonlinear SVM and the GPR as the snr increases are not able to get optimal equalizers, because there is not enough diversity in the training sequence and they overfit to it. The GPR performance is better than the SVM for large snr, because it uses a covariance function in (28) that incorporates a linear term. Although it overfits the nonlinear part, the linear component allows the GPR to reduce the BER for large snr. If we had increased the training sequence, the SVM and GPR would perform better than the linear methods for larger values of the snr.
The GPC shuts down the nonlinear part and performs as the linear SVM. This is the same effect that we saw for large snr in Figure 3, the training set is not long enough to ensure it can train the nonlinear part of its covariance function and it consequently sets it to zero. In Figure 4 for snr less than 10dB, although we can barely notice it, the GPC equalizer follows the nonlinear solutions, as the training sequence is long enough to train its nonlinear component in this case.
The linear SVM and GPC are able to perform significantly better than the linear MMSE, because the channel model is nonlinear. For a nonlinear channel the received constellation is no longer symmetric and penalizing the squared error is suboptimal, as it forces that all the detected symbols to be equally far from its optimal value. The SVM and GPC equalizers only care if the points are correctly classified and they only focus on those that might not be, which explains the BER gap between the linear MMSE equalizer and the GPC and linear SVM ones.
In any case, for the snr of interests between 10 and 20dB, the GPR receivers (and nonlinear SVM) are significantly better than the linear methods and the GPC.  For this range of snr the BER is not low enough for most digital communication applications, but we can significantly reduce the BER using channel coding strategies [18] with high data rates, instead of increasing the snr.

Discussion
In the experiments we show the behavior of GPR for designing digital communication receivers and we show it has many favorable properties for solving such task when we use it with the covariance function in (28): • If the solution is linear, the GPR receiver shuts down the nonlinear part of the covariance function and performs as the linear MMSE detector for long training sequences. It converges faster than the MMSE detector to the optimal solution. It does not degrade its performance when canceling the nonlinear part of the kernel.
• If the solution is nonlinear, the GPR receiver is able to achieve very good performances, comparable to a nonlinear SVM receiver with optimal hyperparameters, and it needs shorter training sequences to achieve such solutions.
The GPR receiver performs significantly better than the linear detectors.
• The GPR receiver performs a single optimization procedure. This is a highly desirable quality as in one step we get the optimal hyperparameters without needing to try several solutions and check which one is best. The GPR decides if it needs a linear or a nonlinear solution in that single optimization without relying on a 'genie' or another procedure to check if the optimal solution is linear.
• The GPR can overfit if the training sequence is not sufficiently long, as we can see in Figure 4. But in this case the overfitting does not degrade the solution as much as it does for the nonlinear SVM. It only happens for very large snr in which we do not typically transmit.
• The GPR receiver uses a least square lost function, which is not ideal for solving classification problems when we are interested in minimizing the misclassification error. But for digital communication problems in which the noise is Gaussian, the use of this loss-function is not critical and the GPR-receiver performs as well as the receivers based on classification lossfunctions (GPC and SVM).
The GPC would initially seems like a better choice for designing digital communication receivers, because it minimizes the misclassification error and it can optimize the hyperparameters, just as the GPR does. But in our experiments we show that GPC receivers usually need longer training sequences before it can tune its nonlinear part and it decides to train a linear detector in cases where a nonlinear detector clearly performs better. We believe that in order for GPC to perform better than (as well as) GPR receivers, we need far longer training sequences, which might not be available in digital communication systems. We conjecture that this limitation of GPC for training digital communication receiver is due to the posterior approximation, because its loss-function is more suitable than the ones the GPR uses and we train the GPC receiver with the same covariance function.
The SVM performs as well as GPR for the proposed problem, but it needs longer training sequence to deal with its fixed hyperparameters or longer training resources to fine-tune its hyperparameters. We do not believe there is an intrinsic advantage for GPR for this problem. Although we believe that GPR being able to tune its hyperparameters by maximum likelihood allows solving the problem easier, as we build the receiver with a single optimization procedure.

Conclusions
We have proposed GPR and GPC for designing digital communication receivers.
GPR follows a wide range of machine learning tools that have been successfully applied to the design of digital communication receivers. But GPR presents several properties that we believe make it a much better candidate for designing these receivers. First of all, GPR can be viewed as a nonlinear MMSE. MMSE is the standard criterion used for designing digital communication receivers, as it trades off inverting the channel and not amplifying the noise. Second, its solution is analytical given the nonlinear function, while most machine learning methods need to perform an optimization problem to achieve their solution. Third, it can train its hyperparameters by maximum likelihood, while others machine learning algorithms need to cross validate their hyperparameters or structure. Forth, its computation complexity is not a limiting issue as addressed in [29].
To highlight the advantages of GPs as digital communications receivers we compare their performances to that of SVM. SVM provides solutions as good as the GPR does, but it needs more training samples. The GPR fits its covariance function by maximum likelihood, and hence it does not suffer from this problem.
The GPC could be initially thought of as a better candidate for designing digital communication receivers, since we are solving a classification problem. However, as we have shown in this paper it needs significantly longer training sequences to provide the same accuracy level as GPR receivers. One possible advantage of GPC compared to GPR for digital communication receivers is that they provide posterior probability estimates for the received bits, which could be sequentially used by a channel decoder to improve the BER. Some preliminary results of this idea can be found in [25].