On a unified framework for linear nuisance parameters

Estimation problems in the presence of deterministic linear nuisance parameters arise in a variety of fields. To cope with those, three common methods are widely considered: (1) jointly estimating the parameters of interest and the nuisance parameters; (2) projecting out the nuisance parameters; (3) selecting a reference and then taking differences between the reference and the observations, which we will refer to as “differential signal processing.” A lot of literature has been devoted to these methods, yet all follow separate paths. Based on a unified framework, we analytically explore the relations between these three methods, where we particularly focus on the third one and introduce a general differential approach to cope with multiple distinct nuisance parameters. After a proper whitening procedure, the corresponding best linear unbiased estimators (BLUEs) are shown to be all equivalent to each other. Accordingly, we unveil some surprising facts, which are in contrast to what is commonly considered in literature, e.g., the reference choice is actually not important for the differencing process. Since this paper formulates the problem in a general manner, one may specialize our conclusions to any particular application. Some localization examples are also presented in this paper to verify our conclusions.


Introduction
The problem of estimating unknown parameters of interest x ∈ R L×1 observed through a linear transformation H ∈ R N×L (N > L), and corrupted by additive noise n ∈ R N×1 , has been well studied and considered in a wide variety of fields [1]. However, the observations y ∈ R N×1 are sometimes also influenced by unknown linear nuisance parameters, denoted by u ∈ R M×1 which enter y through the linear transformation G ∈ R N×M (N > M). For instance, these nuisance parameters could be some common offsets such as the transmit time, the clock bias, and the transmit power in time-of-arrival (TOA) or received signal strength (RSS) based localization [2], or they could represent some redundant signals like the undesired signatures in hyperspectral imaging [3]. In fact, an estimation problem with linear nuisance parameters widely exists in many other fields such as communications [4][5][6], source separation [7], and machine learning [8,9]. Though only projection (OSP) approach projects out the nuisance term u such that the resulting observation vector is only subject to x (e.g., the extraction of the desired signature in [13]); ( 3) the differential signal processing approach firstly chooses a reference and then estimates x from the differences between the reference and the observations [14][15][16][17][18]. Note that these methods obviously result in three distinct observation sets with different signal-to-noise ratios (SNRs), which will greatly influence the estimation performance. Therefore, a vast amount of research has been conducted on these methods, though all follow separate paths. Admittedly, some early results have been reported bridging the first two methods. For instance, the famous OSP-based solution using a matched filter to maximize the output SNR proposed in [19] was later on proven to be equivalent to the least squares (LS) approach based on the joint estimation [20,21]. However, the proposed differential approaches are still widely regarded as a common but distinct way to cope with linear nuisance parameters. One of the most famous applications is time-based localization (TOA or time-difference-ofarrival (TDOA)), where many papers exist on selecting an optimal reference [22][23][24], constructing an optimal observation subset [25][26][27] or just using the full observation set adopting each sample as a reference [28][29][30]. All these issues never occur in the first two methods due to the fact that they are free of a reference. In a nutshell, there still seems to be a huge and inevitable gap between the differential approaches and the other two. This paper analytically investigates the relations between all three methods, where the corresponding best linear unbiased estimators (BLUEs) are presented and discussed. Since the general framework in (1) is used throughout this paper, all the conclusions apply to any kind of problem that can be written in this form, which is exactly the strength of this paper. We also present some localization examples to verify our conclusions. To summarize, the main contributions of this paper are listed below.
1. For the first time, we extend the differential signal processing approach to a more general framework, which can cope with multiple nuisance parameters, whereas most existing methods consider a single nuisance parameter. 2. Surprisingly, the BLUEs of the three considered methods are proven rigorously to be identical to each other if an appropriate preprocessing step is used. This might be expected or known w.r.t. the first two methods, but the equivalence with differential methods has never been reported before. 3. Compared with the joint estimation method, which directly utilizes all the original observations, none of the other two methods suffers any information loss. 4. Although differential methods seem to rely on the selected reference, selecting the right reference is not important since there is no actual trace of the selected reference in the corresponding BLUE. This is in sharp contrast to what is commonly considered in literature. 5. As far as the differencing process is concerned, the differential observation set associated with a single reference already preserves the full data information.
The rest of this paper is organized as follows. Section 2 presents the relations between the three considered methods. Some examples of source localization are shown and numerically studied to support our conclusions in Section 3. Finally, Section 4 summarizes this paper.

Handling linear nuisance parameters
In this section, we study the relations between the joint estimation, the OSP-based estimation, and the differential estimation by investigating their corresponding BLUEs, where for the first time, a general differential approach is introduced coping with multiple nuisance parameters.

Joint estimation
The joint least squares (JLS) estimate of x and u, based on the model (1), is given by where we have used the fact that the augmented matrix H G has a full column rank. Obviously,x jls is the BLUE, since n is the zero-mean white noise, according to the Gauss-Markov theorem [1].

OSP-based estimation
If we prefer to project out the nuisance term u, an orthogonal subspace projector can be formulated [19] as where [ ·] † indicates the pseudo-inverse which is given by G † (G T G) −1 G T , since G is assumed to have a full column rank. Applying P ⊥ G to our original model in (1) results in a new model where the impact of the nuisance term u is eliminated. Due to the symmetry and the idempotence of an orthogonal subspace projector, i.e., P ⊥ G = P ⊥T G and P ⊥ G = P ⊥2 G , we obtain the covariance matrix of the model noise in (4) as P ⊥ G n = σ 2 P ⊥ G P ⊥T G = σ 2 P ⊥ G . Then, following the OSP-based model (4), the corresponding LS optimization problem can be formulated as which leads to the following OSP-based LS estimate Type I of x However, the model noise P ⊥ G n in (4) is not white, i.e., P ⊥ G n is not a (scaled) identity. Moreover, the orthogonal subspace projector P ⊥ G is obviously singular, which implies that the covariance matrix P ⊥ G n is not invertible and hence can not be used to whiten the model (4). Therefore, it is very difficult to decide at this point whether x osp−1 is the BLUE or not.
To cope with that, we need to introduce another type of OSP-based LS estimator for x. If this estimator can be shown to be the BLUE and can also be proven equivalent tox osp−1 , then we can conclude that both of them are the BLUE.
Assume that U n ∈ R N×(N−M) contains orthonormal basis vectors spanning the null space of G. Then, the idea of this second OSP-based estimator is to adopt the null space of G to remove the impact of u. More specifically, pre-multiplying U T n on both sides of our original model leads to Note that (4) can be obtained from (7) by multiplying it on both sides with U n since U n U T n = P ⊥ G [31], and hence these two models are basically equivalent. We can also see that, since U n is an isometry, the model noise U T n n remains white, i.e., the covariance matrix of U T n n is U T n n = σ 2 U T n U n = σ 2 I N−M , which means that the LS estimate of this model is the BLUE.
Applying the LS criterion to the model (7) results in the optimization problem from which we can obtain the OSP-based LS estimate type II of x aŝ Due to the fact that U n U T n = P ⊥ G , we obtain the equivalencex osp−1 ≡x osp−2 and hence both estimators represent the BLUE. In the later simulations, these two OSP-based BLUEs will be considered together for convenience.
Finally, to end this subsection, we would like to focus on the equivalence between the joint estimation and the OSP-based estimation approaches. In fact, the equivalence betweenx jls andx osp−1 is already known [20,21,32], but we found it useful to revisit this result from a different viewpoint. To be explicit, applying the block-wise inversion to (2), we can easily rewrite the joint LS estimate of x and u as where (10), we can directly observe that x jls = M G H T P ⊥ G y and hencê where the equivalence betweenx jls andx osp−2 is an interesting observation that has never been directly reported before, to the best of our knowledge.

Differential signal processing
In this subsection, we would like to examine differential approaches. This method firstly selects a reference and then removes the impact of u by taking differences between the observations and the reference. To be specific, if the jth observation y j is selected as the reference, a new differential observation set can be constructed as with 1, the all-one matrix (sizes are mentioned in subscript if needed) and the size of the observation set are reduced to N − 1 since j is fixed for every element in d j . This type of observation set is very popular and has wide applications in source localization and many other areas. Clearly, it can only be used to remove a single nuisance parameter in case G = 1 N×1 . One may also suggest to select the average of the observations as the reference [16, eq. (28)], thus leading to another kind of differential observation set, given by where P ⊥ 1 N×1 Sometimes, the use of this type of observation set to eliminate the nuisance parameters can be implicit [4], i.e., taking the average of the observations is not clearly pointed out. However, this case can obviously be linked to the OSPbased estimation with a single nuisance parameter in case G = 1 N×1 . Therefore, we are more interested in the simple differencing process of (11), where the reference index j seems to play a significant role.
As already pointed out, (11) only eliminates one nuisance parameter. Nevertheless, we would like to extend this to tackle multiple nuisance parameters, i.e., we would like to relax the constraint G = 1 N×1 to rank(G) = M ≥ 1. The idea we will adopt here is based on eliminating the impact of the nuisance parameters one by one, which requires M differencing steps.
To achieve that, we write G =[ g 1 , · · · , g M ] with g k the kth column vector of G related to the kth nuisance parameter u k (1 ≤ k ≤ M). Thus, our original model in (1) can be rewritten as We then eliminate the nuisance parameters recursively in the order of u 1 , · · · , u M , although the explicit ordering is not important. At the kth iteration, when k − 1 nuisance parameters have already been canceled, the observation vector containing the remaining nuisance parameters can be written as where the superscript (·) (k−1) indicates the variables after k −1 differencing steps, We also assume that, for k = 1, d (0) = y and similarly H (0) = H, g (0) k = g k , and n (0) = n.
To cancel u k , we first notice that some elements of g (k−1) k might be zero, i.e, u k yields no impact on the corresponding observations in d (k−1) and hence these observations should not be involved in the differencing process at this iteration. Without loss of generality, we assume that the first K elements of g (k−1) k are zero, where 1 ≤ K ≤ N − k − 1 (there should be at least two non-zero elements for executing the differencing process). Then, among the remaining observations impacted by u k , we select the jth element as the reference, K + 1 ≤ j ≤ N − k + 1, and perform the following differencing step and obviously (k) g (k−1) k = 0. Accordingly, the new differential observation vector d (k) can be formulated as where u k has been canceled.
We can see that (18) is similar to (15) with k −1 replaced by k. So it is clear that this recursive process can remove all nuisance parameters. Note that the number of zero values K as well as the reference index j could be different in every step, but for simplicity, we use the same notation in every step.
To understand the interaction of the successive differencing steps, let us introduce the total differencing operator = (M) · · · (1) , where obviously rank( (k) (k−1) ) = rank( (k) ) = N − k and hence has full row rank. Since it is clear that G = 0, the final differential observation vector d (M) can be expressed as where the covariance matrix of n is Observe that the model noise has become correlated ever since the first step of the differencing process. Therefore, we need to whiten the model in (19) as where the unknown σ 2 is canceled out at both sides of the equation and P ( T ) −1/2 which exists since has full row rank. Note that P, as well as and d (k) , depend on the reference indices j that have been chosen in the successive differencing steps, although this has not been explicitly stated.
Applying the LS criterion, the corresponding optimization problem is now obtained as which leads to the following BLUE for model (19) x Finally, to prove the equivalence of the estimatex d to the previous estimates, i.e., to prove that we need to establish the relation P T P = U n U T n = P ⊥ G . To do that, we first recall that G = 0 and that has full row rank. Hence, can always be written as = QU T n , where Q is an (N − M) × (N − M) invertible matrix and U n has already been defined before as a basis that spans the null space of G. The proof is completed by computing where we surprisingly notice that, even though P and are subject to possibly different reference indices j, there is no trace of any j in P T P and hence inx d .

A Simple Illustrative Case:
We would like to demonstrate these three different methods, particularly the differential signal processing, with a simple example. Given N = 3 samples, we only assume a single parameter of interest (L = 1), but with two linear nuisance parameters (M = 2). We also know that H = 3 6 7 T and G =  (6) and (9) can easily be carried out and proved to be equal tox jls . We will not present more details for simplicity but particularly focus on the differential method. Since there exist two linear nuisance parameters, it would take two steps for eliminating all of them: 1. In the first step (k = 1), we arbitrarily select the third element of y as the reference (j = 3). Splitting G by columns, we have g According to (16), the new differential observation vector can be obtained as d (1) 2 corresponds to the last column and the next nuisance parameter u 2 . 2. In the second step (k = 2), the first element of d (1) is selected as the reference (j = 1). The differential observation becomes a scalar as d (2)  With a known , we can easily whiten the model in (19) and obtain the differential estimator in (22). Moreover, the equivalence of the differential estimation can also be proved by observing

Discussion
We have studied estimation problems in the presence of deterministic linear nuisance parameters based on a general model. Therefore, all the conclusions drawn in this paper are applicable to any optimization problem with a data model that matches our general model (1). The equivalences between the BLUEs of the joint estimation, the OSP-based estimation and the differential estimation are summarized in Table 1 and also in Fig. 1. Some interesting observations from these equivalences are listed below: 1. The joint estimation has to estimate both x and the nuisance term u while the other two estimation approaches remove the impact of u before estimating x.  (1) Joint estimator in (2) [ (4) OSP estimator type I in (6) (7) OSP estimator type II in (9) U n U T n = P ⊥ G Differential Model in (19) Differential estimator in (22) P T P = P ⊥ G or the Whitened One in (20) a [ I L 0 L×M ] is used for extractingx jls in (22) 2. For the OSP-based estimation, in order to remove the impact of u, using P ⊥ G actually colors the noise, but using U T n keeps the model noise white. Interestingly though, the corresponding LS estimates for those two models are theoretically equivalent and hence they are both the BLUE. 3. In many applications, the differential processing is commonly considered as a separate and independent approach. But, in this paper, we have generally proven its equivalence to the joint estimation and the OSP-based estimation. The differential approach removes the impact of the nuisance parameters by taking differences between the reference and the observations. If one of the observations is selected as a reference, the obtained differential observation set has to be properly whitened in order to obtain the BLUE for this model. 4. From an information theoretic perspective, the joint estimation, which directly utilizes the observations y, preserves the full data information, and any preprocessing on the observations might cause an information loss. However, in this paper, all the other BLUEs have been proven to be equivalent to the BLUE of the joint estimation, which implies that neither the OSP-based estimation nor the differential estimation suffers any information loss by removing the impact of the nuisance parameters. 5. It is also worth noting that, for the differential approach, selecting which observation will function as a reference is not important, since the reference index j yields no impact on the BLUE. This is in sharp contrast to what is commonly considered in literature. 6. One might notice that, in the differencing process, N observations can generate a maximum of N(N − 1)/2 distinct observation differences. In contrast, we only study the estimation problem based on a subset, which is associated with a single reference and corresponds to N − 1 observation differences. However, from the above conclusions, it is clear that the considered subset already preserves all the information (independent of the reference), which Fig. 1 Diagram to illustrate the relations between the BLUEs related to the joint estimation, the OSP-based estimation, and the differential estimation. Note that the noise n is not necessarily Gaussian distributed and the operator [ I L 0 L×M ] is used to extract the first L elements of a vector, i.e.,x jls implies that the full set obtains no more information than any subset does. Also this is a novel observation.

Localization examples
By studying the relations between the BLUE of the joint estimation, the OSP-based estimation, and the differential estimation, the essence of this paper is to provide some in-depth understanding of coping with unknown nuisance parameters. Some important underlying equivalences have been unveiled, especially the one related to the differential method, since, in many applications, this approach is still considered as a separate optimization problem. Owing to the generality of this paper, one may easily apply our analyses and conclusions to some particular applications, if the data model can be (re)formulated to match our general model (1). Some specific localization examples are detailed next.

Time-based localization
Both TOA-and TDOA-based localization are called timebased localization [2], since they both rely on time measurements (either the global time or the local time). The essence of this kind of localization problem is how to accurately extract distance-related information (e.g., the time of flight (TOF)). Directly using TOA measurements requires not only perfect clock synchronization between the emitters and the receivers but also the knowledge of the transmitting time [33]. In cooperative networks, where clock synchronization is frequently carried out (because the inner clock might drift over time) and the transmitting times are also piggybacked with the transmitted signals, one can precisely calculate the TOFs from the TOA measurements and then localize the target node. However, it is often very expensive to meet those requirements, and most networks are constrained by limited resources and capabilities. Therefore, in most cases, sensors suffer from two linear nuisance parameters, i.e., the unknown clock biases to the global time and the unknown transmitting times.
In this example, we assume N anchor nodes that are perfectly synchronized with the global time and there exists only a clock bias in the target node, which broadcasts beacon signals at unknown local transmit times. We denote x t ∈ R d as the target location and s i ∈ R d as the ith anchor location. For convenience, a single unknown global transmit time t 0 is considered for the target node, instead of the local transmit time plus the clock bias. Taking the speed of light c into account, we obtain the TOA measurements as where the element d i of d indicates the TOA measurement from the ith anchor, r(x t ) stacks r i ||x t − s i || 2 , r o ct 0 and n is the vector of the measurement noise n i with n ∼ N (0, σ 2 I N ). Note that, compared with more realistic scenarios, the model (24) is simplified for convenience but still adequate to make our point.

Taylor series expansion
Obviously, the non-linearity of (24) is a very serious issue for localization, other than the nuisance parameter. Many methods, especially those considering mobile scenarios, directly linearize (24) by a Taylor series expansion (TSE) [34]. Note that this kind of method is very similar to the Gauss-Newton (GN) method [35] and holds the maximum likelihood (ML) property. Since we can obtain the estimate of x t by iteratively updating the previous iteration, we first have to apply the TSE to (24) around the target location estimatex Then, we rearrange the above equation and present the TSE model for iteration step k as where The localization problem at the kth iteration boils down to estimating x t from (25) to update the location estimate from the (k − 1)th iteration. The relation between the TSE model and the general model (1) is presented in Table 2. Note that since the discussed approaches can directly be applied to the TOA measurements with a single nuisance parameter (M = 1), the differential approach applied to the TOA measurements actually corresponds to working with the TDOA measurements, i.e., ⎡ However, to avoid any confusion with the TDOA methods we will discuss later on, we will refer to this method as the differential approach applied to the TSE model of the TOA measurements. Table 2 Relations between the general model (1) and the considered time-based and RSS-based localization models a General model (1) y a All the considered models must be white or whitened, i.e., the covariance of the model noise should be a (scaled) identity

Squared distance
The TSE method highly relies on an appropriate initialization that is near the global solution; otherwise, it might converge to a local minimum. Thus, some closedform solutions were proposed to solve this non-convex problem, which requires squaring the distance norm (SD) for linearization [36]. Unlike the TSE method, the SD method depends on the type of measurements, since different modeling steps are carried out for TOA and TDOA measurements.
TOA: Let us first focus on the SD method based on the TOA measurements which can be expressed as Moving r 0 to the other side and squaring both sides of the equation, we obtain where r 2 0 is viewed as a new nuisance parameter. As a result, a linear model with two nuisance parameters (M = 2) can be formulated as where Here, we denote D 1 = diag([ r 1 , · · · , r N ] T ) with diag(·) as a diagonal matrix with its argument on the diagonal, and hence 1 = 4σ 2 D 2 1 . This SD-TOA model is widely considered [37][38][39][40][41]. Some researchers apply the differencing process to remove the nuisance parameters [24,33,[42][43][44][45] while some others use the OSP method [16,46]. Note that the model noise in (30a) is still not white, and hence, an appropriate whitening procedure is required. Assuming D 1 is perfectly known, we can whiten the model (29) as where D 1 D −1 1 and the covariance matrix of D 1 1 is now a scaled identity, i.e., D 1 1 = 4σ 2 I N . In practice, a LS estimate based on the model (29) can first be used to construct an estimate of D 1 for carrying out the whitening. Then, the estimate of D 1 can be repeatedly updated to approach the true D 1 with a more accurate location estimate. In this paper though, we only want to evaluate its best performance and hence directly use the true D 1 . Finally, expressing A 1 =[ A 1 , A 1 ] with A 1 and A 1 , respectively, containing the first d and the remaining columns, the relation between the whitened SD-TOA model and the general model (1) is presented in Table 2.
TDOA: Directly applying the differencing process on the TOA observations d removes the unknown nuisance parameter r 0 , resulting in the TDOA measurements where n i,j = n i − n j . Introducing r j = ||x t − s j || 2 as a new unknown parameter, we can linearize (32) using the following squaring operation As a result, a linear model with a single unknown nuisance parameter r j (M = 1) can be formulated as where Here, we denote D 2 = diag([ · · · , r i , · · · ] T ), i = j, and hence, 2 = 4σ 2 D 2 j T j D T 2 . Also, this SD-TDOA model has been commonly adopted in literature [14,33,[47][48][49][50][51]. Among the TDOA localization techniques based on this model, the famous Chan algorithm [14], from which many others stem, is actually equivalent to some earlier works [52][53][54], where the unknown r j is simply removed by the OSP method. Again, note that the model noise (35a) is not white. Assuming D 2 is perfectly known (as already explained for D 1 , in practice, D 2 should be iteratively estimated), we can whiten the model (34) as where D 2 (D 2 j T j D T 2 ) −1/2 and the covariance matrix of D 2 2 is now a scaled identity, i.e., D 2 2 = 4σ 2 I N−1 . Finally, we split A 2 into A 2 =[ A 2 , A 2 ] with A 2 and A 2 , respectively, containing the first d and the remaining columns. The relation between the whitened SD-TDOA model and the general model (1) is finally presented in Table 2.
Numerical results: We have conducted a Monte Carlo simulation with 1000 trials to verify our conclusions, where the BLUEs of the joint estimation, the OSPbased estimation, and the differential estimation are carried out for each one of the discussed time-based models. Some LS estimators without a proper whitening process are also presented for comparison. The acronyms of all estimators used in the simulations are summarized in Table 3. We also calculate the Cramér-Rao lower bound (CRLB) with an unknown r 0 based on the original model (24) [1, Chapter 3], since the TSE, SD-TOA, and SD-TDOA models all lose some information by ignoring some high-order terms. The root mean square error (RMSE) of the location esti-mate, which is defined as E[ (x − x) 2 ] in general, is used as a performance measure in this paper. From the numerical results in Fig. 2, we can draw the following conclusions.
1. For each model, the corresponding BLUEs yield the same performance as expected. 2. Without a proper whitening, it can be observed that the performance of the LS estimators deteriorates. The D-LS-TSE-TOA, J-LS-SD-TOA, and J-LS-SD-TDOA clearly perform worse than their corresponding BLUEs. 2 ) and accordingly suffers some information loss in modeling. However, the information loss can be reduced with a more accuratex (k−1) t . Therefore, with more iterations, the BLUEs for the TSE model approach the CRLB, which is in fact the essence of the ML property. 4. The SD-TOA model ignores n 2 i , ∀i while the SD-TDOA model ignores n 2 i,j , ∀i, i = j. Ignoring these terms will cause an increasing information loss as the measurement noise gets larger. 5. Even though the BLUEs of the SD-TOA model outperform those of the SD-TDOA model in our simulation, we still cannot decide at this point which model is the best. This is because an optimal localization problem for the SD models should also include any dependence between the (nuisance) parameters, e.g., between x t and ||x t || 2 2 , between r 0 and r 2 0 in θ 1 , or between x t and r j in θ 2 , which explains the huge gap between the CRLB and the BLUEs for the SD models. By contrast, the TSE model obviously does not have this kind of issue. Nevertheless, including these dependencies is beyond the scope of this paper and we will not further consider this. 6. In practice, both the TSE and SD methods require iterations to obtain an accurate location estimate. However, note that, even after serveral iterations, the estimators based on the SD models still need to cope with the abovementioned dependency issue. Therefore, in real life, one often combines those two models, i.e., one uses the TSE model with the J-LS-SD-TDOA or the J-LS-SD-TOA as an initialization. 7. For the SD-TDOA model, ignoring the terms n 2 i,j , ∀i, i = j implies that the information loss depends on the reference choice of the differencing process in (32). However, this is only because of the SD modeling thereafter, not because of the differencing process itself. Note that, for any other differencing process in this paper, the reference index is not important as long as the model is properly whitened.  (2) OSP-BLUE-TSE-TOA, k = 1 " OSP-based estimation (6) or (9) D-BLUE-TSE-TOA b , k = 1 " Differential estimation (22) D-LS-TSE-TOA b , k = 1 " LS estimator based on the unwhitened differential observations in (19) J-LS-SD-TOA Unwhitened SD-TOA model (29) 43), (12,33), (12,16), (37,33), and (37, 16)

Received signal strength based localization
Due to the simplicity of utilizing received signal strength (RSS) measurements, wireless networks with very constrained resources preferably rely on RSS-based localization [2]. Therefore, it gradually became very popular in recent years, and many efforts have already been put on this topic [55][56][57][58]. RSS-based localization mainly suffers from the complicated radio propagation channel. As before, assume that the target node is located at x t and the ith anchor at s i . Based on a large-scale log-normal fading model [59], the RSS measurement can then be modeled as where P 0 is the received power at the reference distance d 0 , γ is the path-loss exponent (PLE), n i ∼ N (0, σ 2 ) is the shadowing effect, and N is the number of anchor nodes. RSS-based localization is aimed at estimating the target location x t from the RSS measurements. However, in some military or hostile scenarios, the transmit power might be unknown. Therefore, without loss of generality, we assume the reference distance d 0 to be 1 m and then the problem of the unknown transmit power can be equivalently converted into that of an unknown P 0 . Note that (37) also has the non-linearity issue and, obviously, the iterative TSE model for RSS-based localization will be very similar to that developed for time-based localization. Therefore, to save space, we do not consider directly applying the TSE model but only focus on the SD method here.
To construct a linear data model, we rewrite (37) as where P i 10 P i 5γ , P 0 10 P 0 5γ and n i 10 n i 5γ . Interestingly though, we still need to apply the TSE to n i here 2 , such that (38) can further be approximated as Then, a linear SD-RSS model for localization can be formulated from (39) as (41d) This model was firstly presented in [57, eq. (18)] but in the absence of the shadowing effect. If we whiten the model (40) utilizing the covariance matrix of ς, i.e., where D = diag([ P 1 , · · · , P N ] T ), we can obtain where the covariance matrix of Dς becomes a scaled identity matrix, i.e., Dς = ln(10) 2 P 2 0 σ 2 25γ 2 I N . Note that this whitening step simply corresponds to an appropriate scaling of every entry of (40).
The whitened model (43b) is found to match our general model (1), since we notice that DF can be split into where F contains the first d + 1 columns of DF. The relation between this model and the general model (1) is presented in Table 2. Note that we only consider a single nuisance parameter P 0 in this model (M = 1). Although we could consider both ||x t || 2 2 and P 0 as nuisance parameters (M = 2), which would lead to the same performance after using the correct preprocessing steps, the reason why we take M = 1 here is to connect this model to the existing literature. For instance, after removing P 0 using a single differencing step, the model for j Dh is equal to the SD-DRSS model used in [57, eq. (22)]. However, without an appropriate whitening procedure, the LS estimators of the SD-RSS and SD-DRSS models yield a different performance, which is why they were treated and studied separately. Now, we realize that they actually are identical to each other as long as the model noise is properly whitened.
Numerical results: A simulation has also been conducted to verify our conclusions for this example. As before, the BLUEs of the joint estimation, OSP-based estimation, and the differential estimation for the SD-RSS model are evaluated and compared with some LS estimators without a proper whitening. Based on the original model in (37), the CRLB with an unknown P 0 is easy to calculate [1, Chapter 3]. From the numerical results in Fig. 3, the critical observation is that all the BLUEs here yield exactly the same performance as expected. Due to the colored model noise, the J-LS-SD-RSS and the D-LS-SD-RSS are relatively worse. Finally, denoting R ||x t || 2 2 , we again point out that neglecting the dependence between R and x t results in the gap between the CRLB and the estimators presented here.

Other examples
We believe that there are many other examples with linear nuisance parameters for our results. However, due to the limited space, we will only point out some of them. Besides the aforementioned localization examples, if anchors are separated into groups with different central clocks, multiple relative clock biases might exist in the TDOA measurements for localization, which can be removed by the OSP method [60, eq. (3)]. In cooperative localization, the multidimentional scaling (MDS) also uses the OSP-based method to eliminate the unknown terms [61, eq. (3)]. An acoustic source localization model, which also matches our general model (1), was presented in [62, eq. (6)].  43), (12,33), (12,16), (37,33), and (37,16). The transmit power is set to 10 dBm and the PLE is set to 2 In [4, eq. (2)], the transmission times and clock offsets are the unknown nuisance parameters for the considered clock synchronization problem. The authors claim that those unknown parameters are systematically ML estimated before the synchronization. However, in fact, those nuisance parameters are equivalently removed by using respectively the observations d avg in (13) or the OSP procedure. In hyperspectral imaging, OSP is also a very common procedure to extract the desired signals [19]. And when tracking mobile targets, frequency-differenceof-arrival measurements are often measured to cope with the Doppler effect [17,18,63,64]. Furthermore, multipleinput-multiple-output (MIMO) receiver design might be affected by some nuisance parameters like I-Q imbalance and DC offset [5, eq. (7)]. In machine learning, a well-designed OSP is desired for dimensionality reduction [8,9]. Extracting and working on the signal space is a strong need for signal separation [7] and underwater communication [6], which can be facilitated by OSP. At last, the famous differential global positioning system (DGPS) introduces a reference station on the ground and constructs a new differential observation set for positioning [65], where even the double differencing process is considered [66][67][68].

Conclusions
In this paper, we have introduced a general framework for estimation in the presence of unknown linear nuisance parameters. Three different kinds of methods to cope with the unknown nuisance parameters have been studied, i.e., the joint estimation, the OSP-based estimation, and the differential estimation. These approaches have been analyzed by investigating their corresponding BLUEs, where a new differential method has been introduced to cope with multiple nuisance parameters. We have discovered that, after a proper whitening procedure, all the BLUEs are equivalent to each other. From this interesting fact, one can draw some useful conclusions: 1. There only exists one unique BLUE for all these methods proposed to cope with unknown nuisance parameters. 2. Compared with the joint estimation, which directly utilizes all the original observations, none of the other two methods suffers any information loss. 3. For the differential approach, which requires selecting some references, the choice of the references is not important since there is no actual trace of the selected references in the corresponding BLUE. 4. In the differencing process, compared with the full differential observation set, any subset related to a single reference already preserves the full data information.
The presented analyses of the general model can be projected onto many practical applications, e.g., hyperspectral imaging, source localization and synchronization. Some localization examples have also been demonstrated, simulated and discussed to verify our conclusions.