Distributed localization using Levenberg-Marquardt algorithm

In this paper, we propose a distributed algorithm for sensor network localization based on a maximum likelihood formulation. It relies on the Levenberg-Marquardt algorithm where the computations are distributed among different computational agents using message passing, or equivalently dynamic programming. The resulting algorithm provides a good localization accuracy, and it converges to the same solution as its centralized counterpart. Moreover, it requires fewer iterations and communications between computational agents as compared to first-order methods. The performance of the algorithm is demonstrated with extensive simulations in Julia in which it is shown that our method outperforms distributed methods that are based on approximate maximum likelihood formulations.


Introduction
The problem we investigate in this paper is that of determining the locations of all sensors in a network, given noisy distance measurements between some sensors.It is usually assumed that the positions of some of the sensors are known, which are referred to as anchors.This is called a localization problem.Specifically, we present a distributed algorithm that solves the maximum likelihood estimation problem for localization when the measurements are corrupted with Gaussian noise.Our algorithm is based on applying the Levenberg-Marquardt algorithm to the resulting nonlinear least-squares problem.It requires a good initialization, and we initialize it with an approximate estimate obtained from the algorithm proposed in [1], which is based on a convex relaxation of our nonlinear least-squares problem formulation.
Over the years there has been a considerable interest in wireless sensor network localization using inter-sensor distance or range measurements [2][3][4].Wireless sensor networks are small and inexpensive devices with low energy consumption and computing resources.Each sensor node comprises sensing, processing, transmission, and power units, some with mobilizers [5,6].The applications are many, e.g., natural disaster relief, patient tracking, military targets, automated warehouses, weather monitoring, smart space, monitoring environmental information, detecting the source of pollutants, and mobile peer-to-peer computing to mention a few, as well as underwater applications [7].The information collected through a sensor network can be used more effectively if it is know where it is coming from and where it needs to be sent.Therefore, it is often very useful to know the positions of the sensor nodes in a network.The use of global positioning system is a very expensive solution for this [8].Instead, techniques to estimate node positions are needed that rely just on the measurements of distances and angles between neighboring nodes [2,9].Deployment of a sensor network for these applications can be done in a random fashion, e.g., dropped from an airplane in a disaster management application, or manually, e.g., fire alarm sensors in a facility or sensors planted underground for precision agriculture [5].
Localization in this setting is mostly based on optimizing some cost function which is dependent on the model uncertainties.The most widely used technique is based on maximizing a likelihood function, know as maximum likelihood estimation, which in general is equivalent to a non-convex optimization problem of high dimensionality [10,11].Both centralized and distributed algorithms have been used to solve the problem [12].The centralized algorithms require that each sensor/agent sends its information to a central unit where an estimate of the sensors position can be computed using for example second-order optimization methods.Then the results are sent back to the agents.
The disadvantage of these algorithms is that the processing in the central unit can be computationally heavy, specially when the number of sensors are large.Distributed algorithms overcome this obstacle.These algorithms enable us to solve the problem through collaboration and communication between several computational agents, which could correspond to the sensors, without the need for a centralized computational unit.The disadvantage of these algorithms is that they might not result in as accurate estimates of the positions as for centralized algorithms.Moreover, they might require excessive communication between the senors, specially for large network sizes.The algorithm we propose in this paper is somewhere between a centralized and distributed algorithms in the sense that, instead of having one central unit as in a centralized algorithms, the sensors are grouped together in a structured way, where one computational agent is assigned for each group.The sensors then send their measurements to their groups computational agent and those agents in turn carry out the computations by communicating with one another.As a result, neither the computations in the proposed algorithm are as heavy as in centralized algorithms, nor the communication burden is as intensive as in distributed algorithms in which all the adjacent sensors communicate together in order to find a solution to the localization problem.
There have been various techniques developed to distribute the computations.We will now survey different distributed methods for localization.First, we discuss approaches which are based on the original non-convex maximum likelihood formulation.They all solve the maximum likelihood problem exactly.The authors in [13], propose a distributed multidimensional scaling algorithm which minimizes multiple local nonlinear least-squares problems.Each local problem is solved using quadratic majorizing functions.In [14], the authors present two distributed optimization approaches, namely a distributed gradient method with Barzilai-Borwein step sizes, and a distributed Gauss-Newton approach.In [15], a decentralized algorithm is devised based on the incremental subgradient method.In [11], the authors propose a distributed alternating direction method of multipliers approach.To this end, an equivalent equality constrained problem of the original nonlinear least-square problem is considered by introducing duplicate variables in the optimization problem which allows for a distributed solution.In [16], the authors reformulate the problem to obtain a gradient Lipschitz cost which in turn enables them to propose a distributed algorithm based on a majorization-minimization approach.The main shortcoming of the surveyed approaches is that they take many iterations to converge, and hence are slow, since many communications are required to reach a solution.The reason for this is either because they are based on first-order optimization methods, or as is the case for the Gauss-Newton method, a consensus algorithm is used in order to compute the search direction.Also it is difficult to effectively initialize the algorithms, since the likelihood function might have several local maxima.The latter problem can be overcome by using some approximate algorithm for localization which is easy to initialize.Then the solution from this approximate method is used to initialize the non-convex optimization problem solver.Good approximate problems can be obtained from convex relaxations of the maximum likelihood problem.
We will now continue the survey with methods based on convex relaxations of the maximum likelihood formulation.These are not only used for initialization of non-convex formulations, but are also of interest per se assuming that the approximation provides a good enough approximate localization.A good survey of semi-definite programming relaxation methods is given in [4].The authors in [3], and [12], use the relaxation in [4] to devise distributed algorithms based on the alternating direction method of multipliers and second-order cone programming approaches, respectively.A nice property of these algorithms is that they have convergence guarantees.This, however, comes at the cost of solving a semi-definite programming at every iteration of the algorithm which imposes a substantial computational burden.In [17], a distributed algorithm in which only linear system of equations have to be solved at each iteration is proposed.They distribute the computations using message-passing over a tree.Another way to decrease the computational cost at each iteration is to consider a disk relaxation of the localization problem instead of an semi-definite programming relaxation.Based on this idea, the authors in [1] and [18], devise distributed algorithms for solving the resulting problem which rely on projection based methods and Nestrov's optimal gradient method, respectively.In [19], the authors propose a hybrid approach based on the disk relaxation in [1] and a semidefinite programming relaxation, by fusing range and angular measurements.It should be stressed that the solutions from relaxation based methods do not provide global optima.The quality of the solutions are highly dependent on how tight the relaxation is.
As mentioned above solutions from relaxed formulations may be used to initialize the non-relaxed formulations.In [2], a semi-definite programming relaxation of the problem, combined with a regularization term is used to initialize a gradient-descent method for solving the exact maximum likelihood problem.In [20], the authors propose a hybrid solution to the localization problem.To this end, they apply a distributed alternating direction method of multipliers approach in two stages.In the first stage, they use the disk relaxation formulation of the localization problem as in [1], and then there is a smooth transition to the second stage where they use the original non-convex formulation as in [11].
Although the algorithm in [20], has faster convergence rate than what is presented in [1], the number of communications per sensor is not significantly lower, and in addition to that there is an extra communication overhead because of the existence of several duplicate variables which needs to be passed among the sensors.Notice that the number of duplicate variables in each sensor is proportional to the number of sensors that a senor can communicate with, which causes a considerable amount of computations and communications, specially if the network size is large.Also note that for the algorithms in both [1], and in [20], the computations are distributed in such a way that each sensor has to carry out its own computations and exchange messages with adjacent sensors.The authors in [20] argue that this way of distributing the computations might require an excessive communication burden, specially for large network sizes.Because of this, they discuss the possibility of devising a regional reinterpretation of their algorithm, where the sensors are partitioned in regions and there is one computational agent per region which is responsible for carrying out the computations and exchanging messages with the adjacent computational agents.We will see later that in our proposed algorithm, we also distribute the computations in such a way that not every sensor has to be involved in the computations.
In this paper, we propose a distributed algorithm based on the Levenberg-Marquardt method [21], with a localization accuracy which is better than the algorithm in [1], but with much fewer communications per sensor/agent.The accuracy is better since we solve the maximum likelihood problem and not only an approximation of it.This will show that the claim in [19], that the algorithm in [1] has equal localization accuracy compared to the one in [20], is debatable.We use an approximate estimate obtained from the relaxation based algorithm presented in [1], as the initial starting point for our algorithm.We will see that since the number of communications between agents in our algorithm is far less than the algorithm presented in [1], our algorithm can be utilized on top of the algorithm in [1], in order to improve the estimate in terms of accuracy with much less iterations than what are used in [1], and achieving better accuracy.Note that both algorithms in [1] and [20] which are based on Nesterov's gradient and alternating direction method of multipliers approaches, respectively, are first-order methods whereas the Levenberg-Marquardt algorithm is a pseudo-second-order method as it uses approximate Hessian information.It is known that in general second-order methods require fewer iterations in order to converge than first-order methods.The reason is that second-order methods use both gradient and curvature information of the objective function, whereas first-order methods rely solely on the gradient of the objective function.As a result the number of communications between agents in the distributed Levenberg-Marquardt algorithm is expected to be lower than for the algorithms in [1] and [20].

Contributions
We propose a distributed algorithm for localization that solves the localization problem to highest accuracy using few communication and computations.

Example
In order to introduce the notation and to exemplify what results will be derived, a simple one-dimensional example will be considered.Relevant applications of this are e.g. a metro line in-between two stations with anchors at the stations or a mine tunnel with anchors at the intersection of tunnels.The anchors are positioned at p 1 a and p 2 a and the position of the other sensors p j s , j = 1, 2, 3, 4, are such that Moreover, we assume that each sensor can measure the distance of the adjacent sensors.We depict this in what is known as the inter-sensor measurement graph shown to the left in Fig. 1.The nodes represent the sensors and anchors, and there is and edge between two nodes if they can measure the distance to one another.Assume that we are given measurements R ij between sensors i and j and measurements Y ij between sensors i and anchors j with Gaussian measurement errors with zero mean and unit variance.Then, the maximumlikelihood problem of estimating the positions of the sensors is equivalent to minimize where P =[ p 1 s . . .p 4 s ], which is a linear least-squares problem.The maximal subgraphs of the inter-sensor measurement graph in Fig. 1, which are complete, i.e. contains an edge from every node to every other node, are called cliques and given by They can be arranged in a tree as is seen to the right in Fig. 1.This tree is called a clique tree.It is not unique, but it can always be arranged in such a way that any element in the intersection of two cliques will also be elements of cliques on the path between the two cliques.This is called the clique intersection property.It is not possible in general for any inter-sensor measurement graph to derive a clique tree.For this to be possible, the graph has to be what is called chordal.We will discuss this in more detail later.However, for our example, the graph is chordal, i.e. any cycle of length four or more in the graph has a chord.It is now possible to solve the least-squares problem over this clique tree by using each of the cliques as computational agents.This is done by associating terms of the Fig. 1 Inter-sensor measurement graph to the left and a corresponding clique tree to the right objective function with different cliques.A valid assignment is that the variables of the terms that are assigned to a clique should belong to the clique.Therefore, we assign to C 4 , and Hence, the least-square problem is equivalent to We then start with the leaf clique C 5 and its corresponding function f 5 p 4 s and minimize it with respect to the variables that are not shared with its parent.There is no such variable and hence the minimization is not carried out.We then let m 54 p 4 s = f 5 p 4 s , which is called a message function or value function.This is added to the objective function corresponding to the parent of C 5 , i.e. to f 4 p 3 s , p 4 s .Notice that any quadratic function can be represented with a matrix and a vector, and hence this is the only information that has to be passed to the parent.We then again minimize the resulting function with respect to the variables that are not shared with its parent, i.e. p 4 s .Since the problem is convex and quadratic, this is equivalent of solving a linear equation, and after back substitution of the solution, the objective function value will be a quadratic function of p 3 s , which we denote by m 43 p 3 s .We then add the message function to the objective function corresponding to the parent of C 4 i.e. to f 3 p 2 s , p 3 s and repeat the procedure until we reach the root clique.For the root clique C 1 , we now can optimize f 1 p 1 s + m 21 p 1 s with respect to the remaining variable p 1 s , where m 21 (p 1 s ) is a message from the child clique.By parsing this solution down the clique tree the remaining optimal variables can be computed assuming that the parametric solutions have been stored in the nodes of the clique tree.
The fact that the problem is convex and quadratic, makes it easy to compute the messages.In general, this is not the case, but we will use this procedure not for the optimization problem itself but for computing the search directions in a non-linear leastsquare method, in particular the Levenberg-Marquardt method [21].These equations are linear equations and correspond to a quadratic approximation of the problem at the current iterate of the Levenberg-Marquardt method.All other computations in the Levenberg-Marquardt method also distribute over the clique tree.We see that what we are doing in this example is nothing but serial dynamic programming.In general, the clique tree will not be a chain, and then we will carry out dynamic programming or message passing over a tree, see [22] for details.The clique tree is not unique.For the example we can just as well make C 5 the root and C 1 the leaf.Moreover, we can take C 3 as root and get two branches with C 1 and C 5 as leafs.This will facilitate parallel computations.

Outline
In Section 2, we review the maximum likelihood formulation of the localization problem.In Section 3, we discuss how to find the clique tree and how to assign subproblems, in order to distribute the computations for a general optimization method.In Section 4, we review the Levenberg-Marquardt algorithm for solving non-linear least-square problems.In Section 5, we discuss how to distribute the computations in the Levenberg-Marquardt algorithm using the clique tree.Numerical experiments are presented in Section 6 ,and we conclude the paper in Section 7.

Notations and definitions
We denote by R, the set of real scalars and by R n×m , the set of real n × m matrices.The transpose of a matrix A is denoted by A T .We denote the set of positive integers {1, 2, . . ., p}, with N p .With x i , we denote the ith componenet of the vector x.For a square matrix X, we denote with diag(X), a vector with its elements given by the diagonal elements of X.
A graph is denoted by G(V , E), where V = {1, . . ., n} is its set of vertices or nodes and E ⊆ V ×V denotes its set of edges.Vertices i, j ∈ V are adjacent if (i, j) ∈ E, and we denote the set of adjacent vertices of i by Ne(i) = j ∈ V |(i, j) ∈ E .A graph is said to be com- plete if all its vertices are adjacent.An induced graph by E) is a maximal subset of V that induces a complete subgraph on Q, i.e., no clique is properly contained in another clique [23].Assume that all cycles of length at least four of Q(V , E) have a chord, where a chord is an edge between two non-consecutive vertices in a cycle.This graph is then called chordal [24,Ch. 4].It is possible to make graphs chordal by adding edges to the graph.The resulting graph is then referred to as a chordal embedding.Let C Q = C 1 , . . ., C q denote the set of its cliques, where q is the number of cliques of the graph.Then there exists a tree defined on C Q such that for every C i , C j ∈ C Q where i = j, C i ∩ C j is contained in all the cliques in the path connecting the two cliques in the tree.This property is called the clique intersection property [23].Trees with this property are referred to as clique trees.

Maximum likelihood localization
The localization problem that we consider in this paper can be formulated as a network of n s sensors with unknown positions p i s ∈ R d , i ∈ N n s , and n a anchors with known positions is the dimension of the localization problem.The goal is to find the position of the sensors.We assume that the sensors are capable of performing computations and that they also can measure their distance to some of the adjacent sensors and/or anchors.However, later we will see it is enough to assume that some of the sensors are capable of performing computations for the proposed distributed algorithm.Let us define the set of neighbors of each sensor i, Ne r (i) ⊆ N n s , as the set of sensors to which this sensor has an available range measurement.In a similar fashion let us denote the set of anchors to which sensor i can measure its distance to by Ne a (i) ⊆ {n s + 1, . . ., n s + n a }.Define the inter-sensor and anchor range measurements for each sensor, i ∈ N n s , respectively, where || 2 are the noise-free sensor distance and the noise-free anchor-sensor distance, respectively.The quantities E ij ∼ N (0, σ ) and V ij ∼ N (0, σ ) are the measurement noises.It is assumed that the inter-sensor and the anchor-sensor measurement noises are independent.With these definitions, the maximum likelihood problem for localization can be written as minimize where P = p 1 s , . . ., p n s s ∈ R dn s , and where The problem is a nonlinear least-square problem and hence is non-convex.It is in general NP hard [25], and although the problem is guaranteed to have a global minimum [1], it is difficult find it [3].The goal, therefore, is to find good local minimum for the problem.
There are also work reported in which only sensor to anchor measurements are considered and not inter-sensor ones, see, e.g.[26][27][28], in which a range-free based convex method, a convex relaxation based method using range measurements and a sensor selection based method using range and angle measurements, are proposed, respectively.

Clique tree and assignment strategy
In order to solve the problem in (2) in a distributed way, similar to the approach for the one-dimensional example in Section 1, we base our computations on a clique tree which will be used as the computational graph.
Let us assume that if sensor/anchor i can measure its distance to sensor/anchor j, so can j measure its distance to i.This then allows us to describe the range measurement available using an undirected graph G(V , E) with vertex set V = {1, . . ., n s + n a } and edge set E ⊂ V × V .An edge (i, j) ∈ E, if and only if there is a range measurement between i and j.We assume that the graph is connected.Consider the network with 4 sensors and 4 anchors in Fig. 2. The sparsity graph for this network is shown in the top graph of Fig. 3.The graph is not chordal and therefore as mentioned in Section 1, we first find a chordal embedding of the graph before we are able to obtain a corresponding clique tree.One possibility is to add an edge between nodes 2 and 3 to obtain a chordal graph.A corresponding clique tree is shown in the bottom graph of Fig. 3.For more complicated networks, one may use general purpose algorithms to generate the chordal embedding and subsequently the clique tree.Although the problem of finding Fig. 2 A sensor network with 4 sensors (red crosses) and 4 anchors (green circles).An edge between two nodes, implies existence of a range measurement between two nodes a chordal embedding of a graph by adding a minimal number of edges is NP-hard, suboptimal methods can be used [29].One such method is given in [30].A MATLAB code for chordal embedding and the corresponding clique tree which is based on the approach proposed in [30], is provided in [31].Once the clique tree is found, we choose one of the cliques as the root of the tree.Once the root of the tree is specified, the terms of the problem in (2) are assigned to the cliques.We use the assignment strategy given in Algorithm 1.The purpose of the algorithm is to have a balanced distribution of the terms in the objective function.The resulting assignment relies on the ordering of the cliques.Consequently, different ordering of the cliques may result in different assignments of terms in the objective function.for k = 1, . . ., q do 4: if ij is not assigned then The clique tree can be constructed in a distributed way.We start with the anchors and they start by communicating with sensors that they can communicate with.This will give us the initial knowledge of the sparsity graph.Then the senors that the anchors can communicate with can do the same and tell the anchors about what neighbors they have found, and then it goes on like that.This will give us the sparsity graph, and we can centrally at the anchors compute the clique tree and distribute out that information.It is clear that if we lose a measurement, assuming that does not make the sparsity graph disconnected, then we can still have the same clique tree, by just making an embedding of the missing measurement.However, if we get an extra measurement, then if we want to include it we have to recompute the clique tree.However, if the new sensor is only providing measurements to sensors in the same clique, it can form a new clique with the sensors it can communicate with, and the clique tree can be easily augmented with this clique.If all the sensors of a clique fail, then the sensor network and the corresponding sparsity graph will be disconnected, see e.g.Theorem 3.7. in [32], which violates our assumptions.

Remark 1 Note the each clique is a grouping of sensors. Adding artificial edges between sensors in order obtain a chordal embedding is a way to virtually group them. In other words, the added edges corresponds to saying that terms in the objective function are function of variables which they are actually not. Later when we assign different terms of the objective function to the clique tree those added edges are of no relevance. The only purpose of adding edges was to be able to obtain a clique tree.
Remark 2 It should be noted that for the distributed Levenberg-Marquardt algorithm which is discussed later, what clique is chosen as root does not affect the number of communications required for converging to a solution.However, it affects how computations can be carried out in parallel, see [33].

Levenberg-Marquardt algorithm
We will now discuss the Levenberg-Marquardt algorithm and how it can be used to solve the maximum likelihood formulation in (2).Consider the nonlinear least-square problem minimize where 2 and f i : R n → R. The problem in (2) can be written as in (3), where each and every ij in (2) corresponds to a f i in (3) and x = P.We assume that the terms f i s are differentiable.Necessary condition for a local minimum x is that where J(x) m×n is the Jacobian of f (x) = (f 1 (x), . . ., f m (x)).One of the well-known and efficient methods to solve this problem is the Levenberg-Marquardt algorithm which is a variation of the Gauss-Newton algorithm, [21,Ch. 10].The method has been very successful in practice [34, Ch. 10], with a convergence rate which is better than linear and sometimes it can even be quadratic.Now let us linearize F(x) in the neighborhood of x as Applying the Gauss-Newton algorithm solves the problem in (3) in an iterative fashion by minimizing (4) at each iterate, i.e.
A drawback, however, with this approach is that a nearly rank-deficient J(x) may lead to ill-conditioning.One can circumvent this by using the Levenberg-Marquardt algorithm where we solve the problem in (3) in an iterative fashion by minimizing a damped version of (4) at each iterate, i.e. minimize x 1 2 x T (J(x) T J(x) where μ is a damping parameter and the current iterate of x is updated by adding x to it.The size of the damping parameter μ determines the behavior of the algorithm, meaning that for large values of μ, the algorithm behaves like the steepest descent method which is suitable when the current iterate is far from the solution, whereas for small values of μ the algorithm behaves like the Gauss-Newton method which is suitable when the current iterate is close to the solution.In addition, it should be pointed out that in the final iterations of the algorithm if the value F(x) in ( 3) is very small, then the algorithm behaves like the Newton method.The reason follows from the fact that if f i s are close to zero, then J T (x)J(x) is a good approximation of the Hessian ∇ 2 F(x) since As a result the obtained direction is a Newton direction.The strategy for updating μ is controlled by a parameter called Q defined as where x is the solution to the above optimization problem.For details, see [35].It should be noted that this strategy is inherited from the well-known trust-region method and also note that the Levenberg-Marquardt algorithm is sometimes viewed as a trust-region method.See [21, Ch. 10], for details.A suitable choice for the initial value of μ is where the value of τ , as suggested in [35], depends on how good the approximation x 0 is compared to the local minimizer x * .If it is known to be a good approximation, then a small value can be chosen, e.g. 10 −6 , otherwise one can choose 10 −3 or even a larger value.

Distributed computations
We will now discuss how the different computations in the Levenberg-Marquardt algorithm can be distributed over the clique tree.Since each f i corresponds to a ij , we have also assigned the f i s to different cliques of the clique tree.Let φ k be the set of i for which f i have been assigned to clique 2 , where x C k is the sub-vector of x that contains those components of x that f i (x) for i ∈ φ k depend on.We will make this dependence somewhat more clear by defining the vector valued functions . Morover, We see that we have obtained a sum of nonlinear least squares objective functions that are coupled through common components of x.Now let J k x C k be the Jacobains of f k , and let us define the matrices E k as the zero-one matrices that are such that E k x = x C k , for all x ∈ R n , k ∈ N q .It then follows that We see how the matrixes E k distribute the gradients and the approximate Hessians of the individual functions for the different cliques over the gradient vector and the approximate Hessian matrix of the overall problem.
We will now discuss how the above structure can be used to solve the problem in (6) in a distributed way using the clique tree.The only remaining challenge is how to distribute μI over the cliques.To this end, we introduce modifications of E k that we call Ēk .They are obtained by identifying the rows which E k have in common with E par(k) , where par(k) is the parent of the kth clique in the clique tree.Then Ēk is defined to be equal to E k , except for these rows, which are set equal to zero.Let us also define x C k = E k x.It is then straightforward to conclude that the problem in (6) can be written as minimize where Notice that this optimization problem has the same sparsity graph and corresponding clique tree as the original problem in (2).Therefore, x can be obtained using message passing over the clique tree.See [22] for details regarding message passing.
A distributed version of the Levenberg-Marquardt algorithm is presented in Algorithm 2, in which it is straightforward to show that ||∇F(x)|| and Q in ( 7) can be calculated distributedly as Here, Ēk x contains a subset of the components in x C k , which is only available in clique C k and not its parent clique C par(k) .for all i = j.The authors in [36], discuss why Quasi-Newton methods are practical and efficient for optimizing non-smooth functions.[14], which is a special case of the Levenbergh-Marquardt algorithm when μ = 0, the search directions are not computed as efficiently as with the message passing approach used here.

Remark 5
Concerning the computational complexity of the distributed method and its centralized counterpart, we first realize that the centralized counterpart of the algorithm is exactly the same as Algorithm 2, expect that the search direction in Line 3 and Q in Line 4 are computed using Problem 6 and Eq. 7, respectively, and the variable update in Line 6 is done in a centralized manner.Given that at each iteration of both methods, the resulting search directions (Problem 6 for centralized and Problem 10 for distributed method) and the Q values (Eq.7 for centralized and Eq.11 for distributed method) are identical, the distributed method will converge to the same solution as its centralized counterpart.Hence in order to compare the complexity and the computational cost the methods, it is enough to compare them for one iteration of the algorithm.Next we compare the computational complexity of Line 3 and Line 4 in Algorithm 2 for both methods.Line 3, i.e. computation of search directions, which is the major computational burden for both methods, is carried out by solving the linear system of equations (J(x) T J(x) + μI) x = −f (x) T J(x).In the centralized method, this is typically done by factorizing J(x) T J(x) + μI, which is the dominant computation, followed by back/forward substitutions.Common factorizations for this purpose are LDL T , LU, and QR factorizations, which lead to a computational complexity of Algorithm 2 Distributed Levenberg-Marquardt algorithm using the clique tree 1: Given the clique tree, k = 0, ν = 2, x = x 0 , and μ = μ 0 2: while ||∇F(x)|| > do 3: Solve (10) using message passing over the clique tree 4: Calculate Q using (11) 5: x C q i = x C q i + x C q i for all q i = 1, . . ., q else 10: end if , where O(•) is the so-called Big-O notation, see [37] for details.In the distributed method, however, we solve the linear system of equations using message passing over a clique tree.The message passing scheme can be viewed as a multi-frontal LDL T factorization technique [22], leading to a computational complexity of at most O(n 3 ).To be specific, conducting an upward pass from the leave of the clique tree to the root at each iteration, is equivalent to block-diagonalizing the matrix J(x) T J(x) + μI, with the number of blocks being equal to the number of cliques in the clique tree.In addition, conducting a downward pass from the root of the leaves, can be viewed as the back substitution part when solving the linear system of equations.The computational complexity of the downward pass is negligible compared to the upward pass.Finally, Line 4 in Algorithm 2, i.e. the computation of Q, has computational complexity of O(mn) for both the centralized and distributed methods, which is negligible compared to the cost associated with the factorization.To conclude, the proposed distributed method and its centralized counterpart have similar computational complexity of O(n 3 ).

Results and discussion
In this section we compare performance of the proposed distributed Levenberg-Marquardt algorithm, referred as LV algorithm, which is implemented in Julia [38], with two algorithms.The first algorithm is a convex relaxation based distributed algorithm presented in [1].We refer to it as Disk algorithm since the approach is based on what is known as the disk relaxation approach.The second algorithm presented in [16], is a distributed algorithm which directly optimizes the non-convex maximum likelihood problem.We refer to it as StableML algorithm.We do not conduct a comparison with other algorithms, since a thorough comparison with Disk has been conducted in [1], which illustrated the superiority of that algorithm to high performance algorithms in [18] and [3] both in accuracy and number of communications among agents.Furthermore, in [3], the authors show the superiority of their algorithm to the one proposed in [12].

Simulation data
We conduct experiments for networks of sensors with connected inter-sensor measurement graphs in three simulation setups.In all setups we consider several sensors which are randomly distributed, and 9 anchors which are uniformly distributed in a two-dimensional area.We generate the noisy range measurements as where p * s i is the true location of the ith sensor, E ij ∼ N (0, σ ) and V ij ∼ N (0, σ ).We assume that all noises are Gaussian and mutually independent.We consider 10, 30 and 50 fixed sensors for the first, second and third setup, respectively which are distributed in a 1 × 1 area.We consider four different measurement noise standard deviations (σ ), 0.01, 0.05, 0.1 and 0.3, and for each setup, 25 realizations of each noise level that are generated across Monte Carlo runs.We assume there exists a measurement between two sensors or between a sensor and an anchor if the distance between them is less than the communication range which is chosen to be between 0.3 and 0.4 depending on the number of sensors in the network, to ensure that the generated graph is loosely connected.For instance for the first network with 10 sensors, we choose a large communication range, e.g.close to 0.4.By doing so, the average number of edges connected to each sensor turned out to be 7.40, 8.63 and 11.92 for the first, second and third network, respectively.As an example we depict the resulting sensor network for the third setup with 50 sensors and the corresponding clique tree in Figs. 4 and 5, respectively.As can be seen, the inter-sensor measurement graph is connected.

Performance assessment
Before evaluating the performance of the two aforementioned algorithms, we want to stress the importance of the number of measurements for the quality of estimates.This, in our sensor network application, means that the more measurements are available between sensor i and sensor and/or anchor j, the better estimate will be achieved.Notice that in Fig. 4 The sensor network with 50 sensors and 9 anchors, considered for our experiment.The sensor nodes are marked with red crosses and the anchors are marked with green circles.An edge between two nodes, implies existence of a range measurement between two nodes (2021) 2021:74 Page 16 of 26 Fig. 5 Clique tree for the network in Fig. 4 the maximum likelihood formulation (2), we have assumed that there is a single measurement between sensor i and sensor and/or anchor j, in particular R ij and/or Y ij .Let us now assume that there are N measurements available between sensor i and sensor and/or anchor j, in particular R k ij and/or Y k ij for k ∈ N N , where the superscript k denotes the index of the measurement.With these definitions, the maximum likelihood problem becomes minimize Compared to the problem in (2), although they have the same clique tree, this problem requires N times more computations in order to calculate the gradient of the objective function.It can however be shown that solving the problem in ( 12) is equivalent to solving (2), by replacing the single measurement with the average of measurements over k, i.e.
respectively.This follows from the fact that Rij and Ȳij are sufficient statistics for estimation of x i s which follows from the Neyman-Fisher factorization theorem.For details see [39,Ch. 5].
In our simulations, we choose N to be 1, 10, and 100.In the experiments, we run our proposed algorithm based on the formulation in (2) using the average of measurements, i.e.
We refer to the obtained estimate as LV estimate.The Disk and StableML algorithms also run for single measurement, and therefore we use the average of the measurements for these algorithms as well.We refer to these estimates as Disk and StableML estimates, respectively.
The Disk and StableML algorithms are terminated if the norm of the gradient of their cost functions are below 10 −6 .This threshold was chosen based on the experience of the authors in [1] and [16], so as to guarantee that Disk and StableML generate accurate enough solutions.For the proposed distributed Levenberg-Marquardt algorithm, we choose τ = 10 −6 and = 10 −6 .It should be noted that the LV and StableML algorithms are sensitive to the initial starting point, and therefore they should be initialized not very far from optimum to ensure convergence to a good local minimum.Here we use an approximate solution obtained from the Disk algorithm as the initial starting point for both LV and StableML algorithms.To this end, we terminate the Disk algorithm if the norm of the gradient of its cost function is below 10 −1 .It should be noted that in average the number of iterations for convergence to this low accuracy is 2 − 3% of the iterations needed for terminating the Disk algorithm when requiring the norm of the gradient to be below 10 −6 .There might also be other cheap ways of initializing the algorithm, but we have not investigated that in this paper.Moreover, the chordal embedding and the corresponding clique tree for all networks are generated using the MATLAB functions in the toolbox [31].The generated clique trees for the networks with 10, 30, and 50 sensors, have 8, 12 and 15 cliques, respectively.
Let P s = p i s n s i=1 , be the vector obtained by stacking the estimated sensor ith position.Also let , be the vector obtained by stacking the true position of ith sensor.We will compare the performance of two different estimates using the root mean squared error (RMSE) per sensor defined as where n s is the number of sensors and Q is the number of problem instances.The argument q refers to the qth experiment.Figures 6, 7 and 8 illustrate the RMSE results for different noise levels for the networks with 10, 30 and 50 sensors, respectively.Notice that the plots for the LV (red) and StableML (green) estimates are similar.It can be seen from the figures that the LV and StableML estimates perform equally well and in general they outperform the Disk estimate for low noise levels and as the number of measurements N increases, they perform even considerably better than the Disk estimate.Note that as we will see later, the good performance of the StableML estimate comes at the price that it requires far more communications for convergence compared to the LV estimate.Note also that, although we have not included the figures for the objective function values for the nonlinear LS problem, for all the simulations it is the case that the LV and StableML estimates have ended up in lower objective function values in (2) than the Disk estimate.Let us also evaluate the bias-variance trade-off for different estimates for the lowest noise level (σ = 0.01) and the highest noise level (σ = 0.3).Now, recall the relation between mean square error (MSE), variance and bias of the estimate where . These values for the cases with σ = 0.01, σ = 0.3, and N = 100 are illustrated in Figs. 9, 10, and 11, for the networks with 10, 30 and 50 sensors, respectively.The observations from the results are the followings.In general the LV and StableML estimates have similar biases, variances and MSEs, although in some cases the In order to compare the number communications between the algorithms, first we have to note that the way the computations are distributed in Disk and StableML algorithms are similar.However, they are different than the way the computations are distributed in the LV algorithm.In the distributed LV algorithm, it is enough to assume that one sensor per clique is capable of carrying out the computations.Hence, we have as many computational agents as the number of cliques.This is related to how we distribute our computations which is based on the clique tree.In the distributed Disk and StableML algorithms, however, it is assumed that each sensor is capable of carrying out computations, and therefore those algorithms have as many computational agents as the number of sensors.It should be noted that, we can use even fewer computational agents than the number of cliques in the distributed LV algorithm.This is possible because we can always merge neighboring cliques in the tree.Hence, in our approach we have an upper limit on the number of computational agents that we may use, but no lower limit.
The advantage of having one computational agent per clique is that the number of communications in general is lower compared to the case where we have one computational agent per sensor.This is a nice property especially when the communication is costly.The computations, however, in the former case is heavier than the latter case.Nevertheless, the effort spent for the computations is considerably less than the effort spent for sending and receiving messages between agents which include complex operations such as coding, decoding, synchronization, etc. [20].See [40], for an estimation of energy consumption between two sensors in wireless communication.Notice also that one important factor which affects the number of communications between agents is the number of iterations needed to get a certain accuracy.As discussed in Section 1, the former case which is based on a pseudo-second-order method, requires fewer iterations, and so fewer communications compared to the latter case which is based on a first-order method.The former case also requires inter-clique communications when the sensor readings are sent to the sensor which is responsible for carrying out the computations.Notice however that as discussed before, not all but just the average of the sensor readings needs to be sent as the average is sufficient statistics to the maximum likelihood estimate.The disadvantage of the former case is that if a computational agent fails or if the communication with the neighbor agents is lost, the clique tree should be computed from scratch, whereas in the latter case the failed agent can be dismissed and the algorithm will continue working.

Remark 6
It should be pointed out that although for the distributed LV algorithm it is enough to have one sensor per clique which is capable of carrying out the computations, there are two advantages of having more than one sensor per clique for this purpose.One is that we can distribute the energy consumption of the sensors by letting them take turns for carrying out the computations.The second advantage is that in case a sensor which is responsible for the computations fails, there is a backup sensor ready to take over.How to trigger this can be done in the following way.If the parent/child clique does not get a message from its child/parent clique for a pre-specifed period of time, it implies that a failure has happened and another sensor is requested to take over the computations.
In general we have two types communication in the distributed LV algorithm.The first type relates to the fact that within each clique, each sensor needs to send the information regarding the distance to its adjacent sensors to the sensor which is responsible for carrying out the computations.The second type of communication is about exchanging messages between the sensors which are responsible for carrying out the computations.Theses messages can be expressed with matrices which are symmetric, and vectors.In particular, we need three upwards and downwards pass through the clique tree at each iteration.To be more specific, we require one upwards and one downwards pass through the clique tree in order to calculate the search direction in (10) (Line 3 in Algorithm 2) and one upwards pass through the clique tree in order to calculate different terms of ∇F(x) (Line 2 in Algorithm 2) and Q (Line 4 in Algorithm 2) using (11).For the considered networks with 10, 30 and 50 sensors, for the first upwards pass, the communicated messages are a symmetric matrix and a vector with the average sizes of (7 × 7, 7 × 1), (18 × 18, 18 × 1) and (43 × 43, 43 × 1), respectively.Also, the maximum sizes of the matrix and the vector are (10 × 10, 10 × 1), (26 × 26, 26 × 1) and (72 × 72, 72 × 1) , respectively.For the downwards pass, the communicated messages are a vector with the same size as the vector communicated in the first upwards pass.Finally, for the last upwards pass, the communicated messages are three scalars which can be combined in a (3 × 1) vector.For the distributed disk relaxation method, however, according to the Algorithm 1 in [1], at each iteration every sensor or agent needs to communicate a (2 × 1) vector which is called ω i with its adjacent agents or sensors.It is obvious that for the largest network the amount of data to be sent in the distributed LV algorithm is roughly 400-500 more than for the Disk algorithm.However, as discussed before, what is most costly in many applications is the number of times contact is established and not how much information is sent.We will now compare the total number of communications for different algorithms.The total number of communications required for each algorithm to converge to a solution are depicted in Figs. 12 and 13 for all networks with δ = 0.05 and δ = 0.1, respectively.Both figures correspond to the case with N = 100.It is seen that the LV algorithm requires roughly two orders of magnitude fewer communications for computing the solution compared to the Disk algorithm and the StableML algorithm requires even more communications compared to the Disk algorithm.The shaded areas depict the maximum and minimum values out of 25 problem instances.
It is worth pointing out that for high noise levels, in order to get a better performance in terms of RMSE, one can proceeds as follows.Given N measurements between sensor i and sensor and/or anchor j, we can run the LV algorithm for each of the measurements using the formulation in (2), which will result in N estimates.We can then compute the final estimate as the average of the estimates.We can also do the same procedure using the Disk algorithm.By doing so, although for high noise levels the results will again be improved, compared to the case where we run the algorithm once using the average of the measurements, the estimate obtained from using the LV algorithm outperforms the estimate obtained form the Disk algorithm in terms of RMSE.We verified this in the simulations, however, for the sake of brevity we do not present the results in the paper.
Notice that the number of communications for this approach is much more than the case where we run the algorithm once by using the average of the measurements, since we have to run the algorithm N times.
Fig. 13 Total number of communications required for each algorithm to converge to a solution for δ = 0.1 and N = 100

Conclusion
In this paper we proposed a distributed algorithm for maximum likelihood estimation for the localization problem which relies on the Levenberg-Marquardt algorithm and message passing over a tree.We discussed how the tree can be generated in a distributed way and also we discussed how one should proceed if a measurement between sensors is lost, or if a new sensor is added to the network.The resulting algorithm requires much fewer iterations than first-order distributed methods in order to converge to an accurate solution, which in turn leads to fewer number of communications among computational agents.The algorithm also outperforms other distributed algorithms in terms of accuracy if estimates of small bias with limited number of communications are important.Only by a larger bias and smaller variance can other methods beat our method in terms of RMSE.
The size of messages communicated between cliques depend on the number of sensors and anchors two adjacent cliques share.In terms of parallel computations, the more branches we have in the clique tree, the faster the computations can be carried out.Therefore, we intend to investigate different approaches for finding clique trees in future research.In addition, different ways of initializing the algorithm in a cheap way will be investigated.

Fig. 3 Algorithm 1 A simple assignment strategy 1 : 2 :
Fig.3 Sparsity graph (on the top) and Clique tree (on the bottom) for the network in 2

Fig. 6
Fig. 6 The RMSE per sensor results for different noise levels, namely 0.01, 0.05, 0.1, and 0.3, for the network with 10 sensors.The top left, top right and bottom left plots correspond to N = 1, N = 10, and N = 100, respectively

26 Fig. 7
Fig. 7 The RMSE per sensor results for different noise levels, namely 0.01, 0.05, 0.1, and 0.3, for the network with 30 sensors.The top left, top right and bottom left plots correspond to N = 1, N = 10, and N = 100, respectively

Fig. 8 of 26 Fig. 10
Fig. 8 The RMSE per sensor results for different noise levels, namely 0.01, 0.1 and 0.3, for the network with 50 sensors.The top left, top right and bottom left plots correspond to N = 1, N = 10 and N = 100, respectively

26 Fig. 12
Fig.12 Total number of communications required for each algorithm to converge to a solution for δ = 0.05 and N = 100