 Research
 Open Access
 Published:
Distributed GramSchmidt orthogonalization with simultaneous elements refinement
EURASIP Journal on Advances in Signal Processing volume 2016, Article number: 25 (2016)
Abstract
We present a novel distributed QR factorization algorithm for orthogonalizing a set of vectors in a decentralized wireless sensor network. The algorithm is based on the classical GramSchmidt orthogonalization with all projections and inner products reformulated in a recursive manner. In contrast to existing distributed orthogonalization algorithms, all elements of the resulting matrices Q and R are computed simultaneously and refined iteratively after each transmission. Thus, the algorithm allows a tradeoff between run time and accuracy. Moreover, the number of transmitted messages is considerably smaller in comparison to stateoftheart algorithms. We thoroughly study its numerical properties and performance from various aspects. We also investigate the algorithm’s robustness to link failures and provide a comparison with existing distributed QR factorization algorithms in terms of communication cost and memory requirements.
Introduction
Orthogonalizing a set of vectors is a wellknown problem in linear algebra. Representing the set of vectors by a matrix \({\mathbf {A}}\in \mathbb {R}^{n\times m}\), with n≥m, several orthogonalization methods are possible. One example is the socalled reduced QR factorization (matrix decomposition), A=Q R, with a matrix \({\mathbf {Q}}\in \mathbb {R}^{n\times m}\) having orthonormal columns, and an upper triangular matrix \({\mathbf {R}}\in \mathbb {R}^{m\times m}\) containing the coefficients of the basis transformation [1]. In the signal processing area, QR factorization is used widely in many applications, e. g., when solving linear least squares problems or decorrelation [2–4]. In adaptive filtering, a decorrelation method is typically used as a prestep for increasing the learning rate of the adaptive algorithm [5], ([6], p. 351), ([7], p. 700).
From an algorithmic point of view, there are many methods for computing QR factorization with different numerical properties. A standard approach is the GramSchmidt orthogonalization algorithm, which computes a set of orthonormal vectors spanning the same space as the given set of vectors. Other methods include Householder reflections or Givens rotations, which are not considered in this paper.
Optimization of QR factorization algorithms for a specific target hardware has been addressed in the literature several times (e.g., [8, 9]). Parallel algorithms for computing QR factorization, which are applicable for reliable systems with fixed, regular, and globally known topology, have been investigated extensively (e.g., [10–13]).
Besides parallel algorithms, there are two other potential approaches for computation across a distributed network. In the standard—centralized—approach, the data are collected from all nodes and the computation is performed at a fusion center. Another approach is to consider distributed algorithms for fully decentralized networks without any fusion center where all nodes have the same functionality and each of them communicates only with its neighbors. Such an approach is typical for sensoractuator networks or autonomous swarms of robotic networks [14]. Nevertheless, the investigation of distributed QR factorization algorithms designed for loosely coupled distributed systems with independently operating distributed memory nodes and with possibly unreliable communication links has only started recently [3, 15, 16]. In the following, we focus on algorithms for such decentralized networks.
Motivation
The main goal of this paper is to present a novel distributed QR factorization algorithm—DSCGS—which is based on the classical GramSchmidt orthogonalization. The algorithm does not require any fusion center and assumes only local communication between neighboring nodes without any global knowledge about the topology. In contrast to existing distributed approaches, the DSCGS algorithm computes the approximations of all elements of the new orthonormal basis simultaneously and as the algorithm proceeds, the values at all nodes are refined iteratively, approximating the exact values of Q and R. Therefore, it can deliver an estimate of the full matrix result at any moment of the computation. As we will show, this approach is, among others, superior to existing methods in terms of the number of transmitted messages in the network.
In Section 2, we briefly recall the concept of a consensus algorithm which we use later in the distributed orthogonalization algorithm. In Section 3, we review the basics of the QR decomposition and existing distributed methods. In Section 4, we describe the proposed distributed GramSchmidt orthogonalization algorithm with simultaneous refinements of all elements (DSCGS). We experimentally compare DSCGS with other distributed approaches in Section 5 where we also investigate the properties of DSCGS from many different viewpoints. Section 6 concludes the paper.
Notation and terminology
In what follows, we use k as the node index, \(\mathcal {N}_{k}\) denotes the set of neighbors of node k, N denotes the (known) number of nodes in the network, \(\mathcal {E}\) the set of edges (links) of the network, d _{ k } the kth node degree (\(d_{k}=\mathcal {N}_{k}\)), \(\bar {d}\) the average node degree of the network, and t a discrete time (iteration) index.
We will describe the behavior of the distributed algorithm from a network (global) point of view with the corresponding vector/matrix notation. For example, the (column) vector of all ones denoted by 1, corresponds to all nodes having value 1. In general, we denote the number of rows of a matrix by n and the number of columns by m. Elementwise division of two vectors is denoted as \({\mathbf {z}} = \frac {{\mathbf {x}}}{{\mathbf {y}}} \equiv \frac {x_{i}}{y_{i}}, \forall i\), elementwise multiplication of two vectors as z=x∘y≡x _{ i } y _{ i },∀i and of two matrices as Z=X∘Y. The operation \({\mathbf {X}}\circledast {\mathbf {Y}}\) is defined as follows: Having two matrices X=(x _{1},x _{2},…,x _{ m }) and Y=(y _{1},y _{2},…,y _{ m }), the resulting matrix \({\mathbf {Z}}={\mathbf {X}}\circledast {\mathbf {Y}}\) is a stacked matrix of all matrices Z _{ i } such that \({\mathbf {Z}}_{i}\,=\,({\mathbf {x}}_{1},{\mathbf {x}}_{2},\dots,{\mathbf {x}}_{i})\circ ((\underbrace {1,1,\dots,1}_{i})\otimes {\mathbf {y}}_{i+1})\) (⊗denotes the Kronecker product; i = 1,2,…,m−1), i.e., thus creating a big matrix containing combinations of column vectors: \({\mathbf {Z}}\in \mathbb {R}^{n\times \frac {m^{2}m}{2}}\). This later corresponds in our algorithm to the offdiagonal elements of the matrix R. Also note that all variables with the “hat” symbol, e.g., \(\hat {{\mathbf {u}}}(t)\) represent variables that are computed locally at nodes, while variables with the “tilde” symbol, e.g., \(\tilde {{\mathbf {u}}}(t)\), are updated based on the information from neighbors.
Average consensus algorithm
We model a wireless sensor network (WSN) by synchronously working nodes which broadcast their data into their neighborhood within a radius ρ (socalled geometric topology). The WSN is considered to be static, connected, and with errorfree transmissions (except for Section 5.4 ahead). Although the practicality of synchronicity can be argued [17, 18], we note that it is not an unrealizable assumption [19].
In the following, we briefly review the classical consensus algorithm for computing the average of values distributed in a network. Note that the algorithm can be easily adapted to computing a sum by multiplying the final average value (arithmetic mean) by the total number of nodes N.
The distributed average consensus algorithm computes an estimate of the global average of distributed initial data x(0) at each node k of a WSN. In every iteration t, each node updates its estimate using the weighted data received from its neighbors, i.e.,
or from a global (network) point of view
The selection of the weight matrix W, representing the connections in a strongly connected network, crucially influences the convergence of the average consensus algorithm [20–22]. The main condition for the algorithm to converge is that the largest eigenvalue of W is equal to 1, i.e., λ _{max} = 1, with multiplicity one, and that each row of W sums up to 1. It can then be directly shown [20] that the value x _{ k }(t) at each node converges to a common global value, e.g., average of the initial values.
If not stated otherwise, we use the socalled Metropolis weights [22] for matrix W, i.e.,
These weights guarantee that the consensus algorithm converges to the average of the initial values.
QR factorization
As mentioned in Section 1, there exist many algorithms for computing the QR factorization with different properties [1, 23]. In this paper we utilize the QR decomposition based on the classical GramSchmidt orthogonalization method (in ℓ ^{2} space).
Centralized classical GramSchmidt orthogonalization
Given matrix \({\mathbf {A}} =~ ({\mathbf {a}}_{1},{\mathbf {a}}_{2},\dots,{\mathbf {a}}_{m}) \in \mathbb {R}^{n\times m}\), n≥m, classical GramSchmidt orthogonalization (CGS) computes a matrix \({\mathbf {Q}}\in \mathbb {R}^{n\times m}\) with orthonormal columns and an uppertriangular matrix \({\mathbf {R}}\in \mathbb {R}^{m\times m}\), such that A=Q R. Denoting
we have
and
where \(\left \{\mathbf {u}}\right \_{2} = \sqrt {\sum _{i=1}^{n}{{u_{i}^{2}}}}\) and \(\langle {\mathbf {q}}, {\mathbf {a}}\rangle = \sum _{i=1}^{n}q_{i}a_{i}\).
It is known that the algorithm is numerically sensitive depending on the singular values (condition number) of matrix A as well as it can produce vectors q _{ i } far from orthogonal when the matrix A is close to being rank deficient even in a floatingpoint precision [23]. Numerical stability can be improved by other methods, e.g., modified GramSchmidt method, Householder transformations, or Givens rotations [1, 23].
Existing distributed methods
Assuming that each node k stores its local values \({u_{k}^{2}}\) and q _{ k } a _{ k }, it is then straightforward to redefine the CGS in a distributed way, suitable for a WSN, by following the definition of the ℓ ^{2} norm, i.e., \(\left \{\mathbf {u}}\right \^{2}_{2} ={u_{1}^{2}}+{u_{2}^{2}}+\dots +{u_{n}^{2}}\) (cf. (4)), and inner products, 〈q,a〉=q _{1} a _{1}+q _{2} a _{2}+⋯+q _{ n } a _{ n } (cf. (5)). The summations can then be computed using any distributed aggregation algorithm, e.g., average consensus [20]^{1} (see Section 2), and asynchronous gossiping algorithms [24], using only communication with the neighbors.
Nevertheless, to our knowledge, all existing distributed algorithms for orthogonalizing a set of vectors are based on the gossipbased pushsum algorithm [16, 24]. Specifically in [3], authors used a distributed CGS based on gossiping for solving a distributed least squares problem and in [15], a gossipbased distributed algorithm for modified GramSchmidt orthogonalization (MGS) was designed and analyzed. The authors also provided a quantitative comparison to existing parallel algorithms for QR factorization. A slight modification of the latter algorithm was introduced in [25], which we use for comparison in this paper. We denote the two Gossipbased distributed GramSchmidt orthogonalization algorithms as GCGS [3] and GMGS [25], respectively.
Since the classical GramSchmidt orthogonalization computes each column of the matrix Q from the previous column recursively, i.e., to know vector q _{2}, we need to compute the norm of u _{2} which depends on vector q _{1}, the existing distributed algorithms always need to wait for convergence of one column before proceeding with the next column. This may be a big disadvantage in WSNs as it requires a lot of transmissions. Also, if the algorithm fails at some moment, e.g., due to transmission errors, the matrices Q and R are incomplete and unusable for further application.
In contrast, the distributed algorithm proposed in this paper overcomes these disadvantages and computes approximations of all elements of the matrices Q and R simultaneously. All the norms and inner products are refined iteratively which leads to a significant decrease of transmitted messages, and also the algorithm brings an intermediate approximation of the whole matrices Q and R at any time instance.
Distributed classical GramSchmidt with simultaneous elements refinement
As mentioned in Section 3.2, the GramSchmidt orthogonalization method can be computed in a distributed way using any distributed aggregation algorithm. We refer to CGS based on the average consensus (see Section 2) as ACCGS. ACCGS as well as GCGS [3] and GMGS [25] have the following substantial drawback.
In all GramSchmidt orthogonalization methods, the computation of the norms ∥u _{ i }∥ and the inner products 〈q _{ j },a _{ i }〉,〈q _{ j },q _{ j }〉, occurring in the matrices Q and R, depends on the norms and inner products computed from the previous columns of the input matrix A. Therefore, each node k must wait until the estimates of the previous norms ∥u _{ j }∥ (j < i) have achieved an acceptable accuracy before processing the next norm ∥u _{ i }∥ (a “cascading” approach; see [15]). The same holds also for computing the inner products. We here present a novel approach overcoming this drawback.
Rewriting Eqs. (4) and (5) by a recursion, we obtain
where \(\tilde {{\mathbf {u}}}_{i}(t)\) is the approximation of \(1/N\left \{\mathbf {u}}_{i}\right \_{2}^{2}{\mathbf {1}}\) at time t and
with \(\tilde {{\mathbf {p}}}^{(2)}_{j+(i1)(i2)/2}(t)\) being an approximation of the offdiagonal inner products 1/N〈q _{ j },a _{ i }〉1 (∀j<i) of matrix R (cf. (3)) and \(\tilde {{\mathbf {q}}}_{j}(t)\) an approximation of 1/N〈q _{ j },q _{ j }〉1 at time t. Similarly, we define \(\tilde {{\mathbf {p}}}^{(1)}_{i}(t)\) to be an approximation of 1/N〈q _{ i },a _{ i }〉1. As we show later, \(\tilde {{\mathbf {u}}}_{i}(t)\), \(\tilde {{\mathbf {q}}}_{j}(t)\), \(\tilde {{\mathbf {p}}}^{(1)}_{i}(t)\), and \(\tilde {{\mathbf {p}}}^{(2)}_{j+(i1)(i2)/2}(t)\) converge to \(1/N\left \{\mathbf {u}}_{i}\right \_{2}^{2}{\mathbf {1}}\), 1/N〈q _{ j },q _{ j }〉1, 1/N〈q _{ i },a _{ i }〉1, and 1/N〈q _{ j },a _{ i }〉1, respectively.
Similarly to the stateoftheart methods (see Section 3.2), we further assume that the matrices \({\mathbf {A}}\in ~\mathbb {R}^{n\times m}\) and \({\mathbf {Q}}\in ~\mathbb {R}^{n\times m}\) are distributed over the network rowwise, meaning that each node stores at least one row of the matrix A and corresponding rows of the matrix Q and each node stores the whole matrix R. In case n>N, more rows must be stored at the node and each node must sum the data locally before broadcasting to neighbors. Obviously, the data distribution over the network influences the speed of convergence of the algorithm, as can be seen also in the simulations ahead (see Section 5).
Notation A _{ k },Q _{ k }(t) here represent the rows of the matrices A and Q at a given node k at time t. If more rows are stored in one node, A _{ k } and Q _{ k }(t) are matrices, otherwise they are row vectors. Matrix R ^{(k)}(t) represents the whole matrix R at node k at time t.
From a global (network) point of view, the algorithm is defined in Algorithm 1.
Proof of convergence of DSCGS.
For the first column, vector i=1, \(\hat {{\mathbf {u}}}_{1}(t) = {\mathbf {a}}_{1}\), and thus the convergence results of the average consensus, see Section 2, apply, i.e., as t→∞, the nodes will monotonically reach the common values, i.e., \(\tilde {{\mathbf {u}}}_{1}(t)=1/N\{\mathbf {a}}_{1}\^{2}_{2}{\mathbf {1}}\) and thus also, \(\hat {{\mathbf {q}}}_{1}(t)=\frac {{\mathbf {a}}_{1}}{\{\mathbf {a}}_{1}\^{2}_{2}}\), \(\tilde {{\mathbf {q}}}_{1}(t)=1/N{\mathbf {1}}\), \(\tilde {{\mathbf {p}}}_{1}^{(1)}(t)=1/N\{\mathbf {a}}_{1}\^{2}_{2}{\mathbf {1}}\), and \(\tilde {{\mathbf {p}}}^{(2)}_{1}(t)=1/N\langle {\mathbf {a}}_{1}, {\mathbf {a}}_{2}\rangle {\mathbf {1}}\).
Furthermore, for all columns i>1, all the elements depend only on the first column (i=1), e.g., Eq. (7), \(\hat {{\mathbf {u}}}_{2}(t)={\mathbf {a}}_{2}\frac {\tilde {{\mathbf {p}}}^{(2)}_{1}(t1)\circ \hat {{\mathbf {q}}}_{1}(t1)}{\tilde {{\mathbf {q}}}_{1}(t1)}\Big (\vphantom {\frac {\hat {{\mathbf {u}}}_{1}(t)}{\sqrt {N\tilde {{\mathbf {u}}}_{1}(t1)}}}\Big.\)from Eq. (6) \(\Big.\hat {{\mathbf {q}}}_{1}(t)\! =~\!\!\frac {\hat {{\mathbf {u}}}_{1}(t)}{\sqrt {N\tilde {{\mathbf {u}}}_{1}(t1)}}\Big)\). Thus, eventually, \(\hat {{\mathbf {u}}}_{2}(t)\) will converge to u _{2} (Eq. (5)) and similarly will do all norms and inner products (Eqs. (4) and (5)) of matrix Q and R.
Intuitively, we can see that as \(\tilde {{\mathbf {u}}}_{1}(t)\) converges to its steady state, all other variables converge, with some “delay,” to their steady states as well. We may say that as the first column converges, it “drags” other elements to their steady states. In the worst case, the consequent (following) column starts to converge only when the previous column is fully converged. This behavior differs from the known methods where we have to wait for \(\tilde {{\mathbf {u}}}_{1}(t)\) to be converged before computing other terms.
Note that instead of knowing the number of nodes N and using it as a normalization constant, we could transmit an additional weight vector \(\boldsymbol {\omega }(t)\in \mathbb {R}^{N\times 1}\), i.e., Ψ ^{(0)}(t)=ω(t) and Ψ(t)=(Ψ ^{(0)}(t),Ψ ^{(1)}(t),Ψ ^{(2)}(t),Ψ ^{(3)}(t),Ψ ^{(4)}(t)), such that ω(0)=(1,0,…,0)^{⊤} and Eq. (6) would change only slightly^{2}, i.e.,
We note that the normalization constant N (or ω(t), respectively) affects only^{3} the orthonormality (columns remain orthogonal but not normalized) of the columns of the matrix Q(t), and in case only orthogonality is sufficient, as in [26], we can omit this constant. We can, thus, overcome the necessity of knowing the number of the nodes or reduce the number of transmitted data in the network, respectively.
Relation to dynamic consensus algorithm
The dynamic consensus algorithm is a distributed algorithm which is able to track the average of a timevarying input signal. There exist many variations of the algorithm, e.g., [27–33]. Comparing the proposed DSCGS algorithm with a dynamic consensus algorithm from [30, 32], we observe an interesting resemblance.
Formulating DSCGS from a global point of view, i.e.,
we observe that it is a variant of the dynamic consensus algorithm with an “input signal” S(t). However, the “input signal” S(t) in our case is very complicated as it depends on X(t−1) and S(t−1) and cannot be considered as an independent signal as it is usually considered in dynamic consensus algorithms. Therefore, it is difficult to analyze the properties of this input signal and convergence conditions of DSCGS based on the dynamic consensus algorithm. It is also beyond the scope and focus of this paper to analyze this algorithm in general. Nevertheless, some analysis of this type of dynamic consensus algorithm, for a general input signal, together with the bounds on convergence speed, has been conducted in [34].
Performance of DSCGS
In our simulations, we consider a connected WSN with N = 30 nodes. We explore the behavior of DSCGS for various topologies: fully connected (each node is connected to every other node), regular (each node has the same degree d), and geometric (each (randomly deployed) node is connected to all nodes within some radius ρ—a WSN model). If not stated otherwise, the randomly generated input matrix \({\mathbf {A}}\in ~ \mathbb {R}^{300\times 100}\) has uniformly distributed elements from the interval [0,1] and a low condition number κ(A)=35.7. In Section 5.3.2, we, however, investigate the influence of various input matrices with different condition numbers on the algorithm’s performance.
Also, except for the Sections 5.3.1 and 5.4, for the consensus weight matrix we use the metropolis weights (Eq. (2)).
The confidence intervals were computed from the several instantiations using a bootstrap method [35].
Orthogonality and factorization error
As performance metrics in the simulations, we use the following:

Relative factorization error—\(\frac {\left \{\mathbf {A}}{\mathbf {Q}}(t){\mathbf {R}}^{(k)}(t)\right \_{2}}{\left \{\mathbf {A}}\right \_{2}}\) —which measures the accuracy of the QR factorization at node k,

Orthogonality error— ∥I−Q(t) ^{⊤} Q(t)∥_{2} —which measures the orthogonality of the matrix Q(t) (see step 2 of the algorithm).
Note that both errors are calculated from the network (global) perspective and as depicted, they are not known locally at the nodes, since only R ^{(k)}(t) is local at each node, whereas Q(t) is distributed rowwise across the nodes (Q _{ k }(t)). From now on, we simplify the notation by dropping the index t in Q(t) and R ^{(k)}(t). The simulation results for a geometric topology with an average node degree \(\bar {d}=~8.533\) are depicted in Fig. 1. Since both errors behave almost identically (compare Fig. 1 a, b) and since each node k can compute a local factorization error ∥A _{ k }−Q _{ k } R ^{(k)}∥_{2}/∥A _{ k }∥_{2} from its local data, we conjecture that such local error evaluation can be used also as a local stopping criterion in practice. Note that this fact was used in [26] for estimating a network size.
Note that the error at the beginning stage in Fig. 1 is caused by the disagreement and not converged norms and inner products across the nodes, i.e., the values of \(\tilde {{\mathbf {u}}}(t)\), \(\tilde {{\mathbf {Q}}}(t)\), \(\tilde {{\mathbf {P}}}^{(1)}(t)\), and \(\tilde {{\mathbf {P}}}^{(2)}(t)\). We also observe that the error floor^{4} is highly influenced by the network topology, weights of matrix W, and condition number of input matrix A. We investigate these properties in Section 5.3.
Initial data distribution
If n>N, some nodes store more than one row of A. Thus, before doing distributed summation (broadcasting to neighbors), every node has to locally sum the values of its local rows.
Simulations show that the convergence behavior of DSCGS strongly depends on the distribution of the rows across the network (see Fig. 2). We investigate the following cases: (1) each node stores ten rows of A (“uniform”); (2) 271 rows are stored in the node with the lowest degree, the other 29 rows in the remaining 29 nodes; and (3) 271 rows are stored in the node with the highest degree, the rest in the remaining 29 nodes.
We observe that not only the initial distribution of the data influences the convergence behavior but also the topology of the underlying network. In the case of a regular topology (Fig. 2 a), the influence of the distribution is small and relatively weak in terms of convergence time but stronger in terms of the final accuracy achieved. We recognize that the difference between the nodes comes only from the variance of the values in input matrix A. On the other hand, in case of a highly irregular geometric topology (see Fig. 2 b), where the node with most neighbors stores most of the data, the algorithm converges much faster than in the case when most of the data are stored in a node with only few neighbors.
We further observe that in the “uniform” case, the algorithm behaves slightly differently for different distributions of the rows (although still having ten rows in each node). In Fig. 3, we show results for six different placements of the data across the nodes for three different topologies, where we depict the mean value and the corresponding confidence intervals of the simulated orthogonality error. As we can observe, in case of the fully connected topology, the data distribution is of no importance, since all the nodes exchange data in every step with all other nodes. In case of the geometric topology, however, the convergence of the algorithm is influenced by the distribution of data, even if every node contains the same number of rows (ten rows in each node). This can be recognized by bigger confidence intervals of the orthogonality error. Nevertheless, the speed of convergence for all cases is bigger than the case when most data is stored in the “sparsest” node (cf. Fig. 2 b). In case of the regular topology, the difference is small only due to numerical accuracy of the mixing parameters.
Numerical sensitivity
As mentioned in Section 3.1, the classical GramSchmidt orthogonalization possesses some undesirable numerical properties [1, 23]. In comparison to centralized algorithms, numerical stability of DSCGS is furthermore influenced by the precision of the mixing weight matrix W, the network topology, and properties of input matrix A, i.e., its condition number (see Fig. 5 ahead) and the distribution of the numbers in the rows of the matrix (see Figs. 2 and 3). In this section, we provide simulation results showing these dependencies.
Weights
As mentioned in Section 2, matrix W can be selected in many ways. Mainly, the selection of the weights influences the speed of convergence. Unlike previous simulations, where we used the metropolis weights (see Eq. (2)), here we selected constant weights for matrix W [20], i.e.,
where c∈(0,1]. Such weights, in general, lead to slower convergence. However, we can also see in Fig. 4 that the weights influence not only the speed of convergence but also the numerical accuracy of the algorithm (different error floors).
Condition numbers
It is well known that the classical GramSchmidt orthogonalization is numerically unstable [23]. In cases when input matrix A is illconditioned (high condition number) or rankdeficient (matrix contains linear dependent columns), the computed vectors Q can be far from orthogonal even when computed with high precision.
In this section, we study the influence of the condition number of input matrix A on the accuracy of the orthogonality. The condition number is defined with respect to inversion as the ratio of the largest and smallest singular value. In comparison to classical (centralized) GramSchmidt orthogonalization, we observe (Fig. 5 a) that the DSCGS algorithm behaves similarly, although it reaches neither the accuracy of ACCGS nor of the centralized algorithm (even in the fully connected network). We observe in all of the simulations that the orthogonality error in the first phase can reach very high values (due to divisions by numbers close to zero), which may influence the numerical accuracy in the final phase.
We further observe that the algorithm requires matrix A to be very wellconditioned even for the fully connected network. Unlike other methods, the factorization error in case of DSCGS has the same characteristics as the orthogonality error and is also influenced by the condition number of the input matrix, see Fig. 5 b. Although, as we noted in Section 5.1, orthogonality and factorization error of DSCGS behave almost identically, the dependence of condition number κ(A) on the factorization error would need a further investigation.
Figure 5 also shows that GMGS is the most robust method in comparison to the others. This is caused by the usage of the modified GramSchmidt orthogonalization instead of the classical one.
Mixing precision
Another factor influencing the algorithm’s performance is the numerical precision of the mixing weights W. Here, we simulate the case of a geometric topology with the Metropolis weights model, where the weights are of given precision—characterized by the number of variable decimal digits (4, 8, 16, 32, “Infinite”).^{5}
If we compare Fig. 6 with Fig. 7, we find that the numerical precision of the mixing weights have bigger influence in cases when the input matrix is worse conditioned. In Figs. 8 and 9, we can see the difference between orthogonality errors for various precisions. We observe that for the matrix A with higher condition number, the higher mixing precision has bigger impact on the result.
As we find in Fig. 6, the error floor moves with the mixing precision. However, we must note that even for the “infinite” mixing precision the orthogonality error stalls at an accuracy (∼10^{−12}) lower than the used machine precision—taking into account also the conversion to double precision. From the simulations, we conclude that this is caused by high numerical dynamic range in the first phases of the algorithm as well as by the errors created by the misagreement among the nodes during the transient phase of the algorithm.
Robustness to link failures
In case of distributed algorithms, it is of big importance that the algorithm is robust against network failures. Typical failures in WSN are message losses or link failures, which occur due to many reasons, e.g., channel fading, congestions, message collisions, moving nodes, or dynamic topology.
We model link failures as a temporary dropout of a bidirectional connection between two nodes, meaning that no message can be transmitted between the nodes. In every time step, we randomly remove some percentage of links in the network. As a weight model, we picked the constant weights model, Eq. (8), due to its property that every node can compute at each time step the weights locally based only on the number of received messages (d _{ i }). Thus, no global knowledge is required. However, the nodes must still work synchronously.^{6}
From Fig. 10, we conclude that the algorithm is very robust and even if we drop in every time step, a big percentage (up to 60 %) of the links, the algorithm still achieves some accuracy (at least 10^{−2}; Fig. 10 c).
It is worth noting that moving nodes and dynamic network topology can be modeled in the same way. We therefore argue that the algorithm is robust also to such scenarios (assuming that synchronicity is guaranteed).
Performance comparison with existing algorithms
We compare our new DSCGS algorithm with ACCGS, GCGS, and GMGS introduced in Section 3.2. Although all approaches have iterative aspects, the cost per iteration strongly differs for each algorithm. Thus, instead of providing a comparison in terms of number of iterations to converge, we compare the communication cost needed for achieving a certain accuracy of the result. We investigate the total number of messages sent as well as the total amount of data (real numbers) exchanged.
Simulation results for various topologies are shown in Figs. 11 and 12. The gossipbased approaches exchange, in general, less data (Fig. 12), but since their message size is much smaller than in DSCGS, the total number of messages sent is higher (Fig. 11).
Because the message size of ACCGS is even smaller than in the gossipbased approaches, it sends the highest number of messages. Since the energy consumption in a WSN is mostly influenced by the number of transmissions [36, 37], it is better to transmit as few messages as possible (with any payload size); therefore, DSCGS is the most suitable method for a WSN scenario. However, we notice that in many cases, DSCGS does not achieve the same final accuracy of the result as the other methods.
Note that in fully connected networks, ACCGS delivers a highly accurate result from the beginning, because within the first iterations, all nodes exchange the required information with all other nodes.
In Table 1, we summarize the total communication cost and local memory requirements of the algorithms. However, due to different parameters, it is difficult to rank the approaches in a general case. The requirements depend especially on the topology of the underlying network, the number of iterations I ^{(s)} and I ^{(d)} required for convergence in “static” and “dynamic” consensusbased algorithms or the number of rounds R needed for convergence of pushsum in the gossipbased approaches. For example, in a fully connected network R=O(logN) [24], I ^{(s)}=1. Thus, ACCGS requires O(m ^{2} N) messages sent as well as data exchanged, whereas gossipbased approaches need O(mN logN) messages and O(m ^{2} N logN) data. Note that GCGS and GMGS have theoretically identical communication cost; however, GMGS is numerically more stable (see Fig. 5) and achieves a higher final accuracy (see Figs. 11 and 12). In case of DSCGS and a fully connected network, we can interpret DSCGS in the worst case as m consequent static consensus algorithms (one for each column); thus, I ^{(d)}=O(m), and the number of transmitted messages is O(mN) and data O(m ^{3} N). Nevertheless, theoretical convergence bounds of DSCGS (on I ^{(d)}) remain an open research question.
Conclusions
We presented a novel distributed algorithm for computing QR decomposition and provided an analysis of its properties. In contrast to existing methods, which compute the columns of the resulting matrix Q consecutively, our method iteratively refines all elements at once. Thus, in any moment, the algorithm can deliver an estimate of both matrices Q and R. The algorithm dramatically outperforms known distributed orthogonalization algorithms in terms of transmitted messages, which makes it suitable for energyconstrained WSNs. Based on our empirical observation, we argue that the evaluation of the local factorization error at each node might lead to a suitable stopping criterion for the algorithm. We also provided a thorough study of its numerical properties, analyzing the influence of the precision of the mixing weights and condition numbers of the input matrix. We furthermore analyzed the robustness of the algorithm to link failures and showed that the algorithm is capable to reach a certain accuracy even for a high percentage of link failures.
The biggest drawback of the algorithm is the necessity to have synchronously working nodes. This leads to poor robustness when the messages are sent (or lost) asynchronously. As we showed, since the algorithm originates from the classical GramSchmidt orthogonalization, also the numerical sensitivity of the algorithm is a big issue and needs to be addressed in the future. The optimization of the weights and design of algorithm in such way that it avoids a big dynamic numerical range, especially in the first phases, is also of interest.
An alternative approach, not considered here, which could be worth of future research, would be to find a distributed algorithm as an optimization problem, e.g., mins.t. Q ^{⊤} Q=I∥A−Q R∥. In literature, there exist many distributed optimization methods, e.g., [38, 39], which could lead to even superior algorithms, with even faster convergence and smaller error floors.
Last but not least, theoretical bounds of DSCGS for the convergence time and rate remain an open issue. A first application of the algorithm has already been proposed in [26]. Also, since the proposed algorithm is not restricted to the usage in wireless sensor networks only, a transfer of the proposed algorithm onto socalled networkonchip platforms [40] could possibly lead to further new interesting and practical applications as well.
Endnotes
^{1}Knowing n, \(\left \{\mathbf {u}}\right \^{2}_{2} =~ n{\lim }_{\textit {t}\to \infty }{\mathbf {W}}^{t}({\mathbf {u}}\circ {\mathbf {u}})=\sum _{i=1}^{n}{u_{i}^{2}}\).
^{2} \({\lim }_{\textit {t}\to \infty }\boldsymbol {\omega }(t)=1/N{\mathbf {1}}\).
^{3}Not considering numerical properties.
^{4}Error level at which the algorithm stalls at given computational precision.
^{5}The simulations were performed in Matlab R2011b 64bit using the Symbolic Math Toolbox with variable precision arithmetic. “Infinite” precision denotes weights represented as an exact ratio of two numbers. The depicted result after “infinite” precision multiplication was converted to double precision.
^{6}If there is a link, nodes see each other and immediately exchange messages. From a mathematical point of view, this implies that weight matrix W will be doubly stochastic [1] in every time step.
Appendix: local algorithm
For a better clarity, we here reformulate DSCGS algorithm from the point of view of an individual node i (local point of view). Note that input matrix A is stored rowwise in the nodes, and for simplicity, we show here the case when the number of rows of matrix \({\mathbf {A}}\in \mathbb {R}^{n\times m}\) is equal to the number of nodes in the network. For a formulation from the network (global) point of view and arbitrary size of matrix A, see Section 4.
References
 1
GH Golub, CF Van Loan, Matrix Computations, 3rd Ed. (Johns Hopkins Univ. Press, Baltimore, USA, 1996).
 2
JM Lees, RS Crosson, in Spatial Statistics and Imaging, 20, ed. by A Possolo. Bayesian ART versus conjugate gradient methods in tomographic seismic imaging: an application at Mount St. Helens, Washington (IMS Lecture NotedMonograph SeriesHayward, CA, 1991), pp. 186–208.
 3
C Dumard, E Riegler, in Int. Conf. on Telecom. ICT ’09. Distributed sphere decoding (IEEEMarrakech, 2009), pp. 172–177.
 4
G Tauböck, M Hampejs, P Svac, G Matz, F Hlawatsch, K Gröchenig, Lowcomplexity ICI/ISI equalization in doubly dispersive multicarrier systems using a decisionfeedback LSQR algorithm. IEEE Trans. Signal Process.59(5), 2432–2436 (2011).
 5
E Hänsler, G Schmidt, Acoustic Echo and Noise Control (Wiley, Chichester, New York, Brisabne, Toronto, Singapore, 2004).
 6
PSR Diniz, Adaptive Filtering—Algorithms and Practical Implementation (Springer, US, 2008).
 7
AH Sayed, Adaptation, Learning, and Optimization over Networks, vol. 7 (Foundations and Trends in Machine Learning, BostonDelft, 2014).
 8
KJ Cho, YN Xu, JG Chung, in IEEE Workshop on Signal Processing Systems. Hardware efficient QR decomposition for GDFE (IEEEShanghai, China, 2007), pp. 412–417.
 9
X Wang, M Leeser, A truly twodimensional systolic array FPGA implementation of QR decomposition. ACM Trans. Embed. Comput. Syst.9(1), 3–1317 (2009).
 10
A Buttari, J Langou, J Kurzak, J Dongarra, in Proc. of the 7th International Conference on Parallel Processing and Applied Mathematics. Parallel tiled QR factorization for multicore architectures (SpringerBerlin, Heidelberg, 2008), pp. 639–648.
 11
J Demmel, L Grigori, MF Hoemmen, J Langou, Communicationoptimal parallel and sequential QR and LU factorizations (2008). Technical report, no. UCB/EECS200889, EECS Department, University of California, Berkeley.
 12
F Song, H Ltaief, B Hadri, J Dongarra, in International Conference for High Performance Computing, Networking, Storage and Analysis. Scalable tile communicationavoiding QR factorization on multicore cluster systems (IEEE Computer SocietyWashington, DC, USA, 2010), pp. 1–11.
 13
M Shabany, D Patel, PG Gulak, A lowlatency lowpower QRdecomposition ASIC implementation in 0.13 μm CMOS. IEEE Trans. Circ. Syst. I. 60(2), 327–340 (2013).
 14
A Nayak, I Stojmenović, Wireless Sensor and Actuator Networks: Algorithms and Protocols for Scalable Coordination and Data Communication (Wiley, Hoboken, NJ, 2010).
 15
H Straková, WN Gansterer, T Zemen, in Proc. of the 9th International Conference on Parallel Processing and Applied Mathematics, Part I. Lecture Notes in Computer Science, 7203. Distributed QR factorization based on randomized algorithms (Springer Berlin HeidelbergBerlin, Heidelberg, 2012), pp. 235–244.
 16
H Straková, Truly distributed approaches to orthogonalization and orthogonal iteration on the basis of gossip algorithms (2013). PhD thesis, University of Vienna.
 17
O Slučiak, M Rupp, in Proc. of the 36th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Reaching consensus in asynchronous WSNs: algebraic approach (Prague, 2011), pp. 3300–3303. Chap. Acoustics, Speech and Signal Processing (ICASSP), 2011.
 18
O Slučiak, M Rupp, in Proc. of Statistical Sig. Proc. Workshop (SSP). Almost sure convergence of consensus algorithms by relaxed projection mappings (IEEEAnn Arbor, MI, USA, 2012), pp. 632–635.
 19
F Sivrikaya, B Yener, Time synchronization in sensor networks: a survey. IEEE Netw. Mag. Special Issues Ad Hoc Netw. Data Commun. Topol. Control. 18(4), 45–50 (2004).
 20
R OlfatiSaber, JA Fax, RM Murray, Consensus and cooperation in networked multiagent systems. Proc. IEEE. 95(1), 215–233 (2007).
 21
L Xiao, S Boyd, Fast linear iterations for distributed averaging. Syst. Control Lett.53:, 65–78 (2004).
 22
L Xiao, S Boyd, S Lall, in Proc. ACM/IEEE IPSN–05. A scheme for robust distributed sensor fusion based on average consensus (IEEELos Angeles, USA, 2005), pp. 63–70.
 23
LN Trefethen, D Bau III, Numerical Linear Algebra (SIAM: Society for Industrial and Applied Mathematics, Philadelphia, 1997).
 24
D Kempe, A Dobra, J Gehrke, in Foundations of Computer Science, 2003. Proceedings. 44th Annual IEEE Symposium on. Gossipbased computation of aggregate information, (2003), pp. 482–491. ISSN:02725428, doi:10.1109/SFCS.2003.1238221.
 25
H Straková, WN Gansterer, in 21st Euromicro Int. Conf. on Parallel, Distributed, and NetworkBased Processing (PDP). A distributed eigensolver for loosely coupled networks (IEEEBelfast, UK, 2013), pp. 51–57.
 26
O Slučiak, M Rupp, Network size estimation using distributed orthogonalization. IEEE Sig. Proc. Lett.20(4), 347–350 (2013).
 27
P Braca, S Marano, V Matta, in Proc. Int. Conf. Inf. Fusion (FUSION 2008). Running consensus in wireless sensor networks (IEEECologne, Germany, 2008), pp. 152–157.
 28
W Ren, in Proc. of the 2007 American Control Conference. Consensus seeking in multivehicle systems with a timevarying reference state (IEEENew York, NY, 2007), pp. 717–722.
 29
V Schwarz, C Novak, G Matz, in Proc. 43rd Asilomar Conf. on Sig., Syst., Comp. Broadcastbased dynamic consensus propagation in wireless sensor networks (IEEEPacific Grove, CA, 2009), pp. 255–259.
 30
M Zhu, S Martínez, Discretetime dynamic average consensus. Automatica. 46(2), 322–329 (2010).
 31
O Slučiak, O Hlinka, M Rupp, F Hlawatsch, PM Djurić, in Rec. of the 45th Asilomar Conf. on Signals, Systems, and Computers. Sequential likelihood consensus and its application to distributed particle filtering with reduced communications and latency (IEEEPacific Grove, CA, 2011), pp. 1766–1770.
 32
O Slučiak, H Straková, M Rupp, WN Gansterer, in Rec. of the 46th Asilomar Conf. on Signals, Systems, and Computers. Distributed GramSchmidt orthogonalization based on dynamic consensus (IEEEPacific Grove, CA, 2012), pp. 1207–1211.
 33
P Braca, S Marano, V Matta, AH Sayed, in Proc. of the 39th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Large deviations analysis of adaptive distributed detection (IEEEFlorence, Italy, 2014), pp. 6153–6157.
 34
O Slučiak, Convergence analysis of distributed consensus algorithms (2013). PhD thesis, TU Vienna.
 35
B Efron, RJ Tibshirani, An Introduction to the Bootstrap (Chapman & Hall/CRC Monographs on Statistics & Applied Probability 57, London, UK, 1994).
 36
P Rost, G Fettweis, in GLOBECOM Workshops, 2010 IEEE. On the transmissioncomputationenergy tradeoff in wireless and fixed networks (IEEEMiami, FL, 2010), pp. 1394–1399.
 37
R Shorey, A Ananda, MC Chan, WT Ooi, Mobile, Wireless, and Sensor Networks: Technology, Applications, and Future Directions (Wiley, Hoboken, NJ, 2006).
 38
B Johansson, On distributed optimization in networked systems (2008). PhD thesis, KTH, Stockholm.
 39
I Matei, JS Baras, Performance evaluation of the consensusbased distributed subgradient method under random communication topologies. IEEE J. Sel. Top. Signal Process.5(4), 754–771 (2011).
 40
L Benini, GD Micheli, Networks on chips: a new SoC paradigm. IEEE Comput.35(1), 70–78 (2002).
Acknowledgements
This work was supported by the Austrian Science Fund (FWF) under project grants S10608N13 and S10611N13 within the National Research Network SISE. Preliminary parts of this work were previously published at the 46th Asilomar Conf. Sig., Syst., Comp., Pacific Grove, CA, USA, Nov. 2012 [32].
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Slučiak, O., Straková, H., Rupp, M. et al. Distributed GramSchmidt orthogonalization with simultaneous elements refinement. EURASIP J. Adv. Signal Process. 2016, 25 (2016). https://doi.org/10.1186/s1363401603226
Received:
Accepted:
Published:
Keywords
 Distributed processing
 GramSchmidt orthogonalization
 QR factorization