Urban localization using robust filtering at multiple linearization points

We propose a robust Bayesian filtering framework for state and multi-modal uncertainty estimation in urban settings by fusing diverse sensor measurements. Our framework addresses multi-modal uncertainty from various error sources by tracking a separate probability distribution for linearization points corresponding to dynamics, measurements, and cost functions. Multiple parallel robust Extended Kalman filters (R-EKF) leverage these linearization points to characterize the state probability distribution. Employing Rao–Blackwellization, we combine the linearization point distribution with the state distribution, resulting in a unified, efficient, and outlier-resistant Bayesian filter that captures multi-modal uncertainty. Furthermore, we introduce a gradient descent-based optimization method to refine the filter parameters using available data. Evaluating our filter on real-world data from a multi-sensor setup comprising camera, Global Navigation Satellite System (GNSS), and Attitude and Heading Reference System (AHRS) demonstrates improved performance in bounding position errors based on uncertainty, while maintaining competitive accuracy and comparable computation to existing methods. Our results suggest that our framework is a promising direction for safe and reliable localization in urban environments.

sensors, thereby mitigating individual sensor limitations and leading to a more reliable estimate of the state.However, integrating multi-sensor data presents its own set of challenges, such as the need to mitigate large errors in sensor measurements and accommodate diverse noise sources [5,6].These diverse noise sources contribute to multiple modes of uncertainty that may yield multiple plausible solutions for the state.Hence, algorithms for multi-sensor state estimation must have the dual capability of being robust to large errors while accurately capturing the underlying multi-modal uncertainty.
Various techniques have been proposed to account for large errors and incorporate measurements from multiple sensors, such as optimization-based methods and robust Bayesian filters.Optimization-based methods, such as factor graphs with robust cost functions [7][8][9][10], are capable of handling nonlinearities and uncertainties by optimizing over a large state space consisting of multiple states and uncertainty parameters.However, these approaches are computationally demanding due to the size of the optimization space, and require significant parameter tuning to achieve good estimation performance [11].Moreover, state uncertainty is often not assessed in optimizationbased approaches due to the increased computational demands, which limits their utility in scenarios where reliable localization is crucial.
On the other hand, Bayesian filters-such as Extended Kalman filters (EKFs) and Unscented Kalman filters (UKFs)-estimate the probability distribution of the state based on the history of measurements [12].These filters utilize Bayesian statistics to track both the state estimate and its associated uncertainty by combining prior state information, vehicle dynamics, and sensor measurements across time.For example, EKFs have been used to provide localization estimates by integrating stereo camera measurements and Real-Time Kinematic GPS (RTK-GPS) [13].Another work [14] proposed a similar method by fusing GPS, INS, a monovision camera, and a 3D cartographical model.An adaptive EKF framework in [15] used Google Street View images with GPS for state estimation.Outlier-robust versions of these filters use strategies such as robust cost functions, statistical tests, minimax optimization, or heavy-tailed priors to handle outliers and model complex noise distributions.For example, robust UKFs were explored in [16] for tight integration of multiple GPS receivers with a monocular camera and an IMU, and in [17] for vehicle geo-localization with a GPS receiver, a video camera, and a 3D city model.While these filters are computationally efficient and easy to implement, they require careful parameter tuning based on the application to achieve good state and uncertainty estimation performance.This can become challenging in complex systems that involve multiple sensor measurements and their associated parameters [18].Moreover, these filters rely on a Gaussian distribution to model the state probability distribution, which restricts their capacity to adequately account for the multi-modal uncertainty that can arise due to sensor measurements in complex urban environments.
Particle filter, another type of Bayesian filter, captures multi-modal uncertainty over the state by tracking the distribution as a weighted collection of points in the state space [19].In a recent work [20], a particle filtering framework was proposed for fusing GNSS with camera images and for characterizing the uncertainty in localization from sensor fusion.Similarly, in [21], images from a monocular camera were fused with low-cost GPS sensors and a map to provide high-accuracy localization.Variants of particle filters have been explored for multi-modal sensor fusion such as decentralized filters [22] and differentiable filters [23].However, particle filters are often challenged by the "curse of dimensionality"-or exponentially increasing computational complexity with higher-dimensional state spaces-leading to prohibitive computational costs [24].
Recent literature has proposed two other promising approaches to efficiently capture multi-modal uncertainty over the state: employing a combination of filters, also known as Interacting Multiple Model(IMM) filters [25], and modeling the distribution as a mixture of Gaussian distributions [26].For example, IMM filters have been explored for providing positioning by integrating low-cost GPS and in-vehicle sensors while adapting the vehicle model to various driving conditions [27].An IMM filter was proposed in [28] to provide fault-tolerant positioning by integrating IMU with wheel encoders in GPS-degraded environments for mobile robots.Self-adaptive Gaussian mixture models were used in [29] to model the effects of non-Gaussian GNSS outliers.These approaches have shown an improved ability in capturing real-world uncertainty in a variety of robotics and navigation applications.However, these techniques also have limitations, such as model complexity and difficulty in parameter selection.
In addition to conventional approaches, various studies have introduced data-driven adaptations of filtering algorithms by incorporating machine learning and deep learning techniques.To enhance localization accuracy during GNSS outages, an adaptive Kalman filter that incorporates neural network-derived position and velocity measurements was proposed in [30].A fully differentiable pipeline to train an Extended Kalman Filter (EKF) for visual-inertial odometry was established in [31].Similarly, other researchers have explored neural network-based methodologies for modeling diverse parameters within filtering techniques [23,[32][33][34].These methods use existing data to automatically infer good parameters for improving the accuracy of localization.However, the lack of transparency within these methods poses a challenge, as relying on parameters fine-tuned for accuracy on a limited dataset may not adequately account for uncertainty in the general case.
The limitations of existing robust state estimation methods in effectively capturing uncertainty highlight the need for developing techniques to address these challenges.In this work, we propose a novel robust Bayesian filtering framework that uses Rao-Blackwellization to effectively capture multi-modal uncertainty while maintaining the accuracy and error resilience of existing robust Bayesian filtering approaches.Our approach enhances the robust EKF by tracking multiple points where the EKF and its robust cost function are linearized.By employing a particle filter to effectively track these linearization points, our framework can overcome the limitations of the linearization and Gaussian assumptions in the EKF, capturing the multi-modal uncertainty in a computationally efficient manner.
We use Rao-Blackwellization [35] to integrate the robust EKF and the particle filter into a single robust Bayesian filter that estimates a multi-modal probability distribution of the state.This facilitates the reuse of the standard filtering components in the EKF for efficient implementation [36].We then apply this filter to a multi-sensor setup comprising of camera, GNSS, and attitude and heading reference system (AHRS), where the sensor modules are developed based on our previous work [37].However, tuning the parameters for our complex multi-sensor setup is challenging.To address this, we devise a gradient descent-based optimization strategy for efficient parameter tuning.
Next, we use the estimated multi-modal uncertainty to develop an approach for computing position error bounds along the lateral, longitudinal, and vertical directions of motion.These bounds enable us to assess the positioning performance for safe operation in urban environments [38].Our approach is implemented in PyTorch [39] to leverage parallelization for fast execution and automatic differentiation capabilities.
Our strategy for using Rao-Blackwellization to improve the estimation of state and uncertainty shares similarities with [40,41] and distinguishes itself from the conventional Rao-Blackwellized particle filter in several key aspects.In traditional applications of Rao-Blackwellization, the overall state space of interest is partitioned into multiple substates, where each substate is tracked with a different filter [26,42,43].Our approach, on the other hand, introduces an additional term-the linearization point-and employs Rao-Blackwellization to factor it separately from the primary state of interest.Moreover, our approach emphasizes multi-sensor and robust filtering settings and focuses on characterizing multi-modal uncertainty to compute position error bounds, whereas the previous approaches do not address these aspects.
The contributions of this work are summarized as follows: • We develop a novel Bayesian filtering framework that captures multi-modal uncertainty while accommodating diverse sensor measurements and possible outliers.Our approach tracks a probability distribution of the points for linearizing the dynamics, observation, and robust cost models, effectively accounting for errors from approximation and diverse noise sources.• Drawing on recent advancements in differentiable filter design research [23,44], we present a gradient descent-based optimization strategy for tuning the parameters in our filter.The optimization objective is expressed as a total loss function that includes both measurement and position loss terms.• We present a method for estimating position error bounds along the lateral, longitudinal, and vertical directions of the vehicle's motion.Our proposed approach uses the multi-modal probability distribution obtained from our hybrid filter and explicitly accounts for uncertainties in both position and heading estimation.• We validate our approach using real-world data from Hong Kong [45], a dense urban environment, and demonstrate its effectiveness in achieving improved reliability in the position error bounds, while maintaining competitive state estimation performance compared to existing methods.Moreover, we demonstrate that the computational requirements of our approach are comparable to existing methods, making it a practical choice for state estimation in urban environments.
The rest of the paper is structured as follows: in Sect.2, we present background on robust Bayesian filters.Next, we derive our proposed robust filter for multi-modal uncertainty quantification in Sect.3. Section 4 outlines the dynamics and the sensor modules in our multi-sensor setup.We describe the estimation of position error bounds from the captured multi-modal uncertainty in Sect. 5.The loss functions and optimization strategy for tuning the filter parameters are presented in Sect.6.
Experimental results and discussions for the performance evaluation of our approach are provided in Sect.7. Finally, we conclude the paper with Sect.8.

Background on robust Bayesian filters
We consider a nonlinear discrete-time system with a time-invariant transition function f and a time-varying measurement function h t given as follows: where x t is the system state and m t is the measurement at time t.The process noise w t and the measurement noise ν t are zero mean with covariance matrix Q t and R t , respec- tively.The time-varying nature of h t accommodates an asynchronous measurement set- ting, where measurements from different sensors are obtained at distinct time instances due to varying sampling rates.The objective of Bayesian filtering is to estimate the probability distribution of the state vector x t given the sequence of measurement vectors m 1:t = {m 1 , . . ., m t } and the prior probability estimate p(x t−1 |m t−1 ) .Using Bayes' rule and the law of total probability, the posterior probability of the state conditioned on the measurements is given as where p(m t |x t ) is the likelihood function representing the probability of observing measurement m t from state x t .The term p(x t |x t−1 ) represents the transition probability from state x t−1 to x t .
Conventional approaches to Bayesian filtering approximate the probability distribution p(x t |m t ) as a Gaussian distribution for efficiency and tractability.For example, the extended Kalman filter (EKF) uses a linearized version of the transition and measurement functions, and the unscented Kalman filter (UKF) uses nonlinear function evaluations at preselected sigma points to model p(x t |m t ) [46].Using these approximations, p(x t |m t ) can be efficiently estimated from p(x t−1 |m t−1 ) and the measurements through Eq. 3 and matrix operations.This process is commonly broken down into two steps, namely the predict and the update step [47].However, the estimation accuracy of these approaches can substantially degrade when the process and measurement noise significantly deviates from the modeled Gaussian distribution, which is a common occurrence when using sensors in real-world settings [48][49][50].
To improve the accuracy in non-Gaussian settings, we can replace the Gaussian likelihood used in conventional Bayesian filtering approaches with a robust cost functionbased likelihood where r t = m t − h t (x t ) is the residual, and ρ(r t ) is the robust cost function, such as Huber loss or Tukey biweight loss [51].The robust cost function reweights the influence (1) of measurements based on their residuals, thereby reducing the impact of outliers and unmodeled errors.
The robust cost-based likelihood can be integrated into the EKF and UKF frameworks by modifying the update step [52][53][54][55].A common method to do this involves reweighting the measurement covariance matrices R t based on the first-order approx- imation of the cost function ψ(r t ) , also known as the influence function, The measurement covariance matrix R t is then reweighted using ψ(r t ) to obtain a robust covariance matrix Rt : The robust measurement covariance matrix Rt is then used in the update step of the EKF and UKF to compute the Kalman gain and update the estimate of the state and covariance.
The performance of robust Bayesian filtering depends on the careful selection of a robust cost function and its associated parameters, which can be challenging to identify as they depend on the sensor configuration and environmental characteristics [56].Furthermore, the approximations employed by these filters to enhance computational efficiency-such as linearizing in transition, measurements, and robust cost-can also introduce inaccuracies in both the estimated state and the quantification of uncertainty.Therefore, it is important to account for the impacts of these error sources to effectively represent the uncertainty within the filtering framework.

Proposed robust Bayesian filter with multi-modal uncertainty
In the context of multi-sensor navigation in urban environments, the presence of diverse and dynamic noise can cause significant deviation from the typically assumed Gaussian behavior in the sensor measurements.Instead, the measurements may exhibit multi-modal behavior, which then propagates to a multi-modal uncertainty in the state estimation process [29,49].While the robust Bayesian filters discussed in the previous section can improve state estimation accuracy by relying on a single Gaussian distribution to represent the state probability distribution, they are illequipped to effectively capture the underlying multi-modal uncertainty.
To address this challenge, we propose a robust Bayesian filter that incorporates multi-modal uncertainty modeling into the estimation process.Our approach extends the standard Bayesian filtering framework by explicitly incorporating a linearization point, denoted as x l t , at each timestep t.The linearization point serves as a reference for linearizing the transition and measurement models, as well as for computing the influence function.The overall approach is illustrated in Fig. 1 We first rewrite Eq. 3 to estimate the posterior probability of the state x t given the linearization point x l t and the measurements m 1:t (5) ψ(r t ) = dρ(r) dr r=r t . ( In this equation, the probability terms are computed in a manner analogous to the robust EKF approach described earlier in Sec. 2, with the key distinction that the linearization point x l t is used for linearizing the equations.In order to incorporate the uncertainty in the selection of an appropriate linearization point x l t , we model a separate probability distribution.The corresponding filtering equation is expressed as This equation has a similar form to Eq. 3 and is used to update the linearization point x l t based on available measurements m t .To effectively model the multi-modal uncertainty, we employ a particle filter to estimate p(x l t |m 1:t ) .The filter uses a set of weighted parti- cles to represent the probability distribution, where each particle is a linearization point.First, we use the prior distribution p(x l t−1 |m 1:t−1 ) to generate a set of prediction parti- cles given by the state transition model where w l t is the process noise for the linearization point at time t with covariance matrix Q l t .For simplicity, we set the covariance Q l t = Q t in this paper.However, we note that the choice of Q t can have a significant impact on the estimation accuracy and robustness.Future work can consider exploring the different choices of Q l t as a separate term-such as by taking advantage of the well-established methods in particle filter literature and the use of proposal distributions [57,58]-to improve the estimation performance.Based on the linearization point x l t corresponding to each particle, we can estimate the posterior state probability distribution p(x t |x l t , m 1:t ) using the Bayesian filtering expres- sion described in Eq. 8, while keeping the linearization point x l t unchanged.However, due to the randomness in the particle filter, tracking the linearization point in this manner can result in precision errors, particularly when the number of particles is low.To address this issue and improve the accuracy of the estimation process, we propose a twostep approach.First, we estimate p(x t |x l t , m 1:t ) using the linearization point x l t .Then, we refine the value of x l t by setting it to the mean of the posterior distribution p(x t |x l t , m 1:t ) .This refinement step ensures that the linearization point is more closely aligned with the state probability distribution, improving the accuracy of the estimation.
In the second step of the particle filter, we update the weight of each prediction particle based on the available measurements m t and likelihood p(m t |x l t ) .We model the likelihood as a normal distribution with mean based on the nonlinear measurement model h t (•) and covariance R l t To improve estimation performance, appropriate values of R l t can be assigned based on the application.By modeling R l t separately from R t , we can incentivize tracking distinct linearization points in high uncertainty scenarios without compromising on the estimation accuracy.
Finally, we perform a resampling step on the set of linearization points based on the updated weights to improve our estimate of the posterior distribution p(x l t |m 1:t ) .We use the soft resampling strategy proposed in [59] to maintain differentiability in the parameter optimization.

Algorithm 1: Proposed robust Bayesian filter
Input: Initial set of linearization points x l 0,i , i = 1, . . ., N , state estimates Output: Estimated posterior probability distribution p(x t |m 1:T ) 1 Initialize particle weights w 0,i = 1/N , i = 1, . . ., N; state covariance matrices Σ 0,i , i = 1, . . ., N Generate prediction particles x l t,i based on Eq. 9; To estimate the overall probability distribution of the state p(x t |m t ) , we combine the estimates from the particle filter and the robust EKF using the law of total probability and the Rao-Blackwell theorem which allows us to estimate p(x t |m 1:t ) by conditioning on the linearization point x l t .This approach has a smaller estimation variance than Eq. 3 based on the Rao-Blackwell theorem, allowing us to better capture the uncertainty in the estimate of the state x t .The overall algorithm is outlined in Algorithm 1.
The use of a particle filter allows us to capture multiple modes in the distribution of the linearization point, which in turn enables us to better represent the multi-modal uncertainty in state estimation.However, particle filters are often challenged by the "curse of dimensionality" when the state space is large.To address this challenge, we restrict the domain of the linearization point to the position domain, reducing the dimensionality of the particle filter state space.This enables us to maintain computational efficiency while modeling the uncertainty inherent in the system.

Multi-sensor state estimation
In this section, we describe how we apply the approach described in the previous sections for multi-sensor state and uncertainty estimation.Figure 2 illustrates the multi-sensor setup and the individual sensor modules.The camera module captures an image at each timestep, and utilizes robust features and depth estimated using deep neural networks to assess the motion.The GNSS module combines pseudorange measurements from multiple satellite constellations and a base station.The attitude and heading reference system (AHRS) provides the orientation and angular velocity measurements.Our proposed filter integrates all of these measurements to estimate the state and uncertainty.Further details regarding the filter and each of the sensor modules are provided in the subsequent sections.(11) p(x t |m 1:t ) = p(x l t |m 1:t )p(x t |x l t , m 1:t )dx l t , Fig. 2 Sensor modules in our multi-sensor setup.The setup includes a camera module with depth estimation and feature detection component, b GNSS module that combines pseudorange measurements from multiple satellite constellations and a nearby base station, and c AHRS module that provides orientation and angular velocity measurements.The camera module outputs velocity and landmark measurements, which are fused with the GNSS and AHRS measurements by our filter for state estimation

State space and dynamics model
The state vector x t = p t v t q t ωt consists of position p t , velocity v t , orientation quater- nion q t and the bias in angular velocity measurements ωt at time t.We model the vehi- cle motion as a constant velocity model using the angular velocity ω t obtained from the AHRS: where t is the time step between consecutive measurements, and q(ω) is the quater- nion derivative corresponding to the angular velocity ω .To avoid numerical stability issues, we re-normalize the quaternion q t to a unit quaternion each time we update it based on the dynamics.The process noise covariance matrix Q t is given as where Q p t , Q v t , and Q ω t denote the diagonal process noise covariance matrices for position, velocity, and angular velocity bias, respectively.We construct the process noise covariance for the quaternion Q q t by linearizing the Euler-to-quaternion transformation about q t , which enables us to incorporate correlation across terms and obtain better uncer- tainty characterization in the dynamics.To compute Q q t , we first obtain the Jacobian J q t of the Euler-to-quaternion transformation.We then use the diagonal 3 × 3 process noise covariance matrix Qq t in the Euler space, and apply it to J q t as To account for the varied acquisition rates of sensors, we utilize sequential or asynchronous updates within our filter [60,61].Specifically, the state probability distribution is updated incrementally as the measurements arrive, with different update steps executed for the measurements.This approach both improves computational efficiency and enables modularity, allowing easy addition or removal of the different sensor components.
In the following sections, we provide a detailed description of each sensor module and the corresponding filter observation model.

Camera module
Our visual odometry pipeline utilizes consecutive image frames {I t−1 , I t } from an onboard monocular camera to estimate the vehicle motion.The choice of a monocular camera is motivated by its advantages in terms of affordability, compactness, and versatility over other visual sensors [62].While traditional techniques based on feature-matching from monocular images can estimate motion up to an unknown scale factor [63,64], we leverage recent research on visual odometry to develop a more accurate approach.Our pipeline incorporates neural networks to both detect features and estimate their associated depth from camera images, which allows us to obtain (12) scale-consistent and robust visual odometry.This strategy, which combines monocular depth estimates with keypoints detected on the image, has been previously explored in [65,66] and has been shown to outperform both geometry-based and learning-based methods.Throughout the paper, we refer to this pipeline as Monocular-Depth aided Visual Odometry (MD-VO).MD-VO relies on keypoint detection and depth estimation to construct the observation model.
Keypoint detection In the first step, we detect a set of keypoints k t = {k 1 t , . . ., k M t } and descriptors d t = {d 1 t , . . ., d M t } from the input image I t where M is the total num- ber of detected features.Keypoints are the distinctive local regions in the image, while descriptors are vectors that represent these regions with invariance to motion and lighting deformations.However, real-world camera images are often subject to significant variations over time, leading to spurious feature matching and erroneous motion estimation [2].To mitigate the effects of spurious feature-matching, we use the neural network-based SuperPoint detector [67], which has been optimized using real-world images.This detector is designed to identify keypoints and descriptors that are robust to real-world deformations and suitable for multiple-view geometry problems.By utilizing these features, we improve our feature extraction step and improve positioning performance in urban environments.
Next, we match the descriptor set d t with the descriptor sets {d t−1 , . . ., d t−T SP } from the previous T SP = 10 frames.We rank the matches based on the sum of closest matching scores η i SP calculated as: We apply a threshold τ SP to the scores η i SP to select the 50 best-performing keypoints for subsequent motion estimation.This strategy ensures that the selected features persist in multiple frames, leading to more robust feature detection.To further enhance the robustness of subsequent motion estimation against outliers, we employ Random Sample Consensus (RANSAC) algorithm [68] to identify and remove large outliers.
For the remainder of this section, we will use the notation k t to refer to the M ′ best keypoints identified using the aforementioned method.Additionally, we will use k t−1 to denote the matching keypoints (i.e., the minimizers of η i SP (t − 1) ) from the previ- ous frame.
Depth Estimation In the depth estimation step, we estimate the 3D depth I D t−1 (k) associated with every pixel k in frame I t−1 from the previous time t − 1 .Projecting 2D image coordinates to 3D is an ill-posed and inherently ambiguous task.However, previous research has shown that this task can be effectively accomplished by leveraging patterns in the appearance of objects [69,70].(15) In this work, we utilize pre-trained models from MonoVO [70] to estimate the depth.To reduce computation requirements, we initiate depth calculation from the previous image acquisition timestep, as we only require the 3D locations at the previous time.
Next, we estimate the 3D coordinates kt−1 that correspond to the 2D keypoints k t−1 as where K cam is the matrix containing camera intrinsic parameters.Using these estimated 3D coordinates, we construct observation models based on the current frame.
To construct the observation model for the camera module, we consider the sensor measurement m t based on the current frame 2D keypoints k t and an estimated body frame velocity ṽt .The measurement vector m t is given as here k t is rearranged in a column-major fashion to obtain a single vector, which makes it easier to use within the filter.We obtain the velocity ṽt from the matched 3D-2D key- points using perspective-n-point algorithms [71].We enforce ṽt to be nonzero only along the axis that aligns with the forward direction of the vehicle's motion.Our empirical evaluations indicate that adding this velocity measurement to the overall measurement vector helps regularize the filter's estimate of velocity and improves the overall stability of the filter.Based on the estimated 3D coordinates kt−1 of the previous frame keypoints k t−1 , we model the measurement m t as a function of both the vehicle's state x t and an estimate of the keyframe's position pt−1 and orientation qt−1 where R(q t ) and R( qt−1 ) are rotation matrices corresponding to the orientations q t and qt−1 , respectively, and ζ is the scaling parameter.The measurement noise matrix R t cor- responding to m t is set as a diagonal matrix with identical values for each keypoint.In addition, we assign small values to the diagonal terms corresponding to the zero-valued entries in ṽt .This is because the vehicle can be assumed to have negligible or zero motion along non-forward directions with a high level of certainty.By assigning smaller values to these terms, the filter is incentivized to model the vehicle's motion primarily along the forward direction.
We estimate the keyframe's position pt−1 and orientation qt−1 separately from the vehicle's state x t by running a concurrent EKF with identical dynamics and sensor mod- ules, except for a modification to the camera module.In the camera module, we set the (17 observation model h t (x t ) to the expected body frame velocity obtained from Eq. 22, and measurement m t to the velocity ṽt derived through 3D-2D matching.Based on our experi- ments, using a separate filter to estimate the keyframe location provides more stable estimates compared to using estimates based on the same filter.

GNSS module
Our GNSS pipeline utilizes pseudorange measurements ρ t = {ρ 1 t , . . ., ρ M t t } from mul- tiple constellations (GPS and Beidou) and a nearby base station, where M t is the total number of visible GNSS satellites at time t.These measurements are obtained by calculating the time delay between the transmission of a signal from a satellite and its reception at the receiver onboard the vehicle.The kth pseudorange measurement ρ k t is related to the distance between the receiver and the satellite by the following equation [72]: where p k t denotes the position of the satellite corresponding to the kth measurement at time t, b const denotes the bias error in the receiver's clock that also depends on the GNSS constellation, b k denotes the bias error in the satellite's clock, ǫ ion , ǫ trop denote the errors due to ionospheric and tropospheric effects, ǫ mp denotes the error due to multipath effects and ǫ denotes the random error.We use multiple GNSS constellations to com- pensate for limited satellite visibility in urban environments.
To utilize the pseudorange measurements ρ t for positioning, it is necessary to account for the various error sources specified in Eq. 23.A common approach is to employ double-differenced (DD) measurements [73].To construct DD measurements, measurements from the base station ρt are first subtracted from the received measurements ρ t , thereby canceling out the effects of ionospheric and tropospheric disturbances, as well as satellite clock bias.To account for the receiver clock bias, a reference satellite (index denoted as ref ) is selected and its measurements are subtracted from all other measure- ments.For the kth satellite, the DD measurement ∇�ρ k t is expressed as follows: The above approach successfully removes most error sources based on clock differences between the satellite and the receiver.However, DD measurements that form differences between measurements from different constellations still have a remaining receiver clock bias error due to lack of synchronization between the constellations [72].This error-known as the Inter-System Bias (ISB)-does not change significantly over long periods of time.Therefore, we pre-estimate this error from initial measurement residuals and ground truth data.The overall measurement vector m t for the GNSS module is constructed using DD measurements as (23) The corresponding sensor observation model h t (x t ) is as follows: where the double differences are constructed in a similar way to the DD measurements in Eq. 25.
We set the measurement noise matrix R t corresponding to m t as a diagonal matrix with identical entries for all measurements for simplicity, which is common practice in filtering literature.However, it is worth noting that alternative methods of assigning covariance exist that take into account additional properties such as signal strength and satellite elevation [74].Investigating the impact of such techniques on filter performance could be a direction for future work.

AHRS module
In the AHRS module, we use the orientation measurements o t to update the orientation quaternion q t tracked by the filter.The sensor observation model h t (x t ) is the estimated quaternion q t .The measurement m t is set as the quaternion q(o t ) corresponding to the measured orientation o t , where q(•) is the Euler-to-quaternion transformation function.The measurement noise covariance matrix R t is calculated using the Jacobian J q t of the Euler-to-quaternion transformation and a 3 × 3 diagonal covariance matrix Rt for Euler angles, similar to Eq. 14 To ensure numerical stability, we set the off-diagonal terms in R t to zero and only keep the diagonal terms.

Estimating position error bounds
To ensure the safety of our proposed filtering method for location estimation, we use the tracked probability distribution p(x t |m 1:t ) to estimate probabilistic bounds on the estimation error.These error bounds can then be compared against carefully designed safety limits, such as the ones described in [38], to detect situations when the filter's location output is unsafe for navigation.Incorporating these checks adds a layer of protection to the localization system against uncorrectable estimation errors, improving the system's reliability.
To estimate the position error bounds, we first utilize the vehicle's estimated orientation q t to determine the transformation from the global coordinate frame to the coordi- nate frame aligned with the lateral (side-to-side), longitudinal (forward-backward), and vertical directions of the vehicle's motion.This transformation allows us to project the probability distribution p(x t |m 1:t ) onto the motion axes of the vehicle, enabling us to estimate the error along directions that are more relevant from a safety perspective.
The position part of the state probability distribution is represented as a weighted sum of Gaussian distributions, where the ith component's mean and covariance are denoted by p t,i and p t,i , respectively.To estimate the probability distribution corresponding to the axes-aligned error vector e t , we project each Gaussian distribution component along the axes specified by q t .(27) To compute the mean µ e t,i parameter of the ith projected Gaussian distribution component, we apply the rotation matrix R(q t ) to compute the difference between p t,i and the filter position estimate p t To estimate the covariance parameter e t,i , we also need to account for the uncertainty in estimating the orientation.We first generate M samples of orientation {q 1 , . . ., q M } as where q t denotes the filter orientation covariance.We then project the covariance p t,i by applying the rotation matrix R(q j ) and use Monte Carlo integration to estimate e t,i : To obtain a probabilistic error bound for a given direction d and confidence level α , we seek to compute the quantile τ α d from the error distribution p(e t ) that satisfies the fol- lowing equation: where 1 d is a 3-dimensional unit vector with the entry corresponding to direction d set to one and the rest set to zero.However, as the error distribution is represented as a mixture of Gaussian distributions, we cannot compute this quantile analytically.Instead, we over-approximate the bound by maximizing across the component-wise quantiles.The position error bound τ α d for direction d is estimated as where τ α/2 d,i denotes the quantile of the Gaussian distribution N (|µ e t,i 1 d |, 1 ⊤ i � e t,i 1 d ) for confidence level α/2 .The quantile τ α/2 d,i is computed using the absolute value of the mean to provide a one-sided bound on the position error, which is the maximum possible deviation in the positive or negative direction along the axis.Similarly, we use the confidence level of α/2 for these bounds since it includes both sides of the probability distribution.
The position error bounds estimated from our approach are designed to account for the uncertainty resulting from the various sources of noise and linear approximations inherent in the filtering process.In instances where the noise sources are unbiased and the approximations hold true, the quantiles derived from the different components would exhibit similar behavior, resulting in a performance that is similar to an EKF.However, in cases where these assumptions do not hold, our methodology enables us to capture and quantify the uncertainty present across the different components, thereby offering an advantage over the Gaussian approximations utilized in the EKF.(29) µ e t,i = R(q t )(p t,i − p t ). (30)

Optimizing filter parameters
Designing effective filtering methods involves selecting appropriate values for all the filter parameters.For our filter, these parameters include standard deviation terms for the dynamics, sensor observation models and the scaling factor in the camera module.However, choosing suitable values for these parameters can be challenging and timeconsuming due to the filter's complexity.
In settings with available prior ground truth data and measurements, we can address this challenge by leveraging ideas from recent research on differentiable filter design [23,44].We present a gradient-based optimization strategy that uses available measurements and ground truth position data to quickly find values for the filter parameters that result in good filter performance.Specifically, we minimize a total loss function that comprises measurement loss terms and position loss terms.
We employ a window-based strategy for optimization, wherein we randomly select windows of length T = 5 s from the available data.This approach allows us to incre- mentally tune the filter parameters on subsets of data, which is more computationally efficient than optimizing over the entire dataset.At the beginning of each window, we initialize the filter by sampling from a Gaussian distribution centered at the ground truth position with a standard deviation of 5 ms, which helps to regularize the optimization and to prevent overfitting to specific data instances.
The measurement loss L m (θ) with respect to the current parameters θ is the negative log probability of observing the measurements m 1:t from the filter given the initial state.
where T m denotes the total measurement instances available within the window, and x t is estimated using the filter.
The terms in this loss are available at the acquisition rate of the sensor measurements.This loss is similar to the unsupervised optimization objectives used for tuning Bayesian filter parameters in existing research [75].
The position loss L p (θ) is the mean squared error between the filter estimated posi- tions p 1:T and the corresponding ground truth positions p * 1:T , which are usually available at a much slower rate than the sensor measurements.The position loss L p (θ) is expressed as where T p denotes the total ground truth instances available within the window.This loss provides a stronger signal for tuning the filter parameters where explicit supervision is available [76].The total loss L(θ) is expressed as the weighted sum of the position loss and each of the measurement losses (34) where L cam , L gnss , L ahrs are calculated according to Eq. 35.The parameters cam , gnss , ahrs denote the weighting factors for the losses and are set to 0.2 in our case.We optimize L(θ) using gradient descent to update the parameters θ .To ensure numerical stability and prevent non-positive values during optimization, we use the softplus function-which is a smooth approximation of the rectified linear unit (ReLU) function-to enforce positivity for all parameters.
We note that the overall optimization problem consisting of several parameters is complex, and finding the global minimizer is not necessarily possible using this strategy since it depends on the initialization.Nevertheless, we find this strategy useful in refining the parameters starting from heuristically set initial values and reducing the overall tuning effort considerably.

UrbanNav dataset
We evaluate our proposed filter on the Hong Kong UrbanNav dataset [45] which includes measurements from diverse sensors like GNSS, LiDAR, camera, and IMU, gathered in an urban environment with distinct regions.These regions comprise a wide street, one-sided buildings, and medium-height buildings, creating non-line-of-sight and multipath errors in GNSS measurements.The environment also contains dynamic objects that contribute to errors in the camera module as shown in the dataset images in Fig. 3b.The dataset spans a total duration of 785 s, covering a path length of 3.64 (37)  km with two loops of the same trajectory.The second loop was used to tune the filter's parameters, while the first loop was used to evaluate its performance.
In our evaluation, we utilize GNSS (GPS and Beidou), a monocular camera, and AHRS measurements from the UrbanNav dataset.The GNSS measurements were captured at 1 Hz using a commercial u-blox ZED-F9P receiver, with GPS and Beidou visibility ranging from 4 to 20 satellites at each time instant.AHRS data was obtained from an Xsens Mti 10, collected at 400 Hz.Images were captured at 15 Hz from a ZED2 camera with a resolution of 1920 × 1080 .Ground truth was obtained from a post-processed RTK GNSS- INS integrated system, available at 1 Hz.In our experiments, the filter and the baselines were executed at 4 Hz.We focus on positioning performance and its uncertainty in this work.

Baselines
We compare the performance of our algorithm with respect to five filtering baselines, namely: a) Robust Naive RBPF (R-RBPF), b) Robust EKF with tight VO integration (R-EKF), c) Robust UKF with tight VO integration (R-UKF), d) Robust EKF with MD-VO only (R-EKF-VO), and e) Robust EKF with GNSS only (R-EKF-GNSS).Some notes about these baselines are as follows: • R-RBPF partitions the state space into position terms and tracks them using a standard particle filter with 20 particles to match our approach.The particle weights are computed using the robust version of the measurement likelihood function, as described in Eq 4.An EKF tracks the remaining terms, including orientation, velocity, and angular velocity bias.The position error bounds are determined by calculating the maximum deviation across the tracked particles along the lateral, longitudinal, and vertical directions.• R-EKF and R-UKF use the robust cost function in their update step as described in Sect. 2. During the execution of R-UKF, it was observed that the covariance matrix becomes non-positive definite at certain time instances due to numerical errors associated with the UKF, a well-known challenge [77] attributed to the choice of filter parameters.To address this issue, we reinitialized the covariance matrix and continued executing the filter from the last estimated state.• We use R-EKF-VO and R-EKF-GNSS as simple baselines to gain insights into the filter performance when only a single sensor is used.
We evaluate the filter's performance on the entire dataset.However, for ease of visualization, we present the results categorized into three distinct regions, as illustrated in Fig. 4.This approach enables a comprehensive evaluation of the filter performance within each typical category of regions present in urban environments.

Experimental setup
The filter and all baselines were tuned and executed on the Stanford Research and Computing Center's HPC cluster using an AMD 7502P CPU with 256 GB RAM and an NVIDIA Geforce RTX 2080Ti GPU.PyTorch is used for automatic differentiation and tensor operations.The implementation of the filter and baselines is built upon the multimodal sensor fusion codebase from [23].
To tune the filter parameters, we use R-EKF as a reference due to its fast execution time.We employ the same set of parameters for all filters to ensure a fair comparison.The parameters are tuned individually while the remaining parameters are held fixed.We optimize the parameters using the Adam optimizer [78] with a learning rate of 0.01 for 100,000 iterations.The final parameter values, along with the key manually set parameters, are listed in Table 1.
The Tukey biweight function [79] is used as the cost function in the filters.However, if the input residuals exceed a set threshold, this function becomes constant, causing the influence function to become non-invertible.To address this issue, we approximate the inverse by assigning a high value to the entries associated with residuals above the threshold.These threshold values are manually set and remain unchanged during the optimization of filter parameters using gradient descent.

Positioning performance
We first compare the positioning performance of our approach with respect to the baselines.Figure 4 shows the trajectory tracking plots of each approach and Fig. 5 shows the positioning performance with respect to time for qualitative analysis.The plots demonstrate that our proposed method produces position estimates with higher accuracy than the baselines.Specifically, in the region characterized by medium-height buildings, our approach exhibits clear improvement compared to R-EKF.Similarly, our approach outperforms R-UKF in a few regions where numerical errors occurred during covariance estimation, necessitating reinitialization.Notably, both R-EKF and R-UKF demonstrate smoother position estimates than our approach.Despite using the same number of particles as our approach, R-RBPF performs substantially worse.This observation is consistent with the widely recognized fact that particle filter methods alone require a large number of particles-that scales exponentially with the size of the state space-for good estimation performance [80].However, for real-world applications, the computational costs associated with increasing the number of particles may become prohibitively high.We also observe that, in general, baselines with multi-sensor measurements provide better positioning performance compared to those relying on single-sensor measurements.
To quantitatively evaluate the performance of our approach, we computed the mean, median, and 95th percentile of positioning error for all three sections of the trajectory.The results, presented in Table 2, demonstrate that our method produces positioning errors comparable to those of R-EKF and R-EKF-GNSS in the wide street and one-sided building regions.Additionally, our method exhibits smaller errors than all other baselines in these regions.In the medium urban region, our approach achieves smaller errors than all baselines except R-UKF.These results highlight the usability of our approach in real-world sensor fusion settings, where accurate state estimation is important.
Fig. 4 Trajectory tracking plots of each approach (blue) compared to the ground truth positions (dashed red).Purple lines depict the boundaries of wide street, one-sided buildings, and medium urban regions shown in Fig. 3a.The trajectory estimated from our approach is visually closer to the ground truth trajectory in a majority of the regions compared to the baselines Fig. 5 Estimated positions over time along the east, north, and up directions compared to the ground truth positions over the entire trajectory of the dataset.The estimated positions from our approach show improved tracking performance and closely follow the ground truth positions compared to the baselines

Position error bound evaluation
For qualitative evaluation of the position error bounding performance, we show plots that depict the estimated error bounds versus position error for each approach along the lateral, longitudinal, and vertical directions, as shown in Figs. 6, 7 and 8.The position error bounds lying in the upper half of the plot-separated by the dotted red line-successfully enclose the position error.
Our approach generates position error bounds that are more clustered in the upper half of the plots compared to the baselines in all regions, except for R-UKF in the medium urban region.For clarity purposes, we compare our approach qualitatively against R-EKF and R-UKF only, while other baselines are included in the quantitative comparisons.
For quantitative comparison, we employ the failure rate metric to evaluate the effectiveness of each method in bounding the position error.The failure rate metric specifies the fraction of instances where the estimated error bound fails to capture the actual position error, where a smaller value indicates better performance.The results are included in Tables 3, 4, and 5 for the failure rates along the lateral, longitudinal, and vertical directions, respectively.
Based on the presented results, our proposed approach outperforms all baseline filters, except for R-UKF in the medium urban region and R-EKF-GNSS in all three regions.R-EKF-GNSS shows promising performance, despite its comparatively lower positioning accuracy, due to its sole reliance on GNSS measurements.The EKF's Gaussian uncertainty modeling assumptions, which are used in R-EKF-GNSS, hold reasonably true for GNSS measurements.Similarly, R-UKF demonstrates good bounding performance in the medium urban region where it also has good positioning performance.However, its performance deteriorates in the one-sided building and wide street regions, where it exhibits worse positioning accuracy.
The results indicate that both R-UKF and R-EKF exhibit a higher failure rate along the vertical direction, most likely due to overconfident uncertainty estimates.In contrast, our approach uses the uncertainty across multiple particles to generate more conservative error bounds in the vertical direction with considerably lower failure rates.These observations emphasize the significance of accurate uncertainty parameters in characterizing the position error bounds, particularly in challenging urban environments.
Overall, the qualitative and quantitative evaluations demonstrate that our proposed approach produces more reliable error bounds compared to the baselines.Our approach's improved position error bounding can be attributed to its capability to capture multi-modal uncertainty through multiple linearization points, which effectively captures uncertainty in situations where multiple plausible solutions exist based on the measurements.This reduces the proposed method's reliance on the choice of the uncertainty parameters, highlighting the potential of our approach for real-world applications in developing robust and reliable systems for autonomous navigation.

Computational statistics
We analyzed the average runtime per step for our proposed approach and the baselines, tabulated in Table 6.Our method and R-RBPF-both using 20 particles-exhibit an average step time that is less than twice that of the EKF and UKF-based baselines.This efficiency is attributed to the utilization of tensor operations to parallelize the EKF operations associated with the linearization points.From a theoretical standpoint, this aligns with the expected computational complexity of O(nm 3 + nm 2 + n) -contributed by the robust filtering, weighting, and resampling steps-where n is the number of particles and m is the size of the measurement vector.It is worth noting that upon assuming optimal parallelization across all particles, the complexity reduces to O(m 3 + n) .This is comparable to the EKF's theoretical complexity of O(m 3 ).
We also investigate the effect of the number of landmark features used in the measurement vector and the number of particles on the computational time, as shown in Fig. 9. Our findings indicate that the computational time stays under 100 ms till 200 features in the measurement vector and then increases at a sublinear rate till 1000 features.Furthermore, the computational time grows linearly with the particles at a slope smaller than 1, demonstrating the computational benefits of efficient tensor operations in our approach.These observations show that our approach can be employed in real-world applications without imposing a significant computational burden.

Discussion
Our experiments on positioning performance demonstrate that the proposed robust Bayesian filtering framework for positioning produces state estimates that are    competitive with the robust EKF and UKF baselines.Moreover, this performance can be achieved with minimal manual tuning through the proposed gradient descent-based optimization strategy.Additionally, our approach outperforms particle filter baselines with the same number of particles when applied to a real-world setting.However, the performance of our method is dependent on the underlying particle filter for tracking linearization points, which results in position estimates that are less smooth compared to the EKF and UKF-based methods.Therefore, it may be beneficial to explore improvements in particle filtering literature to further enhance the performance of our method.Incorporating improved methods for robust state estimation, such as factor graphs, in the framework is another direction for future research to enhance the overall performance.
Our position error bounding experiments indicate that the error bounds estimated using our approach are better correlated with the actual positioning error along each direction than the bounds from most baselines.It is important to note that a limitation of our approach is the difficulty in establishing theoretical guarantees on its performance  due to the need for knowing the individual error distributions of each measurement at each time instant.Nevertheless, our empirical results suggest that our approach can identify situations where the positioning error grows large, making it useful in preventing the unsafe use of the estimated location.Hence, our approach has practical utility in safety-critical applications.Finally, our experiments on computational statistics illustrate that our proposed framework has comparable computational requirements to the underlying robust state estimation method, which is achieved using parallelization with tensor operations.Therefore, our approach can be integrated into real-time localization systems without imposing excessive computational overhead.Future work can further explore adaptively varying the number of tracked linearization points in our framework to improve computational efficiency.

Conclusions
In this paper, we introduced a robust Bayesian filtering framework that effectively captures multi-modal uncertainty in positioning using diverse sensor measurements while being robust to outliers.Our framework consists of two key components: the robust filter component, which uses techniques such as Extended Kalman Filters for efficient and robust state estimation, and the multi-modal uncertainty component, which tracks a probability distribution over points for linearizing the dynamics, measurement models, and robust cost in the robust filter component.We combined these two components using the Rao-Blackwell theorem to create a robust Bayesian filter that is resilient to measurement outliers and can capture multi-modal state uncertainty.
We validated our proposed filter on real-world data from a multi-sensor setup comprising a camera, GNSS, and AHRS.To tune the filter parameters, we utilized a gradient descent-based optimization strategy that leverages available measurements and ground truth position data.The results demonstrate our filter's competitive state estimation performance compared to existing filter-based robust state estimation methods and improved performance in bounding the position errors based on uncertainty.Furthermore, these improvements are achieved with less than twice the computational time of existing Kalman filter-based methods.
These results suggest that our proposed framework is suitable for real-world localization applications where the reliability of the estimated location is critical.Our approach enables more robust and accurate localization in complex environments, thus proving to be a valuable tool for autonomous navigation systems in urban environments.

Fig. 1
Fig.1Architecture of the proposed robust Bayesian filter.The filter takes a prior probability distribution of the linearization points and state and uses a system transition model and measurement model to estimate the posterior probability distribution of the state.The linearization point probability distribution is represented as weighted samples, and the state probability distribution is represented by a Gaussian distribution tracked by the underlying robust Kalman filter.The filter first applies the prediction step to each linearization point in the prior distribution.Using this predicted point for linearizing the robust filter system transition, measurement and cost models, we then apply the predict and update steps in parallel to update the state probability distribution for each linearization point.Based on the updated state probability distribution, the linearization point samples are refined and their weights are updated using the available measurements.Finally, the weighted linearization point samples are resampled and combined with the updated state probability distribution to represent the posterior state probability distribution

Fig. 3
Fig. 3 Description of the UrbanNav Hong Kong dataset and sample images from the dataset collected in diverse environments

Fig. 6
Fig.6 Comparison of position error bounds and position error along the lateral direction for the wide street, one-sided building, and medium urban regions.The dashed red line separates the upper region where the bounds exceed position error and the lower region where the bounds are smaller.Our proposed approach demonstrates improved bounding performance as indicated by the majority of error bounds lying in the upper half

Fig. 7 Fig. 8
Fig. 7 Comparison of position error bounds and position error along the longitudinal direction for the wide street, one-sided building, and medium urban regions.The dashed red line separates the upper region where the bounds exceed position error and the lower region where the bounds are smaller.Our proposed approach demonstrates improved bounding performance as indicated by the majority of error bounds lying in the upper half

Fig. 9
Fig.9 Runtime in milliseconds of our algorithm as a function of the number of features in the measurement vector and as a function of the number of particles in the filter.The blue vertical lines denote standard deviation error bars.Our approach leverages parallelization through tensor operations to achieve low computational overhead and efficiency comparable to an EKF, as shown by the figures

Table 1
Experimental parameters

Table 2
Positioning error statistics

Table 3
Lateral failure rate for different sections of the trajectory

Table 4
Longitudinal failure rate for different sections of the trajectory

Table 5
Vertical failure rate for different sections of the trajectory

Table 6
Average runtime per filter step for baselines and our algorithm