Largescale monocular FastSLAM2.0 acceleration on an embedded heterogeneous architecture
 Mohamed Abouzahir^{1}Email author,
 Abdelhafid Elouardi^{1},
 Samir Bouaziz^{1},
 Rachid Latif^{2} and
 Abdelouahed Tajer^{3}
https://doi.org/10.1186/s1363401603863
© The Author(s) 2016
Received: 2 July 2015
Accepted: 29 July 2016
Published: 17 August 2016
Abstract
Simultaneous localization and mapping (SLAM) is widely used in many robotic applications and autonomous navigation. This paper presents a study of FastSLAM2.0 computational complexity based on a monocular vision system. The algorithm is intended to operate with many particles in a largescale environment. FastSLAM2.0 was partitioned into functional blocks allowing a hardware software matching on a CPUGPGPUbased SoC architecture. Performances in terms of processing time and localization accuracy were evaluated using a real indoor dataset. Results demonstrate that an optimized and efficient CPUGPGPU partitioning allows performing accurate localization results and highspeed execution of a monocular FastSLAM2.0based embedded system operating under realtime constraints.
Keywords
Monocular FastSLAM2.0 CPU GPGPU Heterogeneous architecture Hardware software matching1 Introduction
Simultaneous localization and mapping (SLAM) algorithms are computationally intensive. Therefore, there is a general need, in case of embedded systems, to have an architecture that allows a software optimization for efficient and scalable implementation. Computer systems, in the early days of their creation, have contained one kind of processors designed to run general computing tasks. Performance improvement of such computers was relied to Moore’s law which predicts doubling transistor density every 18 months. However, this trend has reached a certain maturity. It is no longer possible to gain performance by increasing transistor density because adding more transistors also adds high complexity, heat, and memory issues. In order to surpass these issues and to reach high performances, a new trend now is to include other processing elements in a single chip area. These new systems gain performance not by adding only the same type of processing units but implementing also dissimilar processors incorporating specialized capabilities dedicated for handling specific tasks. These systems are referred to as heterogeneous system architectures (HSAs). Such systems allow development of applications that seamlessly integrate CPUs with the most prevalent processing elements: GPUs. Heterogeneous architectures of such systems can be exploited to accelerate execution time of SLAM algorithms and make them operating in realtime constraints.
This article presents an algorithm architecture matching of FastSLAM2.0 algorithm on a heterogeneous architecture integrating a CPU and a GPGPU. Only few works deal with the implementation of FastSLAM2.0 on such embedded systems. Authors in [1] presented a hardware software codesign approach of the importance weight calculation and particle update on a NIOS II processor. Moyers et al. [2] presented a fixedpoint version of FastSLAM 2.0 algorithm and describes its implementation on a configurable and extensible very long instruction word (VLIW) processor. Chau et al. [3] presented a heterogeneous implementation of adaptive particle filters based on an fieldprogrammable gate array (FPGA) and a CPU for mobile robot localization and people tracking applications. There is no full implementation of FastSLAM2.0 on an embedded CPUGPGPU published to this date. In the state of the art, we found that only the particle filter, as an algorithm of its own used for filtering, has recently been studied. Tosun et al. [4] presents a parallelization of particle filterbased localization and mapping on a multicore architecture. Maskell et al. [5] presents a parallel resampling on an FPGA and [6] presents a GPU implementation of a general particle filter algorithm.
The work presented in [7] proposed a method to accelerate robot localization and mapping of FastSLAM1.0 algorithm. Authors exploit general purpose parallel computing on NVIDIA GPUs. However, this work only focuses on accelerating one task of the entire FastSLAM1.0 algorithm: the particle weight computation as a part of the resampling step on GPU where the other tasks of FastSLAM1.0 are executed on CPU. They also optimized memory access using textures. Furthermore, they evaluated their implementation on a desktop machine using a highend GPU (NVIDIA GeForce GTX 660) and a CPU (Intel Core i53570K). Their tests have shown significant performance improvements according to the naive implementation on CPU.

The algorithm implemented in our work is slightly different from literature. Eade and Drummond [8] used only a single camera; the particle poses are predicted from images via visual odometry using a specific camera motion model. They tested the algorithm on a short sequence with few images for small environments. They claimed that the implemented algorithm is not yet ready for use in largescale environments and significant challenges must be adopted. In our work, we used odometry to predict the particles poses and a single monocular camera for vision task. Some improvements are adopted in order to test the algorithm in largescale environments. More details are given in Section 2.

Our work proposes a full parallel implementation of FastSLAM2.0 on a GPU. Such implementation has not been reported before in any previous work. All main steps of the algorithm (functional blocks) have been accelerated on a GPU using the GPGPU concept. Zhang and Martin [7] accelerated only the particle weight calculation. This was a challenge to map the entire algorithm on a GPU.

Zhang and Martin [7] used a highend NVIDIA GPU of a desktop machine. Our work investigated a parallel GPGPU implementation of a heavy computationally algorithm (monocular FastSLAM2.0) on an embedded architecture. This rises a new challenge to investigate whether by adopting the same optimization strategies as those used for highend GPU is suitable to design an embedded systembased SLAM applications. This challenge is due to constraints of the used architecturebased system on chip (NVIDIA Tegra K1 SoC) such as sharing the same physical memory between CPU and GPU. This required a special attention and imposed new demands to our works.
Organization of this paper is as follows: in Section 2, a state of the art about SLAM algorithms is presented with a description of image processing used for features extraction and matching. In the same section, the monocular FastSLAM2.0 algorithm is presented. Section 3 presents our methodology adopted in this work to evaluate the algorithm on a heterogeneous embedded architecture. In Section 4, a description of the target embedded architecture is provided with a brief background about hardware and material used for embedded GPGPU programming. The GPGPU implementation as well as the adopted partitioning model and the performed optimization is given in the same section. Section 5 is devoted to experimental results and provides a detailed discussion about the algorithm complexity and performances comparison. Section 6 summarizes results and gives a conclusion about this work.
2 Localization and mapping
SLAM algorithms allow autonomous navigation of robots in unknown environments. Localization and mapping represent a concurrent problem that cannot be solved independently. Indeed, if a mobile robot follows an unknown trajectory in an unknown environment, the estimation of the robot’s pose and the explored map becomes more complicated. In such situation, no information is previously known by the mobile robot which is supposed to create a map and to localize itself according to this map. Before the robot can estimate the position of a given landmark, it needs to know from which location this landmark was observed. At the same time, it is difficult to estimate the actual position of the robot without a map. A good map is necessary for localization while an accurate pose estimate is needed for map reconstruction.
A SLAM algorithm relies on sensor data to concurrently estimate both map and robot pose. Two sensors are usually used: proprioceptive and exteroceptive sensors. Throughout this work, we used a monocular camera as an exteroceptive sensor to observe environment and odometers as proprioceptive sensors to estimate robot pose.
2.1 Image processing
2.2 FastSLAM2.0
FastSLAM2.0 [10] is based on the particle filter (Algorithm 3). Uncertainty of the robot pose is modeled by a number of different particles. Each particle has its own map. It has been proved that FastSLAM2.0 runs successfully in a very large environment and can surpass many problems that decrease consistency of localization and mapping. The main steps of the algorithm are described below:
2.2.1 Prediction
Prediction task also predicts uncertainty of the robot pose. We recall that as stated in [10], the difference between the FastSLAM2.0 and the previous version is how actually the particle poses are sampled. As we will see in Section 2.2.2, the particle poses are updated and sampled from a modified proposal distribution constructed incrementally using the last observed landmarks. The problem here is that such proposal distribution is constructed starting from an estimation of robot uncertainty, P _{ m }. A random initialization of this uncertainty can cause the filter divergence.
Eade and Drummond [8] and Montemerlo et al. [10] did not give any information about how this initial covariance matrix is computed. Initializing P _{ m } using an empirical value is not suitable when using a camera and partial initialization method since an accurate estimate of uncertainty is needed. This is shown in our previous work [12]: “Using small number of particles with P _{ m } randomly initialized, the algorithm can only map a small portion of the trajectory and is not able to close the loop.” In this work, we computed P _{ m } incrementally whenever a new particle pose is predicted to accumulate uncertainty between images. This better reflects the uncertainty in robot pose for a good partial initialization. P _{ m } must be initialized after each image acquisition.
Our work targets an embedded platform. The method we suggested is time consuming when it is naively implemented on an embedded single core, since P _{ m } must be computed for every single particle and at every odometry data acquisition. An efficient optimization will be proposed in Section 4. Algorithm 1 describes the procedure of computing particle poses and their initial covariance matrix P _{ m } according to one odometric data acquisition.
Density of particles must well represent the real trajectory (after integrating odometric data, particle density must cover the real robot pose). Therefore, in the prediction task, the motion model must be applied with a randomization to reflect the system random error and the sensor noise (Monte Carlo localization) [13]. In our implementation, the prediction task is parallelized on GPU. However, the random number generation is difficult to be implemented in parallel on a GPU. According to [14], a random number generation should perform well on a single processor. Thus, providing high precision random numbers becomes more difficult when dealing with parallel architectures. Some studies [15, 16] proposed wellknown techniques to spread random streams through a parallel implementation using different strategies, but they are not yet suitable for use in Monte Carlo localization and remain an ongoing research topic [14, 17, 18]. A solution for this problem will be proposed in Section 4.2.2.
2.2.2 Sampling a pose: particle update
Particle diversity is an important factor that determines the estimation accuracy. The algorithm uses a technique to deal with sample impoverishment. A new proposal distribution (set of particles generated after integrating the control data) is computed using the most recent measurement [10]. Then, a new particle pose is sampled; this is illustrated in Algorithm 2. The proposal distribution construction procedure can be parallelized on GPU pipelines, since each operation can be done independently for each particle.
2.2.3 Estimation
2.2.4 Initialization
During mapping, SLAM algorithms need to know initial landmark positions and the covariance matrix. This seems easy with SLAM algorithms based on LASER sensors since the observation model is invertible [13]. However, in the case of a monocular vision algorithm, the landmark initialization is not obvious. A monocular camera is a projective sensor which cannot provide depth of a landmark in a scene. In order to estimate depth of a landmark, and so its position in the scene, the landmark must be tracked in more than one frame. In our implementation, we used the inverse depth initialization method [19].
2.2.5 Resampling
Resampling task deletes very improbable trajectories in order to maintain a constant number of particles and to prevent particle depletion. This depends on the weight of each particle computed in the estimation step [20]. In our implementation, the most timeconsuming parts in the resampling step are parallelized on GPU pipelines. First, the total weight update task, in essence, starts before the resampling task and updates total particle weights based on likelihood computed in the estimation phase. The resulting weight is the product of likelihoods derived from each matched landmark \(\left ({w_{t}^{M}} = {w_{t}^{M}}* \prod _{i=1}^{N_{m}}\omega _{i}\right)\). The weight normalization and weight summation \(\left (w_{i_{n}} = \frac {1}{w_{s}}, w_{s} = \sum _{i=1}^{M} w_{i}\right)\) are also parallelized on GPU. Computation of the minimum number of effective particles before resampling can also be implemented in parallel \(\left (N_{\text {eff}} = \frac {1}{\sum _{i=1}^{M} {w_{i}^{2}}}\right)\).
3 Evaluation methodology
Our evaluation methodology consists on analyzing the algorithm and its dependencies. We identify the processing tasks that require considerable amount of time by evaluating their processing times. The algorithm is then partitioned into functional blocks (FBs) performing specific tasks. In order to have a bounded processing time, a threshold is fixed for each parameter. Functional blocks that impose the most important processing time are then optimized.
Such evaluation methodology is widely adopted when dealing with the implementation of computeintensive algorithms such as SLAM. In [21, 22], Dine et al. adopted the same evaluation methodology to study the embeddability of the graphbased SLAM. In [23], the same methodology is used for an efficient implementation of the EKFbased SLAM on a low powermultiprocessor architecture.
3.1 Real datasetbased evaluation
SLAM algorithms are interesting applications when it comes to explore in a real indoor/outdoor environment. Evaluation of SLAM algorithms requires a set of different sensor data which are necessary for benchmarking. Sensor data can be obtained either by using a real instrumented robot or an available dataset. In our evaluation, we used a real indoor dataset [24]. This dataset provides a set of different sensor data. We have used data of encoders and a monocular camera.
3.2 Functional block partitioning
3.3 Algorithm dependencies and threshold definition
FB1 depends on the number of integrated odometric data at each iteration. FB2 depends on the number of landmarks in the particle map, the number of extracted features and the size of both images and corner descriptor. FB3 depends on the number of matched landmarks. FB4 depends on the number of matched landmarks to be corrected. FB5 depends on the number of newly observed landmarks to be initialized. FB6 depends on the number of likelihoods computed in FB4 and the number of used particles.

Number of odometric data is unbounded. It is related to encoder frequency used in experiment.

Image size is fixed by dataset used in experiment: 320 × 240 pixels.

Size of descriptor: 16 × 16 pixels.

Maximum number of landmarks for each particle is set to 500. Note that this value is decreased for even larger number of particles to avoid exhausting memory.

Maximum number of extracted features: 60.

Maximum number of matched landmarks: 40.

Maximum number of landmarks being initialized: 40.
3.4 Running times
A GPGPU execution consists of three phases: data transfer to GPU, kernel execution, and data transfer back from GPU to CPU phase after kernel execution. We would like to note that we have deliberately overlapped data transfer and GPU execution phases in the time measurement to take into account the overhead of GPUCPU data transfer.
4 Hardware software matching
Signal processing community has always been interested in implementing algorithms they developed. Evolution of the heterogeneous architectures and tools has allowed designing complex systems that we have not even dare to consider few years ago. We have gradually moved from a separate study of algorithms and architectures, to a more formalized approach taking into account simultaneously algorithm and system architecture for an efficient matching. Algorithm architecture matching consists of a simultaneous study of the algorithm and the architecture in order to perform an optimized implementation of the algorithm taking into account different constraints (real time and embeddability).
Throughout this work, we aim to implement the FastSLAM2.0 algorithm on a heterogeneous embedded architecture. This requires a process to map the algorithm on the target architecture. A homogeneous implementation of FastSLAM2.0 algorithm was previously performed on a low power embedded architecture [26]. Nevertheless, such architecture do not have a suitable GPU to use for general purpose computing. New solutions have appeared using GPUs for general purpose computing. This is proved in the modern systems implementing heterogeneous architectures (HSA) allowing the use of GPUs and CPUs together. Sequential tasks run on CPU while the computationalintensive tasks are handled by GPU. In this article, an heterogeneous implementation of FastSLAM2.0 algorithm is performed using a modern embedded system on chip: the NVIDIA Jetson Tegra K1.
4.1 Hardware description
NVIDIA Tegra K1 specifications
GPU  CPU 

192 NVIDIA CUDA cores  Quadcore ARM CortexA15 
Clock speed: 852 MHz  Clock speed: 2.3 GHz 
OpenGL version: 4.4  OS: Linux for Tegra 
4.1.1 Embedded GPGPU programming
Embedded GPU resources can be accessed by a programmer using various programming languages. CUDA presents the user with a C language for direct application development on NVIDIA GPUs which restrict code portability specifically to NVIDIA hardwares.
Tegra K1 is selected to evaluate our implementation. OpenCL on such architecture is not available. Moreover, it is important to note here that at the time of writing this paper, OpenCL drivers for other boards with embedded GPUs are not available in the public realm. Embedded boards that have the software support to use OpenCL for GPGPU programming contain GPUs with only low numbers of cores which are not really dedicated to highperformance computing. In [27], authors used OpenCL on a Vivante GC2000 GPU with 4 SIMD cores on the i.MX6 Sabre Lite development board. In [28], Maghazeh et al. used OpenCL on a MaliT628 MP6 with 8 cores on the ODROID board and on a MALIT604MP4 GPU with 4 cores on the ARNDAL board. Therefore, using OpenCL in our work is not a good choice. Such boards with powerless embedded GPGPU may not be a good choice for a computationally intensive algorithm. As and when such boards, with a powerful embedded GPU and their respective software drivers supporting OpenCL, become available, it will be worthwhile to consider OpenCL as a programming language.
This work adopts OpenGL for developments. The majority of current embedded GPUs support OpenGL which is independent to a specific architecture. Adopting OpenGL for general purpose computing requires a prior experience in graphic programming. Although, it provides code portability between different embedded GPUs which makes the resulting implementation independent and allows conducting a comparative study between different embedded GPU architectures. Recent works adopted OpenGL for GPGPU programing. Hendeby et al. [6] used OpenGL for general purpose GPU particle filter. Weinlich et al. [29] and Oliveira et al. [30] presented a comparison between different GPGPU programming languages: OpenCL, OpenGL, and NVIDIA CUDA. They proved that by adopting the same optimization strategy among these languages, the gap is slightly different among them in terms of acceleration.
To use OpenGL for GPGPU, a typical GPGPU application program is based on three phases: (i) upload a suitable shader; (ii) allocate appropriate processing units for vertex and fragment shader; (iii) draw a suitable quad to trigger computation and download results.
4.1.2 Unifiedshading architecture
Tegra K1 contains a recent GPGPU that adopts a unified shader architecture (Fig. 4). Unified shading architecture hardware is composed of an array of computing units (192 CUDA cores) which are capable of handling any type of shading tasks instead of dedicated vertex and fragment processor as in old GPUs. All computing units have the same characteristics. They can run either a fragment shader or a vertex shader. With a heavy vertex workload, we could allocate most computing units to run a vertex shader. In the case of a low vertex workload and a heavy fragment load, more computing units could be allocated to run fragment shader. In our work, we allocate more processing units to run the fragment shader to perform the desired parallel processing.
4.2 CPUGPGPU partitioning
The HSA studied in our work allows the use of GPU and CPU together to enhance the global processing time. Using FB partitioning, we propose a distributed implementation of the monocular FastSLAM2.0. Functional blocks that require significant processing time are parallelized on GPU while sequential blocks are implemented on CPU. Algorithm 4 describes the CPUGPGPU partitioning of FastSLAM2.0. The implementation of each FB is described in the following sections.
4.2.1 CPU implementation of image processing task (FB2)
In image processing task, landmark detection is done using the FAST detector. We used an instance of the algorithm that is already optimized using machine learning [9]. Matching task is performed once using the highest weighted particle [25]. However, it can be parallelized according to the number of detected observations. In essence, at each time stamp, only few observations are detected. Therefore, the matching task is well implemented on ARM quadcore (more processing units are not needed). Implementing this task on GPU will only reduce performances. Transfer time will be larger than the execution time. Furthermore, particles map cannot be all transferred to GPU memory.
4.2.2 GPGPU implementation of prediction task (FB1)
Particle pose \({s_{t}^{m}}=\left (s_{x},s_{y},s_{\phi }\right)\) is transferred to a texture in global GPU memory. Each texel in texture memory holds on one particle state. In our implementation, we preferred generating random numbers on CPU for each particle and transfer them to a separate texture on GPU memory. This allows generating an accurate particle pose to surpass the problem discussed in Section 2.2.1. We could employ some techniques to achieve a parallel random number generation such as those described in [15], but it would be at the expense of localization accuracy. Poor particle distribution greatly affects localization results specially in largescale environment.
Many encoder data are acquired at each time stamp (one iteration of the while loop in Algorithm 3). Therefore, the particle poses must be updated at each received encoder data to reconstruct properly the trajectory. To implement this process, a multiple rendering pass is needed. Texture is used as render target to store the output results for one prediction operation, and then it is directly used as input texture for the next operation. Since textures are either readonly or writeonly, three textures are needed for FB1. One unchanged readonly texture is used for encoder data u. Two other textures are attached to frame buffer object (FBO): readonly texture \(s_{\text {old}}^{t}\) for input particle pose and writeonly texture \(s_{\text {new}}^{t}\) to store predicted pose. The role of \(s_{\text {new}}^{t}\) and \(s_{\text {old}}^{t}\) is swapped since the value in \(s_{\text {old}}^{t}\) is no longer needed once new values have been computed.
This is illustrated in Algorithm 5. s is a table that holds two pingpong texture identifiers (s _{old},s _{new}). Role of these textures is changed within the loop by s w a p() function. glDrawBuffer sets writable textures. Two routines glActiveTexture and glBindTexture set readable textures. N is the number of odometric data, drawQuad() is a function that launches computing by drawing a suitable quad. The initial covariance matrix P _{ m } is computed in the same way using multiple rendering pass.
4.2.3 GPGPU implementation of particle update task (FB3)

four textures, readonly not changed, are used for matched landmark parameters
(u,v,x _{0},y _{0},ρ,θ,φ,C _{ t }).

eight textures attached to the color attachment of FBO where

four of them are readonly used to hold the initial Gaussian \(\left (\mu _{\text {old}}^{{m,t}1},\Sigma _{\text {old}}^{{m,t}1}\right)\),

four writeonly textures contain result of one render pass of the updated proposal distribution \(\left (\mu _{\text {new}}^{{m,t}},\Sigma _{\text {new}}^{{m,t}}\right)\).

This is depicted in Algorithm 6. First, eight pingpong textures pingpongTexID are attached to FBO, then we set the four readable textures held in LdmkTexID array which contains single landmark parameter. The first loop transfers \(\left (\mu _{\text {old}}^{{m,t}1},\Sigma _{\text {old}}^{{m,t}1}\right)\) to four textures already attached to FBO. The second loop is the main loop that computes the new proposal distribution using multiple render pass for Nmatched landmarks. We transfer at each iteration of the “For” loop the matched landmark parameters to textures in LdmkTexID array. glDrawBuffers routine sets writtable textures where the new computed Gaussian \(\left (\mu _{\text {new}}^{{m,t}},\Sigma _{\text {new}}^{{m,t}}\right)\) will be stored. The following loop sets readable textures where the old Gaussian mean and covariance will be read. For the next render pass, textures where the new computed Gaussian was written will be readonly, whereas textures that held the old Gaussian \(\left (\mu _{\text {old}}^{{m,t}1},\Sigma _{\text {old}}^{{m,t}1}\right)\) will be writeonly. This is done by swap() function.
It is important to note that we could allocate enough textures to hold all matched landmark parameters since we are allowed to input up to 32 textures. This would allow Gaussian construction to be done in one render pass. This reduces data transfer at each render pass which allows a significant improvement. However, unnecessary memory allocation should be avoided. Such implementation would increase memory requirements. So, the available onchip memory will be almost entirely accessed by these textures.
4.2.4 GPGPU implementation of Estimation task (FB4)
Each render pass corrects one matched landmark in the map for each particle in parallel. Twelve textures are needed for FB4. Eight readonly textures are used as inputs to the fragment shader to hold the old matched landmark state (u,v,x _{0},y _{0},ρ,θ,φ,C _{ t }) and the current state of particles \(\left ({s_{t}^{m}},{P_{t}^{m}}\right)\). Four writeonly textures are attached to FBO to store the updated landmark state for each input particle.
The fragment shader implements general extended Kalman equations to update the landmark state and compute the corresponding likelihood. This is described in Algorithm 7. Each loop iteration corrects one matched landmark in the map for each particle in parallel. If more than one landmark are matched, many iterations are needed to update them. Results (updated landmark state) are transferred from textures to CPU.
4.2.5 GPGPU implementation of inverse depth initialization task (FB5)
Five initial landmark parameters are calculated for all particles (x _{ i },y _{ i },θ _{ i },φ _{ i },ρ _{ i }). The inverse depth parameter ρ has a constant value ρ=0.25 [19]; hence, it is initialized on CPU. However, (x _{ i },y _{ i },θ _{ i },φ _{ i }) are computed on GPU. Landmark position in current frame and particle poses are transferred to GPU via textures. Only two textures are needed for FB5: one readonly texture with RGBA internal format that holds the particle pose (one particle per texel R=x _{ p },G=y _{ p },B=θ _{ p },A=0) and one writeonly texture attached to the frame buffer object to store initialized landmark parameters (one landmark per texel: R=x _{ i },G=y _{ i },B=θ _{ i },A=φ _{ i }). Landmark poses in image are loaded to the shader as a uniform value. To exemplify GLSL source code, Algorithm 8 describes fragment shader code needed for inverse depth initialization.
sampler2DRect is a specific pointer to the active texture unit. Hence, partPose points out input particle poses. The first line of the algorithm makes a texture lookup and retrieves the particle pose stored in a fourdimensional vector p o s_p a r t i c l e. The next two lines compute consecutively the camera pose and the landmark position in the environment. The first view camera pose (x _{ i },y _{ i }), the azimuth θ _{ i }, and the elevation ϕ _{ i } are stored in a specific variable referred to as gl_FragColor.
4.2.6 GPGPU implementation of resampling task (FB6)
Implementation of FB6 is similar to FB1. One texture readonly contains the input likelihood score and two textures contain w _{old} and w _{new} attached to frame buffer (in this case, w _{new}=s c o r e∗w _{old}). We switch roles of textures from readonly to writeonly textures. The normalized weight is computed in one render pass; input texture that stores the updated weight is loaded to the shader and the sum of weights is loaded as a uniform value. Weight summation is a reductiontype operation that can be also implemented in parallel on GPGPU by mapping a M×M texture to 1×1 texture. The algorithm reduces recursively the output region size by computing the local summation of each 2×2 group of elements in one shader and writing it to the corresponding output location.
4.3 Partitioning model
4.3.1 Data transfer management
4.3.2 Optimizing memory access
GPU memory access is optimized using texture memory which fully leverage the parallel processing power. In our implementation and during kernel computation, texture memory access patterns has a spatial locality. In other words, a processing unit running a fragment shader is likely to read from an address near the address that nearby processing unit read. Texture memory is cached on a chip and has a specialized caching scheme optimized for spatial locality which provides effective bandwidth advantage and reduce memory access.
5 Experimental results
In this section, we analyze the processing times of the proposed heterogeneous implementation discussed in Section 4. The experimental tests as well as the evaluation of processing times were based on our methodology discussed in Section 3. First, all processing times reported in our work as well as the localization results are obtained using a real dataset (Section 3.1). Second, we separately evaluate each functional block to synthesize the full FastSLAM2.0 implementation results (Section 3.2). Third, since each FB has parameter dependencies, evaluation was conducted based on a set of defined thresholds to bound the computation time (Section 3.3). Finally, a FB may occur once or more time in a single iteration depending on its parameters. We report the mean of processing time \(\phantom {\dot {i}\!}t_{\text {FB}_{x}}^{'}\) and \(\phantom {\dot {i}\!}t_{\text {FB}_{x}}\) respectively, computed relatively for the MOTS and for a single occurrence (Section 3.4).
5.1 Algorithm evaluation
(s _{ x },s _{ y }) is the estimated robot pose and (x _{GT},y _{GT}) is the reference pose.
5.2 Particlewise GPU parallelization
Montemerlo et al. [10] implemented the FastSLAM2.0based laser range finder that provides range and bearing of a landmark in a scene. Such algorithm can converge with few particles. The SLAM algorithm implemented in our work uses a monocular camera (bearingonly sensor) to observe the environment. The number of particles in such system is necessary to maintain reasonable estimates of pose and landmark uncertainty as stated in [8]. In addition, Eade and Drummond [8] conduct experiments with 50, 250, and 1000 particles to evaluate the impact of the number of particles on landmarks and pose estimation. They showed that 50 particles are sufficient for a very short sequence with few images. However, more detailed and rigorous analysis will be necessary for long trajectories and large environments. There remains significant challenges to tackle with FastSLAM2.0based bearingonly sensor intended to operate in large geographic scales. The exact number of particles necessary is not yet defined which may increase with environment complexity.
In our evaluation (Section 5.1), we have used a very long indoor sequence with 5000 images. To close the loop over this larger trajectory, the number of particles must be increased for an accurate estimate of uncertainties. Our tests show that the monocular system can close the loop over this large trajectory (Fig. 12) using more than 500 particles. The more the number of particles increases, the more the landmark estimates are accurate and localization error decreases (Fig. 13).
5.3 Processing time evaluation
The onecore, quadcore, and CPUGPGPU implementations were run and time evaluated for 500 iterations using Rawseeds dataset discussed in Section 5.1. Computing performances on onecore and quadcore CPU are used as a baseline to analyze the GPGPU acceleration results. As discussed in Section 5.2, a large number of particles is needed to achieve an accurate localization results. Therefore, we choose to conduct experiments with the number of particles that gives more accurate results.
Mean of processing times of functional blocks on Rawseeds dataset
Functional blocks (FBs)  Mean of processing time per one occurrence \(\phantom {\dot {i}\!}t_{\text {FB}_{\text {x}}}\) (ms)  MOTS  Mean of processing times \(\phantom {\dot {i}\!}t_{\text {FB}_{\text {x}}}^{\prime }\) (ms)  Speedup  

Onecore  Quadcore  CPUGPU  Onecore  Quadcore  CPUGPU  –  
–  –  QuadCPU  GPU  –  –  –  QuadCPU  GPU  
FB2  6.38  4.83  4.83    1  6.38  4.83  4.83  –  1.32 
FB1  272.03  150.19  –  3.42  15.1  4107.8  2267.9  –  51.66  43.9 
FB3  11.78  5.72  –  2.28  14.1  166.09  80.71  –  32.17  2.50 
FB4  7.12  5.04  –  1.37  14.1  100.50  71.09  –  19.45  3.65 
FB5  5.45  2.95  –  0.96  10  55.45  29.58  –  9.65  3.06 
FB6  4.7  1.88  –  0.51  5.2  24.87  9.8  –  2.7  3.6 
Total  307.46  170.61  13.37  –  4461.09  2463.91  120.46  20.38 
We note here that the prediction step (FB1) gets 44 × speedup using GPGPU. This is because the parallel implementation of this block on GPU is done with much less data transfer between CPU and GPU. For M particles, we transfer only M random numbers in singleprecision floatingpoint format from CPU to GPU and there is no transfer back to CPU from GPU. Other functional blocks (FB3, FB4, FB5, and FB6) require data transfer from CPU to GPU and back from GPU to CPU. In FB3, we transfer at each iteration the particle map (matched landmarks and their related covariance matrix) to GPU to update the particle pose. In FB4, we transfer at each iteration the matched landmarks and their covariance matrix to GPU to correct their states and we transfer back the updated landmarks to CPU. For FB5, we transfer unmatched landmarks to GPU. Once they are initialized, we transfer them back to CPU. In FB6, we transfer the computed likelihoods to GPU, then we transfer back to CPU the resampled particles. This reduces the gain that can be obtained after parallelization.
As seen before, each FB of the algorithm depends on many parameters. The complexity of FB3 and FB4 increases as there are many matched landmarks to process. The decision to consider whether a landmark is matched or not is greatly related to uncertainty in robot pose [31]. As this uncertainty is varying with the number of particles used, due to the randomization applied to motion model, the number of matched landmarks is also varying for each implementation. In other words, the number of matched landmarks is not necessarily the same, neither when running the algorithm on onecore, quadcore, or GPU nor when running it for different numbers of particles. This is approved in Fig. 15. For the GPGPU implementation of FB3 and FB4, the number of matched landmarks with 2^{12} particles is higher than the number of matched landmarks with 2^{14}. Therefore, GPU can process FB3 and FB4 with 2^{14} particles even faster than it processes FB3 and FB4 with 2^{12} particles. This is because the complexity still rises since FB3 and FB4 are being executed sequentially within each computing kernel. Also, the onecore and quadcore CPU can process FB3 and FB4 slightly faster with 2^{12} than with 2^{10} particles for the same reason.
The complexity of FB5 increases as there are many new landmarks to initialize and to add in the map. The decision whether to add a new landmark or not is also related to robot pose uncertainty and to the number of particles used. Therefore, the number of new initialized landmarks is also varying from an implementation to another one. This is seen in Fig. 15. GPU and quadcore CPU can process FB5 slightly faster with 2^{12} than with 2^{10} particles. The number of new landmarks decreases when implementing the algorithm with 2^{12} particles. When the number of new landmarks steadily increases, the processing time of FB5 scales linearly with the number of particles, as the case of the onecore implementation of FB5.
Since the number of matched landmarks is varying, the number of computed likelihood is also varying. Moreover, if this likelihood presents an outlier value, it is discarded and not used in FB6 to resample particles. This strict policy can decrease the number of the computed likelihoods and hence the processing time of FB6 also decreases. Figure 15 shows this dependencies. The processing time of FB6 GPU implementation decreases with 2^{8} and 2^{12} particles. Also, for FB6 quadcore implementation, the processing time decreases with 2^{12} particles.
Unlike other functional blocks, FB1 depends only on the number of odometric data to process at each iteration. The number of odometric data provided by the sensor at each iteration is the same and remains independent of the implementation type that is currently running. Therefore, the complexity of FB1 scales linearly with the number of particles (Fig. 15).
Actually, such dependencies are inevitable, especially when dealing with a SLAM based on a probabilistic approach. A solution is to bound the processing time by defining a threshold value for each parameter.
For few particles (2^{4}), CPU implementation performs better than quadcore and GPU implementations. The quadcore implementation seems not to perform well for few numbers of particles (2^{4} and 2^{8}). This is due to the fact that many data are shared between threads and memory access is sequential among different threads, specially synchronization barriers among the currently running threads. This issue is common and well known when dealing with OpenMP implementation. This results to performance degradation. In this case, onecore implementation gets the best performance. Parallelizing with more threads on quadcore CPU only worths when the number of particles increases (2^{10} to 2^{16}). We can then achieve the best performance according to onecore implementation. With 2^{4} particles, GPU implementation also performs worse. This is due to data transfer time which is more important than the processing time. However, for even more particles, the degree of parallelization becomes important and hence GPGPU implementation is faster than onecore and quadcore implementations.
M is the number of particles, f is the clock frequency expressed in Hz, and t is the processing time in s. CPP serves as a baseline for performance comparison of implementations on different processing units [32].
For onecore and quadcore implementations, the CPP value achieves the lowest point with 2^{10} particles. This is due to the fact that the global processing time t for both the onecore and quadcore implementations increases significantly with 2^{12} particles (see Fig. 16). CPU needs more cycles to process 2^{12} particles than it needs with 2^{10} particles. This is related to memory copy in the resampling step when a particle is duplicated. The memory copy is a timeconsuming operation when there are many landmarks in the map. This increases significantly the global processing time specially when many particles are used. This is the case when the global processing time increases significantly with 4096 particles. For even large number of particles (2^{14} and 2^{15}), we decrease the maximum number of landmarks in the map of each particle as mentioned in Section 3.3 to avoid exhausting memory usage. This reduced the memory copy operation in the resampling step and hence the global processing time steadily increases with the number of particles in the filter.
Contrary to GPGPU implementation, the global time slightly increases with 2^{12} particles. In general, CPP corresponding to GPGPU implementation keeps a lower value than onecore and multicore implementations as long as the number of particles increases.
6 Conclusions
This article proposed an efficient matching of monocular FastSLAM2.0 algorithm on a heterogeneous embedded architecture. The first complete parallel FastSLAM2.0 implementation in literature on a embedded GPGPU is described. Using a real dataset, the parallel CPUGPU implementation is shown to outperform a quadcore CPU implementation for many particles while maintaining the same localization accuracy.
The absolute performance of FastSLAM2.0 on an embedded GPGPU relies on the number of embedded cores (i.e., processing elements). As the number of processing elements steadily increases and can be expected to match the number of particles needed for an accurate monocular FastSLAM2.0 system intended to operate in largescale environment, GPGPU is an interesting alternative architecture for monocular FastSLAM2.0 implementations. For a fixed number of particles and sufficiently large number of embedded cores, the parallel implementation will always be more efficient. Our FastSLAM2.0 GPGPU implementation has achieved up to 20 × speedup with 192 GPGPU cores. It leads to a system that runs in real time and processes images at the frame rate they were acquired (30 FPS). This meet performance requirement of a robot to operate in real time.
It is important to note here that, at the time of writing this paper, Tegra K1 is the only embedded board that has a powerful GPGPU with 192 cores. Other boards with embedded GPUs only contain limited numbers of cores. As a conclusion, the monocular FastSLAM2.0 is shown to perform well with high number of particles in a largescale environment. A naive implementation of FastSLAM2.0 with high number of particles would increase the localization accuracy but at the expense of robot performance to operate in real time. Our accelerated implementation on an embedded GPGPU achieved a compromise between accurate localization and realtime performance. Our results demonstrated that an optimized monocular FastSLAM2.0 partitioned on a heterogeneous embedded architecture is suitable to have highspeed execution and accurate results under realtime constraints in largescale environments.
7 Appendix
Parameter definitions
Parameters  Definition 

zmssd  Zero mean sum of squared differences 
m _{ d }  Pixels mean of the landmark descriptor 
I _{lmk}  Pixel intensity of the landmark descriptor 
m _{ p }  Pixels mean of the corner descriptor 
I _{ p }  Pixel intensity of corner descriptor 
x,y  Location in image of the detected corner 
i,j  Indexes to surrounding pixels in the descriptor 
f  Motion model 
h  PinHol model 
u _{ t }(n _{ l },n _{ r })  Odometry data 
s _{ t }(s _{ x },s _{ y },s _{ θ })  Particle pose 
δ _{ s },δ _{ θ }  Longitudinal and angular displacement 
b  Wheels base 
P _{ m }  Particles initial covariance matrix 
G _{ u }  Jacobian matrix of motion model f derived according to s _{ t } 
G _{ p }  Jacobian matrix of motion model f derived according to δ _{ s },δ _{ θ } 
Q  Motion model noise 
M  Number of particles 
N  Number of landmarks 
μ  Mean of the new proposal distribution 
Σ  Covariance matrix of the new proposal distribution 
H _{ p }  Jacobian matrix of observation model (PinHol) 
Z _{ n }  Innovation covariance 
z _{ t }  Measurement 
\(\hat {z}_{t}\)  Measurement prediction 
\(\left (\hat {u},\hat {v}\right)\)  Predicted landmark position in image 
(u,v)  Landmark position in image 
(x,y,ρ,ϕ,θ)  Landmark inverse depth parametrization 
\(\left (c_{u},c_{v},f_{k_{u}}\right)\)  Standard camera calibration 
X(x _{cam},y _{cam},z _{cam})  Landmark 3D world coordinate 
C  Landmark covariance matrix 
ω  Particle weight 
N _{eff}  Number of effective particles 
Declarations
Competing interests
The authors declare that they have no competing interests.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 SA Li, CC Hsu, WL Lin, JP Wang, in IEEE International Conference on System Science and Engineering (ICSSE). Hardware/software codesign of particle filter and its application in object tracking (IEEENew York, 2011), pp. 87–91.Google Scholar
 M Moyers, D Stevens, V Chouliaras, D Mulvaney, in IEEE International Conference on Eletronics Circuits and Systems (ICECS), Tunisia. Implementation of a fixedpoint FastSLAM 2.0 algorithm on a configurable and extensible VLIW processor (IEEESfax, 2009).Google Scholar
 TC Chau, X Niu, A Eele, W Luk, PY Cheung, J Maciejowski, in Reconfigurable Computing: Architectures, Tools and Applications. Heterogeneous reconfigurable system for adaptive particle filters in realtime applications (SpringerLos Angeles, 2013), pp. 1–12.View ArticleGoogle Scholar
 O Tosun, et al, in Intelligent Vehicles Symposium (IV), 2011 IEEE. Parallelization of particle filter based localization and map matching algorithms on multicore/manycore architectures (IEEEBadenBaden, 2011), pp. 820–826.Google Scholar
 S Maskell, B AlunJones, M Macleod, in IEEE Nonlinear Statistical Signal Processing Workshop. A single instruction multiple data particle filter (IEEECambridge, 2006), pp. 51–54. doi:10.1109/NSSPW.2006.4378818.Google Scholar
 G Hendeby, R Karlsson, F Gustafsson, Particle filtering: the need for speed. EURASIP J. Adv. Signal Process. 2010:, 22–1229 (2010). doi:10.1155/2010/181403.MATHGoogle Scholar
 H Zhang, F Martin, in IEEE International Conference on Technologies for Practical Robot Applications (TePRA). CUDA accelerated robot localization and mapping (IEEEWoburn, 2013), pp. 1–6.Google Scholar
 E Eade, T Drummond, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1. Scalable monocular slam (IEEENew York, 2006), pp. 469–476. doi:10.1109/CVPR.2006.263.Google Scholar
 E Rosten, R Porter, T Drummond, Faster and better: A machine learning approach to corner detection. IEEE Trans. Pattern Anal. Mach. Intell. 32(1), 105–119 (2010).View ArticleGoogle Scholar
 M Montemerlo, S Thrun, D Koller, B Wegbreit, in Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI). FastSLAM2.0 an improved particle filtering algorithm for simultaneous localization and mapping that provably converges (IJCAIAcapulco, 2003).Google Scholar
 E Seignez, M Kieffer, A Lambert, E Walter, T Maurin, Realtime boundederror state estimation for vehicle tracking. IEEE Int. J. Robot. Res. 28:, 34–48 (2009).View ArticleGoogle Scholar
 M Abouzahir, A Elouardi, S Bouaziz, R Latif, T Abdelouahed, in 2014 Second World Conference On Complex Systems (WCCS). An improved RaoBlackwellized particle filter basedslam running on an OMAP embedded architecture (IEEEAgadir, 2014), pp. 716–721. doi:10.1109/ICoCS.2014.7061001.View ArticleGoogle Scholar
 S Thrun, Probabilistic robotics. Commun. ACM. 45(3), 52–57 (2002).View ArticleGoogle Scholar
 P Hellekalek, Good random number generators are (not so) easy to find. Math. Comput. Simul. 46(5), 485–505 (1998).MathSciNetView ArticleMATHGoogle Scholar
 A De Matteis, S Pagnutti, Parallelization of random number generators and longrange correlations. Numerische Mathematik. 53(5), 595–608 (1988).MathSciNetView ArticleMATHGoogle Scholar
 CJK Tan, The PLFG parallel pseudorandom number generator. Futur. Gener. Comput. Syst. 18(5), 693–698 (2002).View ArticleGoogle Scholar
 P Hellekalek, in ACM SIGSIM Simulation Digest, 28. Don’t trust parallel Monte Carlo! (IEEE Computer SocietyBanff, Alberta, 1998), pp. 82–89.Google Scholar
 M Sussman, W Crutchfield, M Papakipos, in Proceedings of the 21st ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware. Pseudorandom number generation on the GPU (ACMNew York, 2006), pp. 87–94.View ArticleGoogle Scholar
 J Civera, AJ Davison, JMM Montiel, Inverse depth parametrization for monocular SLAM. IEEE transactions on robotics. 24(5), 932–945 (2008).View ArticleGoogle Scholar
 WG Madow, LH Madow, On the theory of systematic sampling, i. The Annals of Mathematical Statistics. 15(1), 1–24 (1944). doi:10.1214/aoms/1177731312.MathSciNetView ArticleMATHGoogle Scholar
 A Dine, A Elouardi, B Vincke, S Bouaziz, in 2015 IEEE International Conference On Robotics and Automation (ICRA). Graphbased slam embedded implementation on lowcost architectures: a practical approach, (2015), pp. 4612–4619. doi:10.1109/ICRA.2015.7139838.
 A Dine, A Elouardi, B Vincke, S Bouaziz, in 2015 IEEE 26th International Conference On Applicationspecific Systems, Architectures and Processors (ASAP). Speeding up graphbased slam algorithm: a GPUbased heterogeneous architecture study (IEEEToronto, 2015), pp. 72–73.View ArticleGoogle Scholar
 B Vincke, A Elouardi, A Lambert, Real time simultaneous localization and mapping: towards lowcost multiprocessor embedded systems. EURASIP J. Embedded Syst. 2012(1), 1–14 (2012).View ArticleGoogle Scholar
 A Bonarini, W Burgard, JD Tardos, in Proceedings of IROS’06 Workshop on Benchmarks in Robotics Research. Rawseeds: Robotics advancement through webpublishing of sensorial and elaborated extensive data sets (Proceedings of IROS’06Beijing, 2006).Google Scholar
 E EADE, Monocular simultaneous localization and mapping. PhD thesis (2008).Google Scholar
 M Abouzahir, A Elouardi, S Bouaziz, R LATIF, A Tajer, in IEEE. The 13th International Conference on Control, Automation, Robotics and Vision, ICARCV. FastSLAM2.0 running on a lowcost embedded architecture (Marina bay Sands, Singapour, 2014).Google Scholar
 A Maghazeh, UD Bordoloi, P Eles, Z Peng, in 2013 International Conference On Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIII). General purpose computing on lowpower embedded gpus: has it come of age? (IEEEAgios Konstantinos, 2013), pp. 1–10.View ArticleGoogle Scholar
 L Nardi, B Bodin, MZ Zia, J Mawer, A Nisbet, PH Kelly, AJ Davison, M Luján, MF O’Boyle, G Riley, et al, in 2015 IEEE International Conference On Robotics and Automation (ICRA). Introducing slambench, a performance and accuracy benchmarking methodology for slam (IEEESeattle, 2015), pp. 5783–5790.View ArticleGoogle Scholar
 A Weinlich, B Keck, H Scherl, M Kowarschik, J Hornegger, in Proceedings of the First International Workshop on New Frontiers in Highperformance and Hardwareaware Computing, 1. Comparison of highspeed ray casting on GPU using CUDA and OpenGL (Proceedings of HipHaC’08Lake Como, 2008), pp. 25–30.Google Scholar
 RS Oliveira, BM Rocha, RM Amorim, FO Campos, W Meira Jr, EM Toledo, RW dos Santos, in Parallel Processing and Applied Mathematics. Comparing CUDA, OpenCL and OpenGL implementations of the cardiac monodomain equations (SpringerTorun, 2011), pp. 111–120.Google Scholar
 M Montemerlo, FastSLAM: a factored solution to the simultaneous localization and mapping problem with unknown data association. PhD thesis (2003).Google Scholar
 M Njiki, A Elouardi, S Bouaziz, O Casula, O Roy, A multiFPGA architecturebased realtime TFM ultrasound imaging. J. RealTime Image Process (2016). doi:10.1007/s1155401605635.
 NVIDIA Tegra K1 Embedded Platform Design Guide. http://developer.download.nvidia.com/embedded/jetson/TK1/docs/3_HWDesignDev/TegraK1_Embedded_DG_v03.pdf. Accessed Jun 2016.