Proposed hardware architectures of particle filter for object tracking
 Howida A Abd ElHalym^{1}Email author,
 Imbaby Ismail Mahmoud^{1} and
 SED Habib^{2}
https://doi.org/10.1186/16876180201217
© Abd ElHalym et al; licensee Springer. 2012
Received: 23 April 2011
Accepted: 23 January 2012
Published: 23 January 2012
Abstract
In this article, efficient hardware architectures for particle filter (PF) are presented. We propose three different architectures for Sequential Importance Resampling Filter (SIRF) implementation. The first architecture is a twostep sequential PF machine, where particle sampling, weight, and output calculations are carried out in parallel during the first step followed by sequential resampling in the second step. For the weight computation step, a piecewise linear function is used instead of the classical exponential function. This decreases the complexity of the architecture without degrading the results. The second architecture speeds up the resampling step via a parallel, rather than a serial, architecture. This second architecture targets a balance between hardware resources and the speed of operation. The third architecture implements the SIRF as a distributed PF composed of several processing elements and central unit. All the proposed architectures are captured using VHDL synthesized using Xilinx environment, and verified using the ModelSim simulator. Synthesis results confirmed the resource reduction and speed up advantages of our architectures.
Keywords
1. Introduction
The adoption of particle filters (PFs) in realtime systems is hampered by their computational complexity. The PF typically involves several complex mathematical operations, in addition to the large memory required to store and handle the particles. Parallel processing offers a possible solution to the realtime requirement. However, full parallelization is obstructed by the resampling step, which is serial in nature. Several efforts were expended to construct distributed resampling algorithms [1, 2]. Implementations of PF applications on hardware or hardware/software codesign platforms are further challenging due to the resource constraints on such platforms. Design and implementation of a generic yet highly optimized architecture for all PFbased systems is not easy because of the wide range of applications to which particle filtering technique are applied [3–5].
In the following, we review the main directions to the hardware implementation of the PFs, especially for object tracking.
Boli'c [6] proposed architecture for distributed resampling with proportional allocation. The main idea is to store the particles to be routed among the processor elements into dedicated memories in the control unit, and to have very fast interface capable of reading particles from the central unit (CU) and routing them to PEs. The overall memory requirements for this architecture equal KM, where K is the number of PEs and M is the total number of particles.
Athalye et al. [7] presented generic architectures for the implementation of the Sequential Importance Resampling Filter (SIRF). The proposed architecture is based on using dualport memory. The memory stores the addresses of the particles in its upper half, while the sampled particles are stored in the lower half of the memory. The idea is that the resampling unit returns the set of indexes (pointers) of the replicated particles instead of the particles themselves. Using index addressing alone does not ensure that the scheme with the single memory will work correctly. They used other memories to store the indexes of the replicated particles and the discarded particles. The size of overall used memory is 4M: 2M depth dualport memory to store the addresses and particles state vectors, M depth memory to store the replicated particles indexes and M depth FIFO to store the discarded particle indexes. They proposed two architectures to implement the SIRF using systematic resampling (SR) and using residual systematic resampling (RSR) algorithms.
Hong et al. [2] proposed a parallel PF consisting of multiple processing elements (PEs) up to four PEs. The PEs are connected with a single CU responsible for resampling. The CU is designed to support both the distributed resampling algorithm with proportional allocation (RPA) and nonproportional allocation (RNA). The proposed resampling architecture reduces the overall resampling time by a factor equal to the number of PEs. The communication between the PEs is via a twolevel interconnect. The first level is used for interactions between the PEs and the CU and the second level is exploited for interactions inside the CU. The CU contains buffers RB_{ i }(i = 1:4) to store excess particles which will be transmitted to the different PEs. The size of these buffers is 2M. Additional memory space is needed to store the tagged particles in tag buffers (TB). M/5 memory words are needed for each TB_{ i }(i = 1:4).
Alarcon and Lopez [8] applied PFs for tracking the lines of a road in real time. For each image, the presence of the lane lines is detected and their center of mass is calculated. Three consecutive images frame^{0}, frame^{1}, and frame^{2} are used for prediction and full tracking of the lane lines. The architecture is designed such that each particle is evaluated and appropriately updated independent of the rest.
Uk Cho et al. [9, 10] proposed a PF algorithm specifically designed for object tracking. The architecture consists of five blocks: the particle initiator (particle sampler), coordinates comparator, particle normalizer, data output, and particle selector. The sample of the new position of the particle is done according to the resampled particles and Gaussian white noises from the particle selector and Gaussian distribution Lookup Table (LUT), respectively. They eliminated the division operation of the particle resampling by using a variable resampling range. The particle selector block performs the resampling step in the PF algorithm. It selects new particles among the current particles according to the uniform random distribution obtained from the random distribution LUT and the cumulative function from the cumulative LUT in the particle normalizer block.
Velmurugan [11] developed an FPGA implementation of the bearingsonly tracker, similar to that in [7]. He used the Xilinx System Generator tool to speed up the development task. The use of this tool reduced the time spent in developing the system, and provided estimates of the FPGA resources consumed. The implementation lacks a higherlevel module to recursively propagate the particles and update the state estimates over time, because this would involve developing a toplevel VHDL module to control the memory reads and writes. The System Generator setup is not well suited to design this control operation, so it is not pursued. The implementation introduced performed the computations for M particles in a single iteration of the PF only.
Medeiros et al. [12] focused on the parallel computation of the particle weights of the colorbased PF. The architecture is composed of a linear array of PEs, each consisting of an arithmetic logic unit and a small amount of memory, a digital input processor, and a digital output processor. A global control processor is responsible for controlling the operation of the PEs and is able to carry out global DSP operations.
Saha et al. [13] introduced a parameterized design framework for PFs. This general framework allows the system features (e.g., number of particles) to be defined as parameters according to the application considered. The memory banks are parameterized. The proposed architecture consists of an array of processor elements and a resampling unit with a set of parameterized interfaces. The processor element consists of a weight calculation unit, a noise generator, and a processor element core. This approach reduces the redesign effort.
Recently, Hendeby et al. [14] used GeneralPurpose Computing on Graphics Processing Unit techniques to make a parallel recursive Bayesian estimation implementation using PFs.
The main objective of most of the published studies on the hardware implementation of PFs is to parallelize the resampling step, or simplify it via the use of LUTs. It may be noted, however, that for a moderate number of particles, resampling itself is not computationally expensive. The PF hardware implementation has to consider all of the main four steps: the particle sampling, weighting, resampling as well as the output step. An efficient implementation should efficiently implement all these steps from the perspective of execution time, hardware resources, and robust performance.
Although, the distributed implementation proposed in [6] reduces the execution time to M/K instead of M clock cycles, this architecture uses MK memory locations instead of 2 M in the straightforward implementation. Similarly, the parallel resampling architecture proposed in [2] reduces the overall resampling time by K but this architecture is limited to four PEs. In addition, this architecture uses 3 M memory to store the particles: one memory to store M particles to be resampled, one to store the replicated M particles, and one to store the M particles which are used as inputs in the next sampling step. Uk Cho et al. [9, 10] simplified the implementation of PFs by using LUTs; the penalty came from the increase of the implemented area as well as the execution time.
In this article, efficient novel hardware architectures of the SIRF are implemented. Three novel hardware architectures of the SIRF for object tracking are introduced. The first architecture is a twostep architecture with sequential resampling, where particle sample, weight, and output calculations are carried out in parallel during the first step, followed by sequential resampling in the second step. This first architecture serves as a core unit for the next two architectures. The second architecture speeds up the resampling step (second step) via a parallel, rather than a serial, architecture. The third architecture implements the SIRF as a distributed PF composed of several PEs and a CU. Each PE is in fact the twostep core of the first architecture. These architectures aim at enhancing the speed of operation while maintaining, at the same time, efficient utilization of logic, and memory resources. All the proposed architectures are implemented on a FPGA platform. A preliminary report covering the first two architectures was presented at a related conference [15].
2. Particle filters
The PF approximates recursively the sequence of posterior probability measures associated to a statespace dynamic model using a finite set of weighted samples. The key idea is to represent the required posterior density function by a set of random samples with associated weights and to compute estimates based on these samples and weights. The PF consists of two phases: prediction and update. The system is represented by statespace and observation equations. Consider an object that has a state X_{ t } and observation Z_{ t } at discrete time t. The previous state sequence at time t  1, t  2,...,2, 1 are denoted as X_{t1}, X_{t2},...,X_{2} and X_{1}, respectively. p(X_{ t }X_{t1}) describes the transition for state vector X_{ t } (dynamic or motion model). Let all available observations be Z_{1:t1}= {Z_{t1},..., Z_{1}}. The prediction uses the probabilistic system transition model to predict the posterior probability distribution at time t. So, we require to construct the probability density function (pdf) p(x_{ t } z_{ t } ) assuming that p(x_{0}z_{0}) ≡ p(x_{0}) is the initial pdf of the state vector, which is known as the prior (z_{0} is the set of no measurements). The posterior density p(X_{ 1:t } Z_{ 1:t } ) may be obtained recursively in the two stages: prediction and update.
Equation 1 assumes that p(x_{ t } x_{t 1}, Z_{1:t 1}) = p(x_{ t } x_{t 1}, Z_{t 1}). This approximation is particularly useful in the common case when only a filtered estimate of p(x_{ t } Z_{1:t}) is required at each time step [17]. In such scenarios, only x_{ t } need be stored, and so discard the path (X_{1:t1}) and the history of the observations (Z_{1:t1}). All practical software or hardware implementations of the PFs adopt this approximation to avoid intractable computation complexity. Throughout thisarticle, we also adopt this assumption.
The likelihood function p(z_{ t } x_{ t } ) is defined by the measurement model. In the update stage (Equation 2), the measurement z_{ t } is used to modify the prior density to obtain the required posterior density of the current state. The recurrence relations (Equations 1 and 2) form the basis for the optimal Bayesian solution. This recursive propagation of the posterior density is only a conceptual solution in that, in general, it cannot be determined analytically. Extended Kalman filters (EKFs), Gaussian Sum Filters (GSFs) [18], and PFs approximate the optimal Bayesian solution.
Therefore, we have a discrete weighted approximation to the true posterior p(x_{ t } z_{ t } ) by using the discrete random measures. In addition, the estimation is calculated as a weighted mean.
2.1. SIRFs overview

The state dynamics and measurement functions are known.

The likelihood function is available.
SIRFs choose the importance density to be the transition prior and perform the resampling step at every time index [17].
2.2. PF for tracking
PFs provide robust tracking for moving objects, especially in the case of nonlinear and nonGaussian problems. PFs must be designed in a way to avoid the loss of tracking. For our image tracking application, the moving object is expected to remain within a region of interest (ROI) area of 32*32 pixels between two consecutive frames. So, the number of particles needed to represent the state space corresponding to this area is of moderate value.
Bolic et al. [19] provide a performance and complexity analysis of PF as applied to realtime object tracking. They address the effect of the number of particles and the sample rate. They found that the performance of the particles filters ceases to improve when the number of particles is greater than a certain number N. This number depends on the problem (object's trajectory; the dynamics of the object;...).
The previous related study, in the field of object tracking [5, 20, 21], used 64, 50, and 100 particles, respectively. In our previous study [22, 23], we use 100 particles.
For hundreds particles, the ratio between the latency cycles and the number of particles affects the total execution time. So, it is important to consider all of the main four steps: the particle sampling, the weighting, resampling as well as the output step to efficiently build a hardware implementation of SIRF. An efficient implementation should efficiently implement all these steps from the perspective of execution time, hardware resources, and robust performance.
2.3. Architectures
This section is devoted to a full and detailed presentation of our first two SIRF architectures. The main goal of our implementation is to minimize the execution time and the used resources without affecting the performance.
2.3.1. The proposed twostep architecture with sequential resampling
In the following, we describe the function of the different sub blocks of Figure 2. The registers: Particle R and NR store the current particle vector and its replication factor, respectively. The contents of these registers are loaded into the Particle S and NS registers, respectively, if NR is nonzero. The WM register file contains the weights of the particles and the RFM register file contains the corresponding replication factors for the particles.
 1.
The particles are sampled (generated) and stored in the FIFO at the Sample engine using the Random generator [24] with the known initial state vector of the object. As the particle is sampled, it can be used for weight calculations.
 2.
During the process of weight calculation, the accumulation of the total sum of weights is carried out in the weight calculation engine.
 3.
As the sampling of the total number of particles finishes, the resampling engine starts the resampling process. The replication factor for each particle is calculated according to its weight as described in Section 2.3.1.2.
 4.When the replication factors of the particles are calculated, the output engine starts the output calculation process as follows:
 a.
Read one particle from the FIFO to Particle R register and the corresponding replication factor to NR register; continue reading until a nonzero value of NR is found. At that time, the particle is transferred to Particle S register and the content of register NR is transferred to register NS. NS is decremented at each clock cycle.
 b.
The contents of register Particle S are sampled by adding random values from the Random Generator and the resulting particle is written to the FIFO. At the same time, the generated particle is used to the weight calculation unit. In addition, the contents of register Particle S are used in the output calculation by multiplying the particle x and y values by 1/M and accumulated to get the mean output values (the X and Y positions of the tracked object).
 c.
During decrementing the contents of register NS to zero, the FIFO reading does not stop until a nonzero value of NR register is found. Figure 3 shows the state diagram of the read and write operations from and to the FIFO.
 a.
 5.
As the M particles are read and repeated according to their replication factors, the resampling engine calculates sequentially the replication factor for each particle for the next instant. The resampling engine calculates one replication factor every clock cycle.
 6.
Finally, again, the FIFO contains M particles and so the RFM contains M corresponding replication factors. For each coming observation, repeat the steps 4 and 5.
Thus, based on the data independencies between the particles, our architecture carries out the three steps of the particle generation, the weight, and output calculations concurrently. The resampling step cannot be overlapped in time with the weight calculation step due to the necessity of knowing the overall sum of the weights. So, we select the modified RSR algorithm [25]. In the following, we describe the main function blocks of our proposed structure.
2.3.1.1. The weight calculation engine
where (i, j) is the particle position and (x, y) is the previous object position. σ_{ P } is the variance of the position. ${I}_{t}^{\left(i,j\right)}$ is the mean gray intensity level estimated at particle position (i, j) at time t, and ${Z}_{t}^{\left(i,j\right)}$ is the measured mean gray level intensity value at position (i, j) at time t. σ_{ I } is the variance of the gray level intensity. The overall likelihood is taken as L = LP * LI.
The weighting engine stores the index of the maximum weight values to be used in the resampling engine. In addition, the weighting engine accumulates the weight sum for use in the resampling unit.
2.3.1.2 The resampling engine
For hardware simplification, we simply use ΔU(i) = 0, i = 1,...,M to eliminate the use of random generator. For the RR method [25], the sum of the replication factors of all the particles ($N=\sum _{m=1}^{M}{r}^{m}$) is less than M, except for special cases. The remaining particles are obtained using other mechanisms. We added the difference (M  N) number to the replication factor of the particle having the largest weight. The index of the highest weight particle is stored during the calculation of the weight values in the weight calculation engine.
2.3.1.3 The execution time
2.3.1.4 Main features of the twostep architecture with sequential resampling
 i.
FIFOs are used instead of memories whenever possible. This saves address decoding logic and enhances speed of operation.
 ii.
Linearized likelihood functions to speed up the weight calculations while preserving the localization features of the Gaussian likelihood functions
2.3.2. The twostep architecture with parallel resampling
2.3.2.1. The parallel resampling engine
Pseudocode of splitting the RSR algorithm into L loops: Generate a random number ΔU^{1} ~ u[0, 1]
D = M/W // where W is the total sum of weights. 

 For l = 1 to L 
r(l) = int((W(l) ΔU(l))*D); 
ΔU(l+1) = ΔU(1) + r(l) W(l) * D; 
Loop l 
 For m = (l1)*M/L to l*M/L1 
Perform RSR 
 End For 
 End For 
Comparison between the resources for different cases of the twostep sequential architecture with parallel resamplin (Xilinx FPGA xc5vlx50t3ff1136)
Resource  Sequential L = 1, M = 8  Parallel RSR L = M = 8  Parallel RSR L = M = 64 

Slice register  11  11  131 
Slice LUTs  77  78  1692 
Number of slices used as logic  77  78  1692 
Number of DSP48Es  1  8  48 
2.3.2.2. The execution time
3. The distributed implementation of PF
In the following sections, we propose a distributed architecture for efficient implementation of the PFs. The main design goal of this architecture is to minimize the execution time. This goal is achieved by using multiple PEs with a CU. First, we review the published study on distributed PFs. We, also, review the different networks to connect the PEs. Next, we introduce our proposed distributed SIRF.
3.1. Algorithms and architectures for distributed PFs
In this section, we discuss the different resampling algorithms for distributed PFs.
3.1.1. Centralized resampling [27]
3.1.2. Distributed RPA [27]
Let the number of particles be M and number of PEs is K, so each PE is assigned N = M/K particles. After the weight calculation step, each PE calculates the sum of the weights of its particles, i.e., ${\mathsf{\text{W}}}^{\mathsf{\text{k}}}=\sum _{i=1}^{N}{w}^{i,k},k=1,2,....,K$. Each PE, then, sends only this sum to the CU. Next, the CU treats the individual PEs as particles, and carries out a resampling step between these K PEs according to their total weight sums. This resampling step is termed "Interresampling". The CU sends to each PE its share of the replicated particles. Each PE would subsequently carry out "intraresampling" amongst the particles share assigned to it by the CU. The sequence of operations performed by the PE and CU are shown in Figure 10b. Obviously, particles should be redistributed among PEs after each cycle of inter and intraresampling. Thus, each cycle of inter and intraresampling is followed by a PEs' communication phase to rebalance the particles distribution among PEs. To speed up this process, PEs are divided into groups, and particles are locally redistributedin parallelwithin each group. PEs are next regrouped via several alternative schemes until the goal of equal particle redistribution among PEs is achieved.
The same resampling results are obtained via RPA or via sequential resampling. RPA is obviously superior to centralized resampling due to the time savings in the parallelized intraresampling step as well as the reduced PECU communications. Also, CU design is simpler. The time for the resampling procedure in the distributed RPA is reduced $\frac{M}{M\phantom{\rule{0.2em}{0ex}}/\phantom{\rule{0.2em}{0ex}}K+K}$ times, where M/K corresponds to the intraresampling time and K is a time for interresampling.
3.1.3. Distributed resampling with RNA [27]
The problems of the particle routing and the delay introduced by the global preprocessing step (interresampling) can be solved by using the RNA algorithm. In the RNA algorithm, the number of particles within a group (one or more PEs) is fixed and equals to the number of particles per group (N^{ k } = N). So, full independent sampling is performed by each group. The interresampling step is eliminated completely. Again, intraresampling leads to imbalance in particle distribution among the PEs groups. Thus, group communications is again needed post intrasampling. Bolic et al. [27] proposed several schemes for speeding up this communications, which they termed "regrouping". The main advantage of RNA is the routing of particles which is deterministic and is planned by a designer. The other characteristic of RNA is that the weights after resampling are not equal to 1/M; instead, they are equal inside the groups.
3.2. The proposed distributed architecture for SIRF
The particles that are replicated as a result of the resampling for PE^{ k } are stored into the local FIFO ^{ k } for N (N = M/K) particles. When there is a surplus of particles, N^{ k } > N, these particles are routed to the neighboring processor PE^{k+1}through the local interconnections.
If there is a shortage of particles in PE ^{ k }, N^{ k } < N, then PE ^{ k } reads particles from the neighboring processor PE^{k1}through the local interconnections.
 1
Each PE ^{ k } performs the sample step, the importance step to N particles (N = 8 in our design example) and accumulates PE total weight ${\mathsf{\text{W}}}^{\mathsf{\text{k}}}=\sum _{i=1}^{N}{w}^{i,k},k=1,2,....,K$.
 2
The CU receives the partial sums of the processors weights from the PEs, performs the interresampling, and sends back the replication number of particles N^{ k } to PE ^{ k } for k = 1,2,..., K.
 3
The CU also, calculates the inverse of W^{ k } (1/W^{ k } ) for k = 1, 2,..., K and sends them to PE ^{ k } for k = 1,2,..., K, respectively, to be used in the intraresampling in each PE.
 4
The PEs perform the intraresampling in parallel. The particles are allocated to the local corresponding FIFO for N particles.
 5
All the PEs with N^{ k } > N will send the surplus (N^{ K }  N) particles to the local network in parallel.
 6
All the PEs with N^{ k } < N will take the remainder (N  N^{ k } ) particles from the local network in parallel.
 7
The CU calculates the output Xmean, Ymean position of the object as well as Imean gray level intensity of the object as cumulative sums.
3.2.1. The resampling step
We use the modified RSR algorithm described in Section 2.3.1.2 in both the interresampling in the CU and intraresampling in the PEs.
3.2.1.1. The interresampling
The CU performs the partial resampling using the partial weights ${\mathsf{\text{W}}}^{\mathsf{\text{k}}}=\sum _{i=1}^{N}{w}^{i,k},k=1,2,....,K$ to produce N_{ k }, k = 1, 2, ..., K using the modified RSR algorithm [25]. Since the resampling produces replication factors equal or less than M ($M\ge \sum _{i=1}^{K}{N}^{k}$), we opted to add the remaining particles to the PE with the largest weight. The index of this PE is stored during the calculation of the replication factors.
Using the LUTs, we calculate the inverse of the total weight WT and the individual weights W^{ k } of each PE. For our example with M = 64 and K = 8, simple reflection would reveal that the same LUT can be used to calculate both WT and W^{ k } , provided that a simple selectable shift operation is added at the output of this LUT.
3.2.1.2. The IntraResampling
We implement the RSR resampling described in Section 2.3.1.2. After the CU calculates the number of particles that each PE replicates (N^{ k }, k = 1,....,K), the intraresampling is performed inside each PE. Each PE calculates one replication factor for each particle of its M/K particles per clock cycle, leading to K replication factors calculated by the K PEs per clock cycle.
3.2.2. The local interconnect network
It is important to notice that N^{ k } is a random number, which depends on the overall distribution of the weights. The PEs with N^{ k } > N have surplus of particles and they need to exchange particles with the other PEs having a shortage of particles (N^{ k } < N). The N^{ k } numbers change after each sampling period, so that it is necessary to connect different PEs in order to perform particle routing. In summary, the communication pattern is nondeterministic and the connections among the PEs are changed after each sampling period.
 1
State "00" where N^{ k } = N. So, any incoming particle from PE^{k1}is sent to PE^{k+1}.
 2
State "01" where N^{ k } > N. The processor is a source element. As soon as PE ^{ k } replicates the first N particles, it sends the reminder N^{ k }  N to the neighbor PE^{k+1}. After PE ^{ k } completes the replication of N^{ k } particles and any stored particles in the local net memory, the state transfers to state "00". If during replication of the particles, a particle from PE^{k1}is coming, then PE ^{ k } stores this value in local net memory until it completes the replication of all its internal generated particles.
 3
State "10" where N^{ k } < N. The PE ^{ k } is a sink element. First, it replicates N^{ k } and reads any incoming particles through the network to complete its FIFO to N particles. As soon as the reminder N  N^{ k } particles are read from the network, the PE ^{ k } state transfers to state"00".
The local net memory size is designed to handle the worst case, in which two sequential processor elements PE ^{ k } , PE^{k+1}have half the replicated particles, i.e., N^{ k } = N^{ k+1 } = M/2.
So, the local net memory should be enough to store M/2  N (= M(K  2)/2K). In our case, the local net memory size has 24 particle locations. If the number of PEs K is less than two, then the local net memory is zero. This is logical since the net memory is used inside each PEs to store the particles passing through the PEs. If we have one PE or two PEs, then there is no passingthrough particles, and hence, there is no need to the net memory.
The distributed RPA PF is captured using VHDL and verified using ModelSim simulator.
3.2.3. The execution time of the distributed SIRF
4. Comparison of the execution time between the three different implementations for SIRF

The total time of the twostep architecture with sequential resampling:

The total time of the twostep architecture with parallel resampling:

The total time of the distributed PF: (aM/K + K + M/K + 2 + b(K1)M/K + TL) clock cycles
The timing comparison in the worst case and the best case
HW implementation  Worst case  Best case 

The twostep architecture with sequential resampling  a = 2 3M + 1+TL 196 cycle  a = 1 2M + 1+TL 132 cycle 
The twostep architecture with parallel resampling  a = 2 2M + 2 + TL 132 cycle  a = 1 M + 2 + TL 69 cycle 
The distributed PF  a = 2 and b = 1 2M/K + K+M+2+TL 93 cycle  a = 1 and b = 0 2M/K + K +2+TL 29 cycle 
Worst case throughputs of proposed architectures (M = 64)
HW implementation  Worst case  Maximum clock frequency (MHz)  Throughput 

The twostep architecture with parallel resampling  132 cycle  36  270 × 10^{3} iterations/s 
The distributed PF  93 cycle  74  795 × 10^{3} iterations/s 
5. Synthesis comparison between the three different SIRF implementations
The resources utilization for the three proposed architectures on the Xilinx Virtex 5 FPGA xc5vlx50t3ff1136
Device utilization summary  

Resource  Single PE with sequential resampling  Single PE with parallel resampling  Distributed PF 8 PE  FPGA available resources 
Slice registers  891  967  9049  28800 
Slice LUTs  1203  4430  15017  28800 
Fully used Bit Slices  364  746  2517  21549 
Bonded IOBs  28  28  148  480 
Block RAM/FIFO  1  1  4  60 
BUFG/BYFGCTRLs  2  2  2  32 
DSP48Es  5  48  17  48 
The third implemented distributed RPA with local network, uses 2M/K FIFOs for each processor element, plus (K2)M/2K memory locations as the local net memory in each processor elements. Totally each processor elements requires a total of (M/2+M/K) location memory to store the particles. The overall required memory is (M+KM/2). In the comparison, the architecture presented in [6] needs M memory locations for each processor elements (M/K inside the processor elements plus (K1)M/K memory locations at the CU). Thus, the total memory locations required in [6] design is MK. Thus, our implementation has a resource reduction advantage of M(K  2)/2 in addition to a speed advantage without compromising the SIRF performance.
For several hundred particles, the ratio between the latencies cycles and the number of particles affects the total execution time. For our object tracking application, the moving object moves between two consecutive frames in an area of 32 × 32 pixels. So, the number of particles, needed to represent this state space, is of moderate value. That is why it is important to consider the latencies cycles. For example, in the implementation of Athalye et al. [7], the latency of sampling and importance computation units are 8 and 53 cycles, respectively, giving a total value for TL = 61 cycles. If M = 10000 (as in case of reference [7]) this TL value is insignificant. However, if M = 64 (as in our case) this TL value is close to 100% of the computation time of the total sampling time. In our implementations, TL is reduced to 3 cycles, leading to a latency time overhead of 3/64, less than 5%.
6. Conclusion
In this article, three novel architectures for SIRF are proposed. The first architecture is a twostep architecture with sequential resampling, where particle sample, weight calculation, and output calculation are carried out in parallel during the first step, followed by sequential resampling in the second step. This first architecture serves as a core unit for the next proposed architectures. The second architecture speeds up the resampling step (second step) via a parallel, rather than a serial, architecture. The third architecture implements the SIRF as a distributed PF composed of several PEs and a CU. Each PE is in fact the twostep core of the first architecture. All the proposed architectures are captured using VHDL and are synthesized using Xilinx environment and verified using Modelsim simulator.
The main features of the proposed architectures are

Memory addressing is eliminated via the efficient use of dualport FIFOs to store the particles' state vectors. Consequently, significant speed up of the PF process is achieved. There is no need to add memories to store the particles addresses or indexes as in [2, 7]. The overall memory size is 2M for the sequential PF. In comparison, required memory sizes for the PFs of references [2, 7] are 3M and 4M, respectively. The FIFO eliminates the need to read, write, or store the states addresses/indexes in separate memory, and hence, reduces the execution cycle time.

The use of linearized weight likelihood functions instead of the exponential functions. This feature captures the localization of the likelihood function while reducing the hardware resources needed to evaluate it.

The second architecture speeds up the resampling by a factor of M via a parallel, rather than a serial, architecture.

The third distributed PF architecture is implemented by several PEs with simple but efficient ring interconnection network. The local routing of the particles among the PEs executed in parallel with the particle generation, weight, and output calculations pipeline step. The proposed ring interconnection network delay is in the range of zero cycle in the best case to (K1)M/K cycle in the worst case when only one particle has a nonzero weight. The proposed distributed PF achieves a total execution time of (2M/K + K + 2 + TL) in the best case to (2M/K + K + M + 2 + TL) in the worst case. In comparison, the PF designs of references [2, 6] require 3M/4 + 2 cycles and (2M/K + K + Mr + TL) cycles, respectively. Mr is the delay due to particles routing assuming the same latencies TL. In addition our distributed PF gain a reduction of memory size of M(K2)/2 when compared with the parallel PF proposed in [6].
Declarations
Authors’ Affiliations
References
 Miguez J: Analysis of parallelizable resampling algorithms for particle filtering. Signal Process J 2007, 87: 31553174. 10.1016/j.sigpro.2007.06.011View ArticleMATHGoogle Scholar
 Hong S, Chin S, Djurc PM, Bolic M: Design and implementation of flexible resampling mechanism for highspeed parallel particle filters. J VLSI Signal Process 2006, 44: 4762. 10.1007/s1126500659199View ArticleMATHGoogle Scholar
 Marimon D, Maret Y, Abdeljaoued Y, Ebrahimi T: Particle filterbased camera tracker fusing marker and feature point cues. Proceedings of IS&T/SPIE Conference on Visual Communication and Image Processing 2007., 6508:Google Scholar
 de Freitas N: RaoBlockwellised particle filtering for fault diagnosis. IEEE Aerospace Conference Proceedings 2002, 4: 493.Google Scholar
 Zhang B, Tian W, Jin Z: Robust appearanceguided particle filter for object tracking with occlusion analysis. Int J Electron Commun 2008, 62: 2432. 10.1016/j.aeue.2007.01.006View ArticleGoogle Scholar
 Boli'c M: Architectures for efficient implementation of particle filters. Ph D thesis, Stony Brook University 2004.Google Scholar
 Athalye A, Bolic M, Hong S, Djuric PM: Generic hardware architectures for sampling and resampling in particle filter. EURASIP J Appl Signal Process 2005, 17: 28882902.View ArticleMATHGoogle Scholar
 Alarcon J, Lopez I: A new realtime hardware architecture for road line tracking using a particle filter. IEEE Industrial Electronics IEC Annual Conference, Paris 2006, 736741.Google Scholar
 Uk Cho J, Byun JE, Kang H: A realtime object tracking system using a particle filter. Proceedings of intelligent robots and Systems Conf. IEEE/RSJ, Beijing, China 2006, 28222827.Google Scholar
 Uk Cho J, Jin SH, Pham XD, Jeon JW: Object tracking circuit using particle filter with multiple features. Proceedings of SICEICASE International Joint Conf., Bexco, Busan, Korea 2006, 14311436.Google Scholar
 Velmurugan R: Implementation strategies for particle filter based target tracking. Ph D Thesis, Georgia Institute of Technology 2007.Google Scholar
 Medeiros H, Gao X, Park J: A parallel implementation of the color based particle filter for object tracking. Computer Vision and Pattern Recognition Workshops, CVPRW '08, IEEE Comput Soc Conf 2008, 18.Google Scholar
 Saha S, Bambha NK, Bhattacharyya SS: Parameterized design framework for hardware implantation of particle filters. Proceedings of the International Conf on Acoustics, Speech and Signal Processing, Las Vegas, Nevada 2008, 14491452.Google Scholar
 Hendeby G, Karlsson R, Gustafesson F: Particle filtering: the need for speed. EURASIP J Adv Signal Process 2010., 2010: Article ID 181403, 9Google Scholar
 Abd ElHalym Howida A. , Mahmoud II, Habib SED: Efficient hardware architecture for particle filter based object tracking. Proceeding of IEEE International Conference on Image Processing (ICIP), Hong Kong 2010, 44974500.Google Scholar
 Arulampalam MS, Maskell S, Gordon N, Clapp T: A tutorial on particle filters for online nonlinear/nonGaussian Bayesian tracking. IEEE Trans Signal Process 2002, 50: 174188. 10.1109/78.978374View ArticleGoogle Scholar
 Gordon NJ, Salnoda DJ, Smith AFM: Novel approach to nonlinear/nonGaussian Bayesian state estimation. IEE Proc F 1993, 140(2):107113.Google Scholar
 Sorenson HW, Alsach DL: RecursiveBayesian estimation using gaussian sums. Automatica 1971, 7: 465479. 10.1016/00051098(71)900975View ArticleMathSciNetMATHGoogle Scholar
 Bolic M, Hong S, Djurc PM: Performance and complexity analysis of adaptive particle filtering. Proceedings of the conference on Signals, Systems and Computers, CA 2002, 853857.Google Scholar
 Cho J, Byun IE, Hang H: A realtime object tracking system using particle filter. Proceedings of Intelligent Robots and Systems Conf IEEE/RSJ, Beijing, China 2006, 28222827.Google Scholar
 Bai K, Liu W: Improved object tracking with particle filter and mean shift. Proceedings of the IEEE International Conf on Automation and Logistics Jinan, China 2007, 431435.Google Scholar
 Abd ElHalym Howida A. , Mahmoud II, AbdelTawab A, Habib SED: Appraisal of an enhanced particle filter for object tracking. Proceedings of IEEE International Conference on Image Processing (ICIP), Cairo, Egypt 2009, 41054107.Google Scholar
 Abd ElHalym Howida A. , Mahmoud II, AbdelTawab A: SED Habib, Particle filter versus particle swarm optimization for object tracking. Proceedings of ASAT Conf, Cairo, Egypt 2009.Google Scholar
 Mahmoud II, Abd ElTawab A: Hardware genetic algorithm for robot motion path planning. Proceedings of 4th STCEX Riyad, KSA 2006, 2: 359366.Google Scholar
 Bolic M, Athalye A, Djuric PM, Hong S: Algorithmic modification of particle filters for hardware implementation. Eur Signal Process Conf Vienna, Austria 2004, 16411644.Google Scholar
 [http://www.mathworks.com/matlabcentral/fileexchange/17960]
 Bolic M, Djuri PM, Hong S: Resampling algorithms and architectures for distributed particle filters. IEEE Trans Signal Process 2005, 53: 24422450.MathSciNetView ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.