 Research
 Open Access
 Published:
Covariance tracking: architecture optimizations for embedded systems
EURASIP Journal on Advances in Signal Processing volume 2014, Article number: 175 (2014)
Abstract
Covariance matching techniques have recently grown in interest due to their good performances for object retrieval, detection, and tracking. By mixing color and texture information in a compact representation, it can be applied to various kinds of objects (textured or not, rigid or not). Unfortunately, the original version requires heavy computations and is difficult to execute in real time on embedded systems. This article presents a review on different versions of the algorithm and its various applications; our aim is to describe the most crucial challenges and particularities that appeared when implementing and optimizing the covariance matching algorithm on a variety of desktop processors and on lowpower processors suitable for embedded systems. An application of texture classification is used to compare different versions of the region descriptor. Then a comprehensive study is made to reach a higher level of performance on multicore CPU architectures by comparing different ways to structure the information, using single instruction, multiple data (SIMD) instructions and advanced loop transformations. The execution time is reduced significantly on two dualcore CPU architectures for embedded computing: ARM CortexA9 and CortexA15 and Intel PenrynM U9300 and HaswellM 4650U. According to our experiments on covariance tracking, it is possible to reach a speedup greater than ×2 on both ARM and Intel architectures, when compared to the original algorithm, leading to realtime execution.
1 Introduction
Tracking consists in estimating the evolution in state (e.g., location, size, orientation) of a moving target over time. This process is often subdivided into two other subproblems: detection and matching. Detection deals with the difficulties of generic object recognition, i.e., finding instances from a particular object class or semantic category (e.g., humans, faces, vehicles) registered in digital images and videos. On the other hand, matching methods provide the location which maximizes the similarity with the objects previously detected in the sequence. Generic object recognition requires models that cope with the diversity of instances’ appearances and shapes. This is generally made by learning techniques and classification. Conversely, matching algorithms analyze particular information and construct discriminative models that allow to disambiguate different instances from the same category and avoid confusions.
The main difficulty of tracking is to trace target trajectories and adapt to changes of appearance, pose, orientation, scale, and shape. Since the beginnings of computer vision, a diversity of tracking methods have been proposed, some of them construct path and state evolution estimations using a Bayesian framework (e.g., particle filters, hidden Markov models), others measure the perceived optical flow in order to determine object displacements and scale changes (median flow) [1]. Exhaustive appearancebased methods compare a dense set of overlapping candidate locations to detect the one that fits best with some kind of template or model. When a priori information about the target location and its dynamics (e.g., speed and acceleration) is available, the number of comparisons can be reduced enormously by giving preference to the more likely target regions. Other accelerations can be achieved using local searches that are based on gradientdescent algorithms able to handle small target displacements and geometrical changes. Among these approaches, feature points tracking techniques are very popular [2] since points can be extracted in most scenes, contrary to lines or other geometric features. Because they represent very local patterns, their motion models can be assumed as rigid and be estimated in a very efficient way. This method, as well as block matching, are rawpixel methods, since the target is directly represented by its pixels matrix.
In order to deal with nonrigid motion, kernelbased methods such as meanshift (MS) [3] and [4] use a representation based on color or texture distribution.
Covariance tracking (CT) [5] is a very interesting and elegant alternative which offers a compact target representation based on the spatial correlation of different features computed at each pixel in the target bounding box. Very satisfying tracking performances have been observed for diverse kinds of objects (e.g., with rigid motion or not, with texture or not). CT has been studied extensively, and many feature configurations and arrays of covariance descriptors have been proposed to improve its discrimination power [6–11]. Smoother trajectories can be obtained by considering target dynamics; therefore, they increase tracking accuracy and reduce the search space [12, 13]. Genetic algorithms [14] can also be used to accelerate the convergence towards the optimal solution of the best candidate position, considering a search in a large image. But, to our knowledge, little work has been done to analyze the computational demands of CT and its portability to embedded systems [15]. The goal of this article is to fill this gap, analyze the algorithm’s computational behavior for different implementations, and measure their load on embedded architectures. A study is also made to compare different sizes and configurations of the descriptors in terms of discrimination power through a texture classification application.
The article is structured as follows. The first section introduces some of the basic principles of the CT algorithm and provides a brief description of the different searching and matching methods that can be associated with C. Then various configurations of the covariance matrix are evaluated. Finally, we provide an indepth description of implementation details and suitable acceleration techniques proposed to achieve a higher level of performance. Experiments and details about the algorithm implementation are presented in the final section that comes followed by our conclusions.
2 Covariance matrices as image region descriptors
Let I represent a luminance (grayscale) or a color image with three channels and consider a rectangular region of size n = W × H (it can be the bounding box of the target to be tracked for example). Let F be the W × H × n_{ F }dimensional feature image extracted from I
where ϕ is any n_{ F }dimensional mapping forming a feature vector for each pixel of the bounding box. The features can be spatial coordinates p_{ uv }, intensity, color (in any color space), gradients, filter responses, or any possible set of images obtained from I. Now, let {z_{ k }}_{k = 1⋯n} be a set of n_{ F }dimensional feature vectors inside the rectangular region R ⊂ F of n pixels. Concerning notations, p_{ uv } stands for the pixel at u th row and v th column.
The region R is represented with the n_{ F }× n_{ F } covariance matrix
where μ is the mean feature vector computed on the n points.
The covariance matrix is a n_{ F }× n_{ F } matrix which fuses multiple features naturally by measuring their correlations. The diagonal terms represent the variance of each feature, while elements outside this diagonal are the correlations. Thanks to the averaging in the covariance computation, noisy pixels are largely filtered out, which is an interesting advantage when compared to rawpixel methods. Covariance matrices are more compact than most classical object descriptors. Indeed, due to symmetry, C_{ R } has only $\left({n}_{F}^{2}+{n}_{F}\right)/2$ different values whatever the size of the target. To some extent, it is robust against scale changes, because all values are normalized by the size of the object, and against rotation when the locations coordinates p_{ uv } are replaced by the distance to the center of the bounding box.
The covariance descriptor ceases to be rotationally invariant when orientation information is introduced in the feature vector such as the norm of gradients with respect to x and y directions. The information considered by the covariance descriptor should be adapted to the problem at hand, because they depend on the application, as described in the next paragraph.
2.1 Covariance descriptor feature spaces
Covariance descriptors have been used in computer vision for object detection [16], reidentification [10, 11] and tracking [5]. The recommended set of features to use depends significantly on the application and the nature of the object: tracking faces is different than tracking pedestrians because faces are somehow more rigid than pedestrians which have more articulations. Color is an important hint for pedestrian or vehicle tracking/reidentification because of their clothes or bodywork color. But color is less significant for reidentification or tracking faces because the set of colors they exhibit is relatively limited.
Table 1 displays a summary of the more common feature combinations used by covariance descriptors in computer vision. The most obvious ones are the components from different color spaces such as RGB and HSV. Pixel brightness in the grayscale image I and its local directional gradients as absolute values I_{ x } and I_{ y }, gradient magnitude $\sqrt{{I}_{x}^{2}+{I}_{y}^{2}}$, and its angle calculated as $arctan\frac{\left{I}_{x}\right}{\left{I}_{y}\right}$. Foreground images G resulting from background subtraction methods and its gradients G_{ x } and G_{ y }. Features g_{00}(x,y) to g_{74}(x,y) represent the 2D Gabor kernel as a product of an elliptical Gaussian and a complex plane wave [9].
Some texture analysis and tracking methods use local binary patterns (LBP) in the place of Gabor filters and the reason is that LBP operators are much more simple and economical. Values Var_{LBP}, ${\text{LBP}}_{{\theta}_{0}}$, and ${\text{LBP}}_{{\theta}_{1}}$ in Table 1 represent local binary pattern variance (which is a classical property of the LBP operator [19]) and the angles defined by them, as detailed in [18]. This version of the feature vector has shown very good performances for tracking, both in terms of robustness and computation times, and requires a far shorter vector (n_{ F }= 7) when compared to Gabor filters (n_{ F }= 43). In the rest of the paper, for the algorithmic optimization, a vector of five to nine features is considered, but note that the proposed optimizations can be applied to any matrix size.
Now, let us detail the computation of the covariance descriptor.
2.2 Covariance descriptor computation
After some term expansions and rearrangements on Equation 2, the (i,j)th element of the covariance matrix can be expressed as
Therefore, the covariance in a given region depends on the sum of each feature dimension z(i)_{i = 1⋯n}, as well as the sum of the multiplications of any pair of features z (i) z (j)_{i,j = 1⋯ n}, requiring in total ${n}_{F}+{n}_{F}^{2}/2$ integral images, one for each feature dimension z (i) and one for the multiplication of any pair of feature dimensions z (i)z(j) (the covariance matrix is symmetric).
Let A be a W × H × n_{ F } tensor of the integral images of each feature dimension
where R(11,u v) is the region bounded by the topleft image corner p_{11} = (1,1) and any other point in the image p_{ uv }= (x_{ u },y_{ v }). In a general way, let R (u v,u^{′}v^{′}) be the rectangular region defined by the topleft point p_{ uv } and the rightbottom point ${\mathit{p}}_{{u}^{\prime}{v}^{\prime}}$. Similarly, the tensor containing the feature productpair integral images is denoted as
Now, for any point p_{ uv }, let A_{ uv } be a n_{ F }dimensional vector and B a n_{ F }× n_{ F } dimensional matrix such as
The covariance of the region bounded by (1,1) and p_{ uv }is
where n is the number of pixels in the R under investigation. Similarly, and after some algebraic manipulations, the covariance of the region R (u v,u^{′}v^{′}) as it was presented in [20] is
Once the integral images have been calculated, the covariance of any rectangular region can be computed in $O\left({n}_{F}^{2}\right)$ time regardless of the size of the region R(u v,u^{′}v^{′}). The complete process is represented graphically in Figure 1, where different imageprocessing operators are applied to the initial image (top left) to calculate the set of feature images (top right). Each feature component i is used to generate the integral image A_{ uv }(i) (bottom left) and the crossed product between features i and j is used to calculate the second order integral images B_{ uv }(i,j).
Next section explains the covariance matching process.
3 Searching and matching
Covariance models and instances can be compared and matched using a simple nearest neighbor approach, i.e., by finding the covariance descriptors that best resemble a model. The problem is that covariance matrices and symmetric positive definite (SPD) matrices in general is that they do not lie on the Euclidean space and many common and widely known operations in Euclidean spaces are not applicable or require to be adapted (e.g., a SPD matrix multiplied by a negative scalar is no longer a valid SPD matrix). A n_{ F }× n_{ F }SPD matrix only has n_{ F }× (n_{ F }+ 1)/2 different elements; while it is possible to vectorize them and perform elementbyelement subtraction, this approach provides very poor results as it fails to analyze the correlations between variables and the patterns stored in them. A solution to this problem is proposed in [21] where a dissimilarity measure between two covariance matrices is given as
where ${\left\{{\lambda}_{i}\right({\mathbf{\text{C}}}_{1},{\mathbf{\text{C}}}_{2}\left)\right\}}_{i=1,\cdots \phantom{\rule{0.3em}{0ex}},{n}_{F}}$ are the generalized eigenvalues of C_{1} and C_{2} computed from
The tracking starts in the first frame of the sequence, by computing the covariance matrix C_{1} in the bounding box of the target under consideration (i.e., the model). The initial detection is not detailed in this paper since it can be made in various ways, by object recognition or background subtraction for example. The tracking procedure consists in determining the new target positions for the successive frames by comparing the covariance matrix C_{2} (i.e., the candidate position) and minimizing the Riemannian distance (9).
Figure 2 depicts two possible searching strategies: the exhaustive search approach (left) and a gradientbased local search or steepest descent approach (right). Exhaustive search methods uniformly sample a large number of candidate positions scanning the whole image (or the region surrounding the previous target position). Steepest descent methods look for the position which maximizes the appearance similarity when compared to the target model. Gradientbased methods do not require a large number of matrix comparisons; however, they do require to run iteratively until convergence, causing their computation time to be very unpredictable. Another limitation (and probably the most important one) is that contrary to exhaustive search approaches, local search may fail for targets undergoing brutal motion or target occlusions. The reason behind this problem is that local search methods tend to fall into local minima. Due to these limitations, the exhaustive search method was preferred for the tracking application implemented for this research.
4 Feature vector evaluation
The objective of this section is to determine the most discriminative vector combination and the ideal number of features (n_{ F }) to use. Multiple feature combinations were tested using a texture classification method. The KTHTips dataset [22] is composed of ten different texture classes each one represented by 81 different samples of size 200 × 200 taken at different scales, illuminations, and poses.
There are different approaches for texture classification with covariance matrices. Most methods subdivide the image in small overlapping subregions and compute a descriptor associated to each one. The drawback with this approach is that it increases the number of matrix comparisons and the storage required. To avoid this problem, the local logEuclidean covariance matrix (L2ECM) [6] computes a single covariance matrix from the logEuclidean transformations of other (simpler) covariance matrices calculated at every pixel neighborhood (typical sizes are 15 × 15 or 30 × 30). While L2ECM provides high texture reidentification scores, its main drawback is that it considerably increases the number of computations and the memory space that is required during the computation phase. Therefore, L2ECM is clearly not appropriate for embedded platforms; hence, for the sake of simplicity, we were much more inclined to use a very simple approach and compute a single covariance descriptor for every sample and feature combination.
Ten random images were selected for the training of each texture class (from the set of 81 samples that represents each one); the remaining samples were used during the classification evaluation. The descriptor obtained from each test image is compared against all the covariance matrices inside the different training sets (ten samples for each texture class) using the Riemannian metric proposed in [20]. A label is assigned to each class using the KNN^{a} algorithm counting the number of votes of each texture class for the closest k = 5 samples. The same procedure was repeated ten times to summarize and avoid unstable or misguiding results.
To evaluate the quality of our classification results, we counted the number of true/false positives and negatives and calculated their associated F_{1} score (this represents the weighted average of the precision and recall) defined as
where
Multiple feature combinations were evaluated based on the spatial coordinates (x and y), the luminance (I) and color channels (R, G, and B), the first and second order gradient magnitudes (I_{ x }, I_{ y }, I_{ xx }, I_{ xy } and I_{ yy }), and the enhanced local binary covariance matrices (ELBCM) features proposed in [18].
Figure 3 depicts the set of feature combinations that were evaluated and their associated F_{1} scores. Each combination has a set of points representing the score associated to each one of the different texture classes. Boxplots are used to highlight their concentration and their median (depicted by the horizontal bars in pink inside the boxes). The figure is divided in two rows for grayscale and colorbased configurations. Each row is further divided into two parts: firstly, the feature combinations including gradient components only (on the left), and secondly, all feature combinations based on the ELBCM descriptor (on the right). Within each cell, the different feature combinations are sorted by their number of features (n_{ F }) in increasing order and by their F_{1} scores median.
Three observations can be made from Figure 3: (1) that ELBCMbased combinations tend to have slightly higher scores than gradientbased ones (i.e., their F_{1} scores are higher and more concentrated), (2) that color plays a crucial role to improve the discriminative power (purely gradient and ELBCMbased configurations both improve their scores when color information is included), and (3) the relevance of the spatial coordinates (x and y) seems to be small for the texture recognition problem.
According to Figure 3, the ideal number of features (among the set of feature combinations evaluated here) is between n_{ F }= 7 and n_{ F }= 9 given that most of the smaller configurations produce less accurate results. However, for such as size of covariance matrix, realtime execution (40 ms for 25 frames per second) as required for visual tracking is impossible without optimizations.
5 Covariance tracking algorithm analysis and optimizations
Three strategies are studied to optimize the CT on multicore CPUs. The first one is based on the structure of arrays (SoA) towards array of structures (AoS) transformation: SoA →AoS. The second one consists in architectural optimizations: either multithreading the SoA version with open multiprocessing (OpenMP) middleware or using single instruction, multiple data (SIMD) instructions (SSE and AVX for Intel, Neon on ARM) for the AoS version. The third and final strategy consists of using loopfusion transformations. Indepth information about the transformations employed in this article can be found in [23].
Let us introduce a set of notations for describing the algorithms and theirs optimizations (Table 2).
In the baseline version of the algorithm, the complete set of feature images F is stored separately using a cube data structure (referred to as fmat[k][i][j]) which can be regarded as an instance of a SoA data structure. The index k is used here to select one of the n_{ F } feature images while the pair (i,j) is used to select the spatial coordinates. Image cubes are straightforward to implement, the required arithmetic to compute the memory of an address using a table of 3D pointers only demands three integer additions; still, the latency time of a memory access is extremely dependent on the data access pattern.
5.1 SoA to AoS transform
The goal of SoA →AoS transform consists of transforming a set of independent arrays into one array, where each cell is a structure combining the elements of each independent array. The contribution of such a transform is to leverage the cache performance by enforcing spatial and temporal cache locality. Table 2 introduces the notations we will use from now on.
The first aspect we want to optimize is the locality of the features for a given point of coordinates (i,j). In the SoA version, we have two cubes: one that stores all the pixel features F_{SoA} (fmat) which size is n_{ F }× h × w and a different cube P_{SoA} (prmat) of size n_{ P }× h × w that stores the feature crossedproducts. In the AoS data layout, these cubes are transformed into two 2D arrays F_{AoS} and P_{AoS} of size h × (w · n_{ F }) and h × (w · n_{ P }).
The SoA →AoS transform swaps the loop nests and changes the addressing computations from a 3Dform cube [ k][ i][ j] into a 2Dform like matrix [ i][ j × n + k], where n is the structure cardinal (here n_{ F } or n_{ P }). The lack of spatial locality within the features in the SoA representation is illustrated in Figure 4; here, the SoA layout (on the left) stores pixels features in discontiguous 2D arrays (distant in memory) while for the AoS representation (on the right) features belonging to the same pixel are gathered together in contiguous memory addresses.
The covariance tracking algorithm is composed of three stages:

1.
pointtopoint product computation of all features,

2.
the integral image computation of features,

3.
the integral image computation of products.
The product of features and its transformation are described in Algorithms 1 and 2. Thanks to commutativity of the multiplication, only half of the products have to be computed (the loop on k_{2} starts at k_{1}, line 3). As the two last stages are similar, we only present a generic version of integral image computation (Algorithm 3) and its transformation (Algorithm 4).
Concerning the index k of Algorithms 1 and 2, the increment k ← k+1 can be replaced by k = k_{1}n_{ F }k_{1}(k_{1} + 1)/n + k_{2} for direct access to the product of feature k_{1} by feature k_{2}.
5.2 SIMD or OpenMP parallelization?
Once this transform is done, one can also apply SIMD to the different parts of the algorithm. For the product part, the two internal loops on k_{1} and k_{2} are fully unrolled in order to show the list of all multiplications and the list of vectors to construct through permutation instructions (e.g., _mm_shuffle_ps in streaming SIMD extensions (SSE)). For example, for a typical value of n_{ F }= 7, there are n_{ P }= 28 products. The associated vectors are (the numbers are the feature indexes) as follows:
In that case, the 7th vector is 100% filled, but it will become suboptimal if n_{ P } is not divisible by the cardinal of the SIMD register (4 with SSE and Neon). In SSE, some permutations can be achieved using only one _mm_shuffle_ps instruction while others need a maximum of two. Because some permutations can be reused to perform other permutations, it is possible to achieve a factorization over all the required permutations. For example with n_{ F }=7, 15 shuffles are required.
In advanced vector extensions (AVX)2, there is a new instruction (compared to AVX) that greatly simplifies permutations: _mm256_permutevar8x32_ps. This instruction implements a full crossbar, so we need exactly one AVX2 permutation per register that is a total 8 (for n_{ F }=7).
In Neon it is more complex. If some permutations can be done into 128bits registers  that is with a parallelism of 4  other permutations require instructions only available with 64bit registers, like the lookup table instruction named vtbl. So in Neon, 128bit float registers should be: 1) split into 64bit registers with vget_low_f32 and vget_high_f32 instructions, 2) typecasted into 64bit integer registers with vreinterpret_u8_f32 – no latency, just for the compiler –, 3) permuted with vtbl1_u8 and vtbl2_u8 instructions, 4) typecasted into 64bit float registers with vreinterpret_f32_u8, and 5) combined into 128bit float registers with vcombine_f32. Finally it requires 31 SIMD Neon instructions to create the seven pairs of products (and 17 extra instructions for the castings). Table 3 gives the values of v n_{ F } and v n_{ P } depending on n_{ F }∈{7,8} and card, the number of block within an SIMD register. For the same values of v n_{ F }, Table 4 provides the number of permutations for SSE, AVX and Neon.
The first part of Table 5 provides the algorithmic complexity and the amount of memory accesses for scalar version. Just replace n_{ F } and n_{ P } with v n_{ F } and v n_{ P } from Table 3 to get the SIMD value. This table also provides the arithmetic intensity (AI)  popularized by Nvidia  that is the ratio between the number of operations and the number of memory accesses. Table 6 provides numerical results from Table 5 for scalar, SSE, AVX, and Neon version; for 3loop version; and for the 1loop version with loopfusion transform.
Concerning OpenMP, the point is to evaluate SOA + OpenMP versus AoS + SIMD. Indeed, for a common 4core General Purpose Processor (GPP), the degree of parallelism with a multithreaded version and with a SIMDized version is the same, i.e., four. Results are provided in cycles per point (cpp) versus the data amount (image size). The cpp is a normalized metric thats help to detect cache overflow (when data do not fit in the cache): the curve of cpp increases significantly.
The three versions (SoA + OpenMP, AoS, AoS + SIMD) have been benchmarked on three generations of Intel processors: Penryn, Nehalem, and SandyBridge for image sizes varying from 128 × 128 up to 1024 × 1024. It appears (Figure 5) that a 4threaded version is always slower than a 1threaded SIMD version. Eight threads are required on the Nehalem to be faster. The reason is the low AI inducing a high stress on the architecture’s buses and also because manipulating SoA requires n_{ P }= 28 active references in the cache; that is more than the usual L2 or L3 associativity (24 on the Intel processor). In the next steps of this article, SIMDization is the only architectural optimization being considered as realistic.
5.3 Loop fusion
We have tested three versions with loopfusion in order to increase the AI ratio by reducing the amount of memory accesses. But for that, we first have to rewrite the integral image computation. As integral image computation is known as being memory bound, but also a very simple algorithm (3 LOADs, 1 STORE, and 3 ADDs), it is quite impossible to reduce its complexity. Nevertheless, one can remove 2 LOADs by using a register that holds the accumulation along a line. Algorithm 5 implements this optimization for basic integral image computation.
The first one is a scalar parametric version (with n_{ F }) that fuses the external iloop and keeps the three jloops unchanged. The second one is a specialized version with n_{ F }= 7 where the three internal loops are fused together. The third one is the SIMDized version of the second one. The internal loop fusion allows to save the LOAD/STORE instructions in order to write a product of features into memory and to read it afterwards to compute the integral image of products. The loopfusion has been done by hand, but some tools like PIPS [24] can do such a kind of transformation automatically [25]. The complexity of scalar and SIMD versions are provided in Table 5. The numerical value of these expressions is given in Table 6.
To be efficient loopfusion is combined to full loopunwinding (on k_{1} and k_{2}) and scalarisation (to store temporary results within a register instead of a memory cell of an array). The behavior of the code is the following, for a given pixel (i,j):

all the features associated to point (i,j) are loaded into n_{ F } registers: ${f}_{0},{f}_{1},\cdots {f}_{{n}_{F}1}$,

the integral image computation of features is done on the fly and in place with Algorithm 5 with n_{ F } accumulators ${\mathit{\text{sf}}}_{0},{\mathit{\text{sf}}}_{1}\cdots {\mathit{\text{sf}}}_{{n}_{F}1}$. The point fmat(i,j,k) that previously holds n_{ F } features is replaced by the sums stored in the n_{ F } accumulator,

the n_{ P } products are then calculated in n_{ P } registers: ${p}_{00},{p}_{01},\dots {p}_{{k}_{1}k2},\cdots {p}_{{n}_{F}1\mathit{\text{nF}}1}$

the integral image computation of the product of features is done in the same way, with n_{ P } accumulators. The point prmat(i,j,k) is filled with the n_{ P } accumulators of products.
The code is quite big (as internal loops are unwound) but very efficient (see next section), but it can be easily generated automatically by a C program, as it is very systematic: load features do accumulation of features, store accumulations, and compute products and do accumulation of products and store accumulations. It is relevant for a bigger value of n_{ F } to avoid bugs. The complexity of these new versions are given in the second part of Tables 5 and 6.
We can observe that without loopfusion has the lowest AI of 0.5. We can notice that, for a given version, loopfusion divides the complexity by a factor 1.2 (by rewriting image integral steps) and memory accesses by a factor 2.5 by avoiding LOADs and STOREs of temporary results.
5.4 Embedded systems
Let us now focus on more embedded processors like the Intel ULV (ultra low voltage) family and ARM processors. In order to observe the performance evolution for each family, two processors were benchmarked: PenrynM U9300 (1.2 GHz, 10 W, SSSE3), HaswellM 4650U (1.7 GHz, 15 W, AVX2), ARM Cortex A9 (1.2 GHz, 1.2 W, Neon), and ARM Cortex A15 (1.7 GHz, 1.7 W, Neon). For PenrynM and HaswellM, the power consumption is the thermal dissipation power (TDP) provided by Intel; for ARM, these processors are part of SoC (Pandaboard OMAP4 from Texas Instruments and Samsung Chromebook with Exynos5 from Samsung) and it is very difficult to find out any figures from ARM or Samsung. So these figures were collected on the internet and crossvalidated on trustable websites.
Figures 6 and 7 provide the cpp of the processor for image size varying from 64 to 1024 for Intel processors and from 64 to 512 for ARM processors.
Firstly, for all processors, the SoA version is very inefficient compared to the best one (AoS+T+SIMD). The SIMDization alone is also inefficient: around × 1.5 instead of × 4 the ideal speedup for 128bit SIMD and × 2.5 instead of × 8 for 256bit SIMD. The reason is that SIMD is really efficient only (a speedup close to the SIMD register cardinal) when data fit in the cache [23]. Here the cache overflow appears for image size around 150 × 150 for ARM and 200 × 200 for Intel. As a matter of fact, a 512 × 512 image requires a cache of size of 36 MB, while a 640 × 480 needs 43 MB. If the biggest server processors just start to have such large cache (IBM Power7+, Intel Xeon Ivy bridge), such an amount of cache is far from the embedded ARM and Intel laptop processor (from 1 to 4 MB). The important fact, also common to the four processors, is that the cpp of AoS + T version remains constant, unlike SoA and AoS versions. So the execution time can be predicted.
Secondly, there is one big difference between them: the cpp values. The Intel cpp’s are up to × 4.5 smaller than ARM ones that comes from higher latency instructions.
CortexA15 is faster than A9 for two reasons: a faster cache and a faster external memory (same for HaswellM versus PenrynM) and because A15 can execute one Neon instruction every cycle instead of every two cycles: the SIMD throughput has been multiplied by two.
Regarding the impact of loopfusion, Table 7 shows that the speedup with AoS version is from × 1.6 up to × 1.7 for ARM and Intel processor and for scalar and SIMD version, respectively. So the loopfusion is as efficient as SIMDization. The total speedup is from × 3.8 up to × 4.9 for ARM and PenrynM processor, respectively, but reaches × 7.9 for HaswellM (with SSE instructions).
Concerning execution time, the PenrynM and the HaswellM are, respectively, × 4.3 and × 2.2 faster than the A9 and the A15. If we compare the estimation energy consumption (based on approximative TDP as previously said), the A9 and the A15 are, respectively, × 3.8 and × 2.1 more efficient than the PenrynM and the HaswellM. ARM embedded processors are still more efficient than Intel ones.
5.5 Impact of other parameters: SIMD width and n_{ F }value
The impact of a twice wider SIMD  256 bits for AVX2 instead of 128 for SSE  has been evaluated on a HaswellM processor. It appears that there is quite no difference in performance between SSE and AVX2. First, AVX (and AVX2) processors can pair two SSE instructions within one AVX instruction, thanks to the outoforder capabilities of these processors. Once the SIMD are fetched and decoded into the pipeline, they are put in the ‘instructions ready’ window before being dispatched to an execute unit (named port in Intel vocabulary). If the processor can find two SSE dataindependent instructions that are ready to be executed, it pairs them together and sends the new instruction to an execute unit.
The impact of n_{ F } has been also evaluated for the four processors. The two specialized scalar and SIMD versions AoS + T and AoS + T + SIMD have been instanced for n_{ F }= 8 SSE, AVX, and Neon. It makes sense for AVX architecture as eight features 100% fills one AVX register (see Table 3). The cpp ratio (cpp(n_{ F }= 8)/cpp(n_{ F }= 7)) varies from 1.11 up to 1.35 for ARM processors and 1.21 up to 1.27 for Intel processors. These values are very close to the theoretical ratios (1.27 and 1.25) of the complexity and memory access amounts of Table 5. It means that the execution time of that part of the global algorithm is predictable until we run out of register and generate spill code.
6 Algorithm implementation
Two sequences have been used to evaluate the global performance on the four processors. Panda and Pedxing for which the robustness of the algorithm have been evaluated in [11] and [18]. For both of them, the execution times are given in cpp for each version of the algorithm: SoA is the basic version, and AoS++ stands for AoS transform + SIMDization + loopfusion transform.
Two counterintuitive results can be noticed. The first one is the features computation cpp: it is lower for SoA. The reason is obviously the memory layout of SoA (versus AoS) when computing the features and storing them into a cube or a matrix. The second counterintuitive result is even more interesting: it concerns the tracking part of the algorithm which is based on the computation of a similarity criterion that requires the computation of the generalized eigenvalues, inversions, and matrix logarithms (9). In order to have the same behavior, we use GNU Scientific Library to perform these computations on both platforms, but we can also use Intel MKL or Eigen libraries. The future position is chosen by evaluating 40 (in our case, but it is parameterizable) random positions in a research window, so matrix operations represent a high percentage of the tracking part. It appears that the features used for tracking lead to a ‘more’ illconditioned matrix requiring more computations for Panda than for the Pedxing3 sequence.
Concerning the acceleration, Tables 8 and 9 show that the optimization of the kernel provides a speedup of × 2.8 to × 2.9 for Intel processors and × 2.0 to × 2.6 for ARM ones that assets the need of all the optimizations.
As both processors have two cores, all the processing parts can be done either on one core (the execution time is the sum of all parts) or on two cores (the biggest part is on one core and the two other parts are on the second core). With such a coarse grain thread distribution, the PenrynM and the HaswellM can track targets in real time for 640 × 480 images. The HaswellM is even real time with only one core. The CortexA9 can do it for image sizes up to 320 × 240 and the CortexA15 is close to real time for 640 × 480 images. Once the kernel computation has been optimized, the biggest processing part becomes the features computation. With the optimization of this part, the CortexA15 should be able to reach realtime execution.
The performance ratio of the whole algorithm is close to the performance ratio of the kernel: the PenrynM and the HaswellM are, respectively, × 4.0 and × 3.3 faster than the CortexA9 and the CortexA15. We can also observe that the image size has quite no impact on the performance ratio. From an energy point of view, the CortexA9 and the CortexA15 are, respectively, × 2.1 and × 2.7 more energy efficient than the PenrynM and the HaswellM.
7 Conclusions
We have presented the implementation of a robust covariance tracking algorithm, with a parameterizable complexity that can be adapted to tradeoff between robustness and execution time. A study has been made to qualitatively compare different covariance matrices in terms of number and nature of visual features. Classical software and hardware optimizations have been applied: SIMDization and loopfusion transform combined with AoSSoA transform to accelerate the kernel of the algorithm. These optimizations allow a realtime execution (25 fps or about 40 ms per image) for 320×240 image size on ARM CortexA9 and for 640×480 on Intel PenrynM and Haswell. ARM CortexA15 should also reach realtime execution for such image size, once the other parts of the algorithm will be optimized.
Our future work will focus on (1) the optimization of the features computation and (2) the multithreading of the tracking in order to perform multitarget tracking with load balancing on the available core. A more thorough study should also be made concerning the impact of the illconditioning of the matrix on the execution time.
To the best of our knowledge, our implementation of the covariance tracking algorithm is the first realtime implementation for embedded systems, while perfectly maintaining the quality of the tracking.
Endnote
^{a} KNearest Neighbours.
References
 1.
Kalal Z, Mikolajczyk K, Matas J: Trackinglearningdetection. Pattern Anal. Mach. Intell. IEEE Trans 2012, 34(7):14091422.
 2.
Lucas BD, Kanade T: An iterative image registration technique with an application to stereo vision. In Proceedings of the 7th International Joint Conference on Artificial Intelligence. Morgan Kaufmann, San Francisco; 1981:674679.
 3.
Comaniciu D, Ramesh V, Meer P: Kernelbased object tracking. Pattern Anal. Mach. Intell. IEEE Trans 2003, 25(5):564577. 10.1109/TPAMI.2003.1195991
 4.
Gouiffès M, Laguzet F, Lacassagne L: Color connectedness degree for meanshift tracking. In Pattern Recognition (ICPR), 2010 20th International Conference On. IEEE; 2010:45614564.
 5.
Porikli F, Tuzel O, Meer P: Covariance tracking using model update based on lie algebra. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference On, vol. 1. IEEE; 2006:728735.
 6.
Li P, Wang Q: Local LogEuclidean Covariance Matrix (L2ECM) for Image Representation and Its Applications. Springer; 2012.
 7.
Zhang Y, Li S: GaborLBP based region covariance descriptor for person reidentification. Image and Graphics (ICIG), 2011 Sixth International Conference on 2011, 368371.
 8.
Guo S, Ruan Q: Facial expression recognition using local binary covariance matrices. Wireless, Mobile & Multimedia Networks (ICWMMN, 2011) 4th IET International Conference on 2011, 237242.
 9.
Pang Y, Yuan Y, Li X: Gaborbased region covariance matrices for face recognition. Circuits Syst. Video Technol. IEEE Trans 2008, 18(7):989993.
 10.
Bak S, Corvee E, Bremond F, Thonnat M: Multipleshot human reidentification by mean Riemannian covariance grid. In Advanced Video and SignalBased Surveillance (AVSS), 2011 8th IEEE International Conference On. IEEE; 2011:179184.
 11.
Romero A, Gouiffès M, Lacassagne L: Covariance descriptor multiple object tracking and reidentication with colorspace evaluation. Asian Conference on Computer Vision, 2012, ACCV 2012 2012.
 12.
Wu Y, Wu B, Liu J, Lu H: Probabilistic tracking on Riemannian manifolds. Pattern Recognition, 2008. ICPR, 2008. 19th International Conference on 2008, 14.
 13.
Tyagi A, Davis JW, Potamianos G: Steepest descent for efficient covariance tracking. Motion and Video Computing, 2008. WMVC 2008. IEEE Workshop on 2008, 16.
 14.
Zhang X, Dai G, Xu N: Genetic algorithms. A new optimization and search algorithms. Control Theory Appl 1995., 3:
 15.
Romero A, Lacassagne L, Gouiffes M: Realtime covariance tracking algorithm for embedded systems. IEEE International Conference on Design and Architectures for Signal and Image Processing (DASIP) 2013.
 16.
Tuzel O, Porikli F, Meer P: Pedestrian detection via classification on Riemannian manifolds. Pattern Anal. Mach. Intell. IEEE Trans 2008, 30(10):17131727.
 17.
Yao J, Odobez JM: Fast human detection from videos using covariance features. The Eighth International Workshop on Visual SurveillanceVS2008 2008.
 18.
Romero A, Gouiffès M, Lacassagne L: Enhanced local binary covariance matrices ELBCM for texture analysis and object tracking. In ACM International Conference Proceedings Series. Association for Computing Machinery; 2013.
 19.
Pietikäinen M, Hadid A, Zhao G, Ahonen T: Computer Vision Using Local Binary Patterns. Springer; 2011. http://books.google.fr/books?id=wBrZz9FiERsC
 20.
Tuzel O, Porikli F, Meer P: Region covariance: a fast descriptor for detection and classification. Computer Vision–ECCV 2006 2006, 3952: 589600. 10.1007/11744047_45
 21.
Förstner W, Moonen B: A metric for covariance matrices. Quo Vadis Geodesia 1999, 113128.
 22.
Hayman E, Caputo B, Fritz M, Eklundh JO: On the significance of realworld conditions for material classification. Computer VisionECCV 2004 2004, 3024: 253266. 10.1007/9783540246732_21
 23.
Lacassagne L, Etiemble D, HassanZahraee A, Dominguez A, Vezolle P: High level transforms for SIMD and lowlevel computer vision algorithms. ACM Workshop on Programming Models for SIMD/Vector Processing (PPoPP) 2014, 4956.
 24.
MINESParisTech: PIPS. . Open source, under GPLv3 (1989–2009) http://pips4u.org . Open source, under GPLv3 (1989–2009)
 25.
Irigoin F, Jouvelot P, Triolet R: Semantical interprocedural parallelization: an overview of the PIPS project. In ICS ’91 Proceedings of the 5th international conference on Supercomputing. ACM, New York; 1991:244251.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Andrés Romero, Lionel Lacassagne contributed equally to this work.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0), which permits use, duplication, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Romero, A., Lacassagne, L., Gouiffès, M. et al. Covariance tracking: architecture optimizations for embedded systems. EURASIP J. Adv. Signal Process. 2014, 175 (2014). https://doi.org/10.1186/168761802014175
Received:
Accepted:
Published:
Keywords
 Covariance tracking
 SIMD
 Multicore
 Embedded systems