Multi-GPU based on multicriteria optimization for motion estimation system
© Garcia et al.; licensee Springer. 2013
Received: 31 October 2012
Accepted: 14 December 2012
Published: 19 February 2013
Graphics processor units (GPUs) offer high performance and power efficiency for a large number of data-parallel applications. Previous research has shown that a GPU-based version of a neuromorphic motion estimation algorithm can achieve a ×32 speedup using these devices. However, the memory consumption creates a bottleneck due to the expansive tree of signal processing operations performed. In the present contribution, an improvement in memory reduction was carried out, which limited accelerator viability usage. An evolutionary algorithm was used to find the best configuration. It supposes a trade-off solution between consumption resources, parallel efficiency, and accuracy. A multilevel parallel scheme was exploited: grain level by means of multi-GPU systems, and a finer level by data parallelism. In order to achieve a more relevant analysis, some optical flow benchmarks were used to validate this study. Satisfactory results opened the chance of building an intelligent motion estimation system that auto-adapted according to real-time, resource consumption, and accuracy requirements.
Motion estimation and compensation are crucial for multimedia coding characterized by high memory requirements and computation complexity. When considering MPEG processing, motion estimation is acknowledged as the most time-consuming , creating up to 90% of the total execution time [2, 3]. Additionally, motion estimation has several applications regarding multimedia scope as segmentation, extraction of 3D structure, pattern tracking, filtering, compression, and de-blurring. Differently developed motion estimation models and algorithms can be classified into three main categories: matching domain approximations , energy models , and gradient models .
Block matching algorithms have the pros of robustness, low cost VLSI implementation (because of their regular parallel procedure), and low overhead (since they contain one vector per block). Nevertheless, there are many cons, since a block may contain several moving objects and fail for zoom, rotational motion, local deformation, and blocking artifact. In additional, they usually estimate the motion error by minimizing a metric, which does not release the true movement, etc. Energy models are probabilistic, delivering a population of solutions that do not indicate motion itself and are not usually used for multimedia purposes.
The gradient-based family can estimate vector motion of every single pixel, giving a dense representation of the processed frame. There are several examples of video compression using gradient based algorithm . Recursive algorithms belonging to this family do not have to transmit motion information. Nevertheless, this algorithm family has the drawback of large motion vectors (severe motion), noisy images, and changes in illumination. The present approach is based on a Multichannel Gradient Model (McGM) [8–10], a neuromorphic algorithm fitted to allow the construction of viable, highly robust, front-end processors for image recognition systems .
The increased computing capabilities of graphics processing units (GPUs) in recent years has increased their use as accelerators in many areas such as scientific simulations, computer vision, bioinformatics, cryptography, and finance, among others. This increase is largely due to impressive performance rates. For example, one of the latest GPUs from Nvidia, the GTX 680, achieves three petaflops in single precision with 1006 cores and also incorporates the newer Kepler architecture. Current trends seem to indicate that this capacity will grow even more with the incorporation of 22 and 28 nm technologies. Recently, for example, AMD announced its Radeon 8000 Series, branded as Sea Island, and Intel is manufacturing Knights Corner products. However, key points that dramatically affect performance rates include the efficient use of the memory hierarchy and the exploitation of parallelism capabilities.
The increased demand for information to be processed also plays a role, because the use of these devices as accelerators is limited due to DDR memory restrictions. To solve this problem, research  has often proposed a data reuse alternative with the aim of minimizing the memory traffic between GPU and CPU. Another approach in the field of rendering meshes can be found in  a solution that uses more efficient algorithms in terms of memory consumption alongside other techniques based on simplification or information compression. The GPU memory reduction proposed here is addressed using a motion estimation scenario, which, to the best of our knowledge, doesn’t exist as a solution in any of the current literature.
In previous study , we developed a GPU-based McGM implementation. In the present article, we address an efficient solution for dense and robust motion estimation per pixel related with GPU memory consumption, which limits the GPU viability.
This article is organized as follows: Section 2 moves through a specific neuromorphic model; Section 3 presents the motivation of this study where multi-objective optimization is used; and Section 4 shows performance and visual results. Finally, Section 5 concludes with the main contributions of this study.
2 Multichannel gradient model (McGM)
2.1 Stage I. temporal filtering
2.2 Stage II. spatial filtering
According to the space domain, the shape of the receptive fields from the primitive visual cortex can be modeled either by using Gabor functions—where the impulse responses are defined by harmonic functions multiplied by a Gaussian—or by using a derivative set of Gaussians . The Gaussian is a unique function in many ways and is of particular importance to biology.
The n th Gaussian derivative can be expressed as a Hermite polynomial multiplied by the original Gaussian: (where σ is the standard deviation of the Gaussian, and the scale factor ensures the function integrates to unity).
2.3 Stage III. steering filtering
2.4 Stage IV. Taylor truncation
The three Taylor expansion derivatives are constructed in one large image using the completed set of basis filter responses. According to the original model , the expansions are truncated after the third-order in the primary direction and the second-order in the orthogonal and temporal directions.
2.5 Stage V. quotients
2.6 Stage VI. velocity primitives
3 Multi-criteria motivation for tunning McGM
Potential benefits of GPUs in the McGM context have been explored in the literature , where authors studied the viability in these novel devices. Throughput results with respect to a single CPU were satisfactory enough in terms of performance, achieving ×32 speedups for 2562 resolution movies.
We would like to emphasize that this particular GPU-based motion estimation scheme is an alternative to consider in terms of Mpixel/s compared to other purpose systems used for such motion-estimation algorithms, although the algorithm features create a bottleneck, specifically when memory requirements are increased in each stage, with an upward trend. This disadvantage limits GPU viability. Attending to the largest memory usage configuration considered in , 3.5 GB of global memory was used, which was close to the capacity limit of a single GPU. Although the memory capacity is greater for GPUs nowadays, this problem is still present with larger data input resolutions.
Performance of the GPU versus CPU
In order to reduce algorithm memory consumption, we could afford not to store, as a particular solution, some of the temporary data computations, recalculating when necessary at the expense of reducing performance throughput under real time conditions. The most memory-demanding stages in the McGM algorithm correspond to the Spatial Filtering and Steering stages. On the one hand, an efficient way to reduce memory necessities was to perform the Steering stage with less θ angles at the expense of accuracy degradation. On the other hand, it was possible to use a numerical derivative  instead of the Gaussian counterpart in the Spatial Filtering stage in order to allow faster derivative recalculation. This alternative scheme was based on the fact of not requiring intermediate data computation storage by saving a huge amount of memory and to recalculate whenever data were used. A simple numerical differentiating filter was used based on the convolution commutative properties: I⊗G x =I⊗(G 0⊗D x )=(I⊗G 0)⊗D x . The number of operations performed in (I⊗G 0)⊗D x are smaller than the Gaussian derivative filtering, making the convolution process faster.
Filter accuracy degradation using a numerical derivative instead of the Gaussian counterpart for first, second, … to the fifth derivative order
x ′ ′
x ′ ′ ′
x I V
Despite the unimportance of degraded filtering accuracy, an experiment comparing motion estimation degradation is carried out to evaluate the loss of accuracy in the overall algorithm. As benchmarks, we have used a couple of synthetic sequences widely accepted in this context: the ‘diverging tree’ and the ‘translating tree’, both created by David Fleet at Toronto University . The ‘diverging tree’ shows an expansive motion of a tree (in camera zoom mode) with an asymmetric velocity range depending on the pixel position (null in the central focus and 1.4 pixels/frame and 2 pixel/frames in the left and right boundaries, respectively). The ‘translating tree’ shows the translational motion of a tree with an asymmetric velocity range depending on the pixel position (zero to 1.73 pixel/frames and zero to 2.3 pixel/frames in the left and right border, respectively). For an error metric, we used Barron , considered to be one of the most accepted metrics in the specialized literature.
Overall degradation measured as mean absolute error of Barron’s angle
As observed, the ‘diverging tree’ experiment behaves reasonably well with numerical derivatives reducing their impact with a higher order of accuracy. Nevertheless, in the ‘translating tree’ experiment, the algorithm is more vulnerable to numerical derivatives than the number of angles variation. Due to this disparity observed in Table 3, it is advisable the space of feasible solutions with any set of input data be explored. Given the large number of parameters to configure, on one hand relative to the McGM algorithm, and on the other hand those based on available resources, the use of genetic algorithms (GAs) can be useful to reduce time-consuming exploration.
3.1 Multi-criteria optimization description
The use of GAs arises from non-viability exploration with a large solution space. In our context, the target is to find a compromise in the reduction of the GPU’s memory usage with negligible accuracy degradation that allows motion estimation system self-adaptation under appreciable environmental conditions and changes in a reasonable time.
where z is the objective vector with 3 objectives to be minimized: execution time f 1, memory usage f 2, and loss of accuracy f 3; z is the decision vector, and X is the feasible region in the decision space, which corresponds to all possible McGM configurations with respect to the derivative decision and the number of angles involved. In GA terminology, x corresponds to a chromosome. In our context:
D x corresponds to the derivative to be computed in spatial filtering. This information is stored in a two-dimensional array whose values determine the way their derivative is computed by means of Gaussian or order-numerical differentiation. The two-dimensional array position is related to the derivative order.
The number of θ angles to be performed in the steering stage, which can be assigned as an integer.
3.2 Our multi-GPU implementation
Over the last few years, a great number of multi-objective evolutionary algorithms have been developed [21–23]. A revision of the GA can be found in a tutorial , where the authors provide the revision’s more relevant features.
For this study, we have chosen the NSGA-II  for its following advantages:
Weights are not required, so it is not necessary to study the impact of f i (x) and assign them.
Its computational requirement is one, which presents less computational complexity.
Its ‘good’ behavior and ability to find a set of solutions near the true Pareto-optimal with few iterations.
It’s widely used and amply tested.
Initially, a random population is created in pop.
The population is sorted based on the non-domination scheme.
It is assigned a fitness, which means every individual of the population is ranked into levels. First-level or Pareto-front is most preferable.
A binary tournament selection and combination is carried out.
A mutation phase is done.
A combined population R comes from the union of an old pop with the new one n e w _p o p. The population R is size 2∗p o p _s i z e.
R is ranked by means of the McGM algorithm and sorted according to a non-domination scheme.
New population pop is made from size p o p _s i z e.
The fast non-dominated sorting is the most computational cost part of the GA, because it involves ranking every individual of the population. We urge that this task be performed entirely on multi-GPUs since this is more efficient than using a CPU, from computational point of view. Most GA operators are executed in CPU due to its low computational demand.
To rank an individual of the population means to compute the McGM algorithm with chromosome configuration, to compute the derivatives in Spatial Filtering, and to compute the number of angles in the Steering Stage to be performed. Several levels of parallelism are exploited: a coarser level, where non-dominate sorting is evaluated in parallel on several GPUs, and finer level by means of data parallelism exploitation available in each stage of the McGM algorithm. Algorithm 1 summarizes our parallel implementation where p o p _s i z e, ngens and % m u t a t i o n are GA input parameters which correspond to population size, number of generations, and mutation probability, respectively. The OpenMP paradigm is used to distribute a non-dominated sort across multiple devices by means of #pragma omp parallel for directives. Our implementation generates Pareto-optimal solutions with a set of motion estimation execution time, accuracy pixel error, and GPU memory usage points. This feature allows the choice of one of the best solutions, taking into account the available computational resources favoring the dynamic tuning depending on current conditions.
Algorithm 1 pareto front = multiGPU NSGAII(pop size, ngens,%mutation)
4.1 Work environment
The systems used are based on Tesla technology. The first one consists of 2 Intel Xeon E5645 processors with six cores (2.40 GHz with 12 MB cache memory and Hyper-threading technology) and 2 Tesla M2070 GPUs. The second one is equipped with 2 Quad Intel Xeon E5530 processors (2.40 GHz with 8 MB cache memory and Hyper-threading technology), connecting with 4 Tesla C1060 GPUs. In both cases, the operating systems are Debian 2.6.38 kernels; the compiler used is a GNU g++ v.4.4 with compilation flags -O3 -m64 -fopenmp and CUDA C/C++ SDK v.4.2 with -O3 -fopenmp -arch sm_20/13 flags enabled.
The system based on Tesla M2070 incorporates Fermi technology, but due to a scarce number of devices available, a scalability study has been completed with a system based on 4 Tesla C1060 GPUs that allow projections be made of parallel efficiency rates in more modern systems.
4.2 Multicriteria results
Multi-objective GAs are used to look for optimal solutions in a huge search space. In our context, they are employed to achieve a set of optimal solutions that reduce the GPU’s memory usage in the McGM algorithm without losing significant accuracy in the motion estimation scenario. As previously mentioned, the tests were performed using the ‘diverging tree’ and the ‘translating tree’ benchmarks, which are widely accepted in this area.
The first experiment was to evaluate both the convergence of the GA and the set of optimal solutions reached. For this purpose, a Euclidean distance metric between consecutive solutions was employed as described in . The GA implemented incorporated a stop condition based on a Euclidean metric when a certain number of iterations remained invariant to ensure the non-dominant solutions converged to the optimal Pareto-front.
In this experiment, the population size was fixed to 500 with 1% mutation probability. The results obtained indicated that after a certain number of generations, the GA barely improved the non-dominant solutions, although it reported new pairs.
Population size only affects the final execution time, achieving results of an optimal solution with similar quality. Empirically, 1% of mutations reported better GA performance. Higher mutation rates only suppose significant variations between consecutive generations, which means higher generations are necessary to reach the convergence criterion. Particularly, greater mutation rates suppose a higher number of iterations, which varies between 15 to 320%.
As shown in Figure 2, optimal solutions are generated with significant reduction in memory requirements, achieving even more accurate solutions than the original McGM’s algorithm for the ‘translating tree’ benchmark.
4.3 Multi-GPU results
Multi-GPU execution times for Tesla M2070 based system
Multi-GPU execution times for Tesla C1060 based system
Moreover, the use of multiple levels of parallelism reports multiplicative accelerations: first, the speedups achieved in the multi-GPU system, which can be up to ×3.71 with 4 GPUs enabled; and second, the accelerations up to ×32 the can be achieved by exploiting the data parallelism on a GPU. On one hand, the combination of both accelerations allows the reduction of exploration time to reach an optimal solution in 99.2% compared with a general-purpose processor. On the other hand, the use of a multi-GPU system not only reports greater FLOPS rates than a CPU, but it is also beneficial in terms of power consumption (MFLOPS/watt).
Moreover, although GA search times are important, their use encourages getting suboptimal solutions that meet the requirements of response time or resource consumption, and as GAs evolve, they are gradually refined. This feature, coupled with the chance of a population size reduction, supposes an impressive simulation times decrease which opens the possibility to build an intelligent system that auto-corrects/adapts depending on the specific requirements or substantial environment changes.
4.4 Visual result
For the ‘diverging tree’, a reduction of 75% in memory usage returns the same precision using the Barron metric and 50% of the McGM execution time (MEtime) compared to the original algorithm. However, the configuration that reduces memory usage by 50% degrades the accuracy in 22% with a speedup of ×3.3.
For ‘translating tree’ benchmark, a solution with half of memory requirements is more accurate (Barron’s error is 0.13 radians less than the original McGM) and ×3.5 faster.
Despite Barron metric’s popularity by the scientific community in the context of motion estimation, some authors [26, 27] point out specific performances due to its asymmetry and its bias of large flow vectors.
4.5 Other error metrics
Although Barron’s metric  is probably the most used in the motion estimation scope, there are other metrics used by Machine Vision community that must be taken into account in order to enhance the visibility and generality of the results obtained.
Best configuration achieved for a reduction of 75 and 50% memory requirements using McCane and Otte&Nagel metric
A new and highly parallel approach is presented to overcome the GPU memory usage problems that occurred in our previous implementation of a well-known neuromorphic motion estimation algorithm. This context provides the main motivation for using evolutionary algorithms to solve multi-criteria optimization problems. The use of GAs based on a multi-GPU scheme allowed for quick exploration of feasible solutions with any set of input data. The choice of NSGA-II is motivated by the good results observed in a few iterations and a near-optimal Pareto-front.
From the viewpoint of reaching a solution that meets the requirements of memory consumption, we observed:
For ‘diverging tree’, a reduction of 75% in memory usage returns the same precision as the all metrics considered and 50% of the McGM execution time compared to the original algorithm. A configuration that reduces memory usage by 50% degrades the accuracy from 15 to 25% with a range of speedup which varies from ×2.8 to ×3.5.
For ‘translating tree’, a configuration that has half of the memory requirements is more accurate in terms of error and is between ×3.3 to ×4 faster.
From the point of view of multi-GPU efficiency is observed:
Successful performance of ×3.71 speedups are archived when four GPUs are enabled.
Our implementation is a scalable approach due to both a well-balanced workload and low-impact communication between host and device.
A found multiplicative effect: ×3.71 speedups in a multi-GPU system by ×32 acceleration by means of exploiting the data parallelism on a GPU. An impressive GA time in reaching an optimal solution in 99.2% compared with a CPU.
An alternative to be considered in terms of power consumption (MFLOPS/watt).
Because of these encouraging results, the possibility exists for building an intelligent system that auto-corrects/adapts depending on specific requirements or environmental condition variations as the GA evolves.
Future lines are based on reusing this system with an environment predictor, with the possibility of real-time execution and self reconfiguration depending on the external constraints and resources available in the platform. This system is expected to contribute to the new machine vision trends, useful for many real-world applications.
The present study had been supported by Spanish Projects CICYT-TIN 2008/508, CICYT-TIN 2012-32180 and Ingenio Consolider ESP00C-07-20811.
- Shaaban M, Goel S, Bayoumi M: Motion estimation algorithm for real-time systems. IEEE Workshop on Signal Processing Systems 2004, 257-262.Google Scholar
- Kang JY, Gupta S, Shah S, Gaudiot JL: An efficient pim (processor-in-memory) architecture for motion estimation, IEEE International Conference on Application-Specific Systems. Architectures, and Processors 2003, 282-292.Google Scholar
- Kang JY, Gupta S, Gaudiot JL: An efficient data-distribution mechanism in a processor-in-memory (pim) architecture applied to motion estimation. IEEE Trans. Comput 2008, 57(3):375-388.MathSciNetView ArticleGoogle Scholar
- Oh HS, Lee HK: Block-matching algorithm based on an adaptive reduction of the search area for motion estimation. Real-Time Imag 2000, 6(5):407-414. 10.1006/rtim.1999.0184View ArticleMATHGoogle Scholar
- Huang CL, Chen YT: Motion estimation method using a 3d steerable filter. Image Vis. Comput 1995, 13(1):21-32. 10.1016/0262-8856(95)91465-PView ArticleGoogle Scholar
- Baker S, Gross R, Matthews I: Lucas-kanade 20 years on: a unifying framework: Part 3. Int. J. Comput. Vis 2002, 56: 221-255.View ArticleGoogle Scholar
- Chi YM, Tran TD, Etienne-Cummings R: Optical flow approximation of sub-pixel accurate block matching for video coding. IEEE ICASSP 2003, 1: 1017-1020.Google Scholar
- Johnston A, McOwan PW, Benton CP: A unified account of three apparent motion illusions. Vis. Res 1995, 35(8):1109-1123. 10.1016/0042-6989(94)00175-LView ArticleGoogle Scholar
- McOwan PW, Johnston A, CPB: Robust velocity computation from a biologically motivated model of motion perception. Proc. Royal Soc. B 1999, 266: 509-518. 10.1098/rspb.1999.0666View ArticleGoogle Scholar
- Liang X, McOwan PW, Johnston A: Biologically inspired framework for spatial and spectral velocity estimations. J. Opt. Soc. Am. A 2011, 28(4):713-723. 10.1364/JOSAA.28.000713View ArticleGoogle Scholar
- Botella G, García A, Rodriguez-Alvarez M, Ros E, Meyer-Bâse U, Molina MC: Robust bioinspired architecture for optical-flow computation. IEEE Trans. VLSI Syst 2010, 18(4):616-629.View ArticleGoogle Scholar
- Mattes L, Kofuji S: Overcoming the GPU memory limitation on FDTD through the use of overlapping subgrids. Int. Conference on Microwave and Millimeter Wave Technology 2010, 1536-1539.Google Scholar
- Zhou Y, Garland M: Interactive point-based rendering of higher-order tetrahedral data. IEEE Transactions on Visualization and Computer Graphics 2006, 12(5):1229-1236.View ArticleGoogle Scholar
- Ayuso F, Botella G, Garcia C, Prieto M, Tirado F: GPU-based acceleration of bioinspired motion estimation model. Concurrency and Computation: Practice and Experience p. (in press) (2012) http://dx.doi.org/10.1002/cpe.2946
- Snowden RJ, Hess RF: Temporal frequency filters in the human peripheral visual field. Vis. Res 1992, 32(1):61-72. 10.1016/0042-6989(92)90113-WView ArticleGoogle Scholar
- Koenderink JJ: Optic flow. Vision Research 1996, 26: 161-180.View ArticleGoogle Scholar
- Fornberg B: Generation of finite difference formulas on arbitrarily spaced grids. Math. Comput 1988, 51(184):699-706. 10.1090/S0025-5718-1988-0935077-0MathSciNetView ArticleMATHGoogle Scholar
- Fleet DJ: Measurement of Image Velocity. Norwell, MA, USA: Kluwer Academic Publishers; 1992.View ArticleMATHGoogle Scholar
- Barron JL, Fleet DJ, Beauchemin SS: Performance of optical flow techniques. Int. J. Comput. Vis 1994, 12: 43-77. 10.1007/BF01420984View ArticleGoogle Scholar
- Konak A, Coit D, Smith D: Multi-objective optimization using genetic algorithms: a tutorial. Reliab. Eng. Syst. Safety 2006, 91(9):992-1007. 10.1016/j.ress.2005.11.018View ArticleGoogle Scholar
- Fonseca C, Fleming P: Genetic algorithms for multiobjective optimization: formulation, discussion and generalization. Int. Conference on Genetic Algorithms 1993, 416-423.Google Scholar
- Srinivas N, Deb K: Muiltiobjective optimization using nondominated sorting in genetic algorithms. Evol. Comput 1994, 2(3):221-248. 10.1162/evco.19126.96.36.199View ArticleGoogle Scholar
- Coello Coello CA: 20 years of evolutionary multi-objective optimization: what has been done and what remains to be done. In Computational Intelligence: Principles and Practice, chap. 4. Edited by: Yen GY, Fogel DB. Vancouver, Canada: IEEE Computational Intelligence Society; 2006:73-88. ISBN 0-9787135-0-8Google Scholar
- Zitzler E, Laumanns M, Bleuler S: A tutorial on evolutionary multiobjective optimization. In Metaheuristics for Multiobjective Optimisation (Springer-Verlag) 2003, 535: 3-38.MathSciNetView ArticleMATHGoogle Scholar
- Deb K, Pratap A, Agarwal S, Meyarivan T: A fast elitist multi-objective genetic algorithm: Nsga-ii. IEEE Trans. Evol. Comput 2000, 6: 182-197.View ArticleGoogle Scholar
- Otte M, Nagel HH: Estimation of optical flow based on higher-order spatiotemporal derivatives in interlaced and non-interlaced image sequences. Artif. Intell 1995, 78(1):5-43.View ArticleGoogle Scholar
- McCane B, Novins K, Crannitch D, Galvin B: On benchmarking optical flow. Comput. Vis. Image Underst 2001, 84(1):126-143. 10.1006/cviu.2001.0930View ArticleMATHGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.