 Research Article
 Open Access
 Published:
MultiGPU based on multicriteria optimization for motion estimation system
EURASIP Journal on Advances in Signal Processingvolume 2013, Article number: 23 (2013)
Abstract
Graphics processor units (GPUs) offer high performance and power efficiency for a large number of dataparallel applications. Previous research has shown that a GPUbased version of a neuromorphic motion estimation algorithm can achieve a ×32 speedup using these devices. However, the memory consumption creates a bottleneck due to the expansive tree of signal processing operations performed. In the present contribution, an improvement in memory reduction was carried out, which limited accelerator viability usage. An evolutionary algorithm was used to find the best configuration. It supposes a tradeoff solution between consumption resources, parallel efficiency, and accuracy. A multilevel parallel scheme was exploited: grain level by means of multiGPU systems, and a finer level by data parallelism. In order to achieve a more relevant analysis, some optical flow benchmarks were used to validate this study. Satisfactory results opened the chance of building an intelligent motion estimation system that autoadapted according to realtime, resource consumption, and accuracy requirements.
1 Introduction
Motion estimation and compensation are crucial for multimedia coding characterized by high memory requirements and computation complexity. When considering MPEG processing, motion estimation is acknowledged as the most timeconsuming [1], creating up to 90% of the total execution time [2, 3]. Additionally, motion estimation has several applications regarding multimedia scope as segmentation, extraction of 3D structure, pattern tracking, filtering, compression, and deblurring. Differently developed motion estimation models and algorithms can be classified into three main categories: matching domain approximations [4], energy models [5], and gradient models [6].
Block matching algorithms have the pros of robustness, low cost VLSI implementation (because of their regular parallel procedure), and low overhead (since they contain one vector per block). Nevertheless, there are many cons, since a block may contain several moving objects and fail for zoom, rotational motion, local deformation, and blocking artifact. In additional, they usually estimate the motion error by minimizing a metric, which does not release the true movement, etc. Energy models are probabilistic, delivering a population of solutions that do not indicate motion itself and are not usually used for multimedia purposes.
The gradientbased family can estimate vector motion of every single pixel, giving a dense representation of the processed frame. There are several examples of video compression using gradient based algorithm [7]. Recursive algorithms belonging to this family do not have to transmit motion information. Nevertheless, this algorithm family has the drawback of large motion vectors (severe motion), noisy images, and changes in illumination. The present approach is based on a Multichannel Gradient Model (McGM) [8–10], a neuromorphic algorithm fitted to allow the construction of viable, highly robust, frontend processors for image recognition systems [11].
The increased computing capabilities of graphics processing units (GPUs) in recent years has increased their use as accelerators in many areas such as scientific simulations, computer vision, bioinformatics, cryptography, and finance, among others. This increase is largely due to impressive performance rates. For example, one of the latest GPUs from Nvidia, the GTX 680, achieves three petaflops in single precision with 1006 cores and also incorporates the newer Kepler architecture. Current trends seem to indicate that this capacity will grow even more with the incorporation of 22 and 28 nm technologies. Recently, for example, AMD announced its Radeon 8000 Series, branded as Sea Island, and Intel is manufacturing Knights Corner products. However, key points that dramatically affect performance rates include the efficient use of the memory hierarchy and the exploitation of parallelism capabilities.
The increased demand for information to be processed also plays a role, because the use of these devices as accelerators is limited due to DDR memory restrictions. To solve this problem, research [12] has often proposed a data reuse alternative with the aim of minimizing the memory traffic between GPU and CPU. Another approach in the field of rendering meshes can be found in [13] a solution that uses more efficient algorithms in terms of memory consumption alongside other techniques based on simplification or information compression. The GPU memory reduction proposed here is addressed using a motion estimation scenario, which, to the best of our knowledge, doesn’t exist as a solution in any of the current literature.
In previous study [14], we developed a GPUbased McGM implementation. In the present article, we address an efficient solution for dense and robust motion estimation per pixel related with GPU memory consumption, which limits the GPU viability.
This article is organized as follows: Section 2 moves through a specific neuromorphic model; Section 3 presents the motivation of this study where multiobjective optimization is used; and Section 4 shows performance and visual results. Finally, Section 5 concludes with the main contributions of this study.
2 Multichannel gradient model (McGM)
This original algorithm was proposed by Johnston et al. We have applied Johnston’s description of the McGM model [9, 10] while adding many specific variations to improve the viability of the GPU implementation, as we will comment upon in later sections. Figure 1 shows a simplified scheme of the processing pipe to be completed.
2.1 Stage I. temporal filtering
Taking as starting point the study performed by Hess and Snowden [15] on temporal processing in human beings, we model three different temporal channels: one lowpass filter and two bandpass filters with a central frequency of 10 and 18 Hz, respectively. These channels can be accomplished using a Gaussian differentiation in the logtime domain.
2.2 Stage II. spatial filtering
According to the space domain, the shape of the receptive fields from the primitive visual cortex can be modeled either by using Gabor functions—where the impulse responses are defined by harmonic functions multiplied by a Gaussian—or by using a derivative set of Gaussians [16]. The Gaussian is a unique function in many ways and is of particular importance to biology.
When the differentiation order increases, the Gaussians are fitted and tuned to higher spatial frequencies. Finally, a range of independent channels is constructed.
The n th Gaussian derivative can be expressed as a Hermite polynomial multiplied by the original Gaussian: (where σ is the standard deviation of the Gaussian, and the scale factor ensures the function integrates to unity).
2.3 Stage III. steering filtering
The steering stage represents the approach to projecting the spacetemporal filters calculated in previous stages under different orientations. Calling n and m the order in x and y directions, respectively, θ (the angle projected) and D (the derivative operator), the general expression is derived as a linear combination of filters belonging to the same order basis.
2.4 Stage IV. Taylor truncation
At this stage, a truncated Taylor expansion is performed, using each oriented filter previously calculated. This function represents a robust structure that gathers all spacetemporal information sequences, approximating one generic pixel by the set of derivatives from the neighborhood, which can be written as follows:
The three Taylor expansion derivatives are constructed in one large image using the completed set of basis filter responses. According to the original model [9], the expansions are truncated after the thirdorder in the primary direction and the secondorder in the orthogonal and temporal directions.
2.5 Stage V. quotients
This is the last stage derived to the common pathway calculation. The next stages implement the modulus and phase estimation with separate expressions. The goal here is to compute a quotient of every sextet’s component:
2.6 Stage VI. velocity primitives
The previous stages compute the visual information considering a Taylor representation of each pixel and calculating the speed for a range of orientations in order to simulate the orientation columns found in the striate cortex [9]. This is accomplished by rotating the coordinate system and Gaussian derivative filters (Steering Stage) to a number of primary directions. Next, the speed measurements–parallel and orthogonal–are placed in primary directions in order to yield a vector of speed measurements, whose components are speed and orthogonal speed:
The raw measurements of speed are also conditioned by including the measurements of the image structure XY/XX and XY/YY, where the final conditioned speed vectors are given by:
∑ is the number of orientations at which speed is evaluated. Inverse speed is also calculated:
Inverse speed is evaluated using different terms from those used to compute speed, and so constitutes an additional independent measurement. Finally, the motion modulus is calculated through a quotient of determinants:
The direction of motion is extracted by calculating a measurement of phase that is combined across all speed related measurements, since they are in phase:
3 Multicriteria motivation for tunning McGM
Potential benefits of GPUs in the McGM context have been explored in the literature [14], where authors studied the viability in these novel devices. Throughput results with respect to a single CPU were satisfactory enough in terms of performance, achieving ×32 speedups for 256^{2} resolution movies.
We would like to emphasize that this particular GPUbased motion estimation scheme is an alternative to consider in terms of Mpixel/s compared to other purpose systems used for such motionestimation algorithms, although the algorithm features create a bottleneck, specifically when memory requirements are increased in each stage, with an upward trend. This disadvantage limits GPU viability. Attending to the largest memory usage configuration considered in [14], 3.5 GB of global memory was used, which was close to the capacity limit of a single GPU. Although the memory capacity is greater for GPUs nowadays, this problem is still present with larger data input resolutions.
The scope of this study is to explore mechanisms in order to reduce the data amount without losing the efficiency and accuracy requirements. To highlight the memory handicap of GPUs, Table 1 shows the summarized performance results observed using the McGM algorithm in a graphic device compared with a single CPU. The performance observed at each stage of the algorithm is shown in Mpixel/s (Mpps), and the global throughput with a particular model configuration, as follows: three temporal derivatives, a temporal convolution window of 15 frames, five spatial derivatives, a spatial separable convolution window of 31 pixels, and 12 angles steered. Moreover, as shown in this table, GPU implementation amply fulfilled realtime requirements in all of the resolution configurations considered. This is further shown in the last column, which corresponds to overall performance, which was measured in frames per second (fps). While generalpurpose processors can only reach realtime rates for small video resolutions, GPUbased systems enabled higher resolution movies where more DDR memory capacity was available.
In order to reduce algorithm memory consumption, we could afford not to store, as a particular solution, some of the temporary data computations, recalculating when necessary at the expense of reducing performance throughput under real time conditions. The most memorydemanding stages in the McGM algorithm correspond to the Spatial Filtering and Steering stages. On the one hand, an efficient way to reduce memory necessities was to perform the Steering stage with less θ angles at the expense of accuracy degradation. On the other hand, it was possible to use a numerical derivative [17] instead of the Gaussian counterpart in the Spatial Filtering stage in order to allow faster derivative recalculation. This alternative scheme was based on the fact of not requiring intermediate data computation storage by saving a huge amount of memory and to recalculate whenever data were used. A simple numerical differentiating filter was used based on the convolution commutative properties: I⊗G _{ x }=I⊗(G _{0}⊗D _{ x })=(I⊗G _{0})⊗D _{ x }. The number of operations performed in (I⊗G _{0})⊗D _{ x } are smaller than the Gaussian derivative filtering, making the convolution process faster.
Table 2 shows the error in computing G _{0}⊗D _{ x } to evaluate accuracy degradation. Filter degradation denotes (G _{0}⊗D _{ G })−(G _{0}⊗D _{ N }) difference where D _{ G } and D _{ N } corresponds to Gaussian and Numerical derivative filtering, respectively. As can be appreciated, loss of accuracy is not so important for 9–31 pixel filtering, reaching a maximum of 3% error. A priori, we may conclude that performing numerical derivatives rarely creates considerable error.
Despite the unimportance of degraded filtering accuracy, an experiment comparing motion estimation degradation is carried out to evaluate the loss of accuracy in the overall algorithm. As benchmarks, we have used a couple of synthetic sequences widely accepted in this context: the ‘diverging tree’ and the ‘translating tree’, both created by David Fleet at Toronto University [18]. The ‘diverging tree’ shows an expansive motion of a tree (in camera zoom mode) with an asymmetric velocity range depending on the pixel position (null in the central focus and 1.4 pixels/frame and 2 pixel/frames in the left and right boundaries, respectively). The ‘translating tree’ shows the translational motion of a tree with an asymmetric velocity range depending on the pixel position (zero to 1.73 pixel/frames and zero to 2.3 pixel/frames in the left and right border, respectively). For an error metric, we used Barron [19], considered to be one of the most accepted metrics in the specialized literature.
Barron Equation (11) shows deviation from the correct spacetime orientation, the velocity being a 3D unit direction vector. This vector wraps both modulus (speed) and phase (direction) in a single value reducing and reduce the rise of directional errors for small velocities.
Since the vector is selfnormalized, the angle between the measured velocity v _{ e } and the correct one v _{ c } is given by Equation (12). This error measurement is calculated for every pixel for which a velocity measurement was recovered.
Table 3 shows an error in Barron’s angle when used as a numerical derivative instead of a Gaussian counterpart in spatial filtering with a significant θ angles reduction in the steering stage. Columns O(h), O(h ^{2}), O(h ^{3}), and O(h ^{4}) denote the observed error of Barron’s angle when performed with numerical derivatives with first, second, third and fourthaccuracy order, respectively. # θ is related to the maximum number of θ angles projected in the Steering stage. The table shows the impact of half or quarter θ s.
As observed, the ‘diverging tree’ experiment behaves reasonably well with numerical derivatives reducing their impact with a higher order of accuracy. Nevertheless, in the ‘translating tree’ experiment, the algorithm is more vulnerable to numerical derivatives than the number of angles variation. Due to this disparity observed in Table 3, it is advisable the space of feasible solutions with any set of input data be explored. Given the large number of parameters to configure, on one hand relative to the McGM algorithm, and on the other hand those based on available resources, the use of genetic algorithms (GAs) can be useful to reduce timeconsuming exploration.
3.1 Multicriteria optimization description
The use of GAs arises from nonviability exploration with a large solution space. In our context, the target is to find a compromise in the reduction of the GPU’s memory usage with negligible accuracy degradation that allows motion estimation system selfadaptation under appreciable environmental conditions and changes in a reasonable time.
The goal of the multiobjective optimization [20] is to simultaneously optimize several objectives that could be inconsistent. Considering the problems, some tradeoffs among the different variables involved also need to be considered. In our context, we consider the following threeobjective minimization problem:
where z is the objective vector with 3 objectives to be minimized: execution time f _{1}, memory usage f _{2}, and loss of accuracy f _{3}; z is the decision vector, and X is the feasible region in the decision space, which corresponds to all possible McGM configurations with respect to the derivative decision and the number of angles involved. In GA terminology, x corresponds to a chromosome. In our context:

D _{ x } corresponds to the derivative to be computed in spatial filtering. This information is stored in a twodimensional array whose values determine the way their derivative is computed by means of Gaussian or ordernumerical differentiation. The twodimensional array position is related to the derivative order.

The number of θ angles to be performed in the steering stage, which can be assigned as an integer.
3.2 Our multiGPU implementation
Over the last few years, a great number of multiobjective evolutionary algorithms have been developed [21–23]. A revision of the GA can be found in a tutorial [24], where the authors provide the revision’s more relevant features.
For this study, we have chosen the NSGAII [25] for its following advantages:

Weights are not required, so it is not necessary to study the impact of f _{ i }(x) and assign them.

Its computational requirement is one, which presents less computational complexity.

Its ‘good’ behavior and ability to find a set of solutions near the true Paretooptimal with few iterations.

It’s widely used and amply tested.
The NSGAII is based on a fast nondominated sorting procedure where a fast crowded distance estimation is carried out. It involves a simple crowded comparison operator [25]. The NSGAII algorithm could be summarized in the next stages:

1.
Initially, a random population is created in pop.

2.
The population is sorted based on the nondomination scheme.

3.
It is assigned a fitness, which means every individual of the population is ranked into levels. Firstlevel or Paretofront is most preferable.

4.
A binary tournament selection and combination is carried out.

5.
A mutation phase is done.

6.
A combined population R comes from the union of an old pop with the new one n e w _p o p. The population R is size 2∗p o p _s i z e.

7.
R is ranked by means of the McGM algorithm and sorted according to a nondomination scheme.

8.
New population pop is made from size p o p _s i z e.
The fast nondominated sorting is the most computational cost part of the GA, because it involves ranking every individual of the population. We urge that this task be performed entirely on multiGPUs since this is more efficient than using a CPU, from computational point of view. Most GA operators are executed in CPU due to its low computational demand.
To rank an individual of the population means to compute the McGM algorithm with chromosome configuration, to compute the derivatives in Spatial Filtering, and to compute the number of angles in the Steering Stage to be performed. Several levels of parallelism are exploited: a coarser level, where nondominate sorting is evaluated in parallel on several GPUs, and finer level by means of data parallelism exploitation available in each stage of the McGM algorithm. Algorithm 1 summarizes our parallel implementation where p o p _s i z e, ngens and % m u t a t i o n are GA input parameters which correspond to population size, number of generations, and mutation probability, respectively. The OpenMP paradigm is used to distribute a nondominated sort across multiple devices by means of #pragma omp parallel for directives. Our implementation generates Paretooptimal solutions with a set of motion estimation execution time, accuracy pixel error, and GPU memory usage points. This feature allows the choice of one of the best solutions, taking into account the available computational resources favoring the dynamic tuning depending on current conditions.
Algorithm 1 pareto front = multiGPU NSGAII(pop size, ngens,%mutation)
4 Results
4.1 Work environment
The systems used are based on Tesla technology. The first one consists of 2 Intel Xeon E5645 processors with six cores (2.40 GHz with 12 MB cache memory and Hyperthreading technology) and 2 Tesla M2070 GPUs. The second one is equipped with 2 Quad Intel Xeon E5530 processors (2.40 GHz with 8 MB cache memory and Hyperthreading technology), connecting with 4 Tesla C1060 GPUs. In both cases, the operating systems are Debian 2.6.38 kernels; the compiler used is a GNU g++ v.4.4 with compilation flags O3 m64 fopenmp and CUDA C/C++ SDK v.4.2 with O3 fopenmp arch sm_20/13 flags enabled.
The system based on Tesla M2070 incorporates Fermi technology, but due to a scarce number of devices available, a scalability study has been completed with a system based on 4 Tesla C1060 GPUs that allow projections be made of parallel efficiency rates in more modern systems.
4.2 Multicriteria results
Multiobjective GAs are used to look for optimal solutions in a huge search space. In our context, they are employed to achieve a set of optimal solutions that reduce the GPU’s memory usage in the McGM algorithm without losing significant accuracy in the motion estimation scenario. As previously mentioned, the tests were performed using the ‘diverging tree’ and the ‘translating tree’ benchmarks, which are widely accepted in this area.
The first experiment was to evaluate both the convergence of the GA and the set of optimal solutions reached. For this purpose, a Euclidean distance metric between consecutive solutions was employed as described in [25]. The GA implemented incorporated a stop condition based on a Euclidean metric when a certain number of iterations remained invariant to ensure the nondominant solutions converged to the optimal Paretofront.
Figure 2 shows the evolution of the set of nondominated solutions throughout the iterations with a severe stop condition. To facilitate its visualization, only the GPU’s memory reduction and the error difference were included, although the GA also optimizes the motion estimation time. Barron’s angle error ψ _{ E } corresponds to the difference of mean Barron error with respect to the original McGM counterpart.
In this experiment, the population size was fixed to 500 with 1% mutation probability. The results obtained indicated that after a certain number of generations, the GA barely improved the nondominant solutions, although it reported new pairs.
Population size only affects the final execution time, achieving results of an optimal solution with similar quality. Empirically, 1% of mutations reported better GA performance. Higher mutation rates only suppose significant variations between consecutive generations, which means higher generations are necessary to reach the convergence criterion. Particularly, greater mutation rates suppose a higher number of iterations, which varies between 15 to 320%.
As shown in Figure 2, optimal solutions are generated with significant reduction in memory requirements, achieving even more accurate solutions than the original McGM’s algorithm for the ‘translating tree’ benchmark.
4.3 MultiGPU results
Table 4 shows GA time measured in seconds in a system based on the Tesla M2070 for the best GA configuration: 1% of mutation rate configuration, 500 individuals and a severe convergence condition to find the Paretofront. The ‘diverging tree’ was used as a benchmark, although similar performance rates were observed with the ‘translating tree’. Note that the benchmark choice only affects the number of generations processed to reach an optimal solution. As expected, the fitness evaluation is the most costly stage of the GA by far. The information exchange overhead between host and devices is not so relevant, which reports satisfactory speedups of ×1.79 for 2 GPUs.
Analogous results were obtained in a system with a larger number of graphic devices. Table 5 shows even higher accelerations when 2 GPUs are enabled. Furthermore, it is noticeable that scalability rates remain satisfactory with 4 GPUs, achieving ×3.71 speedups. Computational results show that our multiGPU implementation is efficient in terms of scalability (95 and 93% using 2 and 4 GPUs, respectively), and the tendency indicates that GA convergence times would be even lower if more computational resources were available. We can conclude that this successful scalability makes GAs useful for solving problems of this nature. These good performance results are due to both a wellbalanced workload and low overhead involved in data exchange.
Moreover, the use of multiple levels of parallelism reports multiplicative accelerations: first, the speedups achieved in the multiGPU system, which can be up to ×3.71 with 4 GPUs enabled; and second, the accelerations up to ×32 the can be achieved by exploiting the data parallelism on a GPU. On one hand, the combination of both accelerations allows the reduction of exploration time to reach an optimal solution in 99.2% compared with a generalpurpose processor. On the other hand, the use of a multiGPU system not only reports greater FLOPS rates than a CPU, but it is also beneficial in terms of power consumption (MFLOPS/watt).
Moreover, although GA search times are important, their use encourages getting suboptimal solutions that meet the requirements of response time or resource consumption, and as GAs evolve, they are gradually refined. This feature, coupled with the chance of a population size reduction, supposes an impressive simulation times decrease which opens the possibility to build an intelligent system that autocorrects/adapts depending on the specific requirements or substantial environment changes.
4.4 Visual result
Finally, visual results are presented for both benchmarks considered. Figure 3 shows the main differences in motion estimation outputs in the ‘diverging tree’ benchmark. The original McGM outputs appear at the top of the figure; in the center and at the bottom their counterparts with GA solutions for 75 and 50% memory requirements. It is also remarked the motion estimation time consumption (MEtime). The motion phase (the direction of the pixels) is colorcoded from the boundary frame (each particular color pixel points outward to the border color frame). The modulus or velocity magnitude is represented by a linear intensity of gray scales. Similarly, Figure 4 displays GA solutions for the ‘translating tree’ benchmark.
For the ‘diverging tree’, a reduction of 75% in memory usage returns the same precision using the Barron metric and 50% of the McGM execution time (MEtime) compared to the original algorithm. However, the configuration that reduces memory usage by 50% degrades the accuracy in 22% with a speedup of ×3.3.
For ‘translating tree’ benchmark, a solution with half of memory requirements is more accurate (Barron’s error is 0.13 radians less than the original McGM) and ×3.5 faster.
Despite Barron metric’s popularity by the scientific community in the context of motion estimation, some authors [26, 27] point out specific performances due to its asymmetry and its bias of large flow vectors.
4.5 Other error metrics
Although Barron’s metric [19] is probably the most used in the motion estimation scope, there are other metrics used by Machine Vision community that must be taken into account in order to enhance the visibility and generality of the results obtained.
Otte and Nagel [26, 27] remarked the fact of asymmetry and bias for extensive opticalflow vectors. Based on this drawback, it is proposed a new metric which accounts the magnitude difference between bidimensional ground truth flow vector (o f v _{ c }) and the estimated one (o f v _{ e }) as shown in the Equation (14):
McCane et al. [27] claims this is not sufficient due it gets discount error in regions of small flows. They propose two metrics in order to overcome these problems. The first metric is the angle difference between the correct tridimensional v _{ c } vector and the estimated one v _{ e } used in the Barron’s metric (Equation (15)) but the third component is replaced by δ. In our experiments we assign δ=0.75. This threshold modulates the error considering less significant in zones of small flow than in zones of large flow.
An additional metric is proposed, such as the normalize magnitude of the vector difference between the estimated and the correct tridimensional flow vectors. The normalization factor is the magnitude of the correct flow, it is taken into account the effect of small flows using a significance threshold T as shown in Equation (16). It is chose T to be 0.5 pixels. The effect of this threshold would result in a normalized error equal to the unity.
Figure 5 shows the trend of the NSGAII algorithm considering McCane and Otte&Nagel metrics. Analogously to Section 4.2, ‘translating’ and ‘diverging tree’ sequences are also used as benchmarks.
For the sake of clarity, Table 6 summarizes the main successful configuration reported by GA execution. Results observed are consistent with regardless of the metric used. Meanwhile for ‘translating tree’ improves Motion Estimation effectiveness with a significant reduction of memory requirements, for ‘diverging tree’ no degradation is observed for 75% memory usage for any metric performed. From the viewpoint of the execution times, performance results are as expected. On the one hand a reduction in memory requirements by 50% are translated into speedups from 3.3× to 4×. On the other hand, 75% of memory consumption reports an average of 50% in motion estimation execution time.
5 Conclusions
A new and highly parallel approach is presented to overcome the GPU memory usage problems that occurred in our previous implementation of a wellknown neuromorphic motion estimation algorithm. This context provides the main motivation for using evolutionary algorithms to solve multicriteria optimization problems. The use of GAs based on a multiGPU scheme allowed for quick exploration of feasible solutions with any set of input data. The choice of NSGAII is motivated by the good results observed in a few iterations and a nearoptimal Paretofront.
From the viewpoint of reaching a solution that meets the requirements of memory consumption, we observed:

For ‘diverging tree’, a reduction of 75% in memory usage returns the same precision as the all metrics considered and 50% of the McGM execution time compared to the original algorithm. A configuration that reduces memory usage by 50% degrades the accuracy from 15 to 25% with a range of speedup which varies from ×2.8 to ×3.5.

For ‘translating tree’, a configuration that has half of the memory requirements is more accurate in terms of error and is between ×3.3 to ×4 faster.
From the point of view of multiGPU efficiency is observed:

Successful performance of ×3.71 speedups are archived when four GPUs are enabled.

Our implementation is a scalable approach due to both a wellbalanced workload and lowimpact communication between host and device.

A found multiplicative effect: ×3.71 speedups in a multiGPU system by ×32 acceleration by means of exploiting the data parallelism on a GPU. An impressive GA time in reaching an optimal solution in 99.2% compared with a CPU.

An alternative to be considered in terms of power consumption (MFLOPS/watt).
Because of these encouraging results, the possibility exists for building an intelligent system that autocorrects/adapts depending on specific requirements or environmental condition variations as the GA evolves.
Future lines are based on reusing this system with an environment predictor, with the possibility of realtime execution and self reconfiguration depending on the external constraints and resources available in the platform. This system is expected to contribute to the new machine vision trends, useful for many realworld applications.
References
 1.
Shaaban M, Goel S, Bayoumi M: Motion estimation algorithm for realtime systems. IEEE Workshop on Signal Processing Systems 2004, 257262.
 2.
Kang JY, Gupta S, Shah S, Gaudiot JL: An efficient pim (processorinmemory) architecture for motion estimation, IEEE International Conference on ApplicationSpecific Systems. Architectures, and Processors 2003, 282292.
 3.
Kang JY, Gupta S, Gaudiot JL: An efficient datadistribution mechanism in a processorinmemory (pim) architecture applied to motion estimation. IEEE Trans. Comput 2008, 57(3):375388.
 4.
Oh HS, Lee HK: Blockmatching algorithm based on an adaptive reduction of the search area for motion estimation. RealTime Imag 2000, 6(5):407414. 10.1006/rtim.1999.0184
 5.
Huang CL, Chen YT: Motion estimation method using a 3d steerable filter. Image Vis. Comput 1995, 13(1):2132. 10.1016/02628856(95)91465P
 6.
Baker S, Gross R, Matthews I: Lucaskanade 20 years on: a unifying framework: Part 3. Int. J. Comput. Vis 2002, 56: 221255.
 7.
Chi YM, Tran TD, EtienneCummings R: Optical flow approximation of subpixel accurate block matching for video coding. IEEE ICASSP 2003, 1: 10171020.
 8.
Johnston A, McOwan PW, Benton CP: A unified account of three apparent motion illusions. Vis. Res 1995, 35(8):11091123. 10.1016/00426989(94)00175L
 9.
McOwan PW, Johnston A, CPB: Robust velocity computation from a biologically motivated model of motion perception. Proc. Royal Soc. B 1999, 266: 509518. 10.1098/rspb.1999.0666
 10.
Liang X, McOwan PW, Johnston A: Biologically inspired framework for spatial and spectral velocity estimations. J. Opt. Soc. Am. A 2011, 28(4):713723. 10.1364/JOSAA.28.000713
 11.
Botella G, García A, RodriguezAlvarez M, Ros E, MeyerBâse U, Molina MC: Robust bioinspired architecture for opticalflow computation. IEEE Trans. VLSI Syst 2010, 18(4):616629.
 12.
Mattes L, Kofuji S: Overcoming the GPU memory limitation on FDTD through the use of overlapping subgrids. Int. Conference on Microwave and Millimeter Wave Technology 2010, 15361539.
 13.
Zhou Y, Garland M: Interactive pointbased rendering of higherorder tetrahedral data. IEEE Transactions on Visualization and Computer Graphics 2006, 12(5):12291236.
 14.
Ayuso F, Botella G, Garcia C, Prieto M, Tirado F: GPUbased acceleration of bioinspired motion estimation model. Concurrency and Computation: Practice and Experience p. (in press) (2012) http://dx.doi.org/10.1002/cpe.2946
 15.
Snowden RJ, Hess RF: Temporal frequency filters in the human peripheral visual field. Vis. Res 1992, 32(1):6172. 10.1016/00426989(92)90113W
 16.
Koenderink JJ: Optic flow. Vision Research 1996, 26: 161180.
 17.
Fornberg B: Generation of finite difference formulas on arbitrarily spaced grids. Math. Comput 1988, 51(184):699706. 10.1090/S00255718198809350770
 18.
Fleet DJ: Measurement of Image Velocity. Norwell, MA, USA: Kluwer Academic Publishers; 1992.
 19.
Barron JL, Fleet DJ, Beauchemin SS: Performance of optical flow techniques. Int. J. Comput. Vis 1994, 12: 4377. 10.1007/BF01420984
 20.
Konak A, Coit D, Smith D: Multiobjective optimization using genetic algorithms: a tutorial. Reliab. Eng. Syst. Safety 2006, 91(9):9921007. 10.1016/j.ress.2005.11.018
 21.
Fonseca C, Fleming P: Genetic algorithms for multiobjective optimization: formulation, discussion and generalization. Int. Conference on Genetic Algorithms 1993, 416423.
 22.
Srinivas N, Deb K: Muiltiobjective optimization using nondominated sorting in genetic algorithms. Evol. Comput 1994, 2(3):221248. 10.1162/evco.1994.2.3.221
 23.
Coello Coello CA: 20 years of evolutionary multiobjective optimization: what has been done and what remains to be done. In Computational Intelligence: Principles and Practice, chap. 4. Edited by: Yen GY, Fogel DB. Vancouver, Canada: IEEE Computational Intelligence Society; 2006:7388. ISBN 0978713508
 24.
Zitzler E, Laumanns M, Bleuler S: A tutorial on evolutionary multiobjective optimization. In Metaheuristics for Multiobjective Optimisation (SpringerVerlag) 2003, 535: 338.
 25.
Deb K, Pratap A, Agarwal S, Meyarivan T: A fast elitist multiobjective genetic algorithm: Nsgaii. IEEE Trans. Evol. Comput 2000, 6: 182197.
 26.
Otte M, Nagel HH: Estimation of optical flow based on higherorder spatiotemporal derivatives in interlaced and noninterlaced image sequences. Artif. Intell 1995, 78(1):543.
 27.
McCane B, Novins K, Crannitch D, Galvin B: On benchmarking optical flow. Comput. Vis. Image Underst 2001, 84(1):126143. 10.1006/cviu.2001.0930
Acknowledgements
The present study had been supported by Spanish Projects CICYTTIN 2008/508, CICYTTIN 201232180 and Ingenio Consolider ESP00C0720811.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Received
Accepted
Published
DOI
Keywords
 GPGPU
 Motion estimation
 Memory reduction technique
 Multiobjective optimization