Multi-GPU based on multicriteria optimization for motion estimation system

Graphics processor units (GPUs) offer high performance and power efficiency for a large number of data-parallel applications. Previous research has shown that a GPU-based version of a neuromorphic motion estimation algorithm can achieve a ×32 speedup using these devices. However, the memory consumption creates a bottleneck due to the expansive tree of signal processing operations performed. In the present contribution, an improvement in memory reduction was carried out, which limited accelerator viability usage. An evolutionary algorithm was used to find the best configuration. It supposes a trade-off solution between consumption resources, parallel efficiency, and accuracy. A multilevel parallel scheme was exploited: grain level by means of multi-GPU systems, and a finer level by data parallelism. In order to achieve a more relevant analysis, some optical flow benchmarks were used to validate this study. Satisfactory results opened the chance of building an intelligent motion estimation system that auto-adapted according to real-time, resource consumption, and accuracy requirements.


Introduction
Motion estimation and compensation are crucial for multimedia coding characterized by high memory requirements and computation complexity. When considering MPEG processing, motion estimation is acknowledged as the most time-consuming [1], creating up to 90% of the total execution time [2,3]. Additionally, motion estimation has several applications regarding multimedia scope as segmentation, extraction of 3D structure, pattern tracking, filtering, compression, and de-blurring. Differently developed motion estimation models and algorithms can be classified into three main categories: matching domain approximations [4], energy models [5], and gradient models [6].
Block matching algorithms have the pros of robustness, low cost VLSI implementation (because of their regular parallel procedure), and low overhead (since they contain one vector per block). Nevertheless, there are many cons, since a block may contain several moving objects and fail for zoom, rotational motion, local deformation, and blocking artifact. In additional, they usually estimate *Correspondence: garsanca@dacya.ucm.es Computer Architecture Deparment, Complutense University of Madrid, Madrid, Spain the motion error by minimizing a metric, which does not release the true movement, etc. Energy models are probabilistic, delivering a population of solutions that do not indicate motion itself and are not usually used for multimedia purposes.
The gradient-based family can estimate vector motion of every single pixel, giving a dense representation of the processed frame. There are several examples of video compression using gradient based algorithm [7]. Recursive algorithms belonging to this family do not have to transmit motion information. Nevertheless, this algorithm family has the drawback of large motion vectors (severe motion), noisy images, and changes in illumination. The present approach is based on a Multichannel Gradient Model (McGM) [8][9][10], a neuromorphic algorithm fitted to allow the construction of viable, highly robust, front-end processors for image recognition systems [11].
The increased computing capabilities of graphics processing units (GPUs) in recent years has increased their use as accelerators in many areas such as scientific simulations, computer vision, bioinformatics, cryptography, and finance, among others. This increase is largely due to impressive performance rates. For example, one of the latest GPUs from Nvidia, the GTX 680, achieves http://asp.eurasipjournals.com/content/2013/1/23 three petaflops in single precision with 1006 cores and also incorporates the newer Kepler architecture. Current trends seem to indicate that this capacity will grow even more with the incorporation of 22 and 28 nm technologies. Recently, for example, AMD announced its Radeon 8000 Series, branded as Sea Island, and Intel is manufacturing Knights Corner products. However, key points that dramatically affect performance rates include the efficient use of the memory hierarchy and the exploitation of parallelism capabilities.
The increased demand for information to be processed also plays a role, because the use of these devices as accelerators is limited due to DDR memory restrictions. To solve this problem, research [12] has often proposed a data reuse alternative with the aim of minimizing the memory traffic between GPU and CPU. Another approach in the field of rendering meshes can be found in [13] a solution that uses more efficient algorithms in terms of memory consumption alongside other techniques based on simplification or information compression. The GPU memory reduction proposed here is addressed using a motion estimation scenario, which, to the best of our knowledge, doesn't exist as a solution in any of the current literature.
In previous study [14], we developed a GPU-based McGM implementation. In the present article, we address an efficient solution for dense and robust motion estimation per pixel related with GPU memory consumption, which limits the GPU viability.
This article is organized as follows: Section 2 moves through a specific neuromorphic model; Section 3 presents the motivation of this study where multiobjective optimization is used; and Section 4 shows performance and visual results. Finally, Section 5 concludes with the main contributions of this study.

Multichannel gradient model (McGM)
This original algorithm was proposed by Johnston et al. We have applied Johnston's description of the McGM model [9,10] while adding many specific variations to improve the viability of the GPU implementation, as we will comment upon in later sections. Figure 1 shows a simplified scheme of the processing pipe to be completed.

Stage I. temporal filtering
Taking as starting point the study performed by Hess and Snowden [15] on temporal processing in human beings, we model three different temporal channels: one low-pass filter and two band-pass filters with a central frequency of 10 and 18 Hz, respectively. These channels can be accomplished using a Gaussian differentiation in the log-time domain. (1)

Stage II. spatial filtering
According to the space domain, the shape of the receptive fields from the primitive visual cortex can be modeled either by using Gabor functions-where the impulse responses are defined by harmonic functions multiplied by a Gaussian-or by using a derivative set of Gaussians [16]. The Gaussian is a unique function in many ways and is of particular importance to biology. When the differentiation order increases, the Gaussians are fitted and tuned to higher spatial frequencies. Finally, a range of independent channels is constructed.
The nth Gaussian derivative can be expressed as a Hermite polynomial multiplied by the original Gaussian: (where σ is the standard deviation of the Gaussian, and the scale factor ensures the function integrates to unity).

Stage III. steering filtering
The steering stage represents the approach to projecting the space-temporal filters calculated in previous stages under different orientations. Calling n and m the order in x and y directions, respectively, θ (the angle projected) and D (the derivative operator), the general expression is derived as a linear combination of filters belonging to the same order basis.

Stage IV. Taylor truncation
At this stage, a truncated Taylor expansion is performed, using each oriented filter previously calculated. This function represents a robust structure that gathers all space-temporal information sequences, approximating one generic pixel by the set of derivatives from the neighborhood, which can be written as follows: The three Taylor expansion derivatives are constructed in one large image using the completed set of basis filter responses. According to the original model [9], the expansions are truncated after the third-order in the primary direction and the second-order in the orthogonal and temporal directions.

Stage V. quotients
This is the last stage derived to the common pathway calculation. The next stages implement the modulus and phase estimation with separate expressions. The goal here is to compute a quotient of every sextet's component:

Stage VI. velocity primitives
The previous stages compute the visual information considering a Taylor representation of each pixel and calculating the speed for a range of orientations in order to simulate the orientation columns found in the striate cortex [9]. This is accomplished by rotating the coordinate system and Gaussian derivative filters (Steering Stage) to a number of primary directions. Next, the speed measurements-parallel and orthogonal-are placed in primary directions in order to yield a vector of speed measurements, whose components are speed and orthogonal speed: The raw measurements of speed are also conditioned by including the measurements of the image structure XY/XX and XY/YY, where the final conditioned speed vectors are given by: is the number of orientations at which speed is evaluated. Inverse speed is also calculated: Inverse speed is evaluated using different terms from those used to compute speed, and so constitutes an additional independent measurement. Finally, the motion modulus is calculated through a quotient of determinants: The direction of motion is extracted by calculating a measurement of phase that is combined across all speed related measurements, since they are in phase: phase = arctan s +ŝ sin θ + s ⊥ +ŝ ⊥ cos θ s +ŝ cos θ − s ⊥ +ŝ ⊥ sin θ (10)

Multi-criteria motivation for tunning McGM
Potential benefits of GPUs in the McGM context have been explored in the literature [14], where authors studied the viability in these novel devices. Throughput results with respect to a single CPU were satisfactory enough in terms of performance, achieving ×32 speedups for 256 2 resolution movies. We would like to emphasize that this particular GPUbased motion estimation scheme is an alternative to consider in terms of Mpixel/s compared to other purpose systems used for such motion-estimation algorithms, although the algorithm features create a bottleneck, specifically when memory requirements are increased in each stage, with an upward trend. This disadvantage limits GPU viability. Attending to the largest memory usage configuration considered in [14], 3.5 GB of global memory was used, which was close to the capacity limit of a single GPU. Although the memory capacity is greater for GPUs nowadays, this problem is still present with larger data input resolutions. http://asp.eurasipjournals.com/content/2013/1/23 The scope of this study is to explore mechanisms in order to reduce the data amount without losing the efficiency and accuracy requirements. To highlight the memory handicap of GPUs, Table 1 shows the summarized performance results observed using the McGM algorithm in a graphic device compared with a single CPU. The performance observed at each stage of the algorithm is shown in Mpixel/s (Mpps), and the global throughput with a particular model configuration, as follows: three temporal derivatives, a temporal convolution window of 15 frames, five spatial derivatives, a spatial separable convolution window of 31 pixels, and 12 angles steered. Moreover, as shown in this table, GPU implementation amply fulfilled real-time requirements in all of the resolution configurations considered. This is further shown in the last column, which corresponds to overall performance, which was measured in frames per second (fps). While general-purpose processors can only reach real-time rates for small video resolutions, GPU-based systems enabled higher resolution movies where more DDR memory capacity was available.
In order to reduce algorithm memory consumption, we could afford not to store, as a particular solution, some of the temporary data computations, recalculating when necessary at the expense of reducing performance throughput under real time conditions. The most memory-demanding stages in the McGM algorithm correspond to the Spatial Filtering and Steering stages. On the one hand, an efficient way to reduce memory necessities was to perform the Steering stage with less θ angles at the expense of accuracy degradation. On the other hand, it was possible to use a numerical derivative [17] instead of the Gaussian counterpart in the Spatial Filtering stage in order to allow faster derivative recalculation. This alternative scheme was based on the fact of not requiring intermediate data computation storage by saving a huge amount of memory and to recalculate whenever data were used. A simple numerical differentiating filter was used based on the convolution commutative properties: The number of operations performed in (I ⊗ G 0 ) ⊗ D x are smaller than the Gaussian derivative filtering, making the convolution process faster. Table 2 shows the error in computing G 0 ⊗ D x to evaluate accuracy degradation. Filter degradation denotes where D G and D N corresponds to Gaussian and Numerical derivative filtering, respectively. As can be appreciated, loss of accuracy is not so important for 9-31 pixel filtering, reaching a maximum of 3% error. A priori, we may conclude that performing numerical derivatives rarely creates considerable error.
Despite the unimportance of degraded filtering accuracy, an experiment comparing motion estimation degradation is carried out to evaluate the loss of accuracy in the overall algorithm. As benchmarks, we have used a couple of synthetic sequences widely accepted in this context: the 'diverging tree' and the 'translating tree' , both created by David Fleet at Toronto University [18]. The 'diverging tree' shows an expansive motion of a tree (in camera zoom mode) with an asymmetric velocity range depending on the pixel position (null in the central focus and 1.4 pixels/frame and 2 pixel/frames in the left and right boundaries, respectively). The 'translating tree' shows the translational motion of a tree with an asymmetric velocity range depending on the pixel position (zero to 1.73 pixel/frames and zero to 2.3 pixel/frames in the left and right border, respectively). For an error metric, we used Barron [19], considered to be one of the most accepted metrics in the specialized literature.
Barron Equation (11) shows deviation from the correct space-time orientation, the velocity being a 3D unit direction vector. This vector wraps both modulus (speed) and phase (direction) in a single value reducing and reduce the rise of directional errors for small velocities.  Since the vector is self-normalized, the angle between the measured velocity v e and the correct one v c is given by Equation (12). This error measurement is calculated for every pixel for which a velocity measurement was recovered. Table 3 shows an error in Barron's angle when used as a numerical derivative instead of a Gaussian counterpart in spatial filtering with a significant θ angles reduction in the steering stage.
and O(h 4 ) denote the observed error of Barron's angle when performed with numerical derivatives with first-, second-, third-and fourth-accuracy order, respectively. #θ is related to the maximum number of θ angles projected in the Steering stage. The table shows the impact of halfor quarter-θs.
As observed, the 'diverging tree' experiment behaves reasonably well with numerical derivatives reducing their impact with a higher order of accuracy. Nevertheless, in the 'translating tree' experiment, the algorithm is more vulnerable to numerical derivatives than the number of angles variation. Due to this disparity observed in Table 3, it is advisable the space of feasible solutions with any set of input data be explored. Given the large number of parameters to configure, on one hand relative to the McGM algorithm, and on the other hand those based on available resources, the use of genetic algorithms (GAs) can be useful to reduce time-consuming exploration.

Multi-criteria optimization description
The use of GAs arises from non-viability exploration with a large solution space. In our context, the target is to find a compromise in the reduction of the GPU's memory usage with negligible accuracy degradation that allows motion estimation system self-adaptation under appreciable environmental conditions and changes in a reasonable time.
The goal of the multi-objective optimization [20] is to simultaneously optimize several objectives that could be inconsistent. Considering the problems, some tradeoffs among the different variables involved also need to be considered. In our context, we consider the following three-objective minimization problem: where z is the objective vector with 3 objectives to be minimized: execution time f 1 , memory usage f 2 , and loss of accuracy f 3 ; z is the decision vector, and X is the feasible region in the decision space, which corresponds to all possible McGM configurations with respect to the derivative decision and the number of angles involved. In GA terminology, x corresponds to a chromosome. In our context: • D x corresponds to the derivative to be computed in spatial filtering. This information is stored in a two-dimensional array whose values determine the way their derivative is computed by means of Gaussian or order-numerical differentiation. The two-dimensional array position is related to the derivative order. • The number of θ angles to be performed in the steering stage, which can be assigned as an integer.

Our multi-GPU implementation
Over the last few years, a great number of multi-objective evolutionary algorithms have been developed [21][22][23]. A revision of the GA can be found in a tutorial [24], where the authors provide the revision's more relevant features. For this study, we have chosen the NSGA-II [25] for its following advantages: • Weights are not required, so it is not necessary to study the impact of f i (x) and assign them. • Its computational requirement is one, which presents less computational complexity. • Its 'good' behavior and ability to find a set of solutions near the true Pareto-optimal with few iterations. • It's widely used and amply tested.
The NSGA-II is based on a fast non-dominated sorting procedure where a fast crowded distance estimation is carried out. It involves a simple crowded comparison operator [25]. The NSGA-II algorithm could be summarized in the next stages: 1. Initially, a random population is created in pop. 2. The population is sorted based on the non-domination scheme. 3. It is assigned a fitness, which means every individual of the population is ranked into levels. First-level or Pareto-front is most preferable. 4. A binary tournament selection and combination is carried out. 5. A mutation phase is done. The fast non-dominated sorting is the most computational cost part of the GA, because it involves ranking every individual of the population. We urge that this task be performed entirely on multi-GPUs since this is more efficient than using a CPU, from computational point of view. Most GA operators are executed in CPU due to its low computational demand.
To rank an individual of the population means to compute the McGM algorithm with chromosome configuration, to compute the derivatives in Spatial Filtering, and to compute the number of angles in the Steering Stage to be performed. Several levels of parallelism are exploited: a coarser level, where non-dominate sorting is evaluated in parallel on several GPUs, and finer level by means of data parallelism exploitation available in each stage of the McGM algorithm. Algorithm 1 summarizes our parallel implementation where pop size, ngens and %mutation are GA input parameters which correspond to population size, number of generations, and mutation probability, respectively. The OpenMP paradigm is used to distribute a non-dominated sort across multiple devices by means of #pragma omp parallel for directives. Our implementation generates Pareto-optimal solutions with a set of motion estimation execution time, accuracy pixel error, and GPU memory usage points. This feature allows the choice of one of the best solutions, taking into account the available computational resources favoring the dynamic tuning depending on current conditions.

Work environment
The systems used are based on Tesla technology. The first one consists of 2 Intel Xeon E5645 processors with six cores (2.40 GHz with 12 MB cache memory and Hyper-threading technology) and 2 Tesla M2070 GPUs. The second one is equipped with 2 Quad Intel Xeon E5530 processors (2.40 GHz with 8 MB cache memory and Hyper-threading technology), connecting with 4 Tesla C1060 GPUs. In both cases, the operating systems are Debian 2.6.38 kernels; the compiler used is a GNU g++ v.4.4 with compilation flags -O3 -m64 -fopenmp and CUDA C/C++ SDK v.4.2 with -O3 -fopenmp -arch sm 20/13 flags enabled.
The system based on Tesla M2070 incorporates Fermi technology, but due to a scarce number of devices available, a scalability study has been completed with a system based on 4 Tesla C1060 GPUs that allow projections be made of parallel efficiency rates in more modern systems.

Multicriteria results
Multi-objective GAs are used to look for optimal solutions in a huge search space. In our context, they are employed to achieve a set of optimal solutions that reduce the GPU's memory usage in the McGM algorithm without losing significant accuracy in the motion estimation scenario. As previously mentioned, the tests were performed using the 'diverging tree' and the 'translating tree' benchmarks, which are widely accepted in this area.
The first experiment was to evaluate both the convergence of the GA and the set of optimal solutions reached. For this purpose, a Euclidean distance metric between consecutive solutions was employed as described in [25]. The GA implemented incorporated a stop condition based on a Euclidean metric when a certain number of iterations remained invariant to ensure the non-dominant solutions converged to the optimal Pareto-front. Figure 2 shows the evolution of the set of nondominated solutions throughout the iterations with a severe stop condition. To facilitate its visualization, only the GPU's memory reduction and the error difference were included, although the GA also optimizes the motion estimation time. Barron's angle error ψ E corresponds to the difference of mean Barron error with respect to the original McGM counterpart.
In this experiment, the population size was fixed to 500 with 1% mutation probability. The results obtained indicated that after a certain number of generations, the GA barely improved the non-dominant solutions, although it reported new pairs. Population size only affects the final execution time, achieving results of an optimal solution with similar quality. Empirically, 1% of mutations reported better GA performance. Higher mutation rates only suppose significant variations between consecutive generations, which means higher generations are necessary to reach the convergence criterion. Particularly, greater mutation rates suppose a higher number of iterations, which varies between 15 to 320%.
As shown in Figure 2, optimal solutions are generated with significant reduction in memory requirements, achieving even more accurate solutions than the original McGM's algorithm for the 'translating tree' benchmark.  Table 4 shows GA time measured in seconds in a system based on the Tesla M2070 for the best GA configuration: 1% of mutation rate configuration, 500 individuals and a severe convergence condition to find the Pareto-front. The 'diverging tree' was used as a benchmark, although similar performance rates were observed with the 'translating tree' . Note that the benchmark choice only affects the number of generations processed to reach an optimal solution. As expected, the fitness evaluation is the most costly stage of the GA by far. The information exchange overhead between host and devices is not so relevant, which reports satisfactory speedups of ×1.79 for 2 GPUs. Analogous results were obtained in a system with a larger number of graphic devices. Table 5 shows even higher accelerations when 2 GPUs are enabled. Furthermore, it is noticeable that scalability rates remain satisfactory with 4 GPUs, achieving ×3.71 speedups. Computational results show that our multi-GPU implementation is efficient in terms of scalability (95 and 93% using 2 and 4 GPUs, respectively), and the tendency indicates that GA convergence times would be even lower if more computational resources were available. We can conclude that this successful scalability makes GAs useful for solving problems of this nature. These good performance results are due to both a well-balanced workload and low overhead involved in data exchange.

Multi-GPU results
Moreover, the use of multiple levels of parallelism reports multiplicative accelerations: first, the speedups achieved in the multi-GPU system, which can be up to ×3.71 with 4 GPUs enabled; and second, the accelerations up to ×32 the can be achieved by exploiting the data parallelism on a GPU. On one hand, the combination of both accelerations allows the reduction of exploration time to reach an optimal solution in 99.2% compared with a general-purpose processor. On the other hand, the use of a multi-GPU system not only reports greater FLOPS rates than a CPU, but it is also beneficial in terms of power consumption (MFLOPS/watt).
Moreover, although GA search times are important, their use encourages getting suboptimal solutions that meet the requirements of response time or resource consumption, and as GAs evolve, they are gradually refined. This feature, coupled with the chance of a population size reduction, supposes an impressive simulation times decrease which opens the possibility to build an intelligent system that auto-corrects/adapts depending on the specific requirements or substantial environment changes.

Visual result
Finally, visual results are presented for both benchmarks considered. Figure 3 shows the main differences in motion estimation outputs in the 'diverging tree' benchmark. compared to the original algorithm. However, the configuration that reduces memory usage by 50% degrades the accuracy in 22% with a speedup of ×3.3.
For 'translating tree' benchmark, a solution with half of memory requirements is more accurate (Barron's error is 0.13 radians less than the original McGM) and ×3.5 faster.
Despite Barron metric's popularity by the scientific community in the context of motion estimation, some authors [26,27] point out specific performances due to its asymmetry and its bias of large flow vectors.

Other error metrics
Although Barron's metric [19] is probably the most used in the motion estimation scope, there are other metrics used by Machine Vision community that must be taken into account in order to enhance the visibility and generality of the results obtained.
Otte and Nagel [26,27] remarked the fact of asymmetry and bias for extensive optical-flow vectors. Based on this drawback, it is proposed a new metric which accounts the magnitude difference between bidimensional ground truth flow vector (ofv c ) and the estimated one (ofv e ) as shown in the Equation (14): McCane et al. [27] claims this is not sufficient due it gets discount error in regions of small flows. They propose two metrics in order to overcome these problems. The first metric is the angle difference between the correct tridimensional v c vector and the estimated one v e used in the Barron's metric (Equation (15)) but the third component is replaced by δ. In our experiments we assign δ = 0.75. This threshold modulates the error considering less significant in zones of small flow than in zones of large flow.
An additional metric is proposed, such as the normalize magnitude of the vector difference between the estimated and the correct tridimensional flow vectors. The normalization factor is the magnitude of the correct flow, it is taken into account the effect of small flows using a significance threshold T as shown in Equation (16). It is chose T to be 0.5 pixels. The effect of this threshold would result in a normalized error equal to the unity. Figure 5 shows the trend of the NSGA-II algorithm considering McCane and Otte&Nagel metrics. Analogously to Section 4.2, 'translating' and 'diverging tree' sequences are also used as benchmarks.
For the sake of clarity, Table 6 summarizes the main successful configuration reported by GA execution. Results observed are consistent with regardless of the metric used. Meanwhile for 'translating tree' improves Motion Estimation effectiveness with a significant reduction of memory requirements, for 'diverging tree' no degradation is observed for 75% memory usage for any metric performed. From the viewpoint of the execution times, performance results are as expected. On the one hand a reduction in memory requirements by 50% are translated into speedups from 3.3× to 4×. On the other hand, 75% of memory consumption reports an average of 50% in motion estimation execution time.

Conclusions
A new and highly parallel approach is presented to overcome the GPU memory usage problems that occurred in our previous implementation of a well-known neuromorphic motion estimation algorithm. This context provides the main motivation for using evolutionary algorithms to solve multi-criteria optimization problems. The use of GAs based on a multi-GPU scheme allowed for quick exploration of feasible solutions with any set of input data. The choice of NSGA-II is motivated by the good results observed in a few iterations and a near-optimal Pareto-front.
From the viewpoint of reaching a solution that meets the requirements of memory consumption, we observed: http://asp.eurasipjournals.com/content/2013/1/23 Diverging Tree optimal iter=200 iter=50 iter=20 iter=1 Translating Tree optimal iter=200 iter=50 iter=20 iter=1 Diverging Tree optimal iter=200 iter=50 iter=20 iter=1 Translating Tree optimal iter=200 iter=50 iter=20 iter=1 Nagel´s error Δψ Nagel Nagel´s error Δψ Nagel Diverging Tree optimal iter=200 iter=50 iter=20 iter=1  • For 'diverging tree', a reduction of 75% in memory usage returns the same precision as the all metrics considered and 50% of the McGM execution time compared to the original algorithm. A configuration that reduces memory usage by 50% degrades the accuracy from 15 to 25% with a range of speedup which varies from ×2.8 to ×3.5. • For 'translating tree', a configuration that has half of the memory requirements is more accurate in terms of error and is between ×3.3 to ×4 faster.
From the point of view of multi-GPU efficiency is observed: • Successful performance of ×3.71 speedups are archived when four GPUs are enabled. • Our implementation is a scalable approach due to both a well-balanced workload and low-impact communication between host and device. • A found multiplicative effect: ×3.71 speedups in a multi-GPU system by ×32 acceleration by means of exploiting the data parallelism on a GPU. An impressive GA time in reaching an optimal solution in 99.2% compared with a CPU. • An alternative to be considered in terms of power consumption (MFLOPS/watt).
Because of these encouraging results, the possibility exists for building an intelligent system that autocorrects/adapts depending on specific requirements or environmental condition variations as the GA evolves.
Future lines are based on reusing this system with an environment predictor, with the possibility of realtime execution and self reconfiguration depending on the external constraints and resources available in the platform. This system is expected to contribute to the new machine vision trends, useful for many real-world applications.