- Research
- Open Access

# Experimental study of image representation spaces in variational disparity calculation

- Jarno Ralli
^{1}Email author, - Javier Díaz
^{1}, - Pablo Guzmán
^{1}and - Eduardo Ros
^{1}

**2012**:254

https://doi.org/10.1186/1687-6180-2012-254

© Ralli et al; licensee Springer. 2012

**Received:**16 September 2011**Accepted:**13 November 2012**Published:**10 December 2012

## Abstract

Correspondence techniques start from the assumption, based on the Lambertian reflection model, that the apparent brightness of a surface is independent of the observer’s angle of view. From this, a grey value constancy assumption is derived, which states that a change in brightness of a particular image pixel is proportional to a change in its position. This constancy assumption can be extended directly for vector valued images, such as RGB. It is clear that the grey value constancy assumption does not hold for surfaces with a non-Lambertian behaviour and, therefore, the underlying image representation is crucial when using real image sequences under varying lighting conditions and noise from the imaging device. In order for the correspondence methods to produce good, temporally coherent results, properties such as robustness to noise, illumination invariance, and stability with respect to small geometrical deformations are all desired properties of the representation. In this article, we study how different image representation spaces complement each other and how the chosen representations benefit from the combination in terms of both robustness and accuracy. The model used for establishing the correspondences, based on the calculus of variations, is itself considered robust. However, we show that considerable improvements are possible, especially in the case of real image sequences, by using an appropriate image representation. We also show how optimum (or near optimum) parameters, related to each representation space, can be efficiently found.

## Keywords

- Mean Square Error
- Input Image
- Image Noise
- Regularisation Term
- Image Representation

## Introduction

The optical flow constraint [1], based on the Lambertian reflection model, states that a change in the brightness of a pixel is proportional to a change in its position, i.e., the grey level of a pixel is assumed to stay constant temporally. This same constancy concept can be used also in disparity calculation by taking into account the epipolar geometry of the imaging devices (e.g., a stereo-rig). The grey level constancy, that does not hold for surfaces with a non-Lambertian behaviour, can be extended for vector valued images with different image representations. In this study, we use a method based on the calculus of variations for approximating the disparity map. Variational correspondence models typically have two terms, the first one being a data term (e.g., based on the grey level constancy), while the second one is a regularisation term used to make the solution smooth. In order to make the data term more robust with respect to non-Lambertian behaviour, different image representations can be used. Some of the problems in establishing correspondences arise from the imaging devices (e.g., camera/lens parameters being slightly different, noise due to imaging devices) and some, from the actual scene being observed (e.g., lighting conditions, geometrical deformations due to camera setup). Typically, illumination difference and optics related ‘errors’ are modelled as a multiplicative type of error, while the imaging device itself is modelled as a combination of both multiplicative and additive ‘errors’. It is clear that the underlying image representation is crucial in order for any correspondence method to generate correct, temporally coherent estimates in ‘real’ image sequences.

In this article, we study how combinations of different image representations behave with respect to both illumination errors and noise, ranking the results accordingly. We believe that such information is useful to the part of the visual community that concentrates on applications, such as obstacle detection in vehicle related scenarios [2], segmentation [3], and so on. Although other authors address similar issues [4, 5], we find these to be slightly limited in scope due to a reduced ‘test bench’, e.g., a small number of test images or image representations. Also, in most of the cases, the way in which the parameters related to the model(s) have been chosen is not satisfactorily explained. Therefore, the main contribution of our article is an analysis of the different image representations supported by a more detailed and systematical evaluation methodology. For example, we show how optimum (or near optimum) parameters for the algorithm, related to each representation space, can be found. This is a small but important contribution in the case of real, non controlled, scenarios. The standard image representation is the RGB-space, the others being (obtained via image transformations): gradient, gradient magnitude, log-derivative, HSV, *rϕθ*, and phase component of an image filtered using a bank of Gabor filters.

This work is a comparative study of the chosen image representations, and it is beyond the scope of this article to explain why certain representations perform better than others in certain situations. Under realistic illumination conditions, with surfaces both complying and not complying with the Lambertian reflection model, theoretical studies can become overly complex, as we show next. It is typically thought that chromaticity spaces are illumination invariant, but under realistic lightning conditions, this is not necessarily so [6]. One of the physical models that explains light reflected by an object is the Dichromatic Reflection Model [7] (DRM), which in its basic form assumes that there is a single source of light [7], which is unrealistic in the case of real images (unless the lightning conditions can be controlled). A good example of this is given by Maxwell et al. in their Bi-Illuminant Dichromatic Reflection article [6]: in a typical outdoor case, the main illuminants are sunlight and skylight, where fully lit objects are dominated by sunlight while objects in the shade are dominated by skylight. Thus, as the illumination intensity decreases, the hue of the observed colour becomes bluish. For the above mentioned reasons, chromatic spaces (e.g., HSV, *rϕθ*) are not totally illumination invariant under realistic lightning conditions. Therefore, in general, we do not speak of illumination invariance in this article but of illumination robustness or robust image representation with respect to illumination changes and noise. By illumination error, we refer to varying illumination conditions between the left- and right stereo cameras.

Next, Section 1 presents the relevant related study, some sources of error, and the variational method. Finally, in Sections 1, 1, and 1, we describe the proposed methodologies, results, and conclusions.

## Background material and related study

## Proposed methodologies

### Searching for optimal parameters with differential evolution

Since the main idea of this study is to rank the chosen image representation spaces with respect to robustness, we have to find an optimum (or near optimum) set of parameter vectors *b*_{1}*b*_{2}*α* for each different case, avoiding over-fitting. As was already mentioned, using a human operator would be prone to bias. Therefore, we have decided to use a gradient free, stochastic, population based function minimiser called differential evolution^{b}(DE) [22, 23]. The rationale for using DE is that it has empirically been shown to find the optimum (or near optimum), it is computationally efficient, and the cost function evaluation can be efficiently parallelised. The principal idea behind DE is to represent the parameters to be optimised as vectors where each vector is a population member whose fitness is described by the cost function value. A population at time *t* is also known as a generation. Therefore, it can be understood that the system evolves with respect to artificial time *t*, also known as cycles. By recurring to the survival of the fittest theorem, the ‘fittest’ members contribute more to the coming populations and, thus, their characteristics overcome those of the weak members, therefore minimising (or maximising) the function value [22, 23]. Two members (parents) are stochastically combined into a new one (offspring), possibly with mutation, which then competes against the rest of the members of the coming generations. Therefore, in our case, a single member is a vector given by *b*_{1}*b*_{2}*α* while the cost function value is the mean squared error (MSE) given by Equation (31).

In order to compare the results obtained using different combinations of the image representations, we adopt a strategy typically used in pattern recognition: the input set (a set of stereo-images) is divided into a learning, a validation, and a test set. The learning set is used for obtaining the optimal parameters while the validation set is used to prevent over-fitting: during the optimisation, when the error for the validation set starts to increase, we stop the optimisation process, therefore keeping the solution ‘general’. This methodology is completely general and can be applied to any other image registration algorithm with only some small modifications. DE itself is computationally cost efficient, the problem being that several function evaluations (one per population member) per cycle are needed. The following table displays the parameters related to the DE, thus allowing us to approximate the computational effort.

**DE parameters**

Parameter | Value |
---|---|

Population members | 25 |

Iterations | 20 |

Training + validation sets | 15 + 5 |

Image representations | 34 |

Total | 34,0000 |

### Image transformations

*I*, HS(V), (

*r*)

*ϕθ*, PHASE, LOGD and |∇

*I*| are tested, except |∇

*I*|and PHASE+LOGD. The 34 different tested combinations can be seen in Appendix 1, Table 2. As it was already mentioned previously, Mileva et al. tested some of the same representations earlier in [4]. Some combinations were left out because of practical issues related to the computational time (see Section Searching for optimal parameters with differential evolution for more information related to the computational time). A preliminary ‘small scale’ experiment was conducted in order to see which representations would be studied more carefully. In the following section, we briefly describe the different input representations under study.

**Tested illumination error types**

Global | Local |
---|---|

Additive (GA) | Additive (LA) |

Multiplicative (GM) | Multiplicative (LM) |

Multiplicative and additive (GMA) | Multiplicative and additive (LMA) |

#### RGBN (normalized RGB)

*N*. In our tests, both images are normalised by using their own factor which is ${N}_{i}=max({R}_{i},{G}_{i},{B}_{i})$,

*i*being the image in question (e.g., left or right image). The transformation is given by Equation (8).

RGBN is robust with respect to global multiplicative illumination changes.

#### ∇*I*

where sub-index states with respect to which variable the term in question has been derived. Gradient constancy term is robust with respect to both global and local additive illumination changes.

#### |∇*I*|

where sub-index states with respect to which variable the term in question has been derived. In general, this term is illumination robust with respect to both local and global additive illumination changes.

#### HS(V)

where $min=min(R,G,B)$ and $max=max(R,G,B)$. As can be understood from (11), the H and S components are illumination robust, while the V component is not and, therefore, it is excluded from the representation. In the rest of the text, HS(V) refers to image representation with only the H and S components.

#### (*r*)*ϕθ*

*rϕθ*does so in a spherical one.

*r*indicates the magnitude of the colour vector while

*ϕ*is the zenith and

*θ*is the azimuth, as in:

As can be observed from (12), both the *ϕ* and *θ* are illumination robust while magnitude vector *r* is not and, therefore, we exclude *r* from the representation. In the rest of the text, (*r*)*ϕθ* and *spherical* refer to an image representation based on the *ϕ* and *θ*.

#### LOGD

where sub-index states with respect to which variable the term in question has been derived. The log-derivative image representation is robust with respect to both additive and multiplicative local illumination changes.

#### PHASE

The reason for choosing the phase representation is threefold: (a) the phase component is robust with respect to illumination changes; (b) cells with a similar behaviour have been found in the visual cortex of primates [27], which might well mean that evolution has found this kind of representation to be meaningful (even if we might not be able to exploit it completely yet); and (c) the stability of the phase component with respect to small geometrical deformations (as shown by Fleet and Jepson [28, 29]). The phase is a local component extracted from the subtraction of local values. Therefore, it does not depend on an absolute illumination measure, but rather on the relative illumination measures of two local estimations (which are subtracted). This makes this estimation robust against illumination artifacts (such as shadows, which increase or decrease local illumination but do not affect local ratios so dramatically). In a similar way, if noise (multiplicative or additive) affects a certain local region uniformly, in average the illumination ratio (in which phase is based) will be less affected than the absolute illumination value. The filtering stage with a set of specific filters can be regarded as band-pass filtering, since only the components that match the set of filters are allowed (or not discarded) for further processing. Gabor filters have specific properties that make them of special interest in general image processing tasks [30].

*x*=(

*x*

*y*) is the image position,

*f*

_{0}denotes the peak frequency,

*θ*the orientation of the filter in reference to the horizontal axis, and

*h*

_{ c }(·)and

*h*

_{ s }(·)denote the even (real) and odd (imaginary) parts. The filter responses (band-pass signals) are generated by convolving an input image with a filter as in:

*I*denotes an input image, ∗denotes convolution, and

*C*(

*x*;

*θ*)and

*S*(

*x*;

*θ*)are the even and odd responses corresponding to a filter with an orientation

*θ*. From the even and odd responses, two different representation spaces can be built, namely phase and energy as follows:

*E*(

*fx*;

*θ*) is the energy response and

*ω*(

*x*;

*θ*), the phase response of a filter corresponding to an orientation

*θ*. As can be observed from (15), the input image

*I*can contain several components (e.g., RGB, HSV) where each component would be convolved independently to extract energy and phase. However, in order to maintain the computation time reasonable, the input images are first converted into grey-level images, after which the filter responses are calculated. Therefore, the transformation can be defined by:

^{c}.

### Induced illumination errors and image noise

**Tested image noise types**

Luminance | Chrominance | Salt & pepper |
---|---|---|

mild (nLM) | mild (nCM) | mild (nSPM) |

severe (nLS) | severe (nCS) | severe (nSPS) |

**Learn-, validation-, and test sets**

Run | Learn | Test | Validation | ||
---|---|---|---|---|---|

1 | Lampshade2 | Cloth1 | Rocks2 | Aloe | Bowling2 |

Baby3 | Reindeer | Baby2 | Baby1 | Laundry | |

Cones | Plastic | Tsukuba | Books | Moebius | |

Art | Wood1 | Rocks1 | Lampshade1 | Venus | |

Dolls | Cloth3 | Cloth2 | Wood2 | Teddy |

From Table 3, we can observe that both global and local illumination errors are used. The difference between these types is that in the former case, the error is the same for all the positions, while in the latter, the error is a function of the pixel position. In the illumination error case (both local and global), we apply the error only on one of the images. Especially in the local illumination error case, this simulates a ‘glare’.

^{d}.

It can be argued that the noise present in the images could be reduced or eliminated by using a de-noising pre-processing step, and thus, the study should be centred more towards illumination type of errors. However, if any of the tested image representations proves to be sufficiently robust with respect to both illumination errors and image noise, this would mean that less pre-processing steps would be needed. This certainly would be beneficial in real applications, possibly suffering from a restricted computational power.

Before describing the mathematical formulations for each of the models previously mentioned, we need to make certain definitions.

However, before going any further, we would like to point out that the illumination error and noise models are applied on the RGB input images before transforming these into the corresponding representations. For the sake of readability, the used notation is explained here again. We consider the (discrete) image to be a mapping $I(\overrightarrow{x},\phantom{\rule{0.3em}{0ex}}k):\Omega \to {\mathbb{R}}^{K+}$, where the domain of the image is *Ω*:=[1,…,*N*]×[1,…,*M*], *M* and *N* being the number of columns and rows of the image, while *K* defines the number of the channels. In our case *K*=3, since the input images are RGB. Minimum and maximum values, after having applied the error or noise models, of the images are limited to [0,…,255]. The position vector can be written as follows: $\overrightarrow{x}\left(i\right)=\left(x\left(i\right),\phantom{\rule{0.3em}{0ex}}y\left(i\right)\right)$, where *i*∈[1,…,*P*] with *P* being the number of pixels (i.e., *P*=*MN*). When referring to individual pixels, instead of writing $I\left(\overrightarrow{x}\left(i\right),\phantom{\rule{0.3em}{0ex}}k)\right)$, we write *I*(*i*, *k*)with *i* defined as previously and *k* being the channel in question.

#### Global illumination errors

where *i*=1,…,*P*, *k*=1,…,3, and *ga* is the additive error, while *gm* is the multiplicative error. For additive error, we have used *ga*=25 and for multiplicative error, we have used *gm*=1.1.

#### Local illumination errors

*μ*, covariance

*Σ*, and scaling factor

*sF*are defined as follows:

*N*and

*M*are the number of columns and rows in the image (as previously) and |

*Σ*| is determinant of the covariance. Scaling factor

*sF*simply scales the values between [0,…,0.35]. In other words, we simulate an illumination error that influences most the centre of the the image in question, as can be observed from Figure 2 (LMA case). With the above in place, we define the local illumination error function as $E\left(i\right):={}_{2}\left(\overrightarrow{x}\left(i\right),\phantom{\rule{0.3em}{0ex}}\mu ,\phantom{\rule{0.3em}{0ex}}\sigma \right)\mathrm{sF}$ and, thus, the local illumination errors are as follows:

where *i*=1,…,*P* and *k*=1,…,3. The multiplier 255 is simply due to the ‘scaling’ of the image representation (i.e., [0,…,255]).

#### Luminance noise

where *i*=1,…,*P* and *k*=1,…,3. The index (*k*−1)*P* + *i* just makes sure that a different value is applied to each pixel in each channel.

#### Chrominance noise

where *i*=1,…,*P* and *k*=1.

#### Salt&pepper noise

*SP*:=[

*U*(0,1),…,

*U*(0,1)]. We use this same vector for generating this type of noise for all the images.

where *i*=1,…,*P*.

### Experiments

^{e}database, with ground-truth, these were used for the quantitative experiments (Figure 3). We have used images that correspond to a size of approximately 370 × 420 (rows ×columns). For the qualitative analysis and functional validation, images from the DRIVSCO

^{f}and the GRASP

^{g}projects were used.

Even if no vigorous image analysis was used when choosing the images, both the learn- and test sets were carefully chosen by taking the following into consideration: (a) none of the sets contains known cases where the variational method is known to fail completely; (b) both very textured (e.g., Aloe and Cloth1) and less textured cases (e.g., Plastic and Wood1) are included. Even though less textured cases are considerably more difficult for stereo algorithms, these were included so that the parameters found by the DE algorithm would be more ‘generic’. In Appendix 1, Table 9, typical disparity values for each image are given, along with an example of the mean squared error for the calculated disparity maps. The reason for not including cases where the algorithm fails is that in these cases, the effect of the used image representation would be negligible and thus, would not convey useful information for our study. The variational methods (and any other method known to us) are known to fail with images that do not contain enough spatial features in order to approximate the disparity correctly. However, in [31] we propose a solution to this problem by using both spatial and temporal constraints.

*K*-fold cross-validation

*k-fold cross-correlation*[32, 33] to statistically test how well the obtained results are generalisable. In our case, due to the size of the data set, we use a 5-fold cross-correlation: the data set is broken in five sets, each containing five images. Then, we run the DE and analyse the results five times using three of the sets for learning, one for validation, and one for testing. In each run the sets for learning, validation and testing will be different. Results are based on all of the five runs. Below, there is a list of sets for the first run (Table 5).Information related to the image sets for all the different runs can be found in Appendix 1, Table 6.

**Ranking for combined error+noise and original images**

Rank | Error + noise | MSE | Original images | MSE |
---|---|---|---|---|

1. | ∇ | 72.3 | ∇ | 35.1 |

2. | ∇ | 83.6 | HS(V)+LOGD | 37.5 |

3. | PHASE+|∇ | 84.1 | ∇ | 39.0 |

4. | ∇ | 87.4 | ∇ | 39.4 |

5. | PHASE | 92.3 | HS(V)+PHASE | 40.6 |

6. | ( | 92.4 | ( | 42.2 |

7. | ∇ | 92.6 | ∇ | 44.9 |

8. | RGBn+LOGD | 92.8 | ∇ | 45.6 |

9. | LOGD | 97.5 | ∇ | 46.0 |

10. | RGB+PHASE | 102.7 | RGB+LOGD | 46.0 |

11. | LOGD+|∇ | 105.5 | RGBn+LOGD | 46.8 |

12. | RGBn+|∇ | 111.5 | RGBn+PHASE | 47.1 |

13. | HS(V)+|∇ | 112.0 | PHASE+|∇ | 47.7 |

14. | RGB+|∇ | 112.9 | RGB+PHASE | 47.9 |

15. | ∇ | 114.8 | PHASE | 48.3 |

16. | RGBn+PHASE | 120.0 | LOGD | 50.7 |

17. | HS(V)+PHASE | 120.2 | HS(V)+( | 53.2 |

18. | RGB+LOGD | 125.7 | ( | 53.6 |

19. | ∇ | 134.1 | LOGD+|∇ | 55.2 |

20. | ( | 139.8 | HS(V)+|∇ | 55.9 |

21. | ∇ | 175.0 | ( | 56.9 |

22. | ∇ | 180.4 | RGB+|∇ | 57.7 |

23. | HS(V)+LOGD | 278.4 | ∇ | 59.8 |

24. | HS(V) | 293.8 | RGBn+|∇ | 62.1 |

25. | RGB+HS(V) | 360.8 | RGBn+HS(V) | 74.8 |

26. | RGBn+HS(V) | 373.8 | ∇ | 99.3 |

27. | RGBn | 374.4 | ( | 103.7 |

28. | RGB | 380.7 | RGB+( | 119.4 |

29. | RGB+RGBn | 394.3 | HS(V) | 134.3 |

30. | ( | 394.8 | RGBn+( | 166.3 |

31. | HS(V)+( | 563.8 | RGB+HS(V) | 178.8 |

32. | ( | 712.2 | RGB | 224.8 |

33. | RGBn+( | 716.7 | RGB+RGBn | 239.1 |

34. | RGB+( | 727.4 | RGBn | 260.3 |

**Combined ranking**

Rank | Representation space | Summed rank |
---|---|---|

1. | ∇ | 9 |

2. | ∇ | 10 |

3. | ∇ | 12 |

4. | ( | 12 |

5. | PHASE+|∇ | 16 |

6. | ∇ | 19 |

7. | RGBN+LOGD | 19 |

8. | PHASE | 20 |

9. | HS(V)+PHASE | 22 |

10. | ∇ | 23 |

11. | ∇ | 24 |

12. | RGB+PHASE | 24 |

13. | HS(V)+LOGD | 25 |

14. | LOGD | 25 |

15. | RGB+LOGD | 28 |

16. | RGBN+PHASE | 28 |

17. | ∇ | 30 |

18. | LOGD+|∇ | 30 |

19. | HS(V)+|∇ | 33 |

20. | RGB+|∇ | 36 |

21. | RGBN+|∇ | 36 |

22. | ( | 41 |

23. | ∇ | 45 |

24. | HS(V)+( | 48 |

25. | ( | 50 |

26. | RGBN+HS(V) | 51 |

27. | HS(V) | 53 |

28. | RGB+HS(V) | 56 |

29. | ( | 57 |

30. | RGB | 60 |

31. | RGBN | 61 |

32. | RGB+RGBN | 62 |

33. | RGB+( | 62 |

34. | RGBN+( | 63 |

#### Error metric

where *d* is the calculated disparity map, *dgt* is the ground truth, *P* is the number of pixels, and *S* is the number of images in the set (e.g., for a single image *S*=1) for which the mean squared error is to be calculated for.

*I*+PHASE, while the second one was ∇

*I*without any combinations. Since both ∇

*I*and PHASE represent different physical quantities (gradient and phase of the image signal, as the names suggest), and both of these have been shown to be robust, it is not surprising that a combination of these was the most robust representation. In general, representations based on both ∇

*I*and PHASE were amongst the most robust representations. On the other hand, ∇

*I*+LOGD was the most accurate representation with the original images (i.e., without induced errors or noise). In general, representations based on ∇

*I*have produced good results with the original images.

**Tested image representation combinations**

Term | Term | |||||||
---|---|---|---|---|---|---|---|---|

None | RGB | RGBN | |∇I| | HS(V) | (r)ϕθ | Phase | LOGD | |

None | ||||||||

RGB | X | X | X | X | X | X | X | |

RGBN | X | X | X | X | X | X | ||

∇ | X | X | X | X | X | X | X | X |

HS(V) | X | X | X | X | X | |||

( | X | X | X | X | ||||

Phase | X | X | ||||||

LOGD | X | X |

*I*alone. Also, it can be noted that the first three are all based on ∇

*I*. However, ∇

*I*+PHASE is slightly more robust than ∇

*I*alone, but not as accurate. This is clear from the figures presented in Section 1.

**Learn-, validation-, and test sets**

Run | Learn | Test | Validation | ||
---|---|---|---|---|---|

1 | Lampshade2 | Cloth1 | Rocks2 | Aloe | Bowling2 |

Baby3 | Reindeer | Baby2 | Baby1 | Laundry | |

Cones | Plastic | Tsukuba | Books | Moebius | |

Art | Wood1 | Rocks1 | Lampshade1 | Venus | |

Dolls | Cloth3 | Cloth2 | Wood2 | Teddy | |

2 | Baby1 | Cloth1 | Teddy | Art | Rocks1 |

Aloe | Wood1 | Reindeer | Baby2 | Rocks2 | |

Lampshade1 | Laundry | Bowling2 | Cloth3 | Cloth2 | |

Dolls | Wood2 | Lampshade2 | Plastic | Books | |

Cones | Baby3 | Moebius | Tsukuba | Venus | |

3 | Aloe | Rocks1 | Lampshade1 | Baby1 | Baby3 |

Dolls | Venus | Moebius | Bowling2 | Wood2 | |

Laundry | Tsukuba | Rocks2 | Lampshade2 | Plastic | |

Cones | Baby2 | Books | Reindeer | Cloth2 | |

Wood1 | Art | Cloth3 | Teddy | Cloth 1 | |

4 | Baby3 | Cones | Tsukuba | Baby1 | Books |

Rocks2 | Art | Cloth3 | Cloth2 | Lampshade2 | |

Laundry | Dolls | Reindeer | Teddy | Cloth1 | |

Plastic | Bowling2 | Lampshade1 | Venus | Rocks1 | |

Aloe | Wood2 | Baby2 | Wood1 | Moebius | |

5 | Bowling2 | Books | Reindeer | Baby2 | Teddy |

Tsukuba | Cloth3 | Rocks2 | Cloth1 | Baby3 | |

Moebius | Aloe | Laundry | Plastic | Venus | |

Cones | Wood1 | Art | Rocks1 | Cloth2 | |

Lampshade1 | Dolls | Lampshade2 | Wood2 | Baby1 |

## Results

In this section, we present the results, both quantity and visual quality-wise. First, the results are given by ranking how well each representation has done, both accuracy and robustness-wise. Then, we study how combining different representations has affected the accuracy and the robustness of these combined representations. After this, we present the results for some real applications visual quality-wise, since ground-truth is not available for these cases.

### Ranking

Here, we rank each of the representation spaces in order to gain a better insight on the robustness and accuracy of each representation. By robustness and accuracy, we mean the following: (a) a representation is considered robust when results based on it are affected only slightly by noise or image errors; (b) a representation is considered accurate when results based on this gives good results using the original (i.e., noiseless) images. While this may not be the standard terminology, we find that using these terms makes it easier to explain the results. In Table 7, each of the representations is ranked with respect to (a) the original images and (b) the combined illumination errors and noise types, while Table 8 combines the aforementioned results into a single ranking. The MSE value in the tables is based on all the different runs (see Section 1). In the case of the combined error and noise (error+noise in the tables), the MSE value is calculated based on all the different illumination errors and noise types for the five different runs.

### Improvement due to combined representation spaces

*I*changes when combined with |∇

*I*|, therefore, allowing us to deduce if ∇

*I*benefits from the combination. Results are given with respect to error, thus, a positive change in the error naturally means greater error and vice versa. Figure 4 displays the results for ∇

*I*, PHASE, and LOGD, while Figure 5 gives the same for (

*r*)

*ϕθ*, HS(V) and RGB. We have left out results for RGBN on purpose, since this was the worst performer and the results, in general, were similar to those of RGB.

As it can be observed, combining ∇*I* with any of the representations, apart from (*r*)*ϕθ*, has improved both accuracy and robustness. Combining (*r*)*ϕθ* with ∇*I* improves robustness but at the same time, worsens accuracy. The situation with PHASE is similar: combining PHASE with other representations, apart from ∇*I*, has improved both accuracy and robustness; when combined with ∇*I*, accuracy worsens slightly while robustness improves. From Table 7, it can be observed that ∇*I*+PHASE is more robust than ∇*I* alone (first and second positions) with error+noise, while ∇*I* ranks seventh and ∇*I*+PHASE ranks ninth with the original images.

### Visual qualitative interpretation

*I*, ∇

*I*+PHASE, ∇

*I*+HS(V), PHASE, and RGB. A video of the results for DRIVSCO is available at

^{h}. These representations were chosen since (a) ∇

*I*was the overall ‘winner’ for the combined results (see Table 8); (b) ∇

*I*+PHASE was the most robust one; (c) ∇

*I*+HS(V) was the most accurate one; (d) PHASE is both robust and accurate, and (e) RGB is the ‘standard’ representation from typical cameras. The parameters used were the same in all the cases presented here and are those from the 1st run (out of five) for the 5-fold cross-validation. The reasoning here is, confirmed by the results, that any robust representation should be able to generate reasonable results for any of the parameters found in the cross-validation scheme.

As can be observed from Figure 6, the results are somewhat similar for all the representations. However, as it can be observed, RGB has visually produced slightly worse results.

^{i}sequence (Additional file 1). Here, ∇

*I*+PHASE has produced the most concise results: results for the road are far better than with any of the other representations. On the other hand, ∇

*I*+HS(V) has produced the best results for the trailer: obtaining correct approximations for the trailer is challenging since it tends to ‘fuse’ with the trees. RGB has produced very low quality results and, for example, scene interpretation based on these results would be very challenging if not impossible.

Additional file 1: **DRIVSCO sequence disparity results.** Disparity calculation results for the DRIVSCO sequence using different image representations. (FLV 5 MB)

Figure 8 shows results for a robotic grasping scene. Both ∇*I* and ∇*I*+HS(V) have produced good results: the object of interest lying on the table is recognisable in the disparity map. ∇*I*+PHASE or PHASE alone has increased ‘leakage’ of disparity values between the object of interest and the shelf. On the other hand, PHASE representation has produced the best results for the table, especially for the lowest part. Again, RGB has produced low quality results.

Altogether, visual qualitative interpretation of the results using real image sequences is in line with the quantitative analysis. Both ∇*I* and ∇*I*+PHASE produce good results even with real image sequences. However, the former produces slightly more accurate results while the latter representation is more robust.

## Conclusions

We have shown that the quality of a disparity map, generated by a variational method, under illumination changes and image noise, depends significantly on the used image representation type. By combining different representations, we have generated and tested 34 different cases and found several complementary spaces that are affected only slightly even under severe illumination errors and image noise. Accuracy differences of 7-fold (without noise) and 10-fold (with noise and illumination errors) were found between the best and worst representation maps, which highlights the relevance of an appropriate input representation for low level estimations such as stereo. This accuracy enhancing and robustness to noise can be of critical importance in specific application scenarios with real uncontrolled scenes and not just well behaving test images (e.g., automatic navigation, advanced robotics, CGI). Amongst the tested combinations, the ∇*I* representation stood out as one of the most accurate and least affected by illumination errors or noise. By combining ∇*I* with PHASE, the joined representation space was the most robust one amongst the tested spaces. This finding was also confirmed by the qualitative experiments. Thus, we can say that the aforementioned representations complement each other. These results were also confirmed in a qualitative evaluation of natural scenes in uncontrolled scenarios.

There are some studies similar to ours, carried out in a smaller scale. However, the other studies typically provide little information related to how the optimum (or near optimum) parameters of the algorithm are achieved, related to each representation space: in this study, we have used a well known, derivative free, stochastic algorithm called DE for the reasons given in the text. We argue that manually obtained parameters are subjected to a bias from the human operator and therefore, can be expected to confirm expected results. Three different sets of images were used for obtaining the parameters and testing each of the representations, in order to avoid over-fitting. The proposed methodology for estimating model parameters can be extended to many other computer vision algorithms. Therefore, our contribution should lead to more robust computer vision systems capable of working with real applications.

### Future study

The weighting factors (*b*_{1} and *b*_{2} in (1)) for each image representation are applied equally to all of the ‘channels’. Since some of the channels are more robust than others, like in the case of HSV for example, each channel should have its own weighting factor. Since this study allows us to cut down the number of useful representations, we propose to study the best behaving ones in more detail with separate weighting factors where needed.

## Appendix 1

### Imagerepresentations and sets

### Typical disparity values

The following table displays minimum, maximum, mean, and standard deviation (STD) of ground-truth disparity for each of the used images. Also, in the same table we give the MSE (mean squared error), for each of the images, calculated using the parameters from the 1st run for the ∇*I* based image representation. The lowest numbers are the mean, standard deviation and MSE for the whole image set. The number on the lowest row in the table are the mean, standard deviation and MSE for the whole image set.

**Typical disparity values for each image, and MSE for each image using parameters from the 1st run for ∇**
I
**image representation**

Image | Min | Max | Mean | Std | MSE |
---|---|---|---|---|---|

Aloe | 14.33 | 70.33 | 24.13 | 9.34 | 43.7 |

Art | 24.33 | 74.67 | 44.36 | 14.37 | 95.5 |

Baby1 | 8.33 | 45.33 | 27.79 | 10.66 | 24.6 |

Baby2 | 13.33 | 51.67 | 28.95 | 12.05 | 7.8 |

Baby3 | 15.67 | 51.00 | 42.15 | 6.93 | 11.7 |

Books | 21.67 | 73.67 | 43.00 | 14.71 | 8.7 |

Bowling2 | 13.33 | 66.00 | 46.88 | 15.96 | 116.0 |

Cloth1 | 13.00 | 57.33 | 38.28 | 8.93 | 0.6 |

Cloth2 | 14.00 | 76.00 | 53.24 | 12.37 | 30.6 |

Cloth3 | 15.00 | 55.33 | 36.28 | 11.31 | 5.1 |

Cones | 5.50 | 55.00 | 33.54 | 11.58 | 8.7 |

Dolls | 3.00 | 73.67 | 45.85 | 14.09 | 4.2 |

Lampshade1 | 14.00 | 64.67 | 35.87 | 15.90 | 78.2 |

Lampshade2 | 8.67 | 65.33 | 38.92 | 14.46 | 75.5 |

Laundry | 11.67 | 77.33 | 40.29 | 12.95 | 27.3 |

Moebius | 21.33 | 72.67 | 37.19 | 11.20 | 12.7 |

Plastic | 7.67 | 65.33 | 45.27 | 13.36 | 15.4 |

Reindeer | 3.67 | 67.00 | 41.54 | 15.01 | 112.8 |

Rocks1 | 19.33 | 56.67 | 37.53 | 9.50 | 1.7 |

Rocks2 | 23.33 | 56.00 | 38.57 | 7.00 | 1.3 |

Teddy | 12.50 | 52.75 | 27.38 | 9.02 | 8.8 |

Tsukuba | 5.00 | 14.00 | 6.79 | 2.67 | 2.1 |

Venus | 3.00 | 19.75 | 8.89 | 4.09 | 1.0 |

Wood1 | 21.67 | 71.67 | 40.83 | 12.71 | 102.5 |

Wood2 | 14.33 | 72.33 | 48.89 | 15.46 | 126.6 |

37.08 | 15.82 | 36.9 |

## Endnotes

^{a}http://www.jarnoralli.fi/.

^{b}http://www.icsi.berkeley.edu/~storn/code.html.

^{c}http://vision.middlebury.edu/stereo/data/.

^{d}http://vision.middlebury.edu/stereo/data/.

^{e}http://vision.middlebury.edu/stereo/data/.

^{f}http://www.pspc.dibe.unige.it/~drivsco/.

^{g}http://www.csc.kth.se/grasp/.

^{h}http://www.jarnoralli.fi/.

^{i}http://www.jarnoralli.fi/joomla/publications/representation-space.

## Declarations

### Acknowledgements

We thank Dr. Javier Sánchez and Dr. Luis Álvarez from the University of Las Palmas de Gran Canaria, and Dr. Stefan Roth from the technical University of Darmstadt for helping us out getting started with the variational methods. Further, we would like to thank Dr. Andrés Bruhn for our fruitful conversations during the DRIVSCO workshop 2009. Also, we would like to thank the PDC Center for High Performance Computing of Royal Institute of Technology (KTH, Sweden) for letting us use their clusters. This work was supported by the EU research project TOMSY (FP7-270436), the Spanish Grants DINAM-VISION (DPI2007-61683) and RECVIS (TIN2008-06893-C03-02), Andalusia’s regional projects (P06-TIC-5060 and TIC-3873) and Granada Excellence Network of Innovation Laboratories’ project PYR-2010-19.

## Authors’ Affiliations

## References

- Horn BKP, Schunck BG: Determining optical flow.
*Artif. Intell*1981, 17: 185-203. 10.1016/0004-3702(81)90024-2View ArticleGoogle Scholar - Huang Y, Young K: Binocular image sequence analysis: integration of stereo disparity and optic flow for improved obstacle detection and tracking.
*EURASIP J. Adv. Signal Process*2008, 2008: 10.MATHGoogle Scholar - Björkman M, Kragic D: Active 3D segmentation through fixation of previously unseen objects. In
*Proceedings of the British Machine Vision Conference*. BMVA Press; 2010:pp. 119.1-119.11. 10.5244/C.24.119Google Scholar - Mileva Y, Bruhn A, Weickert J: Illumination-robust variational optical flow with photometric invariants. In
*DAGM07- Volume LNCS*. Heidelberg, Germany; 2007:152-162.Google Scholar - Wöhler C, d’Angelo P: Stereo image analysis of non-lambertian surfaces.
*Int. J. Comput. Vision*2009, 81(2):172-190. 10.1007/s11263-008-0157-1View ArticleGoogle Scholar - Maxwell B, Friedhoff R, Smith C: A bi-illuminant dichromatic reflection model for understanding images. In
*in IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2008)*. Anchorage, Alaska, USA; 2008:1-8.View ArticleGoogle Scholar - Shafer S, Using color to separate reflection components:
*Tech. rep*. 1984. [TR 136, Computer Science Department, University of Rochester]Google Scholar - Lowe DG, Distinctive image features from scale-invariant keypoints:
*Int. J. Comput. Vision*. 2004, 60: 91-110.View ArticleGoogle Scholar - Bay H, Ess A, Tuytelaars T, Gool LV: Speeded-up robust features (SURF).
*Comput. Vis. Image Underst*2008, 110: 346-359. 10.1016/j.cviu.2007.09.014View ArticleGoogle Scholar - Bruhn A: Variational optic flow computation: accurate modelling and efficient numerics.
*PhD thesis*. Saarland University, Saarbrücken, Germany; 2006.Google Scholar - Brox T: From pixels to regions: partial differential equations in image analysis, .
*PhD thesis*. Saarland University, Saarbrücken, Germany; 2005.Google Scholar - Bruhn A, Weickert J, Kohlberger T, Schnörr C: A multigrid platform for real-time motion computation with discontinuity-preserving variational methods.
*Int. J. Comput. Vision*2006, 70(3):257-277. 10.1007/s11263-006-6616-7View ArticleGoogle Scholar - Slesareva N, Bruhn A, Weickert J: Optic flow goes stereo: a variational method for estimating discontinuity-preserving dense disparity maps. In
*DAGM05- Volume LNCS 3663*. Vienna, Austria; 2005:pp. 33-40.Google Scholar - Brox T, Bruhn A, Papenberg N, Weickert J: High accuracy optical flow estimation based on a theory for warping. In
*ECCV04-Volume LNCS 3024*. Prague, Czech Republic; 2004:pp. 25-36.Google Scholar - Black M, Anandan P: Robust dynamic motion estimation over time. In
*Proc. Computer Vision and Pattern Recognition*. Maui, Hawaii, USA; 1991:pp. 296-302.Google Scholar - Weickert J, Schnörr C: A theoretical framework for convex regularizers in PDE-based computation of image motion.
*Int. J. Comput. Vision*2001, 45(3):245-264. 10.1023/A:1013614317973View ArticleMATHGoogle Scholar - Nagel H, Enkelmann W: An investigation of smoothness constraints for the estimation of displacement vector fields from image sequences.
*PAMI*1986, 8(5):565-593.View ArticleGoogle Scholar - Alvarez L, Weickert J, Sánchez J: Reliable estimation of dense optical flow fields with large displacements.
*Int. J. Comput. Vision*2000, 39: 41-56. 10.1023/A:1008170101536View ArticleMATHGoogle Scholar - Zimmer H, Bruhn A, Weickert J, Valgaerts L, Salgado A, Rosenhahn B, Seidel H: Complementary optic flow. In
*EMMCVPR , vol. 5681 of Lecture Notes in Computer Science*. Bonn, Germany; 2009:pp. 207-220.Google Scholar - Blake A, Zisserman A:
*Visual Reconstruction*. The MIT Press, Cambridge, Massachusetts London, England; 1987.Google Scholar - Trottenberg U, Oosterlee C, Schüller A:
*Multigrid*. Academic Press, A Harcourt Science and Technology Company Harcourt Place. 32 Jamestown Road, London NW1 7BY UK; 2001.MATHGoogle Scholar - Storn R, Price K: Differential evolution—a simple and efficient adaptive scheme for global optimization over continuous spaces.
*Tech. rep*1995. [TR-95-012, ICSI]Google Scholar - Storn R, Price K: Differential evolution—a simple, efficient heuristic for global optimization over continuous spaces.
*J. Global Optimiz*1997, 11(4):341-359. 10.1023/A:1008202821328MathSciNetView ArticleMATHGoogle Scholar - Plagianakos VP, Vrahatis MN: Parallel evolutionary training algorithms for hardware-friendly neural networks.
*Nat. Comput*2002, 1: 307-322. 10.1023/A:1016545907026MathSciNetView ArticleMATHGoogle Scholar - Tasoulis DK, Pavlidis N, Plagianakos VP, Vrahatis MN: Parallel differential evolution. In
*In IEEE Congress on Evolutionary Computation (CEC)*. Portland, OR, USA; 2004:pp. 1-6.Google Scholar - Epitropakis MG, Plagianakos VP, Vrahatis MN: Hardware-friendly higher-order neural network training using distributed evolutionary algorithms.
*Appl. Soft Comput*2010, 10: 398-408. 10.1016/j.asoc.2009.08.010View ArticleGoogle Scholar - Hubel D, Wiesel T: Anatomical demonstration of columns in the monkey striate cortex.
*Nature*1969, 221: 747-750. 10.1038/221747a0View ArticleGoogle Scholar - Fleet D, Jepson A: Stability of phase information.
*IEEE Trans. Pattern Anal. Mach. Intell*1993, 15(12):1253-1268. 10.1109/34.250844View ArticleGoogle Scholar - Fleet D, Jepson A: Phase-based disparity measurement.
*Comput. Vision Graphics Image Process*1991, 53(2):198-210.MATHGoogle Scholar - Sabatini S, Gastaldi G, Solari F, Pauwels K, Hulle MV, Díaz Jx, Ros J, Pugeault N, Krüger N: A compact harmonic code for early vision based on anisotropic frequency channels.
*Comput. Vis. Image Underst*2010, 114: 681-699. 10.1016/j.cviu.2010.03.008View ArticleGoogle Scholar - Ralli J, Díaz J, Ros E: Spatial and temporal constraints in variational correspondence methods.
*Mach. Vision Appl*2011, 1-13.Google Scholar - Devijver PA, Kittler J:
*Pattern Recognition: a Statistical Approach*. Prentice Hall; 1982. ISBN 13: 9780136542360, ISBN 10: 0136542360MATHGoogle Scholar - Kohavi R: A study of cross-validation and bootstrap for accuracy estimation and model selection. In
*Proceedings of the 14th International Joint Conference on Artificial Intelligence- Volume 2*. Morgan Kaufmann, Montreal, Quebec, Canada; 1995:pp. 1137-1143.Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.