### 5.1 Appearance-based analysis

The first part of the observation model deals with the appearance of the objects. The aim is to obtain the probability *p*_{
a
} (*z*_{
i,k
} | *s*_{
i,k
} ) of the current appearance observation given the object state *s*_{
i,k
} (note the subscript *a* that denotes "appearance"). In other words, we would like to know if the current appearance-related measurements support the hypothesized object state. In order to derive the probability *p*_{
a
} (*z*_{
i,k
} | *s*_{
i,k
} ) we will proceed in two levels. First, the probability that a pixel belongs to a vehicle will be defined according to the observation for that pixel. Second, by analyzing the pixel-wise information around the position given by *s*_{
i,k
} , the final observation model will be defined at region level.

The pixel-wise model aims to provide the probability that a pixel belongs to a vehicle. This will be addressed as a classification problem, and it is therefore necessary to define the different categories expected in the image. In particular, the rectified image (see example in Figure 2.) contains mainly three types of elements: vehicles, road pavement, and lane markings. A fourth class will also be included in the model to account for any other kind of elements (such as median stripes or guard rails).

The Bayesian approach is adopted to address this classification problem. Specifically, the four classes are denoted by \mathcal{S}=\left\{P,\phantom{\rule{2.77695pt}{0ex}}L,\phantom{\rule{2.77695pt}{0ex}}V,\phantom{\rule{2.77695pt}{0ex}}U\right\}, which corresponds to the pavement, lane markings, vehicles, and unidentified elements. Let us also denote *X*_{
i
} the event that a pixel *x* is classified as belonging to the class i\in \mathcal{S}. Then, if the current measurement for pixel *x* is represented by *z*_{
x
} , the posterior probability that the pixel *x* corresponds to *X*_{
i
} is given by the Bayes rule

P\left({X}_{i}|{z}_{x}\right)=\frac{p\left({z}_{x}|{X}_{i}\right)P\left({X}_{i}\right)}{P\left({z}_{x}\right)}

(12)

where *p*(*z*_{
x
} | *X*_{
i
} ) is the likelihood function, *P*(*X*_{
i
} ) is the prior probability of class *X*_{
i
} , and *P* (*z*_{
x
} ) is the evidence, computed as P\left({z}_{x}\right)={\sum}_{i\in \mathcal{S}}p\phantom{\rule{0.3em}{0ex}}\left({z}_{x}|{X}_{i}\right)P\left({X}_{i}\right), which is a scale factor that ensures that the posterior probabilities sum to one. Likelihoods and prior probabilities are defined in the following section.

#### 5.1.1 Likelihood functions

In order to construct the likelihood functions, a set of features have to be defined that constitute the current observation regarding appearance. These features should achieve a high degree of separation between classes while, at the same time, be significant for a broad set of scenarios. In general terms, the following considerations hold when analyzing the appearance of the bird's-eye view images. First, the road pavement is usually homogeneous with slight intensity variations among pixels. In turn, lane markings constitute near-vertical stripes of high-intensity, surrounded by regions of lower intensity. As for vehicles, they typically feature very low intensity regions in their lower part, due to vehicle's shadow and wheels. Hence, two features are used for the definition of the appearance-based likelihood model, namely the intensity value, *I*_{
x
} , and the response to a lane-marking detector, *R*_{
x
} . For the latter, any of the methods available in the literature can be utilized [33, 34]. For this work, a lane marking detector similar to that presented in [35] is used, whose response is defined in every row of the image as

{R}_{x}=2{I}_{x}-\left({I}_{x-\tau}+{I}_{x+\tau}\right)

(13)

where *τ* is the expected width of a lane marking in the rectified domain. The likelihood models are defined as parametric functions of these two features. In particular, they are modeled as Gaussian probability density functions:

p\left({I}_{x}|{X}_{i}\right)=\frac{1}{\sqrt{2\pi}{\sigma}_{I,i}}\text{exp}\left(-\frac{1}{2{\sigma}_{I,i}^{2}}{\left({I}_{x}-{\mu}_{I,i}\right)}^{2}\right)

(14)

p\phantom{\rule{0.3em}{0ex}}\left({R}_{x}|{X}_{i}\right)=\frac{1}{\sqrt{2\pi}{\sigma}_{R,i}}\text{exp}\left(-\frac{1}{2{\sigma}_{R,i}^{2}}{\left({R}_{x}-{\mu}_{R,i}\right)}^{2}\right)

(15)

where the parameters for the intensity and the lane marking detector are denoted respectively by the subscripts '*I*' and '*R*'. Specifically, the distribution of the class corresponding to unidentified elements, which would intuitively be uniformly distributed for both features, is instead also modeled as a Gaussian of very high fixed variance to ease further processing. Additionally, likelihood functions are assumed to be conditionally independent on these features for all the classes *X*_{
i
} , thus it is

p\left({z}_{x}|{X}_{i}\right)=p\left({I}_{x}|{X}_{i}\right)p\left({R}_{x}|{X}_{i}\right)

(16)

The parameters of the likelihood models in (14) and (15) are estimated via EM. This method is extensively used for solving Gaussian mixture-density parameter estimation (see [36] for details) and is thus perfectly suited to the posed problem. In particular, it provides an analytical maximum likelihood solution that is found iteratively. In addition, it is simple, easy to implement and converges quickly to the solution when a good initialization is available. In this case, this is readily available from the previous frame, that is, the results from the previous image can recursively be used as starting point in each incoming image. The data distribution is given by

p\left({I}_{x}\right)=\sum _{i\in \mathcal{S}}p\left({X}_{i}\right)p\left({I}_{x}|{X}_{i}\right)

(17)

p\left({R}_{x}\right)=\sum _{i\in \mathcal{S}}p\left({X}_{i}\right)p\left({R}_{x}|{X}_{i}\right)

(18)

Since the densities of the features *I*_{
x
} and *R*_{
x
} are independent, the optimization is carried out separately for these features. Let us first rewrite the expression (17), so that the dependence on the parameters is explicit:

p\left({I}_{x}|{\Theta}_{I}\right)=\sum _{i\in \mathcal{S}}{\omega}_{I,i}p\left({I}_{x}|{\Theta}_{I,i}\right)

(19)

where *Θ*_{I,i}= {*µ*_{I,i}, *σ*_{I,i}} and *Θ*_{
I
}= {*Θ*_{I,i}}_{i∈P,L,V}. Observe that the prior probabilities have been substituted by factors *ω*_{I,i}to adopt the notation typical of mixture models. The set of unknown parameters is composed of the parameters of the densities and of the mixing coefficients, *Θ* = {*Θ*_{I,i}, *ω*_{I,i}}_{i∈P,L,V}. Thereby, the parameters resulting from the final EM iteration are fed into the Bayesian model defined in Equations (12)-(15). The process is completely analogous for the feature *R*_{
x
} .

#### 5.1.2 Appearance-based likelihood model

The result of the proposed appearance-based likelihood model is a set of pixel-wise probabilities of each of the classes. Naturally, in order to know the likelihood of the current object state candidate, we must evaluate the region around the vehicle position given by *s*_{
i,k
} = (*x*_{
i,k
} , *y*_{
i,k
} ). The vehicle position has been defined as the midpoint of its lower edge (i.e., the segment delimiting the transition from road to vehicle). Hence, we expect that in the neighborhood above *s*_{
i,k
} , pixels display high probability to belong to the vehicle class, *p*(*X*_{
V
} | *z*_{
x
} ), while the neighborhood below *s*_{
i,k
} should involve low vehicle probabilities if the candidate state is good. Therefore, the appearance-based likelihood of the object state *s*_{
i,k
} is defined as

{p}_{a}\left({z}_{i,k}|{s}_{i,k}\right)=\frac{1}{\left(w+1\right)h}\left(\sum _{x\in {R}_{a}}p\left({X}_{V}|{z}_{x}\right)+\sum _{x\in {R}_{b}}\left(1-p\left({X}_{V}|{z}_{x}\right)\right)\right)

where *R*_{
a
} is the region of size (*w* + 1) *× h/* 2 above *s*_{
i,k
} , *R*_{
a
} = {*x*_{
i,k
} - *w/* 2 ≤ *x < x*_{
i,k
} + *w/* 2; *y*_{
i,k
} - *h/* 2 ≤ *y* < *y*_{
i,k
} }, and *R*_{
b
} is the region of the same size below *s*_{
i,k
} , *R*_{
b
} = {*x*_{
i,k
} - *w/* 2 ≤ *x < x*_{
i,k
} + *w/* 2; *y*_{
i,k
} < *y* ≤ *y*_{
i,k
} + *h/* 2}.

### 5.2 Motion-based analysis

As mentioned above, the second source of information for the definition of the likelihood model is motion analysis. Two-view geometry fundamentals are used to relate the previous and current views of the scene. In particular, the homography (i.e., projective transformation) of the road plane is estimated between these two points in time. This allows us to generate a prediction of the road plane appearance in future instants. However, vehicles (which are generally the only objects moving on the road plane) feature inherent motion in time, hence their projected position in the plane differs from that observed. The regions involving motion are identified through image alignment of the current image and the previous image warped with the homography. These regions will correspond to vehicles with high probability.

#### 5.2.1 Homography calculation

The first step toward image alignment is the calculation of the road plane homography between consecutive frames. As shown in [37] the homography that relates the points of a plane between two different views can be obtained from a minimum of four feature correspondences by means of the direct linear transformation (DLT). Indeed, in many applications the texture of the planar object allows to obtain numerous feature correspondences using standard feature extraction and matching techniques, and to subsequently find a good approximation to the underlying homography. However, this is not the case in traffic environments: the road plane is highly homogeneous, and hence most of the points delivered by feature detectors applied on the images belong to background elements or vehicles, and few correspond to the road plane. Therefore, the resulting dominant homography (even if using robust estimation techniques) is in general not that of the road plane.

To overcome this problem, we propose to exploit the specific nature of the environment. In particular, highways are expected to have different kind of markings (mostly lane markings) painted on the road. Therefore, we propose to first use a standard lane marking detector (such as the ones described in [33–35]) and then to restrict the feature search area in extended regions around lane markings. Nevertheless, the resulting set of correspondences will still typically be scarce, and some of them may be incorrect or inaccurate, depending on the sharpness of the lane marking corners and on the resolution of the image around them. Hence, the instantaneous homography computed from feature correspondences using DLT might be highly unreliable (errors in one of the points will have a large impact in the solution to the equation system of DLT), and sometimes the number of points is not even sufficient to compute it.

For the above-mentioned reasons, intermediate processing of the instantaneous homography is necessary. This is achieved in this study by means of a linear estimation process based on Kalman filtering. Let us first inspect the analytical expression of the homography between two consecutive instants. Figure 3. illustrates the situation of a vehicle with an on-board camera moving on a flat road plane, *π*_{0} = (**n**^{T}, *d*)^{T}, where **n** = (0, 1, 0)^{T} and *d* is the distance between the camera and the ground plane. The coordinate system of the camera at time *k*_{1} is adopted as the world coordinate system, where *Z*-axis indicates the driving direction. At time *k*_{2} the camera has moved to position **C**_{2}, and rotation *R*_{
x
} (*α*) might have occurred around the *X*-axis due to camera shaking (*α* denotes the change in the pitch angle). Additional rotation *R*_{
y
} (*β*) models variations in the yaw angle (i.e., around the *Y* -axis), which must be considered in the case the vehicle changes lane or takes a curve. From the previous discussion, and assuming a pinhole camera model, the camera projection matrices at times *k*_{1} and *k*_{2} are, respectively,

\begin{array}{c}{P}_{1}=K\left[I|\mathbf{0}\right]\\ {P}_{2}=K{R}_{x}\left(\alpha \right){R}_{y}\left(\beta \right)\left[I|\phantom{\rule{2.77695pt}{0ex}}-{\mathbf{C}}_{2}\right]\end{array}

(20)

The homography *H* relates the projections, **x**_{1} and **x**_{2}, of a 3D point, **X** ∈ *π*_{0}, in two different images. Its expression can be derived from Equation (20). In effect, for the first view it is **x**_{1} = *P*_{1}**X** = *K*[*I*|**0**] and hence any point in the ray \mathbf{X}={({\mathbf{x}}_{1}^{T}{({K}^{-1})}^{T}\mathrm{,}\phantom{\rule{0.5em}{0ex}}\rho )}^{T} projects to **x**_{1}. The intersection of this ray and the plane *π*_{0} determines the value of the parameter *ρ*: it is *π*_{0}^{T}**X** = **n**^{T}*K*^{-1}**x** + *dρ* = 0, and thus *ρ* = - **n**^{T}*K*^{- 1}**x**_{1}*/d*. The projection **x**_{2} of the point **X** into the second view is

\begin{array}{ll}\hfill {\mathbf{x}}_{2}& ={P}_{2}\mathbf{X}=K\phantom{\rule{0.3em}{0ex}}{R}_{x}\left(\alpha \right){R}_{y}\left(\beta \right)\left[I|\phantom{\rule{2.77695pt}{0ex}}-{\mathbf{C}}_{2}\right]\mathbf{X}\phantom{\rule{2em}{0ex}}\\ =K\phantom{\rule{0.3em}{0ex}}{R}_{x}\left(\alpha \right)\left[{R}_{y}\left(\beta \right)|-{R}_{y}\left(\beta \right){\mathbf{C}}_{2}\right]\phantom{\rule{2.77695pt}{0ex}}{\left({\mathbf{x}}_{1}^{T}{\left({K}^{-1}\right)}^{T},\phantom{\rule{2.77695pt}{0ex}}\rho \right)}^{T}\phantom{\rule{2em}{0ex}}\\ =K\phantom{\rule{0.3em}{0ex}}{R}_{x}\left(\alpha \right)\left[{R}_{y}\left(\beta \right){K}^{-1}{\mathbf{x}}_{1}+\mathbf{t}\rho \right]\phantom{\rule{2em}{0ex}}\\ =K\phantom{\rule{0.3em}{0ex}}{R}_{x}\left(\alpha \right)\left[{R}_{y}\left(\beta \right)-\mathbf{t}{\mathbf{n}}^{T}/d\right]{K}^{-1}{\mathbf{x}}_{1}\phantom{\rule{2em}{0ex}}\end{array}

where **t** = - *R*_{
y
} (*β*) **C**_{2}. This vector constitutes the translation in the direction in which the vehicle is heading and is thus given by **t** = (0, 0, 1)^{T}*v/f*_{
r
} , where *v* is the velocity of the vehicle and *f*_{
r
} is the frame rate. From the above equations, the expression of the homography of the plane *π*_{0} between *k*_{1} and *k*_{2} is derived:

H=K\phantom{\rule{0.3em}{0ex}}{R}_{x}\left(\alpha \right)\phantom{\rule{0.3em}{0ex}}\left[{R}_{y}\left(\beta \right)-\mathbf{t}{\mathbf{n}}^{T}/d\right]\phantom{\rule{0.3em}{0ex}}{K}^{-1}

(21)

At each time *k* we have a noisy approximation of the homography *H* of the road plane between the previous and the current instant. However, the evolution of *H* in temporal domain is assumed to be smooth due to the intrinsic constraints in vehicle dynamics, therefore better estimates can be obtained by filtering noisy measurements in time. Temporally filtered estimates of the homography are obtained by modeling *H* with a zero-order Kalman filter whose state vector is composed of the elements *H*_{
ij
} of the homography matrix. The design of the filter is summarized as follows:

\begin{array}{c}{\mathbf{x}}_{k}^{T}=\left\{{H}_{ij},\phantom{\rule{2.77695pt}{0ex}}1\le i,\phantom{\rule{2.77695pt}{0ex}}j\le 3\right\}\\ {\mathbf{x}}_{k}={\mathbf{x}}_{k-1}+{\mathbf{w}}_{k}\\ {\mathbf{z}}_{k}^{T}=\left\{{H}_{ij}^{k},\phantom{\rule{2.77695pt}{0ex}}1\le i,\phantom{\rule{2.77695pt}{0ex}}j\le 3\right\}\\ {\mathbf{z}}_{k}={\mathbf{x}}_{k}+{\mathbf{v}}_{k}\end{array}

The process and measurement noise, **w**_{
k
} and **v** _{
k
}, are assumed to be given by independent Gaussian distributions, *p*(*w*) ~ *N*(0, *Q*) and *p*(*v*) ~ *N*(0, *R*). Observe that the measurement vector is composed of the elements of the instantaneous homography matrix, *H* ^{k} , computed from image correspondences. As stated above, measurements are expected to be prone to error due to the usually small set of correspondences available, hence the measurement error should be tuned to be larger than the process noise (in the proposed configuration it is *Q* = 10^{-6}, *R* = 10^{-3}).

The designed filter provides corrected estimates for the homography at time *k*, {\widehat{H}}^{k}, built from the posterior estimate of the filter state, {\widehat{\mathbf{x}}}_{k}. Most importantly, this measure can be used as a prediction for the homography in the next time point. This prediction provides an effective reference to evaluate whether or not the computed instantaneous measurement may be erroneous. Indeed, at the current time *k*, we can compare the instantaneous homography *H*^{k} to the prediction made in the previous time instant {\widehat{H}}^{k-1}: if *H*^{k} is close to the expected value {\widehat{H}}^{k-1} then the filter equations will be conveniently updated; in contrast, if the matrices are significantly different, then it is natural to think that noisy correspondences were involved in the calculation of *H*^{k} .

The distance between matrices is measured according to the norm of the matrix of differences. Specifically, the norm induced by the 2-norm of a Euclidean space is used. This is obtained by performing singular value decomposition (SVD) of the matrix and retaining its largest singular value [38]. The incoming matrices are accepted and introduced into the Kalman filtering framework only if \left|\right|{H}^{k}-{\widehat{H}}^{k-1}\left|\right|\phantom{\rule{2.77695pt}{0ex}}<\phantom{\rule{2.77695pt}{0ex}}{t}_{a}. Otherwise, the measured homography is deemed to be unreliable and the predicted homography is used. The threshold *t*_{
a
} modulates the maximum acceptable distance to the predicted matrix, which depends on the kinematic restrictions of the platform in which the camera is mounted.

In the case of highways, vehicle dynamics are bounded by the maximum speed, the maximum turning angle (i.e., yaw angle, *β*) and the maximum variation in the pitch angle, *α*, for a given frame rate. The maximum velocity is considered to be *v* = 120 km/h (33.3 m/s), as enforced by most nation governments. Additionally, a maximum pitch angle variation of *α* = *±* 5° is considered, and an upper bound of *β* = *±* 3° is inferred for the turning angle according to the standard road geometry design rules. Taking into account these bounds, and assuming an image processing rate of at least 1 fps, the threshold is experimentally found to be *t*_{
a
} = 60.

#### 5.2.2 Motion-based likelihood model

Once a time-filtered estimate of the homography {\widehat{H}}^{k} is available, reliable image alignment can be performed. Image alignment allows for the location of regions of the image likely featuring motion (and therefore likely containing vehicles). The previous image is aligned with the current image by warping it with {\widehat{H}}^{k}. Image alignment is exemplified in Figure 4. In the upper row, the snapshots of a sequence at times *k* - 1 and *k* are displayed. In Figure 4.c, the image in Figure 4.a warped with {\widehat{H}}^{k} is shown. Observe that this is very similar in the road region to the actual image at time *k* (Figure 4.b).

As suggested in the overview of Section 5.2, the reason for image alignment is that all elements in the road plane (except for the points of the vehicle that belong to this plane) are static, and thus their actual position matches that projected by the homography. In contrast, vehicles are moving, hence their positions in the road plane at time *k* significantly differ from those projected by the homography, which assumes they are static. Therefore, the differences between the image at time *k* and the image at time *k -* 1 warped with {\widehat{H}}^{k} shall be null for all the elements of the road plane except for the contact zones of the vehicles with the road. The differences in these regions will be more significant when the velocity of the vehicles is high. Figure 4.d illustrates the difference between the current image-Figure 4.b-and the previous image warped-Figure 4.c-for the example referred below. As can be observed, whiter pixels-indicating significant difference-appear in the areas of motion of the vehicles in the road. The transformation of the elements outside the road is naturally not well represented by {\widehat{H}}^{k} (this is the homography of the road plane) and thus random regions of strong differences arise in the background, which will be considered clutter.

The pixel-wise difference between the current image and the previous image warped provides information on the likelihood of the current object state candidate, *s*_{
i,k
} . Analogously to the appearance-based likelihood modeling, the region around the vehicle position indicated by *s*_{
i,k
} will be evaluated in order to derive its likelihood. Also, to preserve the duality with the appearance-based analysis, the processing is shifted to the rectified domain using the transformation *T* defined in Section 2. The resulting image, denoted *D*_{
r
} , is illustrated in Figure 4.e for the previous example. In particular, the likelihood of belonging to a region of motion is maximum in *x*_{max} = argmax(*D*_{
r
} (*x*)), hence a map of probabilities that the pixel *x* belongs to a moving region, denoted *p*(*m|x*), can be inferred for the whole image as *p* (*m*|*x*) = *D*_{
r
} (*x*)/*D*_{
r
} (*x*_{max}).

From the above discussion, observe that the regions of strong differences are between the current vehicle position and the position that it would occupy if it were static (which is always closer to the camera). Therefore, as opposed to the appearance-based modeling (Section 5.1.2), we expect that in the neighborhood below the current vehicle position, *s*_{
i,k
} , pixels have high likelihood values *p*(*m* | *x*), whereas the neighborhood above *x* should involve small or null probabilities of motion. Hence, the likelihood of the current vehicle state *s*_{
i,k
} regarding the motion analysis is defined as

{p}_{m}\left({z}_{i,k}|{s}_{i,k}\right)=\frac{1}{\left(w+1\right)h}\left(\sum _{x\in {R}_{a}}\left(1-p\left(m|x\right)\right)+\sum _{x\in {R}_{b}}p\left(m|x\right)\right)

where the regions *R*_{
a
} and *R*_{
b
} are those defined in Section 5.1.2, and the subscript *m* in the probability denotes that it refers to motion observation. The likelihood result obtained from the motion-based analysis is finally combined with that achieved after appearance-based analysis. The joint likelihood of a candidate state *s*_{i,k}is simply defined as the arithmetic mean of likelihoods:

p\left({z}_{i,k}|{s}_{i,k}\right)=\frac{1}{2}\left({p}_{a}\left({z}_{i,k}|{s}_{i,k}\right)+{p}_{m}\left({z}_{i,k}|{s}_{i,k}\right)\right)

(22)

Note that, although the product of likelihoods could have been used instead, the mean is preferred in order to avoid that the calculation be biased by likelihood values that are too small.