In this section, we present the main contribution of this work: a complete vision-based traffic detection system which enhances the data supplied by standard FCD systems. The benefits of using computer vision instead of other technologies such as radar-based systems can be summarized as follows. Computer vision systems can compensate for the lower angular resolution of the low-cost radar and the increased appearance of ghost radar targets (guard-rails, railings, lamp posts, reflections, etc.). These false positives are relevant and they cannot simply be ignored. The camera has very good angular resolution and can be used to determine height, width, and lateral speed of the target. Pattern recognition can be used to classify the object and even weakly reflective targets such as pedestrians can be detected. Moreover, the cost of a vision system is significantly lower than the cost saved by using the simpler radar. A vision system, in addition to overcoming cost reduction problems, can contribute to the system features such as road analysis and scene understanding.
Each individual vehicle is equipped with three FireWire cameras (forward-, rear- and side-looking cameras) that cover the local environment of the bus (see Figure 2). A common hardware trigger synchronizes the image acquisition of the three cameras and an onboard PC houses the computer vision software.
Each individual vehicle detection system provides information about the number of detected vehicles and both their relative position and speed. These results are combined with the GPS measurements and the data provided by the CAN bus in order to provide globally referenced traffic information. This scheme is described in Figure 3.
The layers of the proposed architecture of the three vision modules are conceptually the same: lane detection, vehicle candidates selection, vehicle recognition, and tracking. The first step of each one of the vision systems consists of reducing the searching space in the image plane in an intelligent manner in order to increase the performance of the vehicle detection module. Accordingly, road lane markings are detected and used as the guidelines that drive the vehicle searching process (see Figure 4). The area contained by the limits of the lanes is scanned in order to find vehicle candidates that are passed on to the vehicle recognition modules. Thus, the rate of false positives is reduced. In case that no lane markings are detected, a basic region of interest is used instead covering the front, rear, and side parts of the vehicle. Finally, a tracking stage is implemented using Kalman filtering techniques.
3.1. Lane Detection
An attention mechanism is necessary in order to filter out inappropriate candidate windows based on the lack of distinctive features, such as horizontal edges and vertical symmetrical structures, which are essential characteristics of road vehicles. This has the positive effect of decreasing both the total computation time and the rate of false positive detections. Lane markings are detected using gradient information in combination with a local thresholding method which is adapted to the width of the projected lane markings. Then, clothoid curves are fitted to the detected markings. The algorithm scans up to 25 lines in the candidates searching area, from 2 meters in front of the camera position to the maximum range in order to find the lane marking measurements. The proposed method implements a nonuniform spacing search that reduces certain instabilities in the fitted curve. The final state vector is composed of 6 variables [5] for each lane on the road
where
and
represent the clothoid horizontal curvature parameters,
and
stand for the clothoid vertical curvature parameters, while
,
, and
are the lateral error and orientation error with regard to the centre of the lane and the width of the lane, respectively. The clothoid curves are then estimated based on lane marking measurements using a Kalman filter for each lane.
Apart from the detected road lanes additional virtual lanes have been considered so as to cope with situations in which a vehicle is located between two lanes (e.g., if it is performing a change lane manoeuvre). Virtual lanes provide the necessary overlap between lanes, avoiding both misdetections and double detections caused by the two halves of a vehicle being separately detected as two potential vehicles. A virtual lane is located to provide overlap between two adjoining lanes. Figure 5 provides some examples of lane markings detection in real outdoor scenarios. Detected lanes determine the vehicle searching area and help reduce false positive detections. In case no lane markings are detected by the system, fixed lanes corresponding to a straight road model are assumed instead.
3.2. Side Vehicle Detection
Side vehicle detection module [6] relies on the computation of optical flow. In order to reduce computational time, optical flow is computed only on Canny points in the image. Canny edge pixels are consequently matched and grouped together in order to detect clusters of pixels that can be considered as candidate vehicles in the image. Classical clustering techniques are used to determine groups of pixels, as well as their likelihood to form a single object. Even after pixels clustering, some clusters can still be clearly regarded as belonging to the same real object. A second grouping stage (double-stage) is then carried out among different clusters in order to determine which of them can be further merged into a single blob. For this purpose, simple distance criteria are considered. Two objects that are very close to each other are finally grouped together in the same cluster. The reason for computing a two-stage clustering process relies on the fact that by selecting a small distance parameter in the first stage, interesting information about clusters in the scene can be obtained. Otherwise, using a large distance parameter in the single clustering process, highly gross clusters would have been achieved, losing all information about the granular content of the points that provide optical flow in the image.
The selected clusters constitute the starting point for locating candidate vehicles in the image. For that purpose, the detected positions of clusters are used as a seed point to search for a collection of horizontal edges that could potentially represent the lower part of a car. The candidate is located on the detected horizontal edges that meet certain conditions of entropy and vertical symmetry. Some of the most critical aspects in side vehicle detection are subsequently listed: (
) shadows on the asphalt due to lampposts, other artefacts or a large vehicle overtaking the ego-vehicle on the right lane; (
) self-shadow reflected on the asphalt (especially problematic in sharp turns like in round-about points), or self-shadow reflected on road protection fences; (
) robust performance in tunnels; and (
) avoiding false alarms due to vehicles on the third lane.
The flow diagram of the two-stage detection algorithm is depicted in Figure 6. As can be observed, there is a pre-detector that discriminates whether the detected object is behaving like a vehicle or not. If so, the frontal part of the vehicle is located in the region of interest. In addition, the vehicle mass centre is computed. In case the frontal part of the vehicle is properly detected and its mass centre can also be computed, a final warning message is issued. After being located, vehicle candidates are classified by using a linear SVM classifier [7] with HOG features [8] previously trained with the samples obtained from real road images, and at that point vehicle tracking starts. Tracking is stopped when the vehicle gets out of the image. Sometimes, the shadow of the vehicle remains in the image for a while after the vehicle disappears from the scene, provoking the warning alarm to hold on for 1 or 2 seconds. This is not a problem, however, since the overtaking car is running in parallel with the ego-vehicle during that time although it is out of the image scene. Thus, maintaining the alarm in such cases turns out to be a desirable side effect.
Figure 7 shows an example of blind spot detection in a sequence of images. The indicator depicted in the upper-right part of the figure toggles from green to blue when a vehicle enters the blind spot area (indicated by a green polygon). A blue bounding box depicts the position of the detected vehicle.
3.3. Forward and Rear Vehicle Detection
Forward- and rear-looking vehicle detection systems share the same algorithmic core. The attention mechanism sequentially scans each road lane from the bottom to the maximum range looking for a set of features that might represent a potential vehicle. Firstly, the vehicle contact point is searched by means of the top-hat transformation. This operator allows the detection of contrasted objects on nonuniform backgrounds [9]. There are two different types of top-hat transformations: white hat and black hat. The white hat transformation is defined as the residue between the original image and its opening (
operator). The black hat transformation is defined as the residue between the closing (
operator) and the original image. The white and black hat transformations are analytically defined as follows:
The opening operator (
) is defined as the dilation of the erosion and the closing operator (
) is defined as the erosion of the dilation (for more details see [10]).In our case we use the white hat operator (2) since it enhances the boundary between the vehicles and the road [11]. Horizontal contact points are preselected if the number of white top-hat features is greater than a configurable threshold. Then, candidates are preselected if the entropy of Canny points is high enough for a region defined by means of perspective constraints and prior knowledge of target objects (see Figure 8).
Before computing the Canny features, an adaptive thresholding method is applied. This process is based on an iterative algorithm that gradually increases the contrast of the image, and compares the number of Canny points obtained in the contrast increased image with the number of edges obtained in the current image. If the number of Canny features in the actual image is higher than in the contrast increased image the algorithm stops. Otherwise, the contrast is gradually increased and the process resumed. This adaptive thresholding method permits to obtain robust image edges, as depicted in the examples provided in Figure 9.
In a second step, vertical edges (
), horizontal edges (
), and grey level (
) symmetries are obtained, so that, candidates will only pass to the next stage if their symmetries values are greater than a threshold. The vertical and horizontal edges symmetries are computed as listed inAlgorithm 1.The grey level symmetry computation procedure is shown inAlgorithm 2.Some examples of the three types of symmetries are depicted in Figure 10.
Algorithm 1: Vertical and horizontal edges symmetries computation procedure.
(
) Initialize 
(
) For 
(
) For each pair of vertical/horizontal edge pixels
and 
(
) 
(
)
Algorithm 2: Gray level symmetry computation procedure.
(
) For each possible symmetry axis
initializes 
(
) For 
(
) For 
(
) If 
(
) 
(
) 
Symmetry axes are linearly combined to obtain the final position of the candidate. Finally, a weighted variable is defined as a function of the entropy of Canny points, the three symmetry values and the distance to the host vehicle. We use this variable to apply a nonmaximum suppression process per lane which removes overlapped candidates. An example of this process is shown in Figure 11.
The selected candidates are classified by means of a linear SVM classifier [7], in combination with histograms of oriented gradients features [8]. We have developed and tested two different classifiers depending on the module (forward and rear classifiers). All candidates are resized to a fixed size of 64
64 pixels to facilitate the features extraction process. The rear-SVM classifier is trained with 2000 samples and tested with 1000 samples (1/1 positive/negative ratio) whereas the forward-SVM classifier is trained with 3000 samples and tested with 2000 samples (1/1 positive/negative ratio). Figures 12 and 13 depict some positive and negative samples of the forward and rear training and test data sets, respectively. Figure 14 shows a couple of examples of vehicle detection after linear SVM classification with HOG features.
After detecting consecutively an object classified as vehicle a predefined number of times (empirically set to 3 in this work), data association and tracking stages are triggered. The data association problem is addressed by using feature matching techniques. Harris features are detected and matched between two consecutive frames as depicted in Figure 15.
Tracking is implemented using Kalman filtering techniques [12]. For this purpose, a dynamic state model and a measurement model must be defined. The proposed dynamic state model is simple. Let us consider the state vector
, defined as follows:
In the state vector
and
are the respective horizontal and vertical image coordinates for the top left corner of every object, and
and
are the respective width and height in the image plane, a dynamical model equation can be written like this
In the model,
is the simple time,
represents the system dynamics matrix and
is the noise associated to the model. Although the definition of
is simple, it proves to be highly effective in practice since the real time operation of the system permits to assure that there will not be great differences in distance for the same vehicle between consecutive frames. The model noise has been modelled as a function of distance and camera resolution. The state model equation is used for prediction in the first step of the Kalman filter. The next step is to define the measurement model. The measurement vector is defined as
. Then, the measurement model equation is established as follows:
In last equation,
represents the measurement matrix and
is the noise associated to the measurement process. The purpose of the Kalman filtering is to obtain a more stable position of the detected vehicles. Besides, oscillations in vehicles position due to the unevenness of the road makes
coordinate of the detected vehicles change several pixels up or down. This effect makes the distance detection unstable, so a Kalman filter is necessary for minimizing these kinds of oscillations.
3.4. FCD Integration
As depicted in Figure 3, the FCD integration or Data Fusion module uses three sources of data: the measurements provided by the GPS, the data supplied by the CAN bus, and the output obtained from the three vision-based vehicle detection modules. Whereas the GPS and the CAN bus sample frequency is 1 Hz, the vision-based system operates in real-time at 25 frames per second (25 Hz). The proposed data fusion scheme provides information at the lowest sample frequency (1 Hz) covering two consecutive GPS measurements, the vehicle speed
(via CAN bus) and the outputs of the vision module.
The outputs of the side, forward, and rear vehicle detection systems at frame
are the number of detected vehicles
and their corresponding distances to the host vehicle
(note that
is used here as a distance/range measurement). These outputs are combined to cover the whole local environment of the vehicle. The traffic load at frame
is given by next expression
where
is the maximum number of vehicles in range that can be detected by the three systems (in our case
is defined as 9 or 13 for two lanes and three lanes roads, resp.). The average road speed at frame
is computed as follows:
where
and
represent the distance between the host vehicle and vehicle
at frames
and
, respectively,
corresponds to the sample time,
is the host vehicle speed provided by the CAN bus, and
is the number of detected vehicles. Note that the distance values correspond to filtered measurements since they are obtained from the first two elements of the Kalman filter state vector (
and
) using known camera geometry and ground-plane constraints.
Two consecutive GPS measurements define both a spatial and a temporal segment. The temporal segment corresponds to the GPS sample time (1 second), and the spatial segment will be defined as the globally referenced trajectory between the two GPS measurements. In order to obtain the extended FCD information (i.e., the road traffic load and the road speed) for this spatio/temporal segment we integrate the values supplied by the vision modules during 25 consecutive frames. With this approach a dense coverage of the road traffic load and the road speed can be assured for host vehicle speeds up to 180 km/h since the total range of the vision module covers more than 50 m (25 meters for both the rear and the forward looking modules; the side range covers up to two third parts of the bus length in the adjacent lane). Obviously this maximum speed will never be exceeded by a public bus. This approach facilitates further map-matching tasks since the extended FCD information between two consecutive points will always be globally referenced.