Figure 2 exhibits the conceptual diagram of the proposed object detection and tracking system, which includes two subsystems, improved YOLO (iYOLO) object detector and double-layer LSTM (dLSTM) object refiner. After the iYOLO detector, the dLSTM refiner takes T consecutive outputs of the iYOLO to refine the final prediction. Before the dLSTM refiner, however, we need to spatially order the iYOLO outputs, which are the bounding boxes and confidences of the detected objects, to correctly characterize their spatial associations. After the spatial association, the dLSTM object refiner then performs the final refinement of the iYOLO outputs. As shown in the top part of Fig. 2, the detailed descriptions of the iYOLO object detector, the multi-object spatial association, and the dLSTM refiner are addressed in the following subsections.
3.1 The iYOLO object detector
The proposed object detection system, as shown in the top part of Fig. 2, first resizes the images to 416 × 416 or 448 × 448 size as the inputs of the improved YOLO (iYOLO) object detection network. For performance improvement and computation reduction, the proposed iYOLO object detector is designed to classify 30 on-road moving objects with one combined-vehicle class, including car, bus, and truck classes together. The iYOLO also combines low and high level features to detect the objects. The details of data representation, network structure, and loss functions of the iYOLO are stated as follows.
For moving object detection, we not only need to predict the locations and box sizes of the detected objects but also need to detect their classes. Therefore, in the iYOLO, the output data is a three-dimension array, which is with the size of \({14} \times {14} \times D\), where D denotes the channel number of the information for representation of detections and classifications. The 14 × 14 array is considered as the grid cells of the image. Thus, there are 196 grid cells in total. Each grid cell, as shown in Fig. 3, contains five bounding boxes, which are called as “anchor boxes”.
Each grid cell contains D elements, which carry the positions, confidences and class information, where D usually is given by:
$$D = B \times (5 + M),$$
(7)
where M is the number of classes and B denotes the number of the anchor boxes in a single grid cell for detecting. As shown in Fig. 4, each grid cell contains B bounding boxes while each bounding box comprises (5 + M) parameters. For the bth box, we have 5 parameters including its bounding box, \({{\{ }}x^{b} ,y^{b} ,w^{b} ,h^{b} {{\} }}\) and the occurrence probability, \(P^{b} = P(o^{b} )\). The bounding box is defined by the center with coordinates xb and yb and the width and the height the box with wb and hb, respectively. In Fig. 4, \(C_{i}^{b}\) denotes the conditional probability of the bth box that contains ith class as:
$$C_{i}^{b} = P( \, i{\text{th}}\;{\text{ class}}|o^{b} ),$$
(8)
for \(1 \le i \le M\). Therefore, we can find the probability of the ith object in the bth box is given by:
$$P(i{\text{th}}\;{\text{class}}, \, o^{b} ) = C_{i}^{b} P^{b} ,$$
(9)
where \(P^{b}\) denotes the occurrence probability of the bth box. If \(C_{i}^{b} P^{b}\) passes a pre-defined threshold, we will consider the ith class object being existed in the bth bounding box and detected by the proposed iYOLO detector.
3.2 Network structure of iYOLO detector
The proposed iYOLO network structure as shown in Fig. 5 is composed of several stages of convolutional layers and max pooling layers. The convolutional layers [38] with batch normalization [39] are mostly with 3 × 3 or 1 × 1 convolutions. The pooling layers perform with stride 2 of direct down sampling. In this paper, we include car, truck and bus classes into vehicle class. Thus, the number of output classes of the iYOLO are reduced to 30. As shown in Fig. 5, we eliminate three sets of two repeated 3 × 3 × 1024 convolution layers compared to the YOLOv2. The reason for decreasing high-level layers is that we can reduce the computations since we use the vehicle class to represent all type of cars. The more high-level layers we reduce; the less complex the model becomes.
In order to enhance the performance, the proposed iYOLO further includes two low-level features to help the final detection. As marked by the thick red lines and green-box functions in Fig. 5, we concatenate the outputs of 12th (after the green-boxed max-pooling) and 17th features (after the green-boxed 3 × 3 × 512 convolution) layers with the final feature. To keep the size of low-level features the same as that of the high layer feature, we introduce two convolution layers with 3 × 3 × 128, 1 × 1 × 64 for first low-level feature and 3 × 3 × 256, 1 × 1 × 64 convolution and reorganize their features into the half of the original resolution in marked functions before the concatenation.
Since the proposed iYOLO will output the information of bounding box location, classification and confidence results simultaneously. Therefore, the prediction of the module is composed of three loss functions: (1) location loss of the bounding box, g = \({{\{ }}x,y,w,h{{\} }}\), where (x, y) denotes the center position while w and h respectively represent width and height of the bounding box; (2) classification loss defined by the conditional probability for specific class,\(p_{s} (i)\); and (3) confidence loss related to probability \(P_{s} (o)\) that states an object existing in the sth grid cell. The total loss function is given by
$$L(f,c,g,\overline{g}) = L_{{{\text{loc}}}} (f,g,\overline{g}) + L_{{{\text{con}}}} (f,g) + L_{{{\text{cls}}}} (f,g,c),$$
(10)
where f is the input image data, c is the class confidence, g and \(\overline{g}\) denote the predicted and the ground truth boxes, respectively. As stated in (10), the total loss is composed of the location loss, confidence loss, and class loss functions balanced by weighting factors \(\lambda_{{{\text{loc}}}}\), \(\lambda_{{{\text{obj}}}}\), \(\lambda_{{{\text{noobj}}}}\), and \(\lambda_{{{\text{cls}}}}\) separately. These four loss functions are described as follows.
The location loss, \(L_{{{\text{loc}}}}\) is given as:
$$L_{{{\text{loc}}}} = \lambda_{{{\text{loc}}}} \sum\limits_{s = 1}^{{S^{2} }} {\sum\limits_{b = 1}^{B} {\alpha_{s,b}^{o} [\left\| {g_{s} - \overline{g}_{s} } \right\|^{2} } } ],$$
(11)
where \(g_{s} {{ = \{ }}x_{s} ,y_{s} ,w_{s} ,h_{s} {{\} }}\) and \(\overline{g}_{s} {{ = \{ }}\overline{x}_{s} ,\overline{y}_{s} ,\overline{w}_{s} ,\overline{h}_{s} {{\} }}\) are the predicted and ground truth bounding boxes, S and B denote the numbers of grid cells and anchor boxes, respectively. In (11), \(\lambda_{{{\text{loc}}}}\) represents the location weighting factor and \(\alpha_{s,b}^{o}\) means the responsibility for the detection of the sth grid with the bth box. If the bounding box passes the intersection over union (IoU) threshold 0.6, then the box specific index, \(\alpha_{s,b}^{o}\) will be 1, otherwise \(\alpha_{s,b}^{o}\) goes 0.
The confidence loss, \(L_{{{\text{con}}}}\) is expressed as:
$$L_{{{\text{con}}}} = \lambda_{{{\text{obj}}}} \sum\limits_{s = 0}^{{S^{2} }} {\sum\limits_{b = 0}^{B} {\alpha_{s,b}^{o} [(P_{s} (o) - P_{s} (\overline{o}))]^{2} } } + \lambda_{{{\text{noobj}}}} \sum\limits_{s = 0}^{{S^{2} }} {\sum\limits_{b = 0}^{B} {(1 - \alpha_{s,b}^{o} )[(P_{s} (o) - P_{s} (\overline{o}))]^{2} } } ,$$
(12)
where the first term exhibits the bounding box confidence loss of the objects while the second term denotes the confidence loss without the objects. In (12), \(\lambda_{{\text{o}}}\) and \(\lambda_{{{\text{no}}}}\) express the confidence weighting factors with object and no-object cases, respectively. The loss values are only valid for responsible bounding boxes, \(\alpha_{s,b}^{o} = 1\), since that the non-responsible bounding boxes don’t have truth label.
The classification loss, \(L_{{{\text{cls}}}}\) is given as:
$$L_{{{\text{cls}}}} = \lambda_{{{\text{cls}}}} \sum\limits_{s = 1}^{{S^{2} }} {\sum\limits_{b = 1}^{B} {\alpha_{s,b}^{o} \sum\limits_{{i \in {\text{classes}}}} {[(p_{s} (i) - \overline{p}_{s} (i))]^{2} } } } ,$$
(13)
where \(p_{s} (i)\) denotes the i-class confidence in specific grid cell of the anchor box and \(\lambda_{{{\text{cls}}}}\) denotes the classification weighting factor. If the bounding box does not contain any object, the predicted probability should be decreased to close to 0. On the contrary, if the box contains an object, the predicted probability should be push to near 1.
Before discussing the proposed dLSTM object refiner, we should properly associate the outputs of each detected object of the iYOLO according to its spatial position. The spatial association of the temporal information of multiple objects is designed to collect all the outputs of the same physical object in a spatial-priority order to become the time series inputs of the dLSTM object tracking modules.
3.3 Multi-object association
To make a good association of a series of outputs for each detected object, we need to design a proper association rule to construct a detected data array as the input of the dLSTM object refiner. Usually, the iYOLO shows the detection results according to the confidence priority, which is not a robust index for object association since the confidences vary from time to time. Therefore, we utilize the close-to-the-car distance as the object association index since any on-road moving objects will physically travel arround their nearby areas smoothly. For the ith detected object with its bounding box, \(g_{i} {{ = \{ }}x_{i} ,y_{i} ,w_{i} ,h_{i} {{\} }}\), the association should be based on the spatial positions in the image frame. If we set the frame left-upper corner of as the origin at (0, 0), the right-lower corner at (W − 1, H − 1), where W and H are the width and height of the frame, respectively. The position at (W/2, H), which is used to measure the closeness of the detected vehicle to the driving car, is set as the reference point. If the detected object is closer to the reference point, it will be more dangeous to the car. Thus, the bounding box of a detected object is closer to the reference point, it should be more important for the object and we should give it a higher priority. The priority of the detected object is spatially ordered by a priority-regulated distance to the critical point as
$$d_{i} = \left( {(\Delta x_{i}^{d} )^{2} + \rho (\Delta y_{i}^{d} )^{2} } \right)^{1/2} ,$$
(14)
where the horizontal distance between the ith bounding box and the reference point is given as:
$$\Delta x_{i}^{d} = \mathop {\min }\limits_{x \in R} (x - W/2),$$
(15)
with \(R = \left\{ {x\left| {(x_{i} - w_{i} /2) \le x \le (x_{i} + w_{i} /2} \right.)} \right\}\) and the vertical distance is expressed by
$$\Delta y_{i}^{d} = \, H - y_{i} - (h_{i} /2).$$
(16)
The horizontal distance, \(\Delta x_{i}^{d}\) defined in (14) finds the minimum displacement of any point in the bounding box to the reference point horizontally. If \(\Delta x_{i}^{d}\) is smaller, the bounding box will be closer to the reference point. If \(\Delta y_{i}^{d}\) is smaller, the bottom of the bounding box of the object is close to the bottom of the frame.
After the computation of all priority-regulared distances of the detected objects, the object indices are determined by the priority-regulated distances in decenting order. The smaller priority-regulated distances will be given a higer order, i.e., a small prioity index to the detected object. If ρ = 0.5, Fig. 6 shows the order of the object confidences and the priority orders of the prority-regulated distances between the reference point and the detected objects. The order of the objects with the regulated distances is spatially stable since the spatial positions of the real objects will not change too quickly. Even if the object is moving horizontally to occlude some objects, the tracked objects will be still reasonable and stable since we don’t care about the ones which are occluded with one combined-vehicle class. We can then focus on the objects, which are geometrically close to the driving car and give them higher priorities for tracking.
3.4 LSTM refiner
After determining the priority order of the detected objects, we collect all bounding boxes as a 2D data array with the same priority order. For example, the data array for the first priority object with \(g_{t}^{(1)} {{ = \{ }}x_{t}^{{(1{)}}} ,y_{t}^{{(1{)}}} ,w_{t}^{(1)} ,h_{t}^{(1)} {{\} }}\), for t = 1, 2, …, T. For simplicity, we ignore the index of the priority order. For each detected object at instant T, we then collect an array of bounding boxes as
$$L = \left\{ {X_{1} ,X_{2} ,X_{3} \ldots ,X_{T} } \right\},$$
(17)
where Xt = [xt,1, yt, xt,2, yt,2]T denotes positions of left-top corner (xt,1, yt,1) and bottom-right corner (xt,2, yt,2) of the bounding box at the tth instant. After collection of Xt for T consecutive samples, Fig. 7 shows the 2D time-series data array of bounding boxes for each detected object.
With the 2D data array for each object, the double-layer LSTM (dLSTM) is designed to reduce unstable bounding boxes. In order to achieve better performance, we might use a longer LSTM module, however, it would increase some unnecessary delay of tracking. As shown in Fig. 8, the double-layer LSTM (dLSTM) refiner contains K-element hidden state vectors with T time instants. The fully connected layer take \(h_{T}^{(2)}\), the Tth hidden state vector of the second LSTM layer and output 4-point prediction position of bounding box, \(\tilde{X}\). As stated in (17), the dLSTM network inputs a series of bounding box data Xt for for t = 1, 2, …, T. In Fig. 8, \(c_{t}^{(l)}\) and \(h_{t}^{(l)}\) denote the cell and the hidden states of the lth layer at the tth time step of the dLSTM model, respectively. To make the model deeper for more accurate earnings [30, 31], we stack two LSTM layers to achieve better time series prediction and avoid long delay simultaneously. As stated in (6), the first LSTM layer inputs the location data array L in chronological order. It generates the K-dimension hidden state \(h_{t}^{(1)}\) and the K-dimension cell state \(c_{t}^{(1)}\). Then, we will output hidden state features of the first LSTM layer as the inputs of the second LSTM layer. In the dLSTM refiner, the hidden state \(h_{t}^{(l)}\) is treated as the decision output and the cell state \(c_{t}^{(l)}\) gets the updates from the output of the previous step \(h_{t - 1}^{(l)}\). The second LSTM only returns the last step of its output sequences for dropping the temporal dimensions. Finally, the second LSTM layer followed by the fully connected layer interprets the K feature vector to the predicting location \(\tilde{X}\) learned by the dLSTM module.
To train dLSTM module, the IoU-location loss, which combines intersection over union (IoU) and position mean square error (MSE) losses as
$$L_{{{\text{dLSTM}}}} = - \alpha \cdot \frac{1}{n}\sum\limits_{i = 1}^{n} {\log ({\text{IoU}}(X_{i} ,\overline{X}_{i} ))} + \beta \cdot \frac{1}{n}\sum\limits_{i = 1}^{n} {\left\| {X_{i} - \overline{X}_{i} } \right\|^{2} } ,$$
(18)
where \(X_{i}\) and \(\overline{X}_{i}\) denote the locations of the predicted and ground truth bounding boxes, respectively. In (18), the IoU, which represents the ratio of intersection and union of the predicted and groundtruth bounding boxes, gives 0 < IoU < 1. Thus, we use − log(IoU) as the first loss function. In addition, the mean square error (MSE) of \(X_{i}\) and \(\overline{X}_{i}\) is used for the coordination loss function. To balance IoU and MSE loss functions, we need to select α and β for a better combination. After the training process, it is noted that the dLSTM refiners with the same weights are used for all detected objects in this paper.
3.5 Object tracking status
For each detected object, as shown in the bottom part of Fig. 2, we need to constantly monitor the object tracking condition, which could be a newly-appeared, tracked, or disappeared object. The tracking status will help to control the dLSTM refiner correctly. Not only the bounding boxes, we also need adopt the occurrence and conditional probabilities, as shown in the bottom part of Fig. 2, for effective object tracking. We assume that we have already initiated P dLSTM refiners to track P priority objects. For each detected object at instant T, from the iYOLO, we have collected the tracking information:
-
1.
Priority Index: p,
-
2.
Input Data Array: \(\{ X_{1} ,X_{2} ,X_{3} \ldots ,X_{T} \}\),
-
3.
Output Predict Data: \(\tilde{X}_{T}\),
-
4.
T sets of top five confidences given by the iYOLO: {\(p(i*)\), i*}, for i* = 1, 2, …, 5,
where i* carries the index of the top i class and \(p(i*)\) = \(C_{i*}^{b} p^{b}\), which records the confidences of top five classes. As shown in Fig. 9, there are four possible conditions of confidence plots in practical applications, where Thp denotes the detection threshold of confidence. With all outputs of the detected object collected from the iYOLO, the status of the dLSTM refiner can be determined as the follows:
At T + 1 instant for a detected object, whose the confidence estimated by the iYOLO is higher than the threshold, we need to first distinguish that the object is a newly-appearing (red solid line) or stably-tracked object (red-dash line) as shown in Fig. 9. With the same priority index, we first check the IoU of bounding boxes of the object obtained at T and T + 1. If the IoU of two consecutive bounding boxes is greater than a threshold and the object is with the same class, we then determine this object as the stably-tracked one. In this case, we need to update the collected information by left-shifting data array as,
$$X_{t } = X_{{t + {1}}} ,{\text{ for}}\quad t = {1},{2}, \ldots ,T,$$
(19)
and top five confidences.
If the IoU of the current and previous bounding boxes is lower than the threshold or the class index is different, we then treat it as a newly-appeared object. In this case, we need initialize a new dLSTM refiner and a new set of data array with the same XT+1 as
$$X_{t} = X_{T + 1} , \, \quad {\text{for}}\quad t = 1,2, \ldots ,T,$$
(20)
We store the new set of top five confidences with the same {\(p(i*)\), i*}, for i = 1, 2, …, 5. The newly-appeared object becomes the tracked object in the next iterations. For both tracked and newly-appeared cases, we entrust the detection ability of the iYOLO. Once the iYOLO decides it as the positive confidence, which is greater than the threshold, it must be an active object. It is noted that the dLSTM refiner will not change the detection performance for the stably-tracked (red-dash line) and newly-appearing (red-solid line) object. For these two cases, the miss-detected (MD) counter will be reset to zero shown in the bottom part of Fig. 2 and the dLSTM refiners are actually used to refine the tracked bounding boxes.
For a dLSTM refiner, we hope not only to improve the accuracy of bounding boxes but also to raise the detection performance for gradually-disappearing (blue-dash line) and unstably-tracked (blue-solid line) conditions as shown in Fig. 9 while the dLSTM refiner obtains the miss-detected information from iYOLO. Based on the fact of no-sudden-disappeared objects, we will not turn off the dLSTM refiner at once if the miss-detection case happens. To improve the detection performance, we further design a miss-detected (MD) counter, which is reset to zero if the iYOLO actively detects the object, which possesses sufficient confidence with a bounding box. Once the detected object does not have large enough confidence at some instants, we will increase MD counter of the tracked object by one until MD is larger than the miss-detection threshold, Nmis. When MD ≤ Nmis, the tracking system will still treat the object as a tracked object but in the “unstably-detected” condition. For any unstably-detected object, the system will give the output of the dLSTM refiner as the compensation, i.e., \(\hat{X}_{T + 1} = \tilde{X}_{T}\). Once MD > Nmis, the tracking system will delete the object and stop the dLSTM refiner hereafter. In general, we can improve the detection performance for unstably-detected and gradually disappeared conditions as shown in Fig. 9.
If the detected object is with a larger bounding box, it should not disappear in a shorter time. On the contrary, the object is with a smaller bounding box, it could be closer to the disappearing case. As shown in Fig. 2, we suggest the adaptive missed-detection counting (AMC) threshold as:
$$N_{{{\text{mis}}}} = \left\{ {\begin{array}{*{20}l} {10,} \hfill & {{\text{for}}\quad A_{T} \ge A_{\max } ,} \hfill \\ {5,} \hfill & {{\text{for}}\quad A_{\max } > A_{T} \ge A_{\min } ,} \hfill \\ {2,} \hfill & {{\text{for}}\quad A_{\min } \, > A_{T} .} \hfill \\ \end{array} } \right.$$
(20)
where AT = wThT denotes the area of the latest bounding box of the detected object by the iYOLO before T time instant. To terminate the dLSTM refiner, the AMC threshold will set to Nmis = 10, 5, and 2 for large, middle, and small detected bounding boxes, respectively. With the AMC threshold, we could help to raise the detection rate and properly avoid the false positive rate. Since the dLSTM refiner will take 10 sets of bounding boxes from iYOLO, i.e., T = 10, we choose the AMC threshold, Nmis to be 10, 5, and 2 for large, middle, and small detected bounding boxes empirically. With the above adaptive confidence threshold, we could recover the miss-detected object, which could be frequently disturbed by various environmental changes. Since the dLSTM module needs the data vectors of consecutive bounding boxes of the detected object, for the miss-detected object, we will replace the output of the iYOLO with \(\hat{X}_{T + 1} = \tilde{X}_{T}\).