In this section, we describe our tracker. In Section 3.1, we introduced the main tracking framework of our algorithm, which is shown in Fig. 1. In Section 3.2, we introduce the tracker based on LADCF correlation filtering. In Section 3.3, we introduce the composite evaluation criteria of the confidence degree and the SVM based re-detector.
3.1 The main framework of the algorithm
The proposed algorithm aims to combine both the DCF tracker and the re-detector for long-term tracking. First, the baseline correlation filter tracker is adopted to estimate the translation in the tracking stage. Second, the maximum response value and the APCE criterion are utilized to judge the confidence level of the target. Finally, when the value of confidence is higher than the threshold, the baseline tracker achieves the tracking target alone. When the confidence level drops sharply, it indicates tracking failure. We do not update the model and exploit the SVM model to re-detect the target object in the current frame. The structure of the algorithm in this paper is shown in Fig. 1.
The tracking framework is summarized as follows:
-
(1)
Position and scale detection: We utilize DSST to achieve the target position and scale prediction. The t − th frame target is It, and the filter model is θmodel. When a new frame It appears, we extract multiple scale search windows \( \left[{I}_t^{\mathrm{patch}}\left\{s\right\}\right] \) from it, s = 1, 2, …, S, with S denoting the number of scales. For each scale s, the search window patch is centered around the target center position pt − 1 with a size of aNn × aNn pixels, where a is the scale factor and \( N=\left\lfloor \frac{2s-S-1}{2}\right\rfloor \). The size of the basic search window size is n × n, which is determined by the target size ω × h and padding parameter ϱ as \( n=\left(1+\upvarrho \right)\sqrt{\omega \times h} \). So, the bilinear interpolation is applied to resize each patch into n × n. Then, we extract multi-channel features for each scale search window as \( \chi \left(\mathrm{s}\right)\epsilon {\mathbb{R}}^{D^2\times L} \). Given the filter template, the response score can efficiently be calculated in the frequency domain as [16]:
$$ \hat{f}(s)=\hat{x}(s)\odot {\hat{\theta}}_{model}^{\ast } $$
(1)
After the implementation of the IDFT on each scale, the maximum value of \( f\in {\mathbb{R}}^{D^2\times S} \) is the relative position and scale.
-
(2)
Updating: We adopt the same updating strategy as the traditional DCF method:
$$ {\theta}_{model}=\left(1-\upalpha \right){\theta}_{model}+\upalpha \theta $$
(2)
where α is the updating rate. More specifically, since θmodel is not available in the learning stage for the first frame, we use a pre-defined mask that only the target region is activated to optimize θ as in BACF. And then, we initialize θmodel = θ after the learning stage of the first frame.
3.2 Correlation filter tracker
In this paper, we set LADCF [16] as the baseline algorithm of our tracking approach.
The LADCF algorithm proposes a new DCF-based tracking method, which utilizes the adaptive spatial feature selection and temporal consistent constraints to reduce the impact of spatial boundary effect and temporal filter degradation. The feature selection process is to select several specific elements in the filter to retain distinguishable and descriptive information, forming a low-dimensional and compact feature representation. Considering an n × n image patch \( x\in {\mathbb{R}}^{n^2} \) as a base sample for the DCF design, the circulant matrix for this sample is generated by collecting its full cyclic shifts, \( {X}^T={\left[{x}_1,{x}_2,\dots, {x}_{n^2}\right]}^T\in {\mathbb{R}}^{n^2\times {n}^2} \) with the corresponding Gaussian-shaped regression labels \( y=\left[{y}_1,{y}_2,\dots, {y}_{n^2}\right] \). The spatial feature selection embedded in the learning stage can be expressed as:
$$ {\displaystyle \begin{array}{c}\underset{\theta, \phi }{argmin}{\left\Vert \theta \circledast x-y\right\Vert}_2^2+{\lambda}_1{\left\Vert \phi \right\Vert}_0\\ {}s.t.\theta ={\theta}_{\phi }=\mathit{\operatorname{diag}}\left(\phi \right)\theta, \end{array}} $$
(3)
where θ denotes the target model in the form of DCF, and ⊛ denotes the circular convolution operator. The indicator vector ϕ can potentially be expressed by θ and ‖ϕ‖0 = ‖θ‖0, and diag(ϕ) is the diagonal matrix generated from the indicator vector of selected features ϕ. The ℓ0-norm is non-convex, and the ℓ1-norm is widely used to approximate the sparsity [24], so a temporal consistency is constructed by ℓ1-norm relaxation spatial feature selection model [16]:
$$ \underset{\theta }{\mathrm{argmin}}{\left\Vert \theta \circledast x-y\right\Vert}_2^2+{\lambda}_1{\left\Vert \theta \right\Vert}_1+{\lambda}_2{\left\Vert \theta -{\theta}_{model}\right\Vert}_1 $$
(4)
where λ1 and λ2 are tuning parameters, and λ1<<λ2. θmodel denotes the model parameters estimated from the previous frame.
The ℓ2-norm relaxation is adopted to further simplify the following expression:
$$ \underset{\theta }{argmin}{\left\Vert \theta \circledast x-y\right\Vert}_2^2+{\lambda}_1{\left\Vert \theta \right\Vert}_1+{\lambda}_2{\left\Vert \theta -{\theta}_{model}\right\Vert}_2^2 $$
(5)
where the lasso regularization controlled by λ1 select the spatial feature. In the above formula, the filter template model is used to increase smoothness between consecutive frames to promote time consistency. In this way, the temporal consistency of spatial feature selection can be preserved to extract and retain the diversity of the static and dynamic appearance.
Since the multi-channel features share the same spatial layout [16], the multi-channel input is represented as Χ = {x1, x2, …, xL}, and the corresponding filter is represented as θ = {θ1, θ2, …, θL}. By minimization, the goal can be extended to multi-channel functions with structured sparsity [16]:
$$ \underset{\theta }{\mathrm{argmin}}{\sum}_{i=1}^L{\left\Vert {\theta}_i\circledast {x}_i-y\right\Vert}_2^2+{\lambda}_1{\left\Vert \sqrt{\sum_{i=1}^L{\theta}_i\odot {\theta}_i}\right\Vert}_1+{\lambda}_2{\sum}_{i=1}^L{\left\Vert {\theta}_i-{\theta}_{model\ i}\right\Vert}_2^2 $$
(6)
where θj is the jth element of the ith channel feature vector \( {\theta}_i\in {\mathbb{R}}^{D^2} \). ⊙ denotes the element-wise multiplication operator. The structured spatial feature selection term calculates the ℓ2-norm of each spatial location and then executes the ℓ1-norm to achieve joint sparsity.
Subsequently, utilizing ADMM [27] to optimize the above formula, we introduce the relaxation variables to construct the goals based on convex optimization [31]. Then, we could obtain the global optimal solution of the model through ADMM and form an enhanced Lagrange operator [16]:
$$ \mathcal{L}=\sum \limits_{i=1}^L{\left\Vert {\theta}_i\circledast {x}_i-y\right\Vert}_2^2+{\lambda}_1\sum \limits_{j=1}^{D^2}{\left\Vert \sqrt{\sum \limits_{i=1}^L{\left({\theta}_i^j\right)}^2}\right\Vert}_1+{\lambda}_2\sum \limits_{i=1}^L{\left\Vert {\theta}_i-{\theta}_{model\ i}\right\Vert}_2^2+\frac{\mu }{2}{\sum}_{i=1}^L{\left\Vert {\theta}_i-{\theta}_i^{\prime }+\frac{\eta_i}{\mu}\right\Vert}_2^2 $$
(7)
where \( \mathcal{H}=\left\{{\eta}_1,{\eta}_2,\dots, {\eta}_L\right\} \) are the Lagrange multipliers, and μ > 0 is the corresponding penalty parameter controlling the convergence rate [16, 32]. As \( \mathcal{L} \) is convex, ADMM is exploited iteratively to optimize the following sub-problems with guaranteed convergence:
$$ \left\{\begin{array}{c}\theta =\mathit{\arg}\underset{\theta }{\mathit{\min}}\mathcal{L}\left(\theta, {\theta}^{\prime },\mathcal{H},\mu \right)\\ {}{\theta}^{\prime }=\mathit{\arg}\underset{\theta^{\prime }}{\mathit{\min}}\mathcal{L}\left(\theta, {\theta}^{\prime },\mathcal{H},\mu \right)\\ {}\mathcal{H}=\mathit{\arg}\underset{\mathcal{H}}{\mathit{\min}}\mathcal{L}\left(\theta, {\theta}^{\prime },\mathcal{H},\mu \right)\end{array}\right. $$
(8)
3.3 Re-detector
3.3.1 Confidence criterion
Most existing trackers do not consider whether the detection is accurate or not. In fact, once the target is detected incorrectly in the current frame, severely occluded, or completely missing, this may cause the tracking failure in subsequent frames.
We introduce a measure to determine the confidence degree of the target objects, which is the first step in the re-detection model. The peak value and the fluctuation of the response map can reveal the confidence about the tracking results. The ideal response map should have only one peak while all the other regions are smooth. Otherwise, the response map will fluctuate intensely. If we continue to use the uncertain samples to track the target in the subsequent frames, the tracking model will be destroyed. Thus, we exploit to fuse two confidence degree evaluation criteria. The first one is the maximum response value Fmax of the current frame.
The second one is the APCE measure which is defined as:
$$ APCE=\frac{{\left|{F}_{max}-{F}_{min}\right|}^2}{mean\left({\sum}_{w,h}{\left({F}_{w,h}-{F}_{min}\right)}^2\right)} $$
(9)
where the Fmax and Fmin are the maximum response and the minimum response of the current frame, respectively. Fw, h is the element value of the wth row and the hth column of the response matrix.
If the target is moving slowly and is easily distinguishable, the APCE value is generally high. However, if the target is undergoing fast motion with significant deformations, the value of APCE will be low even if the tracking is correct.
3.3.2 Target re-detection
In this section, we describe the re-detection mechanism used in the case of tracking failure. In the re-detection module, when the confidence level is lower than the threshold, the SVM [33] is used for re-detection. Considering a sample set (x1, y1), (x2, y2), …, (xi, yi), …, xi ∈ Rd, including positive and negative samples, where d is the dimension of the sample, yi ∈ (+1, −1) is sample labels, SVM can make segmentation of positive and negative samples to obtain the best classification hyperplane. The classification plane is defined as [33]:
$$ {\omega}^{\mathrm{T}}x+b=0 $$
(10)
where ω represents the weight vector, and b denotes the bias term. In the case of the linearly classifiable, for a given dataset T and classification hyperplane, the following formula is used for classification judgment:
$$ \left\{\begin{array}{c}{\omega}^{\mathrm{T}}x+b\le 1,{y}_i=-1\ \\ {}{\omega}^{\mathrm{T}}x+b\ge 1,{y}_i=+1\end{array}\right. $$
(11)
Combining the two equations, we can abbreviate it as:
$$ y\left({\omega}^Tx+b\right)\ge 1 $$
(12)
The distance from each support vector to the hyperplane can be written as:
$$ d=\frac{\left|{\omega}^Tx+b\right|}{\left\Vert \omega \right\Vert } $$
(13)
The problem of solving the maximum partition hyperplane of the SVM model can be expressed as the following constrained optimization problem:
$$ {\displaystyle \begin{array}{c}\ \mathit{\min}\frac{1}{2}{\left\Vert \omega \right\Vert}^2\\ {}s.t.{y}_i\left({\omega}^T{x}_i+b\right)\ge 1\end{array}} $$
(14)
Next, the paper introduces the Lagrangian function to solve the above problem [33].
$$ L\left(\omega, \lambda, c\right)=\frac{1}{2}{\left\Vert \omega \right\Vert}^2-{\sum}_{j=1}^l{c}_i{y}_i\left(\omega \cdot {x}_i+b\right)+{\sum}_{i=1}^l{c}_i $$
(15)
where ci > 0 is the Lagrange multiplier, the solution of the optimization problem satisfies the partial derivative of L(ω, λ, c) to ω and b be 0. The corresponding decision function is expressed as:
$$ f(x)=\operatorname{sign}\left({\omega}^{\ast}\cdot x+{b}^{\ast}\right)=\operatorname{sign}\left\{\left({\sum}_{\mathrm{j}=1}^{\mathrm{l}}{c}_j^{\ast }{y}_j\left({x}_j\cdot {x}_i\right)\right)+{b}^{\ast}\right\} $$
(16)
Then, the new sample points are imported into the decision function to get the sample classification.
In the case of linear inseparability, we use the kernel function to map it to the high-dimensional space. In this work, we use the Gaussian kernel function as follows:
$$ k\left({x}_i,{x}_j\right)={e}^{\left(-\frac{{\left\Vert {x}_i-{x}_j\right\Vert}^2}{2{\sigma}^2}\right)} $$
(17)
When a frame is re-detected, an exhaustive search is performed on the current frame using a sliding window, and the HOG features are extracted for each image patch as the Χ vector in the above formula. And the f(x) is calculated by formula (16). Then, we obtain the sample area with the largest f(x). When the response value is greater than the threshold, it will be used as the location of the tracking target again.
The training process of SVM is as follows [33]. By the confidence level, we determine the quality of the sample. Then, samples with high confidence are used as the positive samples, and samples with low confidence are used as the negative samples. The HOG features from positive and negative samples are extracted to obtain the feature vectors. The feature vectors are represented as (xi, yi), i = 1, 2, …, n, where n denotes the number of training samples, xi represents the HOG feature vector, and yi represents the attribute of the extracted sample. If the training sample is positive, then yi = 1, and if the sample is negative, then yi = − 1. For the binary classification problem of our samples, the loss function is defined as formula (18).
$$ Loss\left(x,y,\omega \right)=-\max \left(0,1-y\left(x\cdot \omega \right)\right) $$
(18)
When the value of loss is negative, the parameters of SVM are updated as follows.
$$ {\omega}^{\ast }={\sum}_{\mathrm{j}=1}^{\mathrm{l}}{c}_j^{\ast }{y}_j{x}_j $$
(19)
$$ {b}^{\ast }={y}_i-{\sum}_{\mathrm{j}=1}^{\mathrm{l}}{y}_j{c}_j^{\ast}\left({x}_j\cdot {x}_i\right) $$
(20)
where cj is the Lagrangian coefficient, x is the feature vector extracted from the sample, and y is the label corresponding to the sample.