In this section, we describe our tracker. In Section 3.1, we introduced the main tracking framework of our algorithm, which is shown in Fig. 1. In Section 3.2, we introduce the tracker based on LADCF correlation filtering. In Section 3.3, we introduce the composite evaluation criteria of the confidence degree and the SVM based redetector.
3.1 The main framework of the algorithm
The proposed algorithm aims to combine both the DCF tracker and the redetector for longterm tracking. First, the baseline correlation filter tracker is adopted to estimate the translation in the tracking stage. Second, the maximum response value and the APCE criterion are utilized to judge the confidence level of the target. Finally, when the value of confidence is higher than the threshold, the baseline tracker achieves the tracking target alone. When the confidence level drops sharply, it indicates tracking failure. We do not update the model and exploit the SVM model to redetect the target object in the current frame. The structure of the algorithm in this paper is shown in Fig. 1.
The tracking framework is summarized as follows:

(1)
Position and scale detection: We utilize DSST to achieve the target position and scale prediction. The t − th frame target is I_{t}, and the filter model is θ_{model}. When a new frame I_{t} appears, we extract multiple scale search windows \( \left[{I}_t^{\mathrm{patch}}\left\{s\right\}\right] \) from it, s = 1, 2, …, S, with S denoting the number of scales. For each scale s, the search window patch is centered around the target center position p_{t − 1} with a size of a^{N}n × a^{N}n pixels, where a is the scale factor and \( N=\left\lfloor \frac{2sS1}{2}\right\rfloor \). The size of the basic search window size is n × n, which is determined by the target size ω × h and padding parameter ϱ as \( n=\left(1+\upvarrho \right)\sqrt{\omega \times h} \). So, the bilinear interpolation is applied to resize each patch into n × n. Then, we extract multichannel features for each scale search window as \( \chi \left(\mathrm{s}\right)\epsilon {\mathbb{R}}^{D^2\times L} \). Given the filter template, the response score can efficiently be calculated in the frequency domain as [16]:
$$ \hat{f}(s)=\hat{x}(s)\odot {\hat{\theta}}_{model}^{\ast } $$
(1)
After the implementation of the IDFT on each scale, the maximum value of \( f\in {\mathbb{R}}^{D^2\times S} \) is the relative position and scale.

(2)
Updating: We adopt the same updating strategy as the traditional DCF method:
$$ {\theta}_{model}=\left(1\upalpha \right){\theta}_{model}+\upalpha \theta $$
(2)
where α is the updating rate. More specifically, since θ_{model} is not available in the learning stage for the first frame, we use a predefined mask that only the target region is activated to optimize θ as in BACF. And then, we initialize θ_{model} = θ after the learning stage of the first frame.
3.2 Correlation filter tracker
In this paper, we set LADCF [16] as the baseline algorithm of our tracking approach.
The LADCF algorithm proposes a new DCFbased tracking method, which utilizes the adaptive spatial feature selection and temporal consistent constraints to reduce the impact of spatial boundary effect and temporal filter degradation. The feature selection process is to select several specific elements in the filter to retain distinguishable and descriptive information, forming a lowdimensional and compact feature representation. Considering an n × n image patch \( x\in {\mathbb{R}}^{n^2} \) as a base sample for the DCF design, the circulant matrix for this sample is generated by collecting its full cyclic shifts, \( {X}^T={\left[{x}_1,{x}_2,\dots, {x}_{n^2}\right]}^T\in {\mathbb{R}}^{n^2\times {n}^2} \) with the corresponding Gaussianshaped regression labels \( y=\left[{y}_1,{y}_2,\dots, {y}_{n^2}\right] \). The spatial feature selection embedded in the learning stage can be expressed as:
$$ {\displaystyle \begin{array}{c}\underset{\theta, \phi }{argmin}{\left\Vert \theta \circledast xy\right\Vert}_2^2+{\lambda}_1{\left\Vert \phi \right\Vert}_0\\ {}s.t.\theta ={\theta}_{\phi }=\mathit{\operatorname{diag}}\left(\phi \right)\theta, \end{array}} $$
(3)
where θ denotes the target model in the form of DCF, and ⊛ denotes the circular convolution operator. The indicator vector ϕ can potentially be expressed by θ and ‖ϕ‖_{0} = ‖θ‖_{0}, and diag(ϕ) is the diagonal matrix generated from the indicator vector of selected features ϕ. The ℓ_{0}norm is nonconvex, and the ℓ_{1}norm is widely used to approximate the sparsity [24], so a temporal consistency is constructed by ℓ_{1}norm relaxation spatial feature selection model [16]:
$$ \underset{\theta }{\mathrm{argmin}}{\left\Vert \theta \circledast xy\right\Vert}_2^2+{\lambda}_1{\left\Vert \theta \right\Vert}_1+{\lambda}_2{\left\Vert \theta {\theta}_{model}\right\Vert}_1 $$
(4)
where λ_{1} and λ_{2} are tuning parameters, and λ_{1}<<λ_{2}. θ_{model} denotes the model parameters estimated from the previous frame.
The ℓ_{2}norm relaxation is adopted to further simplify the following expression:
$$ \underset{\theta }{argmin}{\left\Vert \theta \circledast xy\right\Vert}_2^2+{\lambda}_1{\left\Vert \theta \right\Vert}_1+{\lambda}_2{\left\Vert \theta {\theta}_{model}\right\Vert}_2^2 $$
(5)
where the lasso regularization controlled by λ_{1} select the spatial feature. In the above formula, the filter template model is used to increase smoothness between consecutive frames to promote time consistency. In this way, the temporal consistency of spatial feature selection can be preserved to extract and retain the diversity of the static and dynamic appearance.
Since the multichannel features share the same spatial layout [16], the multichannel input is represented as Χ = {x_{1}, x_{2}, …, x_{L}}, and the corresponding filter is represented as θ = {θ_{1}, θ_{2}, …, θ_{L}}. By minimization, the goal can be extended to multichannel functions with structured sparsity [16]:
$$ \underset{\theta }{\mathrm{argmin}}{\sum}_{i=1}^L{\left\Vert {\theta}_i\circledast {x}_iy\right\Vert}_2^2+{\lambda}_1{\left\Vert \sqrt{\sum_{i=1}^L{\theta}_i\odot {\theta}_i}\right\Vert}_1+{\lambda}_2{\sum}_{i=1}^L{\left\Vert {\theta}_i{\theta}_{model\ i}\right\Vert}_2^2 $$
(6)
where θ^{j} is the jth element of the ith channel feature vector \( {\theta}_i\in {\mathbb{R}}^{D^2} \). ⊙ denotes the elementwise multiplication operator. The structured spatial feature selection term calculates the ℓ_{2}norm of each spatial location and then executes the ℓ_{1}norm to achieve joint sparsity.
Subsequently, utilizing ADMM [27] to optimize the above formula, we introduce the relaxation variables to construct the goals based on convex optimization [31]. Then, we could obtain the global optimal solution of the model through ADMM and form an enhanced Lagrange operator [16]:
$$ \mathcal{L}=\sum \limits_{i=1}^L{\left\Vert {\theta}_i\circledast {x}_iy\right\Vert}_2^2+{\lambda}_1\sum \limits_{j=1}^{D^2}{\left\Vert \sqrt{\sum \limits_{i=1}^L{\left({\theta}_i^j\right)}^2}\right\Vert}_1+{\lambda}_2\sum \limits_{i=1}^L{\left\Vert {\theta}_i{\theta}_{model\ i}\right\Vert}_2^2+\frac{\mu }{2}{\sum}_{i=1}^L{\left\Vert {\theta}_i{\theta}_i^{\prime }+\frac{\eta_i}{\mu}\right\Vert}_2^2 $$
(7)
where \( \mathcal{H}=\left\{{\eta}_1,{\eta}_2,\dots, {\eta}_L\right\} \) are the Lagrange multipliers, and μ > 0 is the corresponding penalty parameter controlling the convergence rate [16, 32]. As \( \mathcal{L} \) is convex, ADMM is exploited iteratively to optimize the following subproblems with guaranteed convergence:
$$ \left\{\begin{array}{c}\theta =\mathit{\arg}\underset{\theta }{\mathit{\min}}\mathcal{L}\left(\theta, {\theta}^{\prime },\mathcal{H},\mu \right)\\ {}{\theta}^{\prime }=\mathit{\arg}\underset{\theta^{\prime }}{\mathit{\min}}\mathcal{L}\left(\theta, {\theta}^{\prime },\mathcal{H},\mu \right)\\ {}\mathcal{H}=\mathit{\arg}\underset{\mathcal{H}}{\mathit{\min}}\mathcal{L}\left(\theta, {\theta}^{\prime },\mathcal{H},\mu \right)\end{array}\right. $$
(8)
3.3 Redetector
3.3.1 Confidence criterion
Most existing trackers do not consider whether the detection is accurate or not. In fact, once the target is detected incorrectly in the current frame, severely occluded, or completely missing, this may cause the tracking failure in subsequent frames.
We introduce a measure to determine the confidence degree of the target objects, which is the first step in the redetection model. The peak value and the fluctuation of the response map can reveal the confidence about the tracking results. The ideal response map should have only one peak while all the other regions are smooth. Otherwise, the response map will fluctuate intensely. If we continue to use the uncertain samples to track the target in the subsequent frames, the tracking model will be destroyed. Thus, we exploit to fuse two confidence degree evaluation criteria. The first one is the maximum response value F_{max} of the current frame.
The second one is the APCE measure which is defined as:
$$ APCE=\frac{{\left{F}_{max}{F}_{min}\right}^2}{mean\left({\sum}_{w,h}{\left({F}_{w,h}{F}_{min}\right)}^2\right)} $$
(9)
where the F_{max} and F_{min} are the maximum response and the minimum response of the current frame, respectively. F_{w, h} is the element value of the wth row and the hth column of the response matrix.
If the target is moving slowly and is easily distinguishable, the APCE value is generally high. However, if the target is undergoing fast motion with significant deformations, the value of APCE will be low even if the tracking is correct.
3.3.2 Target redetection
In this section, we describe the redetection mechanism used in the case of tracking failure. In the redetection module, when the confidence level is lower than the threshold, the SVM [33] is used for redetection. Considering a sample set (x_{1}, y_{1}), (x_{2}, y_{2}), …, (x_{i}, y_{i}), …, x_{i} ∈ R^{d}, including positive and negative samples, where d is the dimension of the sample, y_{i} ∈ (+1, −1) is sample labels, SVM can make segmentation of positive and negative samples to obtain the best classification hyperplane. The classification plane is defined as [33]:
$$ {\omega}^{\mathrm{T}}x+b=0 $$
(10)
where ω represents the weight vector, and b denotes the bias term. In the case of the linearly classifiable, for a given dataset T and classification hyperplane, the following formula is used for classification judgment:
$$ \left\{\begin{array}{c}{\omega}^{\mathrm{T}}x+b\le 1,{y}_i=1\ \\ {}{\omega}^{\mathrm{T}}x+b\ge 1,{y}_i=+1\end{array}\right. $$
(11)
Combining the two equations, we can abbreviate it as:
$$ y\left({\omega}^Tx+b\right)\ge 1 $$
(12)
The distance from each support vector to the hyperplane can be written as:
$$ d=\frac{\left{\omega}^Tx+b\right}{\left\Vert \omega \right\Vert } $$
(13)
The problem of solving the maximum partition hyperplane of the SVM model can be expressed as the following constrained optimization problem:
$$ {\displaystyle \begin{array}{c}\ \mathit{\min}\frac{1}{2}{\left\Vert \omega \right\Vert}^2\\ {}s.t.{y}_i\left({\omega}^T{x}_i+b\right)\ge 1\end{array}} $$
(14)
Next, the paper introduces the Lagrangian function to solve the above problem [33].
$$ L\left(\omega, \lambda, c\right)=\frac{1}{2}{\left\Vert \omega \right\Vert}^2{\sum}_{j=1}^l{c}_i{y}_i\left(\omega \cdot {x}_i+b\right)+{\sum}_{i=1}^l{c}_i $$
(15)
where c_{i} > 0 is the Lagrange multiplier, the solution of the optimization problem satisfies the partial derivative of L(ω, λ, c) to ω and b be 0. The corresponding decision function is expressed as:
$$ f(x)=\operatorname{sign}\left({\omega}^{\ast}\cdot x+{b}^{\ast}\right)=\operatorname{sign}\left\{\left({\sum}_{\mathrm{j}=1}^{\mathrm{l}}{c}_j^{\ast }{y}_j\left({x}_j\cdot {x}_i\right)\right)+{b}^{\ast}\right\} $$
(16)
Then, the new sample points are imported into the decision function to get the sample classification.
In the case of linear inseparability, we use the kernel function to map it to the highdimensional space. In this work, we use the Gaussian kernel function as follows:
$$ k\left({x}_i,{x}_j\right)={e}^{\left(\frac{{\left\Vert {x}_i{x}_j\right\Vert}^2}{2{\sigma}^2}\right)} $$
(17)
When a frame is redetected, an exhaustive search is performed on the current frame using a sliding window, and the HOG features are extracted for each image patch as the Χ vector in the above formula. And the f(x) is calculated by formula (16). Then, we obtain the sample area with the largest f(x). When the response value is greater than the threshold, it will be used as the location of the tracking target again.
The training process of SVM is as follows [33]. By the confidence level, we determine the quality of the sample. Then, samples with high confidence are used as the positive samples, and samples with low confidence are used as the negative samples. The HOG features from positive and negative samples are extracted to obtain the feature vectors. The feature vectors are represented as (x_{i}, y_{i}), i = 1, 2, …, n, where n denotes the number of training samples, x_{i} represents the HOG feature vector, and y_{i} represents the attribute of the extracted sample. If the training sample is positive, then y_{i} = 1, and if the sample is negative, then y_{i} = − 1. For the binary classification problem of our samples, the loss function is defined as formula (18).
$$ Loss\left(x,y,\omega \right)=\max \left(0,1y\left(x\cdot \omega \right)\right) $$
(18)
When the value of loss is negative, the parameters of SVM are updated as follows.
$$ {\omega}^{\ast }={\sum}_{\mathrm{j}=1}^{\mathrm{l}}{c}_j^{\ast }{y}_j{x}_j $$
(19)
$$ {b}^{\ast }={y}_i{\sum}_{\mathrm{j}=1}^{\mathrm{l}}{y}_j{c}_j^{\ast}\left({x}_j\cdot {x}_i\right) $$
(20)
where c_{j} is the Lagrangian coefficient, x is the feature vector extracted from the sample, and y is the label corresponding to the sample.