 Research
 Open Access
 Published:
Learning spatial regularized correlation filters with response consistency and distractor repression for UAV tracking
EURASIP Journal on Advances in Signal Processing volume 2023, Article number: 35 (2023)
Abstract
Correlation filterbased trackers have made significant progress in visual object tracking for various types of unmanned aerial vehicle (UAV) applications due to their promising performance and efficiency. However, the boundary effect remains a challenging problem. Several methods enlarge search areas to handle this shortcoming but introduce more background noise, and the filter is prone to learn from distractors. To address this issue, we present spatial regularized correlation filters with response consistency and distractor repression. Specifically, a temporal constraint is introduced to reinforce the consistency across frames by minimizing the difference between consecutive correlation response maps. A dynamic spatial constraint is also integrated by exploiting the local maximum points of the correlation response produced during the detection phase to mitigate the interference from background distractions. The proposed appearance model can optimize the temporal and spatial constraints together with a spatial regularization weight simultaneously. Meanwhile, the proposed appearance model can be solved effectively based on the alternating direction method of multipliers algorithm. The spatial and temporal information concealed in the response maps is fully taken into consideration to boost overall tracking performance. Extensive experiments are conducted on a public UAV benchmark dataset with 123 challenging sequences. The experimental results and analysis demonstrate that the proposed method outperforms 12 stateoftheart trackers in terms of both accuracy and robustness while efficiently operating in real time.
1 Introduction
Visual object tracking is widely used in many fields, especially in various types of unmanned aerial vehicle (UAV) applications, such as target following [1], autonomous landing [2, 3], and collision avoidance [4]. Although numerous visual tracking methods have been designed for UAVs [5], robust and accurate UAV tracking remains challenging due to numerous factors like aspect ratio change, fast motion, viewpoint change, low resolution, illumination variation, among others. Additionally, the inherent characteristics of UAVs, such as mechanical vibration, battery capacity, and limited computing power, also present great challenges for visual tracking.
In recent years, correlation filter (CF)based trackers have gained increasing attention from researchers due to their satisfactory tracking performance and high computational efficiency [6,7,8,9,10,11,12]. Using the property of circulant matrices, the CF effectively transforms the correlation operation in the spatial domain into elementwise multiplication in the frequency domain to increase the computing speed. However, the cyclic shift operation brings undesired boundary effects, which introduces inaccurate negative samples and substantially degrades tracking performance. To address this issue, Danelljan et al. [8] introduced a spatial regularization to penalize filter coefficients in the background and proposed spatially regularized discriminative correlation filters (SRDCF) for object tracking. A larger set of negative samples are introduced to mitigate the boundary effect. In detection, a conventional CF produces a response map, and the object is believed to be located where the map’s value is the highest. The quality of the response map reflects the similarity between the target appearance model trained in previous frames and the actual target detected in the current frame to some extent. In addition, the desired response map is unimodal and resembles Gaussianshaped labels. However, during the practical detection process, the response map can be easily disturbed by complex factors in real scenarios, such as a similar object, partial or complete occlusion, and background clutter. Multiple peaks usually occur in the generated response map, and the tracker is prone to drift due to the interference from background distractions. Although the introduction of a spatial constraint in learning correlation filters improves tracking performance, this method lacks the consideration of spatial and temporal information hidden in response maps. As shown in Fig. 1, the tracking failure occurs if the response value of any distractor exceeds that of the actual target. The prediction result becomes an object that resembles the tracked target in appearance. In addition, the response maps between consecutive frames are not consistent. If the background distractors can be detected and suppressed, and the consistency between consecutive frames is constrained, the tracking accuracy can be improved to a certain extent.
Based on the aforementioned observations, this paper proposes spatial regularized correlation filters with response consistency and distractor repression for robust and efficient UAV tracking to thoroughly explore spatial and temporal information in response maps. Specifically, a temporal constraint is introduced to reinforce the response consistency between consecutive frames. By minimizing the difference between the correlation response from the current frame and the response map from the previous frame, consistency is sustained, and the temporal information in the response map is therefore efficiently integrated. Moreover, considering the disturbance of tracking scenario changes, a dynamic spatial constraint is integrated to suppress the impact of background distractions, which are automatically located by the local maximum points of the response map produced in the detection phase. Thus, the spatial information in the response map is incorporated in the learning phase to enhance the adaptability of the proposed appearance model in different UAV tracking scenarios. Compared to the baseline, the proposed method can suppress background distractors and ensure the quality of the response map, as shown in Fig. 2. The response maps between adjacent frames are also relatively continuous, which is attributed to the consideration of both spatial and temporal information hidden in response maps.
The main contributions of this work are summarized as follows:

(1)
We propose a robust and efficient UAV tracking method by jointly learning spatial regularized correlation filters with response consistency and distractor repression. Spatial and temporal information hidden in response maps is taken into consideration to enhance the overall tracking performance.

(2)
We apply the alternating direction method of multipliers (ADMM) algorithm to deduce the iteration solutions. Using the ADMM method, an efficient optimization algorithm is developed to find a solution for a spatially regularized CF with temporal and spatial constraints.

(3)
The proposed method is evaluated and compared with 12 stateoftheart trackers on a public UAV benchmark dataset with 123 challenging image sequences. Experimental results demonstrate that the proposed method outperforms other trackers in terms of accuracy and robustness while running efficiently in real time.
The remainder of this paper is organized as follows. Section 2 summarizes several related studies. Section 3 revisits the baseline SRDCF tracker and gives a detailed description of the proposed method. Experimental results are reported and analyzed in Sect. 4, and conclusions are finally drawn in Sect. 5.
2 Related work
2.1 Tracking with correlation filters
CFbased trackers have been widely applied in visual tracking tasks since the introduction of the minimum output sum of the squared error (MOSSE) filter [6], which can reach a leading speed of 669 frames per second (FPS). Following the introduction of the MOSSE tracker, researchers have improved the performance of CFbased trackers from different aspects by introducing the kernel method [13], multichannel formulation [7], partbased strategy [14,15,16], scale estimation [17, 18], effective features [19,20,21], longterm redetector [22], and other techniques. Henriques et al. [7, 13] applied the kernel trick and multichannel features to improve the CFbased trackers. Liu et al. [14], Li et al. [15], and Fu et al. [16] exploited the partbased strategy in the CF model. By identifying scale in a scaling pool, Li and Zhu [17] presented a scale adaptive with multiple features tracker (SAMF) for scale estimation. Danelljan et al. [18] trained a classifier on a scale pyramid for scale estimation and proposed a discriminative scale space tracker (DSST). For effective feature exploitation, Bertinetto et al. [19] utilized two complementary features to establish the target appearance model and proposed a realtime tracker staple. Moreover, to attain a more comprehensive object appearance, some works [20, 21] have incorporated deep features into the CFbased model. Nonetheless, the heavy computational load incurred by the deep features limits their application in realtime UAV tracking tasks. For longterm redetection, Ma et al. [22] introduced online random fern and support vector machine (SVM) classifiers to recover the target in case of tracking failure. Although various tracking methods have been put forth over time, it is still challenging to design a tracker with both favorable performance and satisfactory running speed.
2.2 Tracking with spatial and temporal information
Spatiotemporal information is known to offer essential cues for tracking tasks. To improve both tracking accuracy and robustness, some recent methods utilizing spatial information have been proposed [8, 9, 23]. Danelljan et al. [8] proposed SRDCF for visual tracking by incorporating spatial regularization to alleviate the boundary effect caused by the periodic assumption of training samples. Galoogahi et al. [23] trained a CF with limited boundaries (CFLB) to reduce the number of examples in a CF that are affected by boundary effects. Galoogahi et al. [9] further proposed to learn backgroundaware correlation filters (BACF) for tracking by effectively modeling the target and its background. To enhance the tracking of objects with irregular shapes, Alan Lukezic et al. [24] introduced an automatically estimated spatial reliability map and proposed a discriminative correlation filter with channel and spatial reliability (CSRDCF) method. However, the enhancement brought by spatial information alone is insufficient. In addition to spatial information, the effective addition of temporal information has rekindled increasing interest in the CFbased tracking community. SRDCFdecon [25] reweights its historical training samples to reduce the problem caused by sample corruption. However, depending on the size of the training set, the tracker may need to store and process a large number of historical samples, thereby sacrificing its tracking efficiency. Li et al. [10] incorporated temporal regularization into the SRDCF and proposed spatial–temporal regularized correlation filters (STRCF) for object tracking. Li et al. [26] suggested learning augmented memory correlation filters (AMCF) for UAV tracking. Multiple historical views were selected and stored to be used in training so that they would have more historical appearance information. Huang et al. [11] introduced a regularization term to BACF to restrict the alteration rate of response maps and proposed aberrance repressed correlation filters (ARCF) for UAV tracking. Compared to [10, 11, 25, 26], our method fully exploits the rich spatiotemporal information concealed in response maps to improve the accuracy and robustness of the UAV tracking process.
3 Proposed method
3.1 Revisit the SRDCF tracker
Unlike the conventional kernelized correlation filter (KCF) tracker, the SRDCF tracker introduced a spatial regularization in the learning process to penalize filter coefficients. This allows SRDCF to be learned on a significantly larger set of negative training samples, without corrupting the positive samples, which greatly mitigates the boundary effect and achieves greater performance. The overall objective of SRDCF is formulated by minimizing the following objective:
where \(D=\{({{{\varvec{x}}}_k},{{{\varvec{y}}}_k})\}_{k=1}^T\) indicates a set of training samples, each sample \({{\varvec{x}}}=[{{{\varvec{x}}}_k^1},\ldots ,{{{\varvec{x}}}_k^D}]\) consists of D feature maps extracted from an image region with dimensions \(M\times {N}\), \(*\) denotes the convolution operator, \(\circ\) stands for the Hadamard product, \({{\varvec{y}}}_k\) represents the desired Gaussianshaped labels, \({{\varvec{f}}}\) and \({{\varvec{w}}}\) are the correlation filter and spatial regularization matrix, respectively, the superscript d denotes the dth channel, and the weight \(\alpha _k\) indicates the impact of each sample \({{\varvec{x}}}_k^d\); it is set to emphasize more the recent ones. In [8], Danelljan et al. employ the Gauss–Seidel method to iteratively update the CF \({{\varvec{f}}}\).
Although SRDCF is effective in mitigating boundary effects, it lacks consideration of spatial and temporal information hidden in response maps. In addition, its failure to exploit the circulant matrix structure, the large linear equations, and the Gauss–Seidel solver also increases the computational burden. More details on implementation can be found in [8].
3.2 Overall formulation
Motivated by the above discussion, we propose spatial regularized correlation filters with response consistency and distractor repression to enhance model stability and accuracy. The overall framework and flowchart of the proposed method are presented in Figs. 3 and 4, respectively. The proposed method based on the SRDCF tracker introduces distractorrepressed and responseconsistent constraints to improve the overall tracking performance.
The overall objective of the proposed method is to minimize the following loss function:
where \(\mathcal {C}_d\) and \(\mathcal {C}_r\) denote the distractorrepressed and responseconsistent constraint terms, respectively, and \({{\varvec{w}}}\) and \(\lambda\), respectively, denote the spatial regularization weight and parameter.
3.2.1 Distractorrepressed Constraint
Ideally, the response map is unimodal and resembles Gaussianshaped labels. However, the response map usually has multiple peaks because background distractors exist in actual detection. If the response of the background distractor transcends that of the target object, the tracker will drift to the distractor [12, 27]. In this work, we adopt distractorrepressed constraint to suppress the interferences from background distractions, which is obtained by:
where \(\Delta (\cdot )\) denotes the local maximum cropping function. The local maximum points in the response map \({{\varvec{R}}}\) indicate the presence of distractors. Only the top \(N_d\) local maxima are selected and counted as distractors, discarding the low response values. The cropping matrix \({{\varvec{P}}}^\mathrm{{T}}\) cuts the central area of \(\Delta (\cdot )\) to remove the maximum point within the object area. Factor \(\delta\) controls the repression strength. \({{\varvec{I}}}\) is an identity vector. \([\varphi _{p,c}]\) denotes a shift operator to match the peaks of the response map and regression target. The subscripts p and c denote the location difference of the two peaks. The distractorrepressed constraint term \(\mathcal {C}_d\) generates a dynamic regression target compared to the fixed target label. The first term in (2) is the ridge regression term that convolves the training samples \({{\varvec{x}}}=[{{{\varvec{x}}}}^1,\ldots ,{{{\varvec{x}}}}^D]\) with the filter \({{\varvec{f}}}=[{{{\varvec{f}}}}^1,\ldots ,{{{\varvec{f}}}}^D]\) to fit the distractorrepressed label \({{\varvec{y}}}\). It acts as a dynamic spatial constraint to suppress the local maxima of the response map in training phase.
3.2.2 Response consistent constraint
Ideally, the appearance of the target and its context change very little between adjacent frames as the time interval is short. Therefore, there is not much of a change in the correlation response of two consecutive frames. However, abrupt changes in appearance caused by partial or full occlusion and background clutter will lead to response anomalies. In this study, we introduce a responseconsistent constraint \(\mathcal {C}_r\) to mitigate the influence of abrupt changes in response maps between two consecutive frames:
where \(\gamma\) is the regularization parameter and \(\sum _{d=1}^D{{{\varvec{x}}}_{t1}^d*{{\varvec{f}}}_{t1}^d}\) denotes the response map obtained in the \((t1)\)th frame. The operator \([\varphi _{p,q}]\) shifts two peaks of both response maps to coincide with each other. When abrupt appearance changes occur, the similarity between consecutive frames will suddenly drop and thus, the value of the responseconsistent constraint term will be high. This term indicates that the desired response difference between consecutive frames should be zero, which can help suppress the response inconsistency in the training process. It also acts as a temporal constraint to penalize filter coefficients when the response difference is unusually high.
3.3 Optimization of formulation
Considering the convexity of (2), ADMM is introduced to obtain the globally optimal solution. To this end, we first introduce an auxiliary variable \({{\varvec{g}}}\), by requiring \({{\varvec{f}}}={{\varvec{g}}}\), and the step size parameter \(\mu\). The augmented Lagrangian form of (2) can be formulated as:
where \(\varvec{\rho }\) is the Lagrange multiplier. By introducing \({{\varvec{h}}}=\frac{1}{\mu }{\varvec{\rho }}\), (5) can be reformulated as:
The ADMM algorithm is then adopted by alternately solving the following subproblems,
The solution to each subproblem is detailed as follows:
Subproblem \({{\varvec{f}}}\): Using \({{\varvec{g}}}\) and \({{\varvec{h}}}\) obtained in the last iteration, the optimal \({{\varvec{f}}}\) can be determined by:
Based on the convolution theorem, the cyclic convolution operation in the spatial domain can be replaced by elementwise multiplication in the Fourier domain, and (8) can therefore be rewritten as:
where \(\hat{{{\varvec{f}}}}\) denotes the discrete Fourier transform (DFT) of the filter \({{\varvec{f}}}\). Considering the independence of each pixel, the solution can be, respectively, obtained across all channels for every pixel. The optimization in the \({{\varvec{j}}}\)th pixel can be further reformulated as:
where \(\hat{{{\varvec{R}}}}_{t1}^s\) denotes the DFT of the shifted detection response from the previous frame.
Setting the derivative of (10) to zero, the closedform solution for \(\mathcal {V}_j(\hat{{{\varvec{f}}}})\) can be obtained:
where the vector \({{\varvec{q}}}\) takes the form \({{\varvec{q}}}=\mathcal {V}_j(\hat{{{\varvec{x}}}})(\widehat{{{\varvec{y}}}\circ {\mathcal {C}}_d})_j +\mu \mathcal {V}_j(\hat{{{\varvec{g}}}})\mu \mathcal {V}_j(\hat{{{\varvec{h}}}})+ \gamma \mathcal {V}_j(\hat{{{\varvec{x}}}})\hat{{{\varvec{R}}}}_{t1}^s\). Since \(\mathcal {V}_j(\hat{{{\varvec{x}}}})\mathcal {V}_j(\hat{{{\varvec{x}}}})^\mathrm{{T}}\) is a rank1 matrix, (11) can be solved with the Sherman–Morrison formula [28], i.e.,\(({{\varvec{A}}}+{{\varvec{u}}}{{\varvec{v}}}^\mathrm{{T}})^{1}={{{\varvec{A}}}}^{1} \frac{{{{\varvec{A}}}}^{1}{{\varvec{u}}}{{\varvec{v}}}^\mathrm{{T}}{{{\varvec{A}}}}^{1}}{1+{{\varvec{v}}}^\mathrm{{T}}{{{\varvec{A}}}}^{1}{{\varvec{u}}}}\). In this case, \({{\varvec{A}}}=\frac{\mu }{1+\gamma }{{\varvec{I}}}\), and \({{\varvec{u}}}={{\varvec{v}}}=\mathcal {V}_j(\hat{{{\varvec{x}}}})\). As a result, (11) is equivalent to:
where \(b=\frac{\mu }{1+\gamma }+\mathcal {V}_j(\hat{{{\varvec{x}}}})^{\textrm{T}}\mathcal {V}_j(\hat{{{\varvec{x}}}})\), \(\gamma ^*=\frac{1}{1+\gamma }(\frac{\mu }{1+\gamma })^{1}\), \(\mathcal {\hat{S}}_{\varvec{xx}}=\mathcal {V}_j(\hat{{{\varvec{x}}}})^{\textrm{T}}\mathcal {V}_j(\hat{{{\varvec{x}}}})\), \(\mathcal {\hat{S}}_{\varvec{xg}}=\mathcal {V}_j(\hat{{{\varvec{x}}}})^{\textrm{T}}\mathcal {V}_j(\hat{{{\varvec{g}}}})\), and \(\mathcal {\hat{S}}_{\varvec{xh}}=\mathcal {V}_j(\hat{{{\varvec{x}}}})^\mathrm{{T}}\mathcal {V}_j(\hat{{{\varvec{h}}}})\). Note that (12) contains only the vector sumproduct operation and thus can be computed efficiently. The filter \({{\varvec{f}}}\) can be further obtained by the inverse DFT of \(\hat{{{\varvec{f}}}}\).
Subproblem \({{\varvec{g}}}\): Given \({{\varvec{f}}}\) and \({{\varvec{h}}}\), the optimal \({{\varvec{g}}}\) can be obtained by:
Each element of \({{\varvec{g}}}\) can be computed independently and thus, the closedform solution of \({{\varvec{g}}}\) can be computed by:
As a result, the subproblems \({{\varvec{f}}}\) and \({{\varvec{g}}}\) are solved.
Updating step size parameter \(\mu\): The step size parameter \(\mu\) is updated as follows:
where \(\mu ^{\max }\) and \(\beta\) denote the maximum value of \(\mu\) and the scale factor, respectively.
3.4 Update of appearance model
The appearance model \(\hat{{{\varvec{x}}}}^{\textrm{model}}\) is updated as follows:
where \(\eta\) is the learning rate, \(\hat{{{\varvec{x}}}}\) is the object feature extracted at frame t, and \(\hat{{{\varvec{x}}}}_{t}^{\textrm{model}}\) is the model feature.
3.5 Object location
When a new frame arrives, the filter trained in the last frame \(\hat{{{\varvec{f}}}}_{t1}\) is used to localize the object by searching for the peak in the response map calculated as follows:
where \({{\varvec{z}}}_{t}^d\) denotes the feature map of the search area patch in the dth channel and \(\mathcal {F}^{1}\) represents the IDFT. The target location \({{\varvec{l}}}_t\) in frame t can be found at the maximum response value.
3.6 Tracking with response consistency and distractor repression
The details of our proposed method are summarized in Algorithm 1.
Algorithm1 The proposed tracking method.  

Input: The image frame t, location \({{\varvec{l}}}_{t1}\) and the scale \({{\varvec{s}}}_{t1}\) of the tracked object on frame \(t1\),the appearance model \(\hat{{{\varvec{x}}}}_{t1}^{model}\), and the filter \({{\varvec{f}}}_{t1}\).  
Output: Location \({{\varvec{l}}}_t\) and scale \({{\varvec{s}}}_t\) of the tracked object on frame t.  
If \(t=1\) then  
Extract \(\mathcal {C}_d^1\) centered at the ground truth \({{\varvec{l}}}_1\) using (3);  
Use (12), (14) and (15) to initialize the filters \({{\varvec{f}}}_1\) and \({{\varvec{g}}}_1\);  
Else  
Crop the search image patch \({{\varvec{z}}}\) with S scales on the frame t centered at \({{\varvec{l}}}_{t1}\);  
Extracted grayscale, color names (CN) [29], and histogram of oriented gradient (HOG) [30] features \({[{{\varvec{z}}}(s)]}_{s=1}^S\) of the patch;  
Generate the response map \({{\varvec{R}}}_t\) using (17) ;  
Estimate object location \({{\varvec{l}}}_t\) on frame t by searching for the highest value in \({{\varvec{R}}}_t\);  
Extract \(\mathcal {C}_d^t\) centered at location \({{\varvec{l}}}_t\) using (3);  
Update and the filters \({{\varvec{f}}}_t\) and \({{\varvec{g}}}_t\) using (12), (14) and (15);  
Update the appearance model using (16).  
End 
4 Experiments
In this section, we conduct experiments with the proposed method and 12 other stateoftheart trackers for comparison, using a publicly available UAV benchmark dataset. The experimental settings are first introduced, the overall performance and attributebased evaluation of all trackers on the UAV benchmark dataset are then presented, and the qualitative evaluations, ablation study, parameters, and tracking speed analysis are finally discussed.
4.1 Experimental settings
Parameter Settings For the ADMM hyperparameters, we set \(\gamma =0.8\), \(\lambda =0.01\), the maximum value of the step size parameter \(\mu ^{\max }=10000\), \(\beta =10\), and \(\mu ^0=1\). The number of iterations for the ADMM N is set to 5, and the learning rate \(\eta\) is 0.0192. The number of top local maxima \(N_d\) is set to 30 and \(\delta\) is 0.25. All parameters are the same for the following experiments. The experiments were carried out in MATLAB 2017b on an Intel(R) Core(TM) i77700 CPU (3.6 GHz) with 8 GB RAM.
Features Only the handcrafted features are utilized for appearance representations, namely grayscale, CN, and HOG features. The cell dimensions for feature extraction are 4\(\times\)4 and the HOG orientation bin number is 9.
Datasets To evaluate the tracking performance, experiments were conducted using the public UAV benchmark dataset UAV123@10FPS [31]. The dataset is made up of 123 challenge sequences (37885 frames).
Evaluation Methodology To analyze and evaluate the performance of our proposed method, precision and success rate based on onepass evaluation (OPE) are employed as the main evaluation criteria. More details can be found in [32].
4.2 Overall performance
The overall performance of our proposed method is evaluated with 12 stateoftheart trackers, including KCF [7], TLD [33], LCT [21], SAMF [17], Struck [34], BACF [9], Staple [19], AMCF [26], SRDCF [8], SRDCFdecon [25], MEEM [35], and STRCF [10]. Figure 5 shows the precision and success plots of all the trackers on 123 challenging UAV sequences. In precision plots, the distance precision scores at the 20pixel threshold are 0.673 (our method), 0.627 (STRCF), 0.584 (MEEM), 0.584 (SRDCFdecon), 0.575 (SRDCF), 0.574 (AMCF), 0.573 (Staple), 0.572 (BACF), 0.509 (Struck), 0.465 (SAMF), 0.442 (LCT), 0.415 (TLD), and 0.406 (KCF), respectively. Our proposed method achieves the best precision among all tracking algorithms and outperforms the second and third best trackers by 4.6% and 8.9%, respectively. Likewise, in success plots, the success scores for the area under the curve (AUC) of all trackers are 0.476 (our method), 0.457 (STRCF), 0.429 (SRDCFdecon), 0.423 (SRDCF), 0.415 (Staple), 0.415 (AMCF), 0.413 (BACF), 0.380 (MEEM), 0.347 (Struck), 0.326 (SAMF), 0.289 (LCT), 0.286 (TLD) and 0.265 (KCF), respectively. Our proposed method also achieves an advantage of 1.9% and 4.7% over the second best tracker STRCF and the third best tracker SRDCFdecon. Thus, it can be summarized that our proposed method outperforms the other 12 stateoftheart trackers in terms of precision and success rate on 123 UAV sequences.
4.3 Attributebased evaluation
The benchmark sequences are annotated with 12 attributes, namely scale variation (SV), partial occlusion (POC), camera motion (CM), aspect ratio change (ARC), viewpoint change (VC), low resolution (LR), similar object (SOB), full occlusion (FOC), illumination variation (IV), outofview (OV), fast motion (FM), and background clutter (BC). These attributes affect the performance of a tracker and are used to evaluate the tracker in different scenarios. Figures 6 and 7 depict the precision and success plots of different attributes of all trackers on 123 UAV sequences.
In the precision plots, our proposed method performs competitively among the challenging attributes compared with other stateoftheart trackers. Our method ranks first among ten attributes out of the twelve in the UAV123@10FPS benchmark, namely SV, POC, CM, ARC, VC, LR, SOB, FOC, IV, and FM. Tables 1 and 2 also present the precision and success scores of all trackers for the above attributes. As shown in Table 1, our method performs 5.1%, 3.2%, 2.1%, 7.3%, 5.2%, 4.8%, 2.1%, 1.2%, 8.4%, and 3.1% better in these attributes compared to the second best trackers in precision scores at the 20pixel threshold.
Similarly, in the success plots, our method ranks first among nine attributes, namely SV, POC, ARC, VC, LR, SOB, FOC, IV, and FM. As shown in Table 2, our method performs 2.2%, 2.2%, 2.9%, 1.5%, 4.2%, 0.5%, 0.6%, 4.1%, and 1.7% better in these attributes compared to the second best trackers in AUCbased success scores.
Figures 8 and 9 present the analysis of precision and success scores for all trackers with different attributes. As can be seen in these figures, our proposed method ranks first in general among all trackers. Therefore, it can be concluded that the proposed method has achieved better tracking performance compared to other stateoftheart trackers.
4.4 Qualitative evaluation
For clearer visualization, Fig. 10 further exhibits the tracking results obtained by all trackers with several challenging sequences. Table 3 shows the number of frames and relevant attributes of these sequences. The tracking results show that the proposed method outperforms other stateoftheart trackers.
4.5 Ablation study
To validate its effectiveness, our method is compared to itself with different modules enabled. The overall evaluation is presented in Table 4. With the response consistency (RC) and distractor repression (DR) modules added to the baseline (SRDCF), the performance is smoothly improved. Furthermore, our final tracker improves the baseline method by 5.3% and 9.8% in terms of success rate and precision criteria, respectively.
4.6 Parameters analysis
We investigate the parameter sensitivity of the distractor repression factor \(\delta\) in (3), responseconsistent constraint parameter \(\gamma\) in (4), and the number of local maximums for suppression \(N_d\) described in Sect. 3.2.1. We conduct experiments on our tracker using the UAV123@10fps dataset for different parameters. The precision and success scores are employed as evaluation criteria for tracking performance. We fix the other parameters while changing the value of the analyzed one. The values of both \(\delta\) and \(N_d\) have an influence on the distractors repression and affect the tracking performance. In Fig. 11, we exhibit the precision and success scores of our tracker with various values in \(\delta\) and \(N_d\). Ranging from 0.663 to 0.673 and 0.468 to 0.476, the precision and success scores are slightly influenced by the value of \(N_d\), when \(N_d\) is above 20. As only the top local maxima have significant interference, the suppression of other lower local maxima has little impact on the tracking performance. As for \(\delta\), when its value increases from 0.05 to 0.25, there is an increase in tracking performance. When \(\delta\) varies from 0.25 to 1, both precision and the success scores decrease. Specifically, when \(\delta\) is bigger than 0.35, the variation of \(\delta\) has a relatively small impact on tracking performance, and the precision and success scores are mostly within the range of 0.655 to 0.641 and 0.466 to 0.457, respectively. When the local maximums are fixed, further increasing the value of factor \(\delta\) does not improve the tracking performance. The responseconsistent constraint parameter \(\gamma\) works when the abrupt appearance changes occur and is introduced as a temporal constraint. The value of \(\gamma\) varies from 0 to 1 with a step size of 0.1. Notably, our tracker with \(\gamma =0\) is the ‘Baseline+DR’ tracker. As \(\gamma\) gradually increases from 0, both precision and success scores exhibit an upward trend. When the value arrives \(\gamma =0.8\), both precision and success scores reach the maximum value (0.673 and 0.476). As \(\gamma\) continues to increase, both precision and success scores decrease. The introduction of responseconsistent constraint maintains temporal smoothness, and further improves the tracking performance on the basis of distractors repression. To achieve satisfactory performance, we set \(\delta\), \(\gamma\), and \(N_d\), as 0.25, 0.8, and 30, respectively. All parameters are the same for the overall experiments.
In addition, we also conducted a comparative experiment on the number of iterations of ADMM. Figure 12 presents the precision score, success score, and speed under different numbers of iterations. By comprehensively considering the accuracy and tracking speed, we finally set the number of iterations of ADMM to 5.
4.7 Tracking speed analysis
Figure 13 presents the tracking speed comparison results in terms of the number of processed frames per second (FPS), for all trackers on 123 challenging UAV sequences. The capturing speed of the original UAV sequences is 10 FPS. From Fig. 13, the speed of our method ranks sixth among all trackers and is greater than 10 FPS. Therefore, our method can meet the realtime requirement in UAV tracking tasks.
5 Conclusions
In this study, spatial regularized correlation filters with response consistency and distractor repression were proposed in the context of UAVbased tracking. By exploiting the responseconsistent constraint to limit the correlation response variations, the temporal consistency across adjacent frames was pursued to enhance the discriminative power of the proposed appearance model. In addition, the distractorrepressed constraint was incorporated in the learning phase. It served as a dynamic spatial constraint to suppress the influence of distractors. An ADMM algorithm was developed to solve the appearance model efficiently. Spatial and temporal cues in response maps were explored and encoded in the conventional CF framework to facilitate UAV tracking in complex scenarios and boost overall performance. Comparative experimental results over 123 challenging UAV sequences demonstrated that the proposed method outperforms 12 stateoftheart trackers in terms of accuracy, robustness, and efficiency.
Availability of data and materials
Experiments were conducted using the public UAV benchmark dataset UAV123@10FPS to evaluate the tracking performance. The UAV123@10fps dataset can be found in https://cemse.kaust.edu.sa/ivul/uav123.
Abbreviations
 CF:

Correlation filter
 UAV:

Unmanned aerial vehicle
 ADMM:

Alternating direction method of multipliers
 FPS:

Frames per second
 DFT:

Discrete Fourier transform
 CN:

Color name
 HOG:

Histogram of oriented gradient
 OPE:

Onepass evaluation
 AUC:

Area under the curve
 RC:

Response consistency
 DR:

Distractor repression
References
R. Li, M. Pang, C. Zhao, G. Zhou, L. Fang, Monocular longterm target following on uavs, in 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2016, Las Vegas, NV, USA, June 26–July 1, 2016, p. 29–37 (2016). https://doi.org/10.1109/CVPRW.2016.11
C. Fu, A. Carrio, M.A. OlivaresMendez, P. Campoy, Online learningbased robust visual tracking for autonomous landing of unmanned aerial vehicles, in 2014 International Conference on Unmanned Aircraft Systems (ICUAS), Orlando, FL, USA, p. 649–655 (2014)
S. Lin, M.A. Garratt, A.J. Lambert, Monocular visionbased realtime target recognition and tracking for autonomously landing an UAV in a cluttered shipboard environment. Auton. Robots 41(4), 881–901 (2017)
C. Fu, A. Carrio, M.A. OlivaresMéndez, R. SuarezFernandez, P.C. Cervera, Robust realtime visionbased aircraft tracking from unmanned aerial vehicles, in 2014 IEEE International Conference on Robotics and Automation, ICRA 2014, Hong Kong, China, May 31–June 7, 2014, p. 5441–5446 (2014). https://doi.org/10.1109/ICRA.2014.6907659
C. Fu, B. Li, F. Ding, F. Lin, G. Lu, Correlation filters for unmanned aerial vehiclebased aerial tracking: a review and experimental evaluation. IEEE Geosci. Remote Sens. Mag. 10(1), 125–160 (2021)
D.S. Bolme, J.R. Beveridge, B.A. Draper, Y.M. Lui, Visual object tracking using adaptive correlation filters, in The TwentyThird IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13–18 June 2010, p. 2544–2550 (2010). https://doi.org/10.1109/CVPR.2010.5539960
J.F. Henriques, R. Caseiro, P. Martins, J. Batista, Highspeed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 583–596 (2015)
M. Danelljan, G. Häger, F.S. Khan, M. Felsberg, Learning spatially regularized correlation filters for visual tracking, in 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7–13, 2015, p. 4310–4318 (2015). https://doi.org/10.1109/ICCV.2015.490
H.K. Galoogahi, A. Fagg, S. Lucey, Learning backgroundaware correlation filters for visual tracking, in IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22–29, 2017, p. 1144–1152 (2017). https://doi.org/10.1109/ICCV.2017.129
F. Li, C. Tian, W. Zuo, L. Zhang, M. Yang, Learning spatialtemporal regularized correlation filters for visual tracking, in 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018, p. 4904–4913 (2018). https://doi.org/10.1109/CVPR.2018.00515
Z. Huang, C. Fu, Y. Li, F. Lin, P. Lu, Learning aberrance repressed correlation filters for realtime UAV tracking, in 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27–November 2, 2019, p. 2891–2900 (2019). https://doi.org/10.1109/ICCV.2019.00298
C. Fu, F. Ding, Y. Li, J. Jin, C. Feng, Dr\({}^{\text{2}}\)track: towards realtime visual tracking for UAV via distractor repressed dynamic regression, in IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2020, Las Vegas, NV, USA, October 24, 2020–January 24, 2021, p. 1597–1604 (2020). https://doi.org/10.1109/IROS45743.2020.9341761
J.F. Henriques, R. Caseiro, P. Martins, J.P. Batista, Exploiting the circulant structure of trackingbydetection with kernels, in Computer Vision—ECCV 2012—12th European Conference on Computer Vision, Florence, Italy, October 7–13, 2012, Proceedings, Part IV. Lecture Notes in Computer Science, vol. 7575, p. 702–715 (2012). https://doi.org/10.1007/9783642337659_50
T. Liu, G. Wang, Q. Yang, Realtime partbased visual tracking via adaptive correlation filters, in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7–12, 2015, p. 4902–4912 (2015). https://doi.org/10.1109/CVPR.2015.7299124
Y. Li, J. Zhu, S.C.H. Hoi, Reliable patch trackers: Robust visual tracking by exploiting reliable patches, in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7–12, 2015, p. 353–361 (2015). https://doi.org/10.1109/CVPR.2015.7298632
C. Fu, Y. Zhang, Z. Huang, R. Duan, Z. Xie, Partbased backgroundaware tracking for UAV with convolutional features. IEEE Access 7, 79997–80010 (2019)
Y. Li, J. Zhu, A scale adaptive kernel correlation filter tracker with feature integration, in Computer Vision—ECCV 2014 Workshops—Zurich, Switzerland, September 6–7 and 12, 2014, Proceedings, Part II. Lecture Notes in Computer Science, vol. 8926, p. 254–265 (2014). https://doi.org/10.1007/9783319161815_18
M. Danelljan, G. Häger, F.S. Khan, M. Felsberg, Accurate scale estimation for robust visual tracking, in British Machine Vision Conference, BMVC 2014, Nottingham, UK, September 1–5, 2014 (2014). http://www.bmva.org/bmvc/2014/papers/paper038/index.html
L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, P.H.S. Torr, Staple: complementary learners for realtime tracking, in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016, p. 1401–1409 (2016). https://doi.org/10.1109/CVPR.2016.156
M. Danelljan, G. Häger, F.S. Khan, M. Felsberg, Convolutional features for correlation filter based visual tracking, in 2015 IEEE International Conference on Computer Vision Workshop, ICCV Workshops 2015, Santiago, Chile, December 7–13, 2015, p. 621–629 (2015). https://doi.org/10.1109/ICCVW.2015.84
Y. Li, C. Fu, Z. Huang, Y. Zhang, J. Pan, Intermittent contextual learning for keyfilteraware UAV object tracking using deep convolutional feature. IEEE Trans. Multimed. 23, 810–822 (2021)
C. Ma, X. Yang, C. Zhang, M. Yang, Longterm correlation tracking, in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7–12, 2015, p. 5388–5396 (2015). https://doi.org/10.1109/CVPR.2015.7299177
H.K. Galoogahi, T. Sim, S. Lucey, Correlation filters with limited boundaries, in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7–12, 2015, p. 4630–4638 (2015). https://doi.org/10.1109/CVPR.2015.7299094
A. Lukezic, T. Vojír, L.C. Zajc, J. Matas, M. Kristan, Discriminative correlation filter tracker with channel and spatial reliability. Int. J. Comput. Vis. 126(7), 671–688 (2018)
M. Danelljan, G. Häger, F.S. Khan, M. Felsberg, Adaptive decontamination of the training set: a unified formulation for discriminative visual tracking, in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016, p. 1430–1438 (2016). https://doi.org/10.1109/CVPR.2016.159
Y. Li, C. Fu, F. Ding, Z. Huang, J. Pan, Augmented memory for correlation filters in realtime UAV tracking, in IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2020, Las Vegas, NV, USA, October 24, 2020–January 24, 2021, p. 1559–1566 (2020). https://doi.org/10.1109/IROS45743.2020.9341595
W. Zhang, B. Kang, S. Zhang, Enhanced occlusion handling and multipeak redetection for longterm object tracking. J. Electron. Imaging 27(03), 033005 (2018)
K.B. Petersen, M.S. Pedersen, The Matrix Cookbook. Technical University of Denmark (2012). http://www2.compute.dtu.dk/pubdb/pubs/3274full.html
F.S. Khan, R.M. Anwer, J. van de Weijer, A.D. Bagdanov, M. Vanrell, A.M. López, Color attributes for object detection, in 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 16–21, 2012, p. 3306–3313 (2012). https://doi.org/10.1109/CVPR.2012.6248068
N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), 20–26 June 2005, San Diego, CA, USA, p. 886–893 (2005). https://doi.org/10.1109/CVPR.2005.177
M. Mueller, N. Smith, B. Ghanem, A benchmark and simulator for UAV tracking, in Computer Vision—ECCV 2016—14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I. Lecture Notes in Computer Science, vol. 9905, p. 445–461 (2016). https://doi.org/10.1007/9783319464480_27
Y. Wu, J. Lim, M. Yang, Object tracking benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1834–1848 (2015)
Z. Kalal, K. Mikolajczyk, J. Matas, Trackinglearningdetection. IEEE Trans. Pattern Anal. Mach. Intell. 34(7), 1409–1422 (2012). https://doi.org/10.1109/TPAMI.2011.239
S. Hare, S. Golodetz, A. Saffari, V. Vineet, M. Cheng, S.L. Hicks, P.H.S. Torr, Struck: structured output tracking with kernels. IEEE Trans. Pattern Anal. Mach. Intell. 38(10), 2096–2109 (2016)
J. Zhang, S. Ma, S. Sclaroff, MEEM: robust tracking via multiple experts using entropy minimization, in Computer Vision—ECCV 2014—13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part VI. Lecture Notes in Computer Science, vol. 8694, p. 188–203 (2014). https://doi.org/10.1007/9783319105994_13
Acknowledgements
The author thanks the editor and anonymous reviewers for their helpful comments and valuable suggestions.
Funding
This work was supported by the scientific research program of the Education Department of Shaanxi Provincial Government (20JK0487) and the Key R &D Program of the Shaanxi Province of China (No.2022GY071).
Author information
Authors and Affiliations
Contributions
WZ received the B.S. degree in network engineering from Shaanxi University of Science and Technology, Xi’an, China, in 2010, the M.S. degree in computer science and technology from Shaanxi Normal University, Xi’an, China, in 2013, and the Ph.D. degree in computer science and technology from Northwest University, Xi’an, China, in 2019. She is currently a lecturer of the Department of Computer Science, Baoji University of Arts and Sciences. Her research interests include object tracking and machine learning.The author have read and approved the final version of the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zhang, W. Learning spatial regularized correlation filters with response consistency and distractor repression for UAV tracking. EURASIP J. Adv. Signal Process. 2023, 35 (2023). https://doi.org/10.1186/s13634023009980
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13634023009980
Keywords
 Visual object tracking
 Unmanned aerial vehicle (UAV)
 Spatial–temporal information
 Correlation filter
 Response map