Long-term tracking with transformer and template update

Aiming at the tracking failure due to the disappearance of the target in the long-term target tracking process, this paper proposes a long-term target tracking network based on the visual transformer and template update. First of all, we construct a feature extraction network based on the transformer and adopt a knowledge distillation strategy to improve the effectiveness of the network for global feature extraction. Secondly, in the modeling transformer, the target features are fully fused with the search area features by using encoder, and the position information in the target query is learned by the decoder. Then, target predictions are performed on the information from the encoder–decoder to obtain tracking results. Meanwhile, we design a score head model to judge the validity of the dynamic template of the current frame before tracking in the next frame. We select the appropriate dynamic template for the tracking of the next frame according to the score result. In this paper, we performed extensive experiments on LaSOT, VOT2021-LT, TrackingNet, TLP, and UAV123 datasets, and the experimental results prove the effectiveness of our method. In particular, it exceeds STARK by 0.8 %\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\%$$\end{document} (F score) on VOT2021-LT, 1.0 %\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\%$$\end{document} (S score) on LaSOT, and TrackingNet exceed STARK by 1.1 %\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\%$$\end{document} (NP score), which also demonstrates the superiority of the method in this paper.


Introduction
Although long-term visual tracking has received significant attention from researchers as a research hotspot in visual target tracking, studying a robust long-term tracking framework remains a daunting task as it is conducted in more realistic scenarios with many unresolved difficulties, particularly in the case of target disappearance & reappearance [1].
Currently, most approaches use a combination of CNN and Transformer [2] for longterm visual tracking and have also achieved good results [3,4]. Typically, researchers extract generic features of the input image through a CNN-based backbone network. However, CNN only focuses on the feature relationships between local neighborhoods when processing image content and features, ignoring the impact of global information on feature extraction. In order to extract global information better, inspired by the success of the transformer and its recent application in computer vision [5], this paper uses transformer-based DeiT [6] to replace convolutional neural network. Specifically, a distillation mechanism is used in the feature extraction process and combined with a teacher model for bootstrap optimization. The advantage of this is that the local sensory field and parameter sharing in CNN can be combined into the transformer to improve the processing capability of the transformer model for features.
In addition, in using the transformer model to solve the process of target disappearance and reappearance, most methods pay attention to the importance of temporal information for long-time tracking in addition to global information. Template update is the typical approach to introduce temporal information, which typically specifies a target template in the first frame and uses this same template in subsequent tracking, ignoring the changes of the target across frames. The long tracking time of long-term tracking tasks can lead to problems of target disappearance and deformation. However, the existing method of fixing the template cannot effectively update the target template, which eventually leads to a situation of tracking failure. Therefore, we propose a score prediction head to judge the effectiveness of dynamic target templates using temporal information and the target state of the current frame, so as to flexibly select high-quality dynamic update templates to improve tracking accuracy.
To conclude, this work improves the long-term tracking model in two ways. Firstly, we use transformer-based DeiT to replace the original ResNet [7] as the feature extraction backbone, DeiT takes advantage of the ability of the transformer to extract features with global dependency, reducing the accumulation of errors in the subsequent trace process, thereby improving long-term tracking performance. Secondly, a score prediction head is designed to be applied to the dynamic template update branch, and the cross-attention operation of the score token is performed in the search area and the initial template, to calculate the effectiveness of the dynamic template of the current frame of the score judgment, and provide a reliable dynamic template for the tracking of the next frame.
In summary, this work has three contributions.

1.
A transformer-based long-term tracking model is proposed to capture globally dependent target features in video sequences, allowing extensive communication between the target and the search region. 2. An efficient score prediction head is designed to select high-quality dynamic templates through which additional temporal information is introduced to achieve an efficient transformer-based long-term tracker. 3. Our model demonstrates strong performance on five challenging benchmarks and achieved near real-time operation at 25 FPS on dual RTX 2080Ti GPUs.

Related works
The related work in this paper incorporates both long-term tracking and tracking paradigm.

Long-term tracking
Since 2018, long-term trackers have been developed with the release of long-term tracking datasets. Equipping short-term trackers with re-detectors to improve the ability of long-term tracking to deal with frequent target disappearance and reappear is a mainstream approach at present. For example, The MBMD [8] exploits a SiamPRN-based network to regress the target in a local search region or every sliding window when redetection. Valmadre et al. [9] propose a long-term tracker. This tracker adds a simple redetector to SiamFC [10], and its performance is much better than the original SiamFC.
In [9,[11][12][13], trackers are equipped with the re-detection scheme for long-term tracking, but they merely track the targets in a local search region to expect that the lost targets will reappear around the previous location. This approach carries a high level of risk because the output of the short-term tracker is not as reliable. To avoid this risk, GlobalTrack [14] performs a global instance search of the target for each frame, but this method not only requires a lot of computational costs but also has unsatisfactory results. In the same year, the template updated strategy pushed the comprehensive performance of long-term tracking to a new commanding height. Updatenet [15] applied template matching to SiamFC and DaSiamRPN [16] to predict target locations. LTMU [17] utilizes SiamRPN [18] as a re-detector and Metaupdater as an online updater to predict whether the current state is reliable enough to be used for the update in longterm tracking.

Tracking paradigm
At present, the popular tracking methods [3,4,19,20] are mostly combined with CNN and transformer, using CNN as a backbone to extract the general features of the targets, and the transformer with its powerful modeling ability is usually used for the fusion work between the target and the template, and finally through the simple head network to generate the target state. This method shows powerful performance in many works, for example, Transtrack [21] takes the features extracted in CNN as the query and key, learns to query the target location from the key by the detecting branch, and queries the location of the current frame by tracking the object feature of the previous frame in the key by the tracking branch. Based on the DETR [22], Trackformer [23] queries the target's embedding through a series of learnable object queries, which successfully predict the output embedding through subsequent tasks, such as border regression or category prediction, to pass in the tracking of the next frame. STARK extracts the common features through the ResNet, and then passes in the transformer to model the global spatio-temporal feature dependencies between the target and the search area, and learns query embeddings to predict the target location. This work believes that although the use of CNN to extract common features can adapt to most of the tasks, in the long-term tracking process, this general method is not very suitable. To avoid CNN's shortcomings in handling long-term dependency and understanding the global structure of objects, we propose a full transformer tracker, solely containing encoders and decoders and two simple heads, leading to a more accurate tracker with neat and compact architecture.

Methods
In this section, we describe our approach in detail. In Sect. 3.1, we introduce the motivation. In Sect. 3.2, we introduce the overall tracking framework of the approach. In Sect. 3.3, we introduce the transformer-based feature extraction network. In Sect. 3.4, we introduce the transformer network for the modeling part. In Sect. 3.5, we introduce two simple heads. Section 3.6 describes the training loss.

Motivation
In recent years, the more popular long-term tracking networks have used convolutional neural networks to extract target features. However, the disadvantage of convolutional neural networks is that the convolutional kernel focuses on the information in local regions and ignores the global information of the target and the frame-to-frame dependencies, therefor if we can improve the tracking network's ability to extract global information with long-term dependencies, the overall tracking performance of the long-term tracker will be improved. At the same time, this paper argues that temporal information is important for the performance of the trackers. If all the morphological and positional information in a video frame is assumed to be known when tracking a target in a frame, then there should be some performance improvement for problems such as target disappearance and deformation.

Long-term tracking framework
In this section, we propose the transformer network for long-term visual tracking. The network architecture is demonstrated in Fig. 1, which is mainly composed of four parts. It is divided into feature extraction backbone, transformer structure for building a feature dependency model, a head network to track the target position, and a head network to control the update of dynamic templates. In Fig. 1, X represents a search region of the current frame, T represents a template image of the initial target object, and Z represents dynamically updated template sampled from intermediate frames.

Visual transformer feature extraction network
We use the transformer-based DeiT as the backbone, which introduces a teacher-student strategy for transformer. It reduces dependence on large amounts of data by optimizing data augmentation and regularization strategies. And it improves the running speed to a certain extent. The core of the DeiT core is the introduction of distillation method into the training of VIT [24], and the proposal of token-based distillation. An important component of DeiT is the distillation training, which combines with the teacher model to guide DeiT to learn the target's feature extraction better. The distillation process is rough as shown in Fig. 2, this process is mainly to use a distillation Fig. 1 The proposed tracking architecture tokens to interact with class tokens and patch tokens at the self-attention level, and the distillation token entered into the transformer is learned through backpropagation. This training strategy uses convolutional networks as a teacher network for distillation, which achieves better results with fewer data and fewer computing resources than a network using the transformer architecture as a teacher network. The input of the DeiT backbone is a triplet: a template image of the initial target object X ∈ R 3 * H x * W x , a search area for the current frameT ∈ R 3 * H t * W t , and a dynamically updated template image Z ∈ R 3 * H z * W z sampled from the intermediate frame. We split the input image group into patches, and then linearly projected each patch to obtain a sequence of patch tokens. At the same time, we spliced a class token for classification before the patch tokens, and a distillation token for distillation training after the patch tokens. In the process of training DeiT using distillation strategies, lots of error messages are learned from the teacher network. And the distillation token is designed to solve these error messages, and is specifically designed to receive the label generated by the teacher network and participate in the overall information interaction process. In order to preserve the spatial location information between patches, we added position embedding to encode the location information of the token. We input the class token, patch token, and distillation token with position embedding added to the transformer encoder for processing. The outputs are the initial template featureF x ∈ R Hx s * Wx s * C , the search area feature F t ∈ R H t s * W t s * C , and the dynamic template featureF z ∈ R Hz s * Wz s * C respectively. This work only uses the feature extraction part of the DeiT, that is, removing the MLP layer and the subsequent parts, and the rest has not changed.
Transformer layers in DeiT contain only encoders, which are mainly stacked by combining self-attention and feed-forward network (FFN). The specific structure in the encoder is shown in Fig. 3, and it consists of a multi-head self-attention (MSA), a norm layer (LN), and a multi-layer perceptron (MLP) through residual connections. The specific process can be represented as: where X n represents the output of the multi-head self-attention and X n represents the output of the MLP output, and use a norm layer before each block.

Modeling transformer
The transformer in the modeling phase consists of an encoder and decoder, and there are 6 layers in both the encoder layer and the decoder layer. The encoder captures dependencies between all elements in the sequence and reinforces the original features with global contextual information. And it allows the model to learn the discriminating features for target localization. The decoder allows the target query to focus on all location features and search area features on the template, learning a robust representation that is ultimately associated with the heads.
The feature groups output from the DeiT are stitched and then passes through a 1*1 convolutional layer to reduce the number of channels from C to d, which is consistent with the hidden layer dimension in the subsequent transformer encoderdecoder structure. The flatten and concatenated operations to obtain a total feature F = Hx

Encoder
Similar to the encoder in DeiT, this also consists of continuous encoder layers, each of which includes a multi-head self-attention and feed-forward network, where the feedforward network contains two-layer perceptron and GEIU activation.

Decoder
Similar to the encoder, the decoder also includes a self-attention, encoder-decoder attention, and a feed-forward network. The input of this part is the enhanced feature sequence from the encoder and the preset target query, the target query and the enhanced feature sequence have interacted in the decoder layer, from which the tracked target information is extracted, and a more robust representation is learned for subsequent bounding box prediction and dynamic template update judgment.

Bounding box prediction head
In order to predict the information of the bounding box more accurately, a more stable prediction box is generated. Like the STARK Corner Prediction Head, we use a fully convoluted corner point locator head to directly estimate the bounding box of the tracked object, solely with several Conv-BN-ReLU layers to predict the coordinates of top-left and bottom-right corners, respectively. Finally, the bounding box is obtained by calculating the expectation of the angular point probability distribution.

Score head
Dynamic template plays a key role in capturing time information and changes in the appearance of the target. If the target is completely obscured or out of view or due to the deformation of the target and causes the model to drift, the crop of the dynamic template is not trustworthy. To solve these problems, we designed a score head, which is composed of a scoring prediction head, dynamic template update judgment, and simple crop operation. This head controls how the dynamic template is updated by predicting whether the confidence level of the current frame is correct.
The structure of the scoring prediction head is shown in Fig. 4, which mainly consists of a depth-wise cross-correlation, an attention block, and an MLP. First, a learnable score token acts as a query to interact with the decoder's output in-depth, allowing the score token to encode the extracted enhanced target information. At the same time, the score token focuses on the position of the target token in the dynamic template to compare with the target state in the next frame. Finally, the score is calculated through the MLP layer and sigmoid activation. We use the score to judge the timing of dynamic template updating, to prevent the generation of inferior quality dynamic Fig. 4 Structure of the score prediction head templates with fuzzy targets or severely deformed targets. Therefore, we set a threshold τ to compare with the score, if the score is higher than the threshold τ , the current state is considered reliable, and the dynamic template of this frame is cropped, if the score is below the threshold τ , the current state is considered unreliable, and the dynamic template of the previous frame is maintained. Here τ we set it to 0.5.
To focus on more target local spatial information, the attention block in the score prediction head uses the asymmetric mixed attention proposed by MixFormer [25], It performs separable depth-wise convolutional projection on each feature map (i.e., query, key, and value), then flattens each feature map and process it through a linear projection to generate a query, key, and value for attention operations. This mixed attention is defined as follows: where D represents the dimension of the key, q t , k t and v t represent target, Attention is the attention maps of the target.

Training loss
This work uses the learning method of joint learning positioning and classification, so our training is divided into two stages: localization and classification. In the first stage, the whole network, except for the scoring head, is trained end-to-end only with the L1 loss and GIoU [26] loss to supervise the bounding box prediction results, and the calculation formula of the entire framework loss function L is as follows: In the second stage, only the score head is optimized with binary cross-entropy loss defined as where B is the bounding box groundtruth, B is the prediction result, 1 and 2 are the loss weight coefficient, this work sets 5 and 2, P i is the predicted confidence.
During inference, three templates and corresponding features are initialized in the first frame, and they are fed into the network to generate a bounding box and a confidence score. The dynamic template is updated only when the update interval is reached and the confidence level is greater than the threshold τ . To improve efficiency, we set the update interval to 200 frames. The new template is cropped from the original image and then imported as a new dynamic template image for feature extraction.

Experimental results and discussion
In this section, we evaluate the proposed method on five benchmarks and compare it with other advanced tracking networks. Section 4.1 introduces the relevant details of the experiments. Section 4.2 presents the results of the quantitative evaluation of the 5 benchmarks, including the experimental datasets and the evaluation criteria of the experiments. Section 4.3 presents the ablation experiments, analyzing the results of the qualitative evaluation, and Sect. 4.4 describes the visualization results, providing visualizations of the LaSOT datasets to demonstrate the superiority of our model.

Implementation details
Our trackers are implemented using Python 3.6 and PyTorch 1.7.0. The experiments are conducted on a server with GeForce RTX 2080 Ti/PCle/SSE2. Especially, this is a neat tracker without post-processing and multi-layer feature aggregation strategy.

Model
The backbone is initialized with the parameters pre-trained with 300 epochs on ImageNet with a distillation strategy. The transformer of the backbone has only an encoder, and no decoder, and the heads and layers are both 12. The transformer structure of the modeling section consists of 1 encoder layer and 1 decoder layer, for a total of 6 transformers, including a multi-head attention layer (MSA) and forward network (FFN). The MSA has 8 heads with a width of 256, while the FFN has hidden units of 2048.

Training
The training data consists of the train-splits of LaSOT [27], GOT-10K [28], COCO2017 [29], and TrackingNet [30]. The sizes of search images and templates are 320 × 320 pixels and 128 × 128 pixels, respectively, corresponding to 52 and 22 times the target bounding box area. The minimal training data unit for our model is a triplet, consisting of two templates and one search image. The entire training process consists of two stages, the first stage requires 500 epochs and the second stage requires 50 epochs, each with 6000 samples. The network is optimized using AdamW optimizer and weight decay 10 −4 , The initial learning rates for the backbone and the rest are 10 −5 and 10 −4 , respectively. To stabilize the training to ensure convergence, the gradient cropping and learning rate attenuation strategies were adopted, and the learning rate decreased by 10 times after the 400th epoch in the first stage and 10 times after the 40th epoch in the second stage.

LaSOT
LaSOT is a large-scale long-term tracking benchmark, which contains 1400 videos with an average length of 2512 frames. It has 70 target categories and 20 videos in each category, covering a variety of challenges in the field. Divided into 20% as test sets, it includes a total of 280 videos. Fig. 5 shows that our model surpasses all other trackers by a large margin. Compared to STARK-ST101, SiamRCNN, and LTMU, our model achieves 1.0% ,2.8% , and 7.6% gains on the LaSOT test set, respectively.
To verify the effectiveness of our method for different attributes, detailed results of the success rate of eight typical difficulties on the LaSOT dataset are provided in Figs. 6 and 7, including fast motion (FM), background clutter (BC), motion blur (MB), deformation (DEF), illumination change (IV), partial occlusion (POC), out-of-field (OV), and scale change (SV).
As shown by the results in Figs. 6 and 7, the adaptability of the tracking network becomes necessary in the case of deformation and out-of-field. The method proposed in this paper can effectively establish the long-term dependence of features, and accurately utilize the historical feature information in the case of target deformation or target disappearance. At the same time, our method also has some performance improvement in occlusion and scale change, thanks to the transformer feature extraction network's ability to obtain the most representative feature vectors. In the case of motion blur, the tracking networks need to be able to perform accurate target tracking in low-resolution video frames. Our method compensates for the effect of motion blur on the tracker to some extent and improves the performance of the tracker. However, the method in this paper seems to be ineffective in the case of background clutter and illumination changes, probably because too much background information is introduced during feature extraction and dynamic template update to affect the judgment of the tracker.

TrackingNet
TrackingNet is a carefully selected video dataset specifically for target tracking from large-scale object detection datasets. A total of 30,643 videos, with an average duration of 16.6s, included 511 videos and 70 target categories in the test set. Table 1 shows that our model surpasses all other models with a large margin. Specifically, our model achieves the top-ranked performance on NP of 88.0% , surpassing STARK by 1.1%.

UAV123
UAV123 includes 123 videos captured by the low-altitude drone platform, with a clean background and a wide variation in viewing angle, averaging 915 frames. The dataset has the problems of the invisible target, complete occlusion, and small target scale, which requires our tracker to have faster learning ability and the ability to extract global information. Table 1 shows our results on the UAV123 dataset. Our model outperforms all other models.
TrackingNet and UAV123 datasets adopt the one-pass evaluation (OPE) strategy, and the evaluation indicators are Precision (P), Normalized Precision (NP), and Success (S). The precision is calculated by comparing the distance between the predicted result and the ground truth, and the success rate is calculated by measuring the Intersection over Union (IoU) between the two. Since the precision is very sensitive to the target size and

TLP
This is a long-term single-target tracking dataset that includes 50 long HD videos from real scenes with an average sequence length of over 13,000 frames and a total video duration of over 400 minutes. Table 2 reports the AUC and precision scores on the TLP dataset. Our method has some gains compared to other trackers. For example, our method achieves 0.7% and 1.8% AUC gain compared to STARK-ST101 and LTMU, two methods that use template updates, respectively.

VOT2021-LT
The VOT2021-LT dataset contains 50 videos with 215294 frames in total, in which target objects disappear and reappear frequently. The accuracy evaluation of the dataset mainly includes tracking precision (Pr), tracking recall (Re), and tracking F-score. Precision and recall are computed under a series of confidence thresholds. F-score, defined as F = 2P r R e P r +R e , is used to rank different trackers. Different trackers are ranked according to the tracking F-score. We compare our model to the currently popular tracker and report the evaluation results in Table 3 It can be seen that the F-score of our model is 70.3, which is better than all previous methods.

Ablation study
In this section, we use the LaSOT dataset to perform ablation analysis of our model. Through different experimental settings, we did four experiments with different trackers, namely "Baseline, " "Model 1, " "Model 2, " and "Ours. " The meaning of these concepts is explained below. (1) "Baseline" denotes the STARK-ST101 model. (2) "Model1" denotes a model that uses DeiT as a feature extraction network. (3) "Model2" denotes a model of adding the score prediction head based on STARK-ST101. (4) "Ours" denotes a long-term tracking network that uses both the DeiT feature extraction network and the score prediction head.  The results of the different variants on the LaSOT dataset in Table 4, and the speed, FLOPs, and Params of the different variants in Table 5, from which we can obtain the following conclusions. (1) 'model1' achieves an S of 67.3, which is also competitive compared to other state-of-the-art methods (shown in Fig. 5). The applicability of the transformer to long-term tracking tasks is also verified. But it also brings a corresponding flaw, the large number of parameters of the transformer model drags down the running speed. (2) By comparing "model2" and "baseline, " we can conclude that the score prediction head proposed in this work can improve the long-term performance to some extent, which also means that reliable dynamic templates can bring better gains to the tracker.
(3) Comparing 'ours' with 'model1' and 'model2' yields that both the transformer backbone and the score prediction head control template update are indispensable throughout the tracing process.

Visualization
To show the actual tracking effect of different networks in complex situations such as target disappearance, deformation, and scale change. We select video sequences (fox, motorcycle, skate, racing, and volleyball) from LaSOT, a large real-world scene tracking dataset, to visualize the tracking results, as shown in Fig. 8.
As shown in Fig. 8, the top-down sequences are the "fox" sequence, "motorcycle" sequence, "skate" sequence, "racing" sequence, and "volleyball" sequence, respectively. In the "motorcycle" and "racing" sequences, the targets appear to disappear and reappear frequently, and the proposed algorithm could still obtain accurate target state estimation and high-quality tracking results due to the combined effect of the transformer and template update mechanisms. When the target disappears, STARK-ST101, SiamRCNN, and LTMU all drift to other similar targets easily, and the target does not track the correct target in time when it reappears. In the "fox" and "state" sequences, the targets have severe scale changes, occlusion, and deformation phenomenon, STARK-ST101 and other trackers cannot get accurate target state estimation and have tracking failure, while our tracker could track the target accurately, and handle these situations  better. It is worth noting that there are difficulties in the 'volleyball' sequence with fast motion, background clutter, and small targets, our tracker has experienced tracking errors. Specifically, when attributes such as fast motion, background clutter, and small targets appear in the same video at the same time, the dynamic template strategy in our approach introduces too much background information to affect the tracker's judgment, which leads to a tracking failure situation. In this case, SiamRCNN is better than the method in this paper because it does not use a complex template update strategy.

Conclusion
This paper proposes a new transformer-based long-term tracking framework. We improve the existing transformer-based tracking network and enhance the feature extraction ability by introducing the visual transformer based on the attention mechanism as the backbone network. A score prediction head is designed to control dynamic template updating using asymmetric mixed attention and score token interactive learning. Experimental results show that our model designed in this work is better than the current mainstream tracking networks on five long-term tracking benchmarks. As there