DCNN + CF makes full use of the obvious advantages of correlation filter and deep neural network. On the one hand, it can give play to the speed advantage of correlation filter and ensure the real-time tracking; on the other hand, the deep convolutional neural network has powerful feature representation ability, which can better represent the features of the target and ensure the tracking accuracy. However, the introduction of deep learning into correlation filter inevitably increases the calculation burden, and it is difficult to balance the tracking accuracy and speed. In addition, in the actual tracking process, due to the influence of environmental illumination, target scaling, target rotation and other interference factors, the target features extracted by the pre-offline training network are difficult to accurately describe the disturbed target. In other words, the feature network of pre-offline training is not robust enough. In such a case, it is difficult to accurately predict the position of the target, when the extracted features are associated with the subsequent tracking filter, and it is inevitable to result in target position drift in different degrees. As a result, the tracking accuracy is degraded in some extent. In the case of severe drift, the predicted position of the target is likely to move out of the tracking window, resulting in tracking failure. To address such issues, inspired by the documents[37, 51, 52], we think that the channel of the feature network with more contributions to the output response of the tracking system is taken as the important channel which plays a decisive role in the output response of the tracking system. All of the important channels are reserved while the other channels are pruned off to lighten the feature network and decrease the computational load and enhance the real time performance. To do so, channel importance is defined as decision criterion for selecting important channels. Based on such a motivation, firstly we define the importance score of a channel from the perspective of the nature of correlation filter (Gaussian output response), and then take the importance score of the channel as a decision criterion for selecting important channels to generate a lightened feature network, enhancing the tracking speed under the condition of meeting the tracking accuracy. Furthermore, because of the global statistical distribution characteristics, SSIM can work well for many kinds of interference factors. More importantly, SSIM conforms to people's visual habits. So SSIM is employed to calculate the similarity of two consecutive frames of the video sequence in this paper, and is taken as a decision criterion to make decisions on the update of feature network, the failure detection and recovery of the tracking network.
In summary, we propose a deep correlation filter tracking model based on the importance of the channel, and its overall structure is shown in Fig. 1.
The overall structure is divided into five functional modules. The first module named as initializing filter and computing channel has two functions. One is responsible for initializing the tracking filter by use of the first frame of video sequence; the other is for calculating the importance score of the channel by use of the output response of the filter correlated with the target feature maps of the second frame of sequence, and then selecting the important channels based on the importance score to generate the lightened feature network. The second module named as normally tracking is in charge of normal tracking. The third module named as updating filter is responsible for tracking filter updating online. The fourth module named as online training is responsible for training the feature network online. The fifth module named as recovery is in charge of tracking failure detection and tracking recovery.
To complete these functions, two thresholds \({TH}_{1}\) and \({TH}_{2}\) are set to define different tracking states. During the tracking process, the SSIM value of the target boxes of two consecutive frames is calculated. When SSIM is greater than the threshold \({TH}_{1}\), normal tracking is performed. When SSIM is within the range of \(\left({TH}_{1},{TH}_{2}\right)\), the network is updated while tracking. When SSIM is less than the threshold \({TH}_{2}\), tracking failure is judged and re-tracking is performed. A detailed description of each functional module will be given in the following sections.
3.1 Correlation filter tracking
3.1.1 Initialize the filter
Suppose the size and the tracking box \(\hbox{T}\) centered on the target is given in the first frame, and the ideal Gaussian response corresponding to this tracking box is \(\hbox{G}\in {R}^{M\times N}\), with its peak point corresponds to the position of the tracking target. Furthermore, suppose that the feature network has been trained offline. The number of convolution kernels (or the number of channels) of the output layer is D, the size of the convolution kernel is \(m\times n\), and the output feature map is represented by \(\varphi \left(T\right)\), and its size is \(M\times N\), i.e., \(\varphi \left(T\right)\in {R}^{M\times N\times D}\). According to the literature [24, 29, 33, 53, 54], the tracking filter can be initialized as:
$${W}^{l}=\frac{{\widehat{\varphi }}^{l}\left(T\right)\odot {\widehat{G}}^{*}}{\sum_{l=1}^{D}{\widehat{\varphi }}^{l}(T)\odot ({{\widehat{\varphi }}^{l}(T))}^{*}+\lambda }\quad \left(l=1, 2,\ldots D\right)$$
(1)
Here, \(\odot\) denotes Hadamard product, * represents complex conjugate, \(\wedge\) means Fourier transform, \({W}^{l}\) refers to the correlation filter of the lth feature channel. \({\widehat{\varphi }}^{l}(T)\) represents the Fourier transform of the feature output by the lth feature channel. \({\widehat{G}}^{*}\) represents the complex conjugate of the Fourier transform of the ideal Gaussian G. And the constant \(\lambda \ge 0\) is regularization coefficient.
3.1.2 Target location prediction
In the tracking process, a search box S centered on target with the same size as T is constructed. The feature network is applied to S, and the extracted features are expressed as \({\varphi }^{l}\left(S\right) \left(l=1, 2, \ldots, D \right)\). Then the tracking response \({\left({r}_{i,j}\right)}_{M\times N}\) can be calculated by the following equation [54]:
$${g}^{l}={\left({\widehat{W}}^{l}\right)}^{*}\odot {\widehat{\varphi }}^{l}\left(S\right)\quad \left(l=1,2,\ldots D\right)$$
(2)
$${\left({r}_{i,j}\right)}_{M\times N}={\mathcal{F}}^{-1}\left(\sum_{l=1}^{D}{g}^{l}\right)\quad \left(l=1, 2,\ldots D\right)$$
(3)
Here, \({g}^{l}\) represents the frequency output response of the lth tracking filter, \({\mathcal{F}}^{-1}\) represents the inverse Fourier transform, \({r}_{i,j}\in {R}^{M\times N}\) represents the total output response of all channels. Based on Eqs. (2) and (3), the coordinates \(\left(x,y\right)\) of the predicted position of the target can be obtained as:
$$\left(x,y\right)=\underset{{R}^{M\times N}}{\mathit{max}}{\left({r}_{i,j}\right)}_{M\times N}$$
(4)
3.1.3 On-line filter update
During tracking process, the feature map \({\varphi }_{t-1}^{l}\left(S\right)\) of the target in the \(t-1\) frame and the \({W}^{l-1}\) are employed to predict the position \(\left({x}_{t},{y}_{t}\right)\) of the target in the tth frame by using Eqs. (2), (3), and (4). In the tth frame, construct a search box \({S}_{t}\) centered on \(\left({x}_{t},{y}_{t}\right)\) with the same size as T. The ideal Gaussian response corresponding to this tracking box \({S}_{t}\) is configured as \({G}_{t}\in {R}^{M\times N}\). The feature network is applied to \({S}_{t}\), and the extracted feature is denoted as \(\varphi {\left({S}_{l}\right)}^{l}\). The filter \({W}_{t}^{l}\) is updated online by the following equation [54].
$${W}_{t}^{l}=\frac{{\widehat{\varphi }}_{t}^{l}\left({S}_{t}\right)\odot {\left(\widehat{{G}_{t}}\right)}^{*}}{\sum_{l=1}^{D}{\widehat{\varphi }}_{t}^{l}({S}_{t})\odot ({{\widehat{\varphi }}_{t}^{l}({S}_{t}))}^{*}+\lambda }\quad \left(l=1, 2,\ldots D\right)$$
(5)
Here, \({\widehat{\varphi }}_{t}^{l}({S}_{t})\) represents the Fourier transform of the feature output \(\varphi {\left({S}_{l}\right)}^{l}\) of the lth feature channel, and \({\left(\widehat{{G}_{t}}\right)}^{*}\) represents the complex conjugate of the Fourier transform of the ideal Gaussian response \({G}_{t}\).
3.2 Important feature channel selection
3.2.1 Definition of channel importance
The output layer of the feature network has D output channels and outputs D feature maps. According to Eqs. (2) and(3), each feature map is associated with its corresponding tracking filter and the output of the filter is a Gaussian response. Some typical outputs of all filters are shown in Fig. 2.
According to Eqs. (3) and (4), the final output response of the tracking system is the superposition of the output responses of all channels, and the peak value of the final output response corresponds to the predicted position of the target. The intuitive explanation of the whole process is shown in Fig. 3.
In terms of Eqs. (1), (2), and (3), and from Figs. 2 and 3, it can be seen that the final output response of the tracking system can be taken as an approximate two-dimensional Gaussian response. Therefore, the peak value of the two-dimensional Gaussian response corresponds to the predicted position \(\left(x,y\right)\) of the target. According to the characteristics of the two-dimensional normal distribution function [60], about 95.5% of the data are concentrated in a rectangular box Q, with a length of \(2\times 2.58{\sigma }_{x}\) and a width of \(2\times 2.58{\sigma }_{y}\), centered on \(\left({\mu }_{x},{\mu }_{y}\right)\), as shown in Fig. 4. Here, \({\mu }_{x}\) and \({\mu }_{y}\) are the mean values of the two-dimensional normal distribution, and \({\sigma }_{x}\) and \({\sigma }_{y}\) are the variances of the two-dimensional normal distribution. Obviously,\(\left(x,y\right)=\left({\mu }_{x},{\mu }_{y}\right)\).
The final output response of the tracking system is the superposition of the output responses of all channels. However, as can be seen from Figs. 2 and 3, the contribution of each channel to the final output response of the tracking system is completely different because the statistical distribution of the output response of each channel is different. The more the channel output response falls into this rectangular box, the greater the contribution to the total response. In other words, the channel is more important. As a result, it is natural to take the channel with more contributions as the important one which is reserved to lighten the feature network.
Thus, in term of contribution of the channel to the final output response of the tracking system, the channel importance is defined as the following.
The sum \({\hbox{center}}^{l}\) of the output response of the channel l falling into the rectangular box Q is defined as:
$$\left.\begin{array}{l}{\hbox{center}}^{l}=\sum_{i}\sum_{j}{\mathcal{F}}^{-1}\left({g}_{i,j}^{l}\right) \\ i\in \left({x}-2.58{\sigma }_{x},x+2.58{\sigma }_{x}\right),j\in \left(y-2.58{\sigma }_{y},y+2.58{\sigma }_{y}\right)\end{array}\right\}$$
(6)
The total sum \({\hbox{around}}^{l}\) of all the output response of the channel l falling inside and outside the rectangular box Q is defined as:
$${\hbox{around}}^{l}=\sum_{i=1}^{M}\sum_{j=1}^{N}{\mathcal{F}}^{-1}\left({g}_{i,j}^{l}\right)$$
(7)
The importance of the channel is defined as:
$${\hbox{score}}^{l}=\frac{{\hbox{center}}^{l}}{{\hbox{around}}^{l}} \quad \left(l= 1, 2, \ldots, D\right)$$
(8)
The higher the \({\hbox{score}}^{l}\) is, the more the contribution of the channel response to the final response is, and the more important the channel is. Arrange \({\hbox{score}}^{l}\) in descending order, and take the channels corresponding to the first k \({\hbox{score}}^{l}\) as important channels for follow-up tracking.
3.2.2 Computing of \({\sigma }_{x}\)
and
\({\sigma }_{y}\)
Taking computing \({\sigma }_{y}\) as an example, describes the procedure for computing the variance of a two-dimensional Gaussian response.
The peak position of the target final response is \(\left(x,y\right)\), which can be regarded as the mean value of the total response, i.e., \(\left(x,y\right)=\left({\mu }_{x},{\mu }_{y}\right)\), and the variance of the final response is taken as \(\left({\sigma }_{x},{\sigma }_{y}\right)\). Project the final response \({\left({r}_{i,j}\right)}_{M\times N}\) to the YOZ plane, as shown in Fig. 5a. It means that project the final response \({\left({r}_{i,j}\right)}_{M\times N}\) to YOZ is approximatively equivalent to \(\left({\left({r}_{i,j}\right)}_{M\times N}|i=x\right)\), and the result is shown in Fig. 5b. For the projection curve shown in Fig. 5b, a one-dimensional Gaussian function can be used to fit the projection curve, shown in Fig. 5c.
Let \(f\left({y}_{j}\right)={r}_{i,j}|i=x, j=1, 2, \ldots, N\). The one-dimensional Gaussian fitting function is \(f\left({y}_{j}\right)=\frac{1}{\sqrt{2\pi }{\sigma }_{y}}{e}^{-\frac{{\left({y}_{j}-y\right)}^{2}}{2{\sigma }_{y}^{2}}}\), where y is the known mean and \({\sigma }_{y}\) is the parameter to be estimated. First, construct the likelihood function [61]:
$$L\left({\sigma }_{{y}}^{2}\right)=\prod_{i=1}^{N}\frac{1}{\sqrt{2\pi }{\sigma }_{y}}{e}^{-\frac{{\left({y}_{i}-y\right)}^{2}}{2{\sigma }^{2}}}={\left(2\uppi {\sigma }_{{y}}^{2}\right)}^{-\frac{N}{2}}{e}^{-\frac{1}{2{\sigma }_{{y}}^{2}}\sum_{i=1}^{N}{\left({y}_{i}-y\right)}^{2}}$$
(9)
Take the logarithm on both sides of Eq. (9) to get:
$$\hbox{ln} L\left({\sigma }_{{y}}^{2}\right)=-\frac{N}{2}\hbox{ln}\left(2\pi \right)-\frac{N}{2}\hbox{ln}\left({\sigma }_{{y}}^{2}\right)-\frac{1}{2{\sigma }_{{y}}^{2}}\sum_{i=1}^{N}{\left({y}_{i}-y\right)}^{2}$$
(10)
Let the derivative of Eq. (10) for \({\sigma }_{{y}}^{2}\) be zero to get:
$$\frac{\partial \hbox{ln}L\left({\sigma }_{{y}}^{2}\right)}{\partial {\sigma }_{y}^{2}}=-\frac{N}{{\sigma }_{{y}}^{2}}+\frac{1}{{2\left({\sigma }_{{y}}^{2}\right)}^{2}}\sum_{i=1}^{N}{\left({{y}}_{i}-y\right)}^{2}=0$$
(11)
The estimated value of \({\sigma }_{{y}}^{2}\) can be solved by Eq. (11):
$${\overline{\sigma }}_{{y}}^{2}=\frac{1}{N}\sum_{i=1}^{N}{\left({y}_{i}-y\right)}^{2}$$
(12)
Then, the one-dimensional Gaussian fitting function is \(f\left({y}_{j}\right)=\frac{1}{\sqrt{2\pi }{\overline{\sigma }}_{y}}{e}^{-\frac{{\left({y}_{j}-y\right)}^{2}}{2{\overline{\sigma }}_{{y}}^{2}}}\), as shown in Fig. 5c.
For computing of \({\sigma }_{{\rm x}}\), project the final response \({\left({r}_{i,j}\right)}_{M\times N}\) to the XOZ, as shown in Fig. 6a. It means that project the final response \({\left({r}_{i,j}\right)}_{M\times N}\) to XOZ is approximatively equivalent to \(\left({\left({r}_{i,j}\right)}_{M\times N}|i=x\right)\), and the result is shown in Fig. 6b. For the projection curve shown in Fig. 6b, a one-dimensional Gaussian function can be used to fit the projection curve, shown in Fig. 6c. The computing method is the same as that of \({\sigma }_{{y}}\), and the result is shown in Fig. 6c.
In addition, if there are similar objects in an image, but not within the current search window, this case will not affect the overall performance of the tracking system. If there are similar objects in an image, and within the current search window, there may be more than one peak in Figs. 5 and 6. This case will affect the overall performance of the tracking system. However, such a case is not discussed in this paper because we only discussed the single target tracking.
3.3 On-line feature network update
The design of the feature network generally adopts off-line training and on-line fine-tuning strategy. In actual tracking applications, it is very possible for a target to have a variety of interference factors in the tracking video sequence, such as changes in ambient lighting, target scaling, target rotation, and etc. Therefore, the target features extracted by the off-line training network cannot accurately describe the disturbed target. It is difficult to accurately predict the position of the target, which restricts the improvement of the performance of the tracking system. To solve this problem, many feature network update strategies have been proposed around the similarity of targets in continuous tracking video sequences. Typical strategies mainly include similarity learning methods based on full convolution, peak signal-to-noise ratio (PSNR) method, SSIM method, and etc. The similarity learning method based on full convolution takes the learned similarity function as the similarity criterion, traverses all possible positions of the target, and takes the candidate position with the largest similar function value as the final predicted position of the target [34]. This method affects the real-time tracking performance because of the traversal calculation. In addition, another limitation to this method is that the candidate target with the largest similarity value is not necessarily the most similar (for example, the maximum similarity value with normalization is not greater than 0.5). PSNR is a widely used objective evaluation index based on error-sensitive images. However, since the visual characteristics of the human eyes are not taken into consideration, it is difficult to ensure that the evaluation results are completely consistent with the visual quality seen by the human eyes. Natural images are highly structured, and there are strong correlations among the pixels of the image. These correlations carry important information about the structure of objects in the visual scene. The Laboratory for Image and Video Engineering of the University of Texas at Austin proposed SSIM to measure the structural similarity of two images [55]. Taking into account the fuzzy changes of image structure information in human perception, SSIM measures image similarity from brightness, contrast, and structure respectively, and is better than PSNR in the evaluation of image similarity [56]. In addition, SSIM has global statistical distribution characteristics, which can work well for a variety of interference factors and conforms to people's visual habits. Therefore, SSIM is selected as the criterion to calculate the similarity of two consecutive frames of the video sequence, and make decisions on the update of the decision feature network, the failure detection and tracking recovery of the tracking network in the paper.
3.3.1 SSIM similarity criterion
Assuming that the two input images are a and b respectively, define SSIM as [55]:
$$\hbox{SSIM}\left(a,b\right)={\left[l\left(a,b\right)\right]}^{\alpha }{\left[c\left(a,b\right)\right]}^{\beta }{\left[s\left(a,b\right)\right]}^{\gamma }$$
(13)
$$l\left(a,b\right)=\frac{2{\mu }_{a}{\mu }_{b}+{C}_{1}}{{\mu }_{a}^{2}+{\mu }_{b}^{2}+{C}_{1}}$$
(14)
$$c\left(a,b\right)=\frac{2{\sigma }_{ab}+{C}_{2}}{{\sigma }_{a}^{2}+{\sigma }_{b}^{2}+{C}_{2}}$$
(15)
$$s\left(a,b\right)=\frac{{\sigma }_{ab}+{C}_{3}}{{\sigma }_{a}{\sigma }_{b}+{C}_{3}}$$
(16)
Here, \(\alpha >0\), \(\beta >0\) and \(\gamma >0\). \(l\left(a,b\right)\) represents brightness comparison, \(c\left(a,b\right)\) represents contrast comparison, and \(s\left(a,b\right)\) represents structure comparison. \({\mu }_{a}\) and \({\mu }_{b}\) represent the mean value of image a and image b respectively. \({\sigma }_{a}\) and \({\sigma }_{b}\) represent the standard deviation of image a and image b respectively. \({\sigma }_{ab}\) represents the covariance of image a and image b. \({C}_{1}\), \({C}_{2}\) and \({C}_{3}\) are constants, in order to avoid zero in the denominator.
In actual calculations, generally take \(\alpha =\beta =\gamma =1\), and \({C}_{3}={C}_{2}/2\). Substitute these parameters into Eqs. (13)–(16), reduce and merge to get[55]:
$$\hbox{SSIM}\left(a,b\right)=\frac{\left(2{\mu }_{a}{\mu }_{b}+{C}_{1}\right)\left({\sigma }_{ab}+{C}_{2}\right)}{\left({\mu }_{a}^{2}+{\mu }_{b}^{2}+{C}_{1}\right)\left({\sigma }_{a}^{2}+{\sigma }_{b}^{2}+{C}_{2}\right)}$$
(17)
Choose an appropriate threshold \({\hbox{TH}}_{1}\), the SSIM similarity criterion is:
$$\left.\begin{array}{ll}\hbox{SSIM}\left(a,b\right)\ge {\hbox{TH}}_{1},&\quad a\, \hbox{similar to} \,b \\ \hbox{SSIM}\left(a,b\right)<{\hbox{TH}}_{1},&\quad a \,\hbox{not similar to}\, b\end{array}\right\}$$
(18)
3.3.2 Strategy for updating feature network
Suppose the position of the tracking target in frame \(t-1\) is \(\left({x}_{t-1},{y}_{t-1}\right)\), and the predicted position of the target in frame t is \(\left({x}_{t},{y}_{t}\right)\). Construct two rectangular boxes \({S}_{t-1}\) and \({S}_{t}\) centered at \(\left({x}_{t-1},{y}_{t-1}\right)\) in frame \(t-1\) and centered at \(\left({x}_{t},{y}_{t}\right)\) in frame t respectively, all of them with the same size as T. Use Eq. (17) to calculate the SSIM of \({S}_{t-1}\) and \({S}_{t}\), and determine the similarity of \({S}_{t-1}\) and \({S}_{t}\) according to Eq. (18). To end this, two thresholds \({\hbox{TH}}_{1}\) and \({\hbox{TH}}_{2}\) are respectively set for SSIM, and \({\hbox{TH}}_{1}>{\hbox{TH}}_{2}\). When \({\hbox{TH}}_{1}<\hbox{SSIM}\), it indicates that the similarity between \({S}_{t-1}\) and \({S}_{t}\) is high, the tracking effect is better, and the tracking is kept as normal. In such a tracking process, the correlation filter is updated but the feature network is not updated. When \({\hbox{TH}}_{2}<\hbox{SSIM}<{\hbox{TH}}_{1}\), it indicates that the similarity between \({S}_{t-1}\) and \({S}_{t}\) is reduced because of the various interference factors such as illumination changes, occlusion, target zoom, target rotation, and etc. Although the tracking can be maintained, the tracking accuracy is reduced. It means that it is necessary for the feature network to be updated to increase the tracking accuracy. At this time, the parameters of the feature network of the t − 1 frame are used as initial values, and the feature network is iteratively updated using the feature of the current frame target. The specific update process is as follows:
Step 1: Calculate the ideal Gaussian response graph corresponding to the predicted target position \(\left({x}_{t},{y}_{t}\right)\), denoted as \({G}_{M\times N}={\left({g}_{i,j}\right)}_{M\times N}\);
Step 2: Let the feature network of the t − 1 frame perform on \({S}_{t}\), and extract the feature map of \({S}_{t}\) as \(\varphi \left({S}_{t}\right)\);
Step 3: Use Eqs. (2) and (3) to get the predicted response \({\left({r}_{i,j}\right)}_{M\times N}\) to the target at frame t − 1;
Step 4: Construct the loss function loss as:
$$\hbox{loss}=\sum_{i=1}^{M}\sum_{j=1}^{N}{\Vert {r}_{i,j}-{g}_{i,j}\Vert }^{2}$$
(19)
Step 5: Let \(\frac{\partial \hbox{loss}}{\partial \varphi }=0\), calculate the gradient, and back propagate to update the convolutional core parameters of each channel layer by layer. Finally obtain the update parameters of the feature network.
Step 6: Set t → (t − 1), (t + 1) → t to continue tracking in the next frame.
In addition, When the SSIM is between \({\hbox{TH}}_{1}\) and \({\hbox{TH}}_{2}\), take the parameters of the previous feature network as the initial values, and take the sum of squared errors between the Gaussian output response corresponding to the target predicted position and the ideal Gaussian response as the loss function, the feature network is trained by use of the stochastic gradient descent method so that it can capture the changes of target and environment and extract appropriate features. If there are similar objects in an image, but not within the current search window, the update will not lead to the tracking of an incorrect object. If there are similar objects in an image, and within the current search window, it is very possible that the update will lead to the tracking of an incorrect object. However, such a case is not discussed in this paper because we only focused on the single target tracking.
3.4 Re-tracking of targets
When \(\hbox{SSIM}<{\hbox{TH}}_{2}\), it is considered that the tracking target has drifted out of the field of view, and it is judged that the tracking has failed. At this time, the tracking target should be searched in the neighborhood around the rectangular prediction box in the current frame or the subsequent frame in order to recover tracking from the failure. The specific re-tracking process is as follows:
Step 1: Taking the predicted position \(\left({x}_{t},{y}_{t}\right)\) of the target in the t − 1th frame as the center, a target search area \(A\) which is n times larger than T is constructed in the tth frame, where \(n=\left(2,3,\ldots,\hbox{Around}\left(\frac{W\times H}{M\times N}\right)\right)\), \(W\) represents the width of the image, \(H\) represents the height of the image, and \(\hbox{Around}\left(.\right)\) represents rounding;
Generally, \(n=\hbox{Around}\left(\frac{W\times H}{M\times N}\right)\) is not taken. If so, it means that searching is performed within the entire image calculation is too heavy, which is not conducive to real-time tracking. In addition, during the movement of the target, the displacement between two consecutive frames is generally not too large. So it is not necessary to search within the whole image, as long as an appropriate n is selected according to the actual situation.
Step 2: Let a sliding window \(\hbox{SW}\) with the size of width \(M\) and height \(H\), start the sliding search with a sliding step of 1 from the upper left corner of \(A\) to the lower right corner. In the entire search process, the SSIM between SW and \({S}_{t-1}\) is calculated according to Eq. (17) for each sliding search. Record the maximum value of all SSIMs in the search process as \(\hbox{MaxSSIM}\);
Step 3: If \(\hbox{MaxSSIM}>{\hbox{TH}}_{1}\), it is considered that the target has been searched, and the window position corresponding to \(\hbox{MaxSSIM}\) is the predicted position of the target. And use this position to update \(\left({x}_{t},{y}_{t}\right)\), jump to Step 5;
Step 4: If \(\hbox{MaxSSIM}\ngtr {\hbox{TH}}_{1}\), it means there may be no target to be tracked in the current frame, and we need to continue searching in subsequent frames. Let \((t-1)\to (t-1)\), \((t+1)\to t\), jump to Step 6;
Step 5: Let \(t\to (t-1)\), \((t+1)\to t\);
Step 6: Continue to tracking or searching in the next frame.