Efficient visual tracking via lowcomplexity sparse representation
 Weizhi Lu^{1}Email author,
 Jinglin Zhang^{2},
 Kidiyo Kpalma^{1} and
 Joseph Ronsin^{1}
https://doi.org/10.1186/s1363401502007
© Lu et al.; licensee Springer. 2015
Received: 16 October 2014
Accepted: 23 January 2015
Published: 10 March 2015
Abstract
Thanks to its good performance on object recognition, sparse representation has recently been widely studied in the area of visual object tracking. Up to now, little attention has been paid to the complexity of sparse representation, while most works are focused on the performance improvement. By reducing the computation load related to sparse representation hundreds of times, this paper proposes by far the most computationally efficient tracking approach based on sparse representation. The proposal simply consists of two stages of sparse representation, one is for object detection and the other for object validation. Experimentally, it achieves better performance than some stateoftheart methods in both accuracy and speed.
Keywords
1 Introduction
Object tracking is a challenging task in computer vision due to the constant changes of object appearance and location. Sparse representation has recently been introduced in this area for its robustness in recognizing objects with high corruption [1]. Although related tracking works have been proposed with competitive performance, the application efficiency of sparse representation has not received enough attention. This paper is thus proposed to address this problem.
Sparse representation is mainly developed within the framework of particle filter, where it is used to measure the similarity between the particle and the dictionary with the representation error. Currently, this method still faces some challenges in terms of complexity and performance. To be specific, it should be noted that sparse representation has to be calculated for each particle, while the number of particles is often of the level of hundreds (e.g., 600 particles in [24]). Obviously, it is a considerable computation cost, especially in the setting where each sparse solution is also computationally expensive. Precisely, to represent the particle with relatively little error, sparse representation usually requires a relatively large dictionary (with a trivial template) and relatively dense coefficients, which both will increase the solution complexity. Regarding the tracking performance, it is necessary to point out that sparse representation cannot resolve the problem of identity drift, if it is simply used to weight the particle. There are two major reasons. First, sparse representation cannot provide a reliable similarity measure due to the potential overfitting solution, which tends to introduce excessive nonzero coefficients to reduce the representation error. In practice, it seems difficult to completely avoid the overfitting problem, because the sparsity of sparse solution is usually unknown. Second, the similarity threshold between object and background is hard to be determined with sparse representation, since the similarity level usually varies with the changing of object appearance.
To address the aforementioned problems, this paper develops a simple but effective trackingbydetection scheme by exploring the distribution of sparse coefficients instead of the sparse representation error for similarity measure. The proposed scheme consists of twostage sparse representation. In the first step, the object of interest is detected by the largest sparse coefficient; in the second step, the detected object is validated with a binary classifier based on sparse representation, which outputs the decision in terms of the distribution of sparse coefficients.
The rest of this paper is organized as follows. In the next Section 2, the tracking works related to sparse representation are briefly reviewed. In Section 3, a brief summary about sparse representation is presented. In Section 4, the proposed tracker with twostep sparse representation is described and analyzed. In Section 5, extensive experiments are conducted with comparison to the stateoftheart. Finally, a conclusion is given in Section 6.
2 Related work
Extensive literature has been proposed on object tracking. Due to the limited writing space, we mainly review the tracking works related to sparse representation in terms of performance and complexity.
Sparse representation is introduced into the tracking mainly for improving the performance of recognition or feature selection. Mei and Ling [5] first explored sparse representation into an online tracking system, where a trivial template with high dimension is introduced to approximate noise and occlusion. Later, to improve the highdimensional feature selection, Liu et al. [6] attempt to learn discriminative highdimensional features using dynamic sparsity group. To reduce the sensitivity to background noise in the selected object area, Wang et al. [2] and Jia et al. [7] applied the sparse coding histogram based on local patches to describe objects. Zhong et al. [8] proposed a collaborative model that weights particles by combining the confidences of local descriptor and holistic representation. Note that the tracking methods described above are mainly focused on the performance improvement, while ignoring the complexity of implementation. In the traditional tracking framework of particle filter, the computation cost introduced by sparse representation usually cannot be ignored, because it has to be calculated for each particle. In this sense, it is of practical interests to reduce the complexity related to sparse representation. Mei et al. [3] proposed to discard insignificant samples by limiting their linear least square errors before resampling particles with more computationally expensive ℓ _{1}regularization. In terms of compressed sensing theory, random projection is introduced to reduce the feature dimension in [9,10]. Strictly speaking, the random projection based feature selection arises from random projection theory rather than compressed sensing theory [11]. In this paper, we also apply the similar method for feature selection. According to random projection theory, we will implement random projection with random matrices rather than fixed matrices as applied in [9,10]. By this means, the feature selection performance of random projection should be improved [12]. In [13], Bao et al. developed a fast sparse solution solver with the accelerated proximal gradient approach. However, their solver is sensitive to a parameter termed as Lipschitz constant, which is computational load during the template updating. Liu and Sun [14] attempted to weight each particle only with corresponding sparse coefficient such that sparse representation needs to be conducted only once. This method seems very attractive in complexity; however, it should be noted that the magnitude of each coefficient in fact cannot be ensured ‘proportional’ to the similarity/correlation between the corresponding particle and the query object. Mathematically, with the principle of least squares, we can derive that the exact ‘proportion’ exists only when the subdictionary corresponding to sparse coefficients is orthogonal. Obviously, this condition is hard to be satisfied by the realistic dictionaries. With the erroneous similarity measure, however, the method in [14] still presents relatively good performance. This is because empirically the particles corresponding to large coefficients tend to be similar to the query object. In this case, the selected particles with high weights are not inclined to change the attribute of particle filter, then the tracking performance will not be influenced. Besides the theoretical limitation, this method in [14] also holds a critical performance limitation: it is sensitive to identity drift, because the object out of the scene can hardly be detected only with the distribution of sparse coefficients. Zhang et al. [15] proposed to jointly represent particles by using multitask learning to explore the interdependencies between particles. In addition, to detect an occlusion, the nonzero coefficients in the trivial template were used to locate occluded pixels in [3]. However, this method seems unreliable due to the potential overfitting solution. In particular, when the overfitting solution occurs as in Figure 1c, all pixels are likely to be classified as occlusion, though in fact there is no occlusion. In this paper, to reduce the complexity of sparse representation, we also exploit the sparse coefficients instead of representation error for similarity measure. However, we successfully avoid the limitations mentioned above by developing a novel tracking scheme.
To account for the change of object appearance, almost all the trackers mentioned above explore an adaptive appearance model for the online sample updating [5] or learning [16,17]. It is known that the adaptive model is likely to lead to the identify drift, if the background sample cannot be effectively excluded. However, to effectively discriminate the background sample from the object is still a challenging task due to the change of object appearance. For instance, in practice it is hard to set a decision threshold between the object and background with representation error [4,16,18,19]. To address this problem, it is better to introduce a binary classifier which involves a definitive decision threshold [2023]. Thus, in the proposed approach, we specially develop a binary classifier based on sparse representation. Compared with traditional binary classifiers, as will be shown later, the proposed classifier is competitive in both performance and complexity.
3 Sparse representationbased classification
where δ _{ i }(β) is a function that sums the elements of β corresponding to \( {\boldsymbol{D}}_{G_i} \). The solution to ksparse vector β can be simply derived with greedy algorithms of complexity \( \mathcal{O}(mnk) \), such as OMP [24] or leastangle regression (LARS) [25]. Note that to reduce the representation error, the dictionary D is often further concatenated with a trivial template consisting of two identity matrices [ I−I]∈R ^{ m×2m }, thereby dramatically increasing the solution complexity [2,3,57,9,15].
where 0<γ<1 is an empirical parameter. As will be detailed in Section 4.3, the feature of detecting novel objects can be used to detect the outlier during the object tracking.
4 Proposed tracking scheme
4.1 Randomprojectionbased feature selection
where \( \mathbf{R}\in {\mathbb{R}}^{d\times m} \) is a random projection matrix, d<m. The random matrix R is commonly constructed with elements independently and identically drawn from the Gaussian distribution. Here, for simpler computation, we exploit a more sparse random matrix which holds exactly one nonzero entry being ±1 equiprobably in each column. This kind of matrix has shown better feature selection performance than Gaussian matrices [11], and it also performs well in the following tracking work. Note that to obtain relatively reliable feature selection, for each given y, random projection usually requires to be carried out several times and then to consider the average result [12]. In this process, the matrix R is random generated. More precisely, in our approach the random projection together with sparse representation will be repeated five times for each given y. Then, the average value of five sparse solutions β is used to make a decision for y.
Despite its low implementation complexity, random projection clearly is not the best feature selection tool in terms of performance. However, considering the variation of object appearance, it is reasonable to argue that the feature comparison based on the sum of few randomly selected pixels is probably more robust than the conventional pixelwise comparison. This also explains why random projection can present satisfactory recognition performance in the proposed tracker.
4.2 Object detection
It is clear that the detection performance depends heavily on the reliability of query object y extracted from the former frame. Here, we define a special template Y=[Y _{ s } Y _{ d }] to model the query object y. As it appears in stateoftheart [4,8], Y _{ s } denotes a static appearance model consisting of groundtruth manually or automatically detected from the first frame as well as its perturbations with small Gaussian noises, and Y _{ d } represents a dynamic appearance model collecting some object samples extracted from recent frames. To account for the object appearance change, in this paper, a set of query samples, rather than one, is randomly selected from the two models above. The average sparse solution of the query objects selected above is used to determine the detection result. Note that to avoid false detection, it is suggested to collect more query samples from the static model Y _{ s } than from the dynamic model Y _{ d }. The local patches of the dictionary D are collected with a sliding window of the size of the initialized object. Considering the continuity of the object movement, the searching region of the sliding window allows to be restricted to a relatively small area, e.g., twice or three times the object size. If the object is lost, the searching region can be temporarily expanded. For better understanding, the object detection flow is sketched in Algorithm 1.
It is necessary to point out that the detection corresponding to the largest coefficient is not always reliable or correct. An obvious evidence is that there would remain an ‘object area’ defined by the largest coefficient, even though the object has been occluded or out of the image. To avoid such kind of false detection, we have to introduce a binary classifier to further validate the detection, as detailed in the sequel.
4.3 Object validation and template updating

It is computationally competitive, since it only involves simple operations of matrixvector product.

The decision can be easily derived in terms of the distribution of sparse coefficients.

Compared with traditional binary classifiers, it has an exclusive advantage: it can detect the outliers which is not included in the current background model, because in this case, the sparse coefficients tend to scatter rather than focus [1]. This property has been verified in our recent work on multiobject tracking, in which the novel object is detected as an outlier according to the scattering sparse coefficients [26]. This implies that we can detect the background sample, even when the background model is not robust or large. Then, the computation and storage loads on background modelling can be significantly reduced.

In practice, the discrimination between the object and the background seems to be a multiclass classification problem rather than a binary classification problem, since the complex and dynamic background usually involves kinds of feature subspaces, some of which might be similar to the object feature. In this case, the two opposite halfspaces trained by traditional binary classifiers, like SVM [26], probably overlap with each other, thereby deteriorating the classification performance. In contrast, sparse representation is robust to this problem, because it partially explores the similarity between individual samples rather than directly dividing the sample space into two parts [1].
The parameters of the proposed classifier are further detailed as follows. The positive samples \( {Z}_{G_p} \) come from the aforementioned static and dynamic appearance models Y, and the negative samples \( {Z}_{G_n} \) are collected by a sliding overlapping window from the neighborhood of tracked object, where partial object region is included as opposed to relatively complete object region in positive samples. Correspondingly, the sparse solution β is also divided into two parts: \( \beta =\left[{\beta}_{G_p}\kern1em {\beta}_{G_n}\right] \). Note that the classifier is not sensitive to the representation error, and so β can be very sparse, e.g., it holds at most ten nonzero entries in our experiments. In terms of the distribution of sparse coefficients, we propose two rules to define the positive output. Oneis that the largest coefficient of sparse solution β corresponds to a positive sample of \( {Z}_{G_p} \); namely, argmax_{ i }{β _{ i }}∈G _{ p }. And the other is that the sparse coefficients corresponding to positive samples, \( {\beta}_{G_p} \), take higher energy than the sparse coefficients corresponding to negative samples, \( {\beta}_{G_n} \); that is, \( \left\right{\beta}_{G_p}\Big{\Big}_1/\Big\left\beta \right{\Big}_1>0.5 \). Empirically, the latter criterion is more strict than the former, since it measures the similarity between the candidate object and the whole positive subspace, instead of the individual positive samples. In this paper, to present a relatively fluent tracking trajectory, the detection is positively labeled, when either of the two criterions above is satisfied. But for the template updating, we only apply the second criterion with a stricter threshold, i.e., \( \left\right{\beta}_{G_p}\Big{\Big}_1/\Big\left\beta \right{\Big}_1>0.8 \), which provides more reliable features and then prevents the template from identity drift. In practice, the threshold value needs to be tuned empirically. Recall that random projection needs to be carried out several times to achieve better feature selection performance for the unique candidate object [12]. In our experiments, the random projection together with sparse representation is repeated five times, and then the average value of five sparse solutions β is used for the final decision.
It is necessary to emphasize that the proposed classifier holds an exclusive advantage: it can detect the outlier. Typical outliers include the dynamic background samples and the sudden and great changes of object appearance, which usually cannot be well described with the current background model. In this case, as shown in Figure 4c, the sparse coefficients incline to scatter among the positive and negative subspaces rather than focusing on one of them, namely \( \left\right{\beta}_{G_p}\Big{\Big}_1/\Big\left\beta \right{\Big}_1\approx 0.5 \). Then, the outliers can be easily excluded from the template updating, if a relatively strict threshold is adopted, e.g., \( \left\right{\beta}_{G_p}\Big{\Big}_1/\Big\left\beta \right{\Big}_1>0.8 \). This advantage allows us to build a background model of relatively few samples, since the classifier is not very sensitive to the robustness of the background model. Finally, it is necessary to discuss the case where the object is lost. In this case, the object will be searched again within a larger region. If the search fails, the object will be assumed to move with a constant velocity or keep still in the following few frames. The whole flow of this part is summarized in Algorithm 2.
4.4 Computation cost related to sparse representation

The proposed scheme only involves twostep sparse representation, in which sparse representation allows to be performed only twice, if the feature is robust. To obtain relatively reliable features, in our experiments sparse representation along with random projection is repeated dozens of times, namely (N _{ c }+1)N _{ r } times. In contrast, the traditional framework of particle filter has to be carried out for each particle, such that the repetition times of sparse representation is usually of the level of hundreds, e.g., about 600 times in [24].

Recall that the solution complexity is roughly proportional to the dictionary size m×n and the sparsity k. In the traditional methods, the dictionary size is usually large, since to reduce representation error, it has to involve a highdimensional trivial template [ I−I] with the size of object feature. The sparsity k cannot be restricted, unless the representation error is small enough. In contrast, the proposed approach is not sensitive to the representation error. Thus, in the paper, we significantly reduce the dictionary column size n by excluding the trivial template, while restricting the sparsity k to a relatively low value, e.g., k=10 in our experiments. Furthermore, the dictionary row size m is also drastically reduced with random projection.
Recently, some trackers based on sparse representation have been proposed with ‘realtime’ performance by reducing the complexity of sparse solution [9,13]. However, these trackers cannot reduce the repetition number of sparse representation due to their framework of particle filter. Thus, their computational gain is still limited compared with our approach. For a better understanding, here, we analyze two typical realtime trackers: realtime compressed sensing tracker (RTCST) [9] and APGL1 [9]. For the tracker RTCST, compressed sensing theory is exploited to reduce the feature dimension with linear projection, thereby reducing the complexity of sparse solution. The strategy is also adopted in our approach with random projection theory [11]. So compared with our approach, the tracker RTCST has no computational advantage. Furthermore, it is worth noting that the linear projectionbased feature selection is based on random projection theory rather than compressed sensing theory [11]. The tracker APGL1 is developed by exploring the accelerated proximal gradient (APG) approach for sparse solution. The APG approach seems to be computationally attractive, since it does not require the operation of matrix inversion, which is of complexity \( \mathcal{O}\left({k}^3\right) \) and often involved in current solution algorithms. However, it should be noted that the convergence performance of the APG approach is sensitive to a parameter termed the Lipschitz constant, which needs to be evaluated with the largest singular value of the dictionary D. This implies that the singular value of the dictionary has to be calculated for each dictionary updating, while the solution of singular value holds a relatively high complexity \( \mathcal{O}\left({n}^3\right) \). Then, we can say that the updating of Lipschitz constant will drastically degrade the computational advantage of the APG approach, especially in the complex scene where the dictionary requires to be frequently updated.
5 Experiments
The attributes of ten tested videos on five typical challenges
Video clip  Light  Fast  Scale/  Partial  Complete 

change  motion  rotation  occlusion  occlusion  
Face_hand  \( \checkmark \)  \( \checkmark \)  
Paper  \( \checkmark \)  \( \checkmark \)  
PETS09_s2l1  \( \checkmark \)  
Tudcrossing  \( \checkmark \)  
Face_man  \( \checkmark \)  \( \checkmark \)  
Face_woman  \( \checkmark \)  
Animal  \( \checkmark \)  
Jumping  \( \checkmark \)  
David  \( \checkmark \)  \( \checkmark \)  
Girl  \( \checkmark \)  \( \checkmark \)  \( \checkmark \) 
For comparison, we perform four known trackers: incremental visual tracker (IVT) [16], L1 tracker [5], partial least squares (PLS) tracker [4], and sparsitybased collaborative model (SCM) tracker [8]. These four trackers all explore an adaptive appearance model. The first two trackers are mainly focused on the updating of dynamic appearances, while the latter two trackers further introduce the static model to prevent the identity drift. Note that the trackers L1 and SCM are both developed based on sparse representation. L1 is the first tracker that explores sparse representation, and to the best of our knowledge, SCM is currently known the best tracker based on sparse representation [27]. For fair comparison, the four trackers are all implemented with their original codes, and the tracked object is initialized with same position. Regarding the parameters tuning, we use the default parameters of L1 and PLS. Since the trackers IVT and SCM have provided some options of parameters for some popular videos, we adopt their default parameters for the videos they have evaluated, and select proper parameter options for other videos, e.g., using their parameters for ‘head’ and ‘pedestrian’ to track ‘head’ and ‘pedestrian’ in other videos. Recall that the proposed tracker cannot cope with scales. In contrast, the other four trackers all exploit the technique of affine transform. Empirically, some trackers are probably sensitive to the initialization. For fair and comprehensive performance evaluation, we exploit two initialization methods: onepass evaluation (OPE) and spatial robustness evaluation (SRE) [27]. The OPE method simply initializes the tracker with the ground truth. The SRE method sets the initialization region by scaling the ground truth with five ratios 0.8, 0.9, 1, 1.1, and 1.2, and then the average tracking performance is considered.
The parameters related to the twostep sparse representation are detailed as follows. In the step of object detection, ten query samples are collected from the static model Y _{ s } and five query samples collected from the dynamic model Y _{ d }. The step length of the sliding window is around 4 pixels. Both the object retrieval region and background sampling region are not more than three times the object size. In the step of object validation, the classifier consists of 50 positive samples and 100 negative samples. The positive detection cannot be used for model updating unless \( \left\right{\beta}_{G_p}\Big{\Big}_1/\Big\left\beta \right{\Big}_1>0.8 \). In the two steps above, the sparse representation based on random projection is repeated five times for each instance. The random projection matrix is of size (d=200,m=32×32). It implies that the object is extracted and represented with a vector of size 32×32, which is further reduced to the dimension of 200 by random projection before performing sparse representation. The upper bound of sparsity k is set to 10 during the sparse solution.
5.1 Computational efficiency
The implementation speed (frames per second) of the four trackers based on sparse representation with varying feature size
5.2 Quantitative evaluation
Average center location errors (in pixel)
Face_hand  Paper  PETS09_s2l1  Tudcrossing  Face_man  Face_woman  Animal  Jumping  David  Girl  

OPE  SRE  OPE  SRE  OPE  SRE  OPE  SRE  OPE  SRE  OPE  SRE  OPE  SRE  OPE  SRE  OPE  SRE  OPE  SRE  
IVT  57.80  51.23  38.77  38.64  51.26  42.17  29.11  22.30  7.41  18.01  39.99  30.41  10.02  21.27  15.25  15.10  5.49  6.74  36.50  30.22 
L1  43.98  45.12  70.42  55.28  45.04  45.22  11.49  11.17  28.96  23.47  24.06  20.30  81.02  33.08  55.68  52.03  58.32  57.00  24.53  45.13 
PLS  62.08  59.30  34.77  39.11  42.98  44.71  43.11  37.60  9.09  17.10  22.57  20.00  91.94  32.90  66.93  72.44  77.73  78.17  39.51  54.00 
SCM  23.82  18.17  40.89  37.01  32.12  23.00  4.36  6.35  3.06  6.81  4.68  4.58  25.75  14.22  3.73  3.66  47.20  40.90  110.49  51.18 
Ours  4.08  6.22  4.45  4.12  8.47  40.06  4.23  6.29  9.74  13.40  10.45  9.79  6.39  9.73  3.77  4.15  8.65  10.21  22.68  26.43 
Average overlap rates between the tracked region and the ground truth
Face_hand  Paper  PETS09_s2l1  Tudcrossing  Face_man  Face_woman  Animal  Jumping  David  Girl  

OPE  SRE  OPE  SRE  OPE  SRE  OPE  SRE  OPE  SRE  OPE  SRE  OPE  SRE  OPE  SRE  OPE  SRE  OPE  SRE  
IVT  0.40  0.37  0.48  0.40  0.20  0.21  0.15  0.15  0.56  0.32  0.46  0.45  0.74  0.58  0.52  0.45  0.64  0.63  0.37  0.38 
L1  0.35  0.30  0.12  0.16  0.39  0.35  0.54  0.51  0.36  0.32  0.54  0.54  0.23  0.37  0.11  0.14  0.26  0.24  0.48  0.25 
PLS  0.35  0.29  0.44  0.39  0.38  0.35  0.09  0.11  0.52  0.34  0.43  0.40  0.21  0.37  0.07  0.07  0.25  0.25  0.39  0.27 
SCM  0.70  0.61  0.54  0.48  0.38  0.47  0.69  0.70  0.67  0.61  0.82  0.78  0.57  0.62  0.80  0.80  0.37  0.36  0.13  0.28 
Ours  0.93  0.77  0.82  0.73  0.64  0.30  0.74  0.69  0.56  0.47  0.67  0.59  0.84  0.69  0.82  0.73  0.39  0.35  0.51  0.48 
By comparing the results of OPE and SRE in each video of Table 3, we can see that the performance of the proposed tracker is relatively stable in the two cases. This implies that the proposed tracker is robust to the scale of initialization region. In fact, it presents poor performance only in the video ‘PETS09_s2l1,’ where it fails in the scaling cases of SRE. In contrast, the other four trackers seem sensitive to the initialization: this might be explained by the following fact. These trackers all exploit the technique of affine transform, which inclines to gradually converge to the local part of the target when the feature is not robust. From Table 4, it can be observed that the the SRE result is a little worse than the OPE result in each video, although as shown in Table 3, the proposed tracker in fact presents comparable performance in these two cases. As explained before, this is because the SRE method introduces a relatively large difference between the initialized tracking window and the ground truth in the first frame.
5.3 Qualitative evaluation
Occlusion: The object occlusion has been the major challenge of online visual object tracking, which probably leads to the false object detection and gradually drift the identity of object model. However, in the proposed approach, the binary classifier based on sparse representation can effectively detect and prevent the occlusion from the updating of object model. In practice, the proposed approach is robust to object occlusion. To highlight the advantage of the proposal, we specially produce two challenging videos against longtime complete occlusions: sequences face_handand paper. In the sequence face_hand, the target face is completely occluded with hands for a long period. In the sequence paper, the target paper is completely occluded twice by another similar paper. In addition, the partial or complete occlusion cases can also be observed in the sequences PETS09_s2l1, Tudcrossing, face_man, face_woman, and girl.
The proposed approach performs well on the videos mentioned above. In contrast, the other four trackers fail, when the longtime complete occlusion occurs or the occlusion shares similar feature with the target. For instance, in the sequence face_hand, IVT, L1, and PLS early drift to the background when a shorttime occlusion occurs, and SCM finally drifts to one hand which covers the face for a long period. In the sequence paper, the four trackers all drift to the occlusion or background. It is interesting to note that SCM is robust to shorttime occlusion due to the application of static object model. Nevertheless, it remains sensitive to the longtime occlusion, as demonstrated in the sequence face_hand. This implies that SCM cannot effectively detect the collusion, which finally modifies the attribute of the dynamic object model by the accumulation of false samples.
Motion and blur: The fast or abrupt motion has been a great challenge for the traditional framework of particle filter, whose motion estimation parameters are usually continuous. However, this problem can be easily addressed within the proposed trackingbydetection scheme by expanding the object retrieval region. It is known that the blur caused by fast motion is unfavorable for object recognition. However, the fluent tracking results in sequences animal and jumping validate that the proposed approach works well in this case. It indicates that the random projection of raw image is robust to the blur. By exploring the sparse coding histogram as feature, SCM also performs well in this case. In contrast, the remaining three methods perform relatively worse. They all drift from the target in the sequence jumping.
Scale and rotation: There are drastic scale changes and inplane or outofplane rotations in the two sequences david and girl. They pose great challenges to the proposed approach which only holds a fixedsized tracking window. In this case, the object detection of the proposed approach is usually false. However, the false detection can be effectively identified with object validation. This will help the proposed tracker effectively avoid the identity drift caused by scale or rotation. So in the sequences david and girl, the proposed approach successfully recaptures the object after severe scalings or rotations. In contrast, the other four methods incline to lose the target for ever in the presence of severe scale changes or rotations, e.g., the outofplane rotation in the sequence girl.
Illumination: In theory, the proposed approach should not be sensitive to the illumination change, since the feature vector collected by random projection allows to be linearly scaled during the sparse solution. In practice, the proposed approach performs well together with other four methods. For instance, in the sequence david, the five methods all successfully track the object walking from the dark to the light in the early few frames.
Overall performance: The proposed approach shows better overall performance than others due to the robustness of sparse representation on both object detection and validation. The two trackers SCM and PLS both explore a static object model to identify the object, while they cannot prevent the false detection from updating the dynamic object model. So they perform worse than the proposed tracker in our experiments. Note that SCM obviously outperforms PLS. This can be explained by the fact that SCM explores both static and dynamic features to weight particles, while PLS only adopts the dynamic feature. The remaining two trackers IVT and L1 cannot cope with severe appearance changes, since the ground truth is not preserved in their template updating.
6 Conclusions
This paper has proposed an efficient trackingbydetection scheme based on twostage sparse representation. In order to evaluate the proposed approach, extensive experiments are conducted on ten benchmark videos comprising various challenges like light change, fast motion, scale and rotation, partial occlusion, and complete occlusion. Compared with traditional trackers based on sparse representation, the proposed tracker presents obvious advantages on both accuracy and complexity. Specifically, it significantly reduces the computation cost related to sparse representation, thereby presenting much higher speed than stateoftheart. Thanks to its robustness to identity drift, it also achieves better tracking performance than stateoftheart especially in the presence of severe occlusions.
Declarations
Authors’ Affiliations
References
 J Wright, AY Yang, A Ganesh, SS Sastry, Y Ma, Robust face recognition via sparse representation. IEEE Trans. Pattern Anal. Machine Intelligence. 31(2), 210–227 (2009).View ArticleGoogle Scholar
 Q Wang, F Chen, W Xu, MH Yang, Online discriminative object tracking with local sparse representation. IEEE workshop on application of computer vision (WACV), IEEE, Breckenridge, CO, 911 January, 2012.Google Scholar
 X Mei, H Ling, Y Wu, Minimum error bounded efficient ℓ 1 tracker with occlusion detection. IEEE conference on computer vision and pattern recognition (CVPR), IEEE, Colorado Springs, CO, 21–23 June, 2011.Google Scholar
 Q Wang, F Chen, W Xu, MH Yang, Object tracking via partial least squares analysis. IEEE Trans. Image Process. 21(10), 4454–4465 (2012).View ArticleMathSciNetGoogle Scholar
 X Mei, H Ling, Robust visual tracking using ℓ 1 minimization. IEEE international conference on computer vision, IEEE, Kyoto, 29 September–2 October, 2009.Google Scholar
 B Liu, L Yang, J Huang, P Meer, L Gong, C Kulikowski, Robust and fast collaborative tracking with two stage sparse optimization. European Conference on Computer Vision (ECCV), Crete, Greece, 511 September, 2010.Google Scholar
 X Jia, H Lu, MH Yang, Visual tracking via adaptive structural local sparse appearance model. IEEE conference on computer vision and pattern recognition (CVPR), IEEE, Providence, RI, 16–21 June, 2012.Google Scholar
 W Zhong, H Lu, MH Yang, Robust object tracking via sparsitybased collaborative model. IEEE conference on computer vision and pattern recognition (CVPR), IEEE, Providence, RI, 16–21 June 2012.Google Scholar
 H Li, C Shen, Q Shi, Realtime visual tracking using compressive sensing. IEEE conference on computer vision and pattern recognition (CVPR), IEEE, Colorado Springs, CO, 21–23 June, 2011.Google Scholar
 K Zhang, L Zhang, MH Yang, Realtime compressive tracking. European Conference on Computer Vision (ECCV), Firenze, Italy, 713 October, 2012.Google Scholar
 W Lu, W Li, K Kpalma, J Ronsin, Sparse matrixbased random projection for classification. arXiv:1312.3522 (2014).Google Scholar
 XZ Fern, CE Brodley, Random projection for high dimensional data clustering: A cluster ensemble approach. International Conference on Machine Learning (ICML), Washington, DC, 2124 August, 2003.Google Scholar
 C Bao, Y Wu, H Ling, H Ji, Real time robust L1 tracker using accelerated proximal gradient approach. IEEE conference on computer vision and pattern recognition (CVPR), IEEE, Providence, RI, 16–21 June, 2012.Google Scholar
 H Liu, F Sun, Visual tracking using sparsity induced similarity. IEEE international conference on pattern recognition (ICPR), IEEE, Istanbul, 23–26 August, 2010.Google Scholar
 T Zhang, B Ghanem, S Liu, N Ahuja, Robust visual tracking via multitask sparse learning. IEEE conference on computer vision and pattern recognition (CVPR), IEEE, Portland, OR, 2328 June, 2013.Google Scholar
 D Ross, J Lim, RSLMH Yang, Incremental learning for robust visual tracking. Int. J. Comput. Vision. 77(1), 125–141 (2008).View ArticleGoogle Scholar
 D Wang, H Lu, MH Yang, Online object tracking with sparse prototypes. IEEE Trans. Image Process. 22(1), 314–325 (2013).View ArticleMathSciNetGoogle Scholar
 J Kwon, KM Lee, Visual tracking decomposition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, San Francisco, CA, 1318 June, 2010.Google Scholar
 A Adam, E Rivlin, I Shimshoni, Robust fragmentsbased tracking using the integral histogram. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, New York, NY, 1722 June, 2006.Google Scholar
 B Babenko, MH Yang, S Belongie, Visual tracking with online multiple instance learning. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Miami, FL, 2026 June, 2009.Google Scholar
 S Avidan, Ensemble tracking. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, San Diego, CA, 2026 June, 2005.Google Scholar
 H Grabner, H Bischof, Online boosting and vision. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, New York, NY, 1722 June, 2006.Google Scholar
 Z Kalal, J Matas, K Mikolajczyk, PN learning: Bootstrapping binary classifiers by structural constraints. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, San Francisco, CA, 1318 June, 2010.Google Scholar
 YC Pati, R Rezaiifar, PS Krishnaprasad. Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. Asilomar Conference on Signals, Systems and Computers (ACSSC), IEEE, Pacific Grove, CA, 13 Nov, 1993.Google Scholar
 B Efron, T Hastie, R Tibshirani, Least angle regression. Ann. Stat. 32, 407–499 (2004).View ArticleMATHMathSciNetGoogle Scholar
 W Lu, C Bai, K Kpalma, J Ronsin. Multiobject tracking using sparse representation. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Vancouver, BC, 2631 May, 2013.Google Scholar
 Y Wu, J Lim, MH Yang. Online object tracking: A benchmark. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Portland, OR, 2328 June, 2013.Google Scholar
 J Mairal, F Bach, J Ponce, G Sapiro, Online learning for matrix factorization and sparse coding. J Machine Learning Res. 11, 19–60 (2010).MATHMathSciNetGoogle Scholar
 M Everingham, LV Gool, C Williams, J Winn, A Zisserman, The pascal visual object classes (voc) challenge. Int. J. Comput. Vision. 88(2), 303–338 (2010).View ArticleGoogle Scholar
Copyright
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.