Object contour tracking via adaptive data-driven kernel

We present a novel approach to non-rigid object tracking in this paper by deriving an adaptive data-driven kernel. In contrast with conventional kernel-based trackers which suffer from the constancy of kernel shape as well as scale and orientation selection problem when the tracking targets are changing in size, the adaptive kernel can robustly achieve the adaptation to target variation and act toward the actual target contour simultaneously with the mean shift iterations. Level set technique is novelly introduced to the mean shift sample space to both cope with insufficient low-level information and implement the adaptive kernel evolution and update. Since the active contour model is designed to drive the kernel constantly to the direction that maximizes the appearance similarity, this adaptive kernel can continually seize the target shape to give a better estimation bias and produce accurate shift of the mean. Finally, accurate target region can successfully avoid the performance loss stemmed from pollution of background pixels hiding inside the kernel and qualify the samples fed the next time step. Experimental results on a numer of challenging sequences validate the effectiveness of the technique.


Introduction
Object tracking is a challenging research topic in the field of computer vision. In previous literature, numerous approaches have been dedicated to compute the translation of an object in consecutive frames [1][2][3][4], among which the mean shift methods show impressive performances and have received a considerable amount of attention. As a nonparametric density estimator firstly appeared in [5], mean shift iteratively computes the nearest mode of a point sample distribution. Then, it was applied by Comaniciu [6] to object tracking where the cost function between two color histograms is minimized through the mean shift iterations.
Despite its promising performance [7][8][9][10], there is a significant problem facing the traditional mean shift, i.e., the unclear kernel scale selection mechanism. Since the scale of mean shift kernel directly determines the size of the window within which sample weights are examined and affect the amount of kernel shift, it is a crucial parameter for the mean shift algorithm. However, there is currently no sound mechanism for choosing this scale maturely. The intuitive approach is to search for the best *Correspondence: dongli@sdu.edu.cn 2 Shandong University, Weihai, China Full list of author information is available at the end of the article scale by testing different kernel bandwidths and selecting the one maximizing appearance similarity. This kind of method easily result in performance loss due to the pollution of non-object regions residing inside the kernel. In order to better fit the object shape, anisotropic symmetric kernel is introduced with the selection problem existing not only in scale but also extending to orientation. By simultaneously controlling both the scale and orientation, the estimation bias of the kernel can be controlled by the underlying distribution (Fig. 1a), and result in better mode estimation. Nevertheless, objects in practice may have complex shapes that cannot be well described by simple geometric shapes, even when using the most appropriate one (Fig. 1b). With the expectation that the kernel ideally has the shape of the tracked object, some attempts have been made to use asymmetric kernel for dynamic tracking. However, most of them invite constant kernel shape throughout the sequence, few consider to adapt it to the target variation over time.
In this paper, we derive an adaptive data-driven kernel to simultaneously address the kernel scale/orientation selection problem as well as the constancy of the kernel shape in non-rigid object tracking application. Level set technique is novelly introduced to the mean shift sample space to both cope with insufficient low-level information and implement the adaptive kernel evolution and (2020) 2020:9 Page 2 of 13

Fig. 1
Motivation and improvement illustration of the proposed method, the frame numbers of c are 191, 206, and 219, respectively, in diving sequence. a Kernel scale/orientation selection. b Complex object shape. c The proposed data-driven kernel and its adaptation to target variation update. Since the active contour model is designed to drive the kernel constantly to the direction that maximizes the appearance similarity, the kernel can robustly achieve the adaptation to target variation and act toward the actual target contour simultaneously with the mean shift iterations. As the adaptive kernel continually seizes the target shape, it can give a better estimation bias to produce accurate shift of the mean and successfully avoid the performance loss stemmed from pollution of the nonobject regions hiding inside the kernel. Briefly, our main contributions could be summarized as follow: • In contrast to traditional meanshift methods which use fixed rectangular for target presentation, we introduce the level set model into the meanshift framework to realize non-rigid object contour tracking. • In contrast to traditional level set method that do not consider any interested target knowledge, we evolve the level set curve in the meanshift sample space to drive the curve whose convergence result maximizes the target appearance similarity.
• We proposed an adaptive data-driven kernel based on the level set model within the meanshift framework, which addresses the fix kernel shape and kernel scale/orientation selection problem facing traditional kernel trackers. Figure 1 illustrates the motivation and improvement of our proposed method.

Tracking methods with kernel scale/orientation selection
After the intuitive 10% method in [6], Collins proposed a method [11] using difference of Gaussian mean shift kernel for efficient blobs tracking through scale space. Khan et al. in [12] derive a multi-mode anisotropic mean shift, where the center, size, and orientation of the bounding box are simultaneously estimated during the tracking. In [13], the authors present a probabilistic formulation of kernel-based tracking methods where the EM-estimation conjunction with KL-divergence are used to develop a target-center and kernel bandwidth update scheme.
(2020) 2020:9 Page 3 of 13 However, all of them roughly represent the objects by simple geometric shape kernels that easily result in background pollution. In contrast, the proposed data-driven kernel can adapt to the shape of actual object for tracking and as well qualify the samples for appearance model update.

Tracking methods using asymmetric kernel
In [14], asymmetric kernels are generated using implicit level set functions. After extending the search space to higher dimension, the method simultaneously estimates the new object location, scale, and orientation. Yi et al. propose a method for object tracking based on mean shift algorithm in [15]. They use an object mask to construct the asymmetric kernel and implement probabilistic estimation for the orientation change and scale adaptation. These methods, however, invite constant kernel shape during the tracking task which could not therewith to the object shape in case of out-plane rotations by scale and orientation estimation. In contrast, we evolve the data-driven kernel and adapt it to target variation simultaneously with the mean shift iterations to implement tracking of deformable objects.

Tracking methods using level set
Level set technique has been widely used for dynamic tracking [16][17][18][19]. Bibby et al. [20] derive a posterior framework for robust tracking of multiple previously unseen objects where the shapes are implicit contours represented using level set. In [21], the authors add Mumford-Shah model into the particle filter framework. Once the particle filter gives the candidate positions in prediction step, the level set curve evolution is included, without considering any target bias, to give the candidate contours. In [22], dynamical statistical shape priors are introduced and integrated in a Bayesian framework for level set-based image sequence tracking. In [23], the authors propose a fragments based tracking method within the level set framework, where the whole target and background are segmented by an efficient regiongrowing procedure. Differently, our method introduce the active contour model to the mean shift sample space to both cope with the insufficient low-level information and obtain the adaptive kernel that maximizes the appearance similarity for non-rigid object tracking within mean shift framework.

The mean shift estimation
The mean shift method iteratively computes the closest mode of a sample distribution starting from a hypothesized mode. In specifically, considering a probability density function f (x), given n sample points x i , i = 1, · · · , n, in d-dimensional space, the kernel density estimation (also known as Parzen window estimate) of function f (x) can be written aŝ where w(x i ) ≥ 0 is the weight of the sample x i , and K(x) is a radially symmetric kernel satisfying k(x)dx = 1. The bandwidth h defines the scale in which the samples are considered for the probability density estimation. Then, the point with the highest probability density in current scale h can be calculated by mean shift method as follow: where the kernel profile k(x) and g(x) have the relation- The kernel recursively moves from the current location x to the new location m h (x) according to mean shift vector and finally, converges to the nearest mode.

Kernel representation
Kernel is a crucial factor to the performance of the mean shift algorithm, which defines the scale of the target candidate and the number of samples considered in the mode seeking process. Inappropriate kernel may result in either noisy background pollution or poor object localization. An ideal kernel is expected to have the shape of the actual tracked object which may be complex, and with the capability of adapting to the object variation. Level set methods, first proposed by Osher and Sethian in [24,25], offer a very effective representation of contours and are widely used. The basic idea of the level set approach is to embed the contour C as the zero level set of the graph of a higher dimensional function φ(x, y, τ ), that is where τ is an artificial time-marching parameter and then evolve the graph so that this level set moves according to the prescribed flow. In this manner, the level set may develop singularities and change topology, while φ itself remains smooth and maintains the form of a graph. Based on the competitive properties described above, the level set comes into sight as a reasonable consideration of presenting the expected adaptive kernel. A kernel function K : R d → R in the mean shift framework is supposed to satisfy where x 2 = x T x and k :[ 0, ∞] → R is the profile function with following properties: • k is non-negative.

Sun et al. EURASIP Journal on Advances in Signal Processing
(2020) 2020:9 Page 4 of 13 Implicit level set function φ(x), encoding the signed distances of the pixels x from the object boundary, provides a smooth and differentiable function, and basically meet the requirements of a mean shift kernel. However, there is an exception that the signed distance function of level set is negative outside the object boundary. Therefore, we truncate the level set function of the outside boundary portion (set to 0) as in [14] and normalize the inside portion to meet the density estimator standard Figure 2 illustrates the level set kernel mechanism. In [14], the asymmetric kernel is constructed only for once and used constantly throughout the sequence. Since it does not adapt to object change in shape, the method can only estimate the object scale and orientation of in-plane rotations. In case of 3D or in-depth rotations, it is a challenge for the method to therewith to the object shape (Fig. 2f). Differently, we novelly introduce the active contour model into the mean shift sample space and derive a data-driven kernel, which is able to adapt to the object shape and act toward the actual target contour simultaneously with the mean shift iterations.

Data-driven kernel evolution
Our goal is to evolve the kernel to the expected image area of the target being tracked. Let I τ : x → R m denote the image at time τ that maps a pixel x =[ x y] T ∈ R 2 to a value, where the value is a scalar in the case of a grayscale image (m = 1) or a three-element vector for an RGB image (m = 3). Effective image preprocessing technical could also be used to generate the value. Let Then the contour is deformed in the form of embedding level set function until it minimizes an image-based energy function.
Given an initial kernel region learned from previous observations, we extend the view of candidate object region to a larger ring of neighboring, within which the samples are evaluated by the Bhattacharyya measurement. Therefore, a new kernel function can be adapted without being confined to the current kernel scope. Let q and p denote the color distribution functions generated from the object model and candidate regions, then the weight at pixel x is given by: It is obvious that the weight map of the candidate object region contains two kinds of samples. Samples that are more likely to belong to the target than to the background get larger weights, and vice versa for those are more likely derive from the background. In order to distinguish these samples, we include the active contour model into this sample space as an unsupervised clustering manner to automatically separate the samples into two classes (foreground/background) and drive the kernel to the maximum possible area of being the target.
Let m t and m b denote the within class weight center of the target and background classes; then, we can define the variance of a cluster D * around its center by where * ∈ {t, b} denotes the target and background, respectively. Under the intuition that we would like weight values of pixels on the object and background to both be tightly clustered, i.e., low within class variance, we use the sum of squared error criterion as the clustering criterion function The clustering criterion function optimization is a combinatorial optimization problem and has been proved to be a NP problem. Since the exhaustive computation is unrealistic, we bring this problem into the level set framework and convert the process of iteratively finding approximate solution to the form of level set function evolution. We define the energy function of the active contour as where + presents the region inside curve C and captures the samples belonging to the object class, while + denotes the region outside C and captures the samples of the background class. T(x) is the image gradient for edge detecting where ∇ denotes spatial gradient operator, * denotes convolution, and G σ is the Gaussian filter with standard deviation σ . ξ and μ are the coefficients that weight the relative importance of each item. The first two items are used to measure the within class variation of the object and background classes. The third item is used to ensure the two classes division is on the object boundary. The last item measures the length of the curve C, playing the role of smoothing region boundaries. Therefore, when we minimize the energy function of (10), obviously, we expect to obtain the classification result that both tightly clusters the object/background samples and with division rightly convergent to object edge.
Employing the level set function as a differentiable threshold operator, we unify the integral region and rewrite (10) as

Sun et al. EURASIP Journal on Advances in Signal Processing
(2020) 2020:9 Page 6 of 13 (12) where = + − is the image domain, H(·) denotes the Heaviside function that H(z) = 1, if z ≥ 0 0, else and δ 0 (·) is the Dirac function. By fixing the class means of the target and background samples as and minimizing the energy functional (12), the associated Euler-Lagrange equation for this functional can be given by and implemented by the following gradient descent: where div is the divergence operator, and Since the data-driven kernel is designed within the mean shift sample space to act constantly toward the image area with maximum appearance similarity, the kernel curve, in the proposed algorithm, can be steered to the target region from a wide variety of states, without any request of the initial curve that must be inside or outside the target completely.

Mean shift formulation
For an initial kernel contour C τ −1 learned from previous observations, we evolve it according to the new observation at time τ , I τ , and the target model q as discussed in Section 4.2. This can be realized by doing a gradient descent on the image energy E k : where S τ denotes the curve at time τ , and go through M iterations in the direction of reducing the energy E k as fast as possible: Based on the object/background division contour C τ , we can obtain the corresponding kernel K(x, y) as described in Section 4.1. Then the density estimator can be given by where N is the number of samples in + , the inside region of C τ . The mean shift vector that maximizes the density is computed by Figure 3 illustrates the tracking mechanism of the proposed algorithm.

Results and discussion
In this section, firstly, the proposed method was qualitatively evaluated on several video sequences with different challenges for tracking. All the sequences derive from real-world objects records. Then, the proposed method was further tested on two public datasets for quantitative evaluation. In all cases, the target objects and candidates are modeled in RGB space by the weighted histogram with 16 bins along each dimension. The initial curve of the first frame was a rough polygon supplied manually while the subsequent ones were fed by the results of previous frame.

Qualitative evaluation
The first sequences consist of 230 frames and describe a waving hand with significant shape deformations as well as scaling, rotation changes. From the tracking results shown in Fig. 4 (red), we can see that the proposed method can accurately follow the target due to the adaptation of the data-driven kernel to the object shape variation. For the same sequence, the conventional mean shift tracker (green) could not give well presentation by typical symmetric kernel conjunction with different bandwidths selection. We further compared three mean shift-based algorithms on a high jump sequence to show the superiority of our approach. This sequence records a high jump match, which contains a player undergoing significant shape deformations simultaneously with fast and drastic motion. The three algorithms we tested are (a) standard mean shift using symmetric kernel with different scales selection [6], (b) constant asymmetric kernel-based tracker with both scale and orientation adaptation [14], and (c) the proposed method. Figure 5 shows the tracking results of these (2020) 2020:9 Page 7 of 13 Fig. 3 Tracking mechanism of the proposed method. Employing pedestrian sequence as an example, a and b show the initial curve obtained from previous frame where we admit 15 pixels extension for the new kernel adaptation. c The corresponding candidate samples weights. Then, we include the active contour model into this sample space (d) to separate them into two classes (foreground/background) and obtain the adaptive kernel that maximizes the appearance similarity (e). g The level set function of the contour and f, h the corresponding kernel which is used within the mean shift framework for deforming object tracking tasks (i) algorithms. We can see that in typical mean shift, the pollution of background pixels in the rough kernel region easily results in performance loss and does not guarantee to focus on the target accurately. The algorithm (b), based on constant kernel shape throughout the image frames, is impotent to well present the deforming target only by scale and orientation adjustment. The proposed algorithm, in contrast, effectively adapts the kernel to target variation and obtains pleasant results. Then, we compared our work with conventional level set-based deformable object tracker [21] on a pedestrian sequence. This sequence describes a woman with multicolored appearance walking in a clutter street with large posture changes and sheltering cases. In [21], the traditional Mumford-Shah method is added within the particle filter framework without considering any target bias. Since the typical level set model emphasizes the intensity consistence only, its convergence on multi-colored region highly depends on the initial curve. Therefore, incompetent results are shown in Fig. 6 a due to the unreliable initial curves derived from the prediction step of particle filter procedure. In contrast, the proposed algorithm, based on the weighted mean shift samples, can simultaneously segment out the two class pixels and obtain the accurate target contour. Additionally, we use the decreasing rate of object size over previous few frames as an occlusion detector. Once detected, we slow down the speed of updating the target density distribution, enabling tracking to resume when the target reappears (Fig. 6b).
Another three challenging sequences were tested to further evaluate the proposed method. The first sequence describes a complex scenario where a girl is moving quickly in a circular path with a boy, undergoing significant scale changes and shape deformation as she moves Tracking results on high jump sequence for frames of 0, 13, 27, 39, and 64. a Standard mean shift [8]. b The method in [34]. c The proposed method toward or deviating from the camera. It is a challenge for traditional symmetric kernel or intensity edge-based level set methods to represent the child accurately. As we can see in Fig. 7a, the proposed method shows pleasant results, demonstrating the effectiveness of the technical. Compared with the method of [23], where the whole target and background are segmented into intensity consistent fragments and separately modeled in GMM manner, ours include the active contour model in the mean shift sample space and is committed to obtain the adaptive kernel for deformable object tracking within the mean shift framework, overcoming the computational complexity problem facing the traditional contour trackers. The second sequence contains a man riding on a busy road, with the camera moving fast and background changing dramatically. From the tracking results shown in Fig. 7b, we can see that our method performs well even in a complicated scene. The third sequence describes a toy QQ being pulled across the table with clutter background behind and similar icon beside. During this course, large appearance changes occur when the toy is occluded or turned around. Figure 7 c shows the tracking results of this sequence, indicating the competence of the proposed method in dealing with these challenging cases.

Quantitative evaluation
In this part, for quantitative analysis, we evaluate the proposed method using two public sets of challenging video sequences and compare it to several state-of-theart tracking methods. The first dataset is VOT2014 1 [26] which comprises 25 sequences (an overall size of more than 10,000 frames), and the second is the VOT2016 2 which consists of 60 sequences. These sequences show various objects with different challenges for visual tracking, including large shape deformations, scale variations, illumination variations, occlusion, and so on. Firstly, we compare the proposed method with several related bounding box trackers that also make use of the segmentation techniques for target tracking: the DF tracker in [27], which divides the image into several layers that present the probabilities of a pixel taking each feature value to define the distribution field as (2020) 2020:9 Page 10 of 13 image descriptor for target modeling; PixelTrack in [28], which combines a generalized Hough transform based detector with a probabilistic segmentation method in a co-training manner to track deformable objects; and reliable patch tracker in [29], which divides the target into rectangular patches, tracks them with the kernelized correlation filters [30], and integrates them within a particle filter framework [31]. Then, we also compare the proposed method to other relevant contour trackers, which also exploit segmentation technique to extract the target object contour for dynamic tracking. The first method is HoughTrack (HT) proposed by Godec et al. [32], where the authors proposed a patch-based voting algorithm with Hough forests [33]. By back-projecting the patches that voted for the object center, the authors initialize a graph-cut algorithm to segment foreground from background. The second method is the SLSM in [34], in which a single boosting target model is learnt to guide the level set curve evolution to obtain the interested target region.
For the quantitative analysis, for each video, we determine the percentage of frames in which the object is correctly tracked. Since the ground truth annotation included in the datasets is represented by a rotated bounding box, and to let the contour trackers be compared fairly with other bounding box trackers, we measure the tracking accuracy using the Agarwal-criterion [35] as in [32] and [34]. It is defined as score = where R T is the output target region from the tracking algorithm and R GT the ground truth. In each image frame, the tracking is considered correct if the Agarwal overlap measure is above a threshold (set to 0.5). Since the VOT2016 dataset Table 2 Evaluation results of the compared methods on VOT2016 dataset: percentage of correctly tracked frames (score > 0. 5) Methods Pix [28] DF [27] HT [32] SLSM [34] RPT [29]   contains 60 sequences and for the consideration of space, we select the VOT2014 dataset to show the entire evaluation results of the compared methods (see Table 1). As we can see, for 12 out of 25 video sequences the proposed method outperforms the others, and also the average of correct tracking. Table 2 summarizes the quantitative analysis of the compared methods on VOT2016 dataset. Figure 8 gives some visible tracking results of the proposed method on the two datasets. Finally, we show the ability of the proposed method to work on gray scale images. The first sequence captures a fish whose shape undergoes sudden deformation as it turns or gets occluded. The second sequence describes a toy dog which is held and swayed under a lamp with large appearance and illumination changes as the toy moves and turns. Figure 9 shows the tracking results of these gray scale video sequences. As we can see in images, our work can get pleased performance even with large appearance changes and severe sheltering cases in gray scale images.

Conclusion
We have presented a novel data-driven kernel in this paper for non-rigid object tracking. By introducing the active contour model into the mean shift sample space, the adaptive kernel can be evolved and updated to adapt to target variation simultaneously with the mean shift iterations.
Since the active contour model is designed to drive the kernel constantly to the direction maximizing the appearance similarity, this adaptive kernel can continually seize the target shape to give a better estimation bias and produce accurate shift of the mean, addressing the problem of constant kernel shape and scale/orientation selection facing typical kernel-based trackers. Experimental results have verified the effectiveness of the proposed method in many complicate scenes.