- Research Article
- Open Access
Automatic Moving Object Segmentation from Video Sequences Using Alternate Flashing System
© Jae-Kyun Ahn et al. 2010
- Received: 12 December 2009
- Accepted: 31 March 2010
- Published: 10 May 2010
A novel algorithm to extract moving objects from video sequences is proposed in this paper. The proposed algorithm employs a flashing system to obtain an alternate series of lit and unlit frames from a single camera. For each unlit frame, the proposed algorithm synthesizes the corresponding lit frame using a motion-compensated interpolation scheme. Then, by comparing the unlit frame with the lit frame, we construct the sensitivity map, which provides depth cues. In addition to the sensitivity term, color, coherence, and smoothness terms are employed to define an energy function, which is minimized to yield segmentation results. Moreover, we develop a faster version of the proposed algorithm, which reduces the computational complexity significantly at the cost of slight performance degradation. Experiments on various test sequences show that the proposed algorithm provides high-quality segmentation results.
- Segmentation Result
- Object Segmentation
- Object Contour
- Color Term
- Smoothness Term
Due to the advances in computation and communication technologies, the interest in video contents has increased significantly, and it has become more and more important to analyze and understand video contents automatically using computer vision techniques. To address the growing demand, various video analysis techniques have been introduced. Among them, moving object segmentation is a fundamental tool, which is widely used in a variety of applications. Especially, it plays an important preprocessing role in vision-based human motion capture and analysis, since the shape of a human subject after the segmentation is one of the main features for understanding human behaviors . For example, in human pose estimation, 2D outlines of a human subject, which are extracted from one or more viewpoints using object segmentation techniques, are employed to reconstruct the 3D shape of a generic humanoid model [2–5]. Also, based on moving object segmentation, the outline of a human body can be tracked and used in human gesture analysis and human-machine interface [6, 7]. Moreover, the body shape and dynamics of a human subject can be used to recognize his or her identity [8, 9]. Therefore, the development of accurate video object segmentation techniques is essential to understand human behaviors.
Many approaches have been proposed for video object segmentation. They can be classified roughly into two categories: semiautomatic and automatic methods. Semi-automatic methods [10–13] first identify regions of interest coarsely using initial user interactions. Then, based on the initial information, they construct color, position, or motion models of objects and the background. The models are then used to separate the objects from the background more accurately. In , a background subtraction method was proposed to segment objects in video sequences with static backgrounds. It extracts moving objects by subtracting a given background from each frame in a video sequence. In , Criminisi et al. proposed a discriminative model, which is composed of motion, color, and contrast cues with spatial and temporal priors. Their algorithm achieves high quality video segmentation in realtime, but it works only if ground truth data is available for training model parameters. Also, tracking-based algorithms have been proposed in [12, 13]. They extract objects in the first frame based on users' markings, and then track the objects in subsequent frames using color, position, and temporal cues. These semi-automatic methods [10–13] can achieve relatively accurate segmentation results using initial interactions. However, the interactions prevent them from being used in applications in which full automation is required.
On the other hand, automatic video segmentation methods extract objects without initial interactions [14–17]. They have the object detection stage, which defines objects of interest. Since objects of interest are usually moving, motion information is typically employed to distinguish the objects from the background. The motion field between consecutive frames is estimated, and then regions are classified as object or background based on the motion information. Chien et al.  proposed a background registration technique to estimate a reliable background image. Moving objects are extracted by comparing each frame with the estimated background. Tsaig and Averbuch's algorithm  divides each frame into small regions, finds the matching regions between consecutive frames, and declares the regions with large motions as objects. Yin et al.'s algorithm  learns segmentation likelihoods from the spatial contexts of motion information, and extracts objects automatically with tree-based classifiers. In , Zhang et al. proposed estimating the depth information of sparse points to detect foreground objects. These automatic methods [14–17] are effective, provided that objects and the background exhibit different motion characteristics. However, they may not provide accurate results for sequences with no or small object motions.
Recently, a new approach to automatic object segmentation, which uses extra information, such as depth, flash/no-flash difference, and depth-of-field (DoF), has been introduced [18–21]. Kolmogorov et al.'s algorithm  uses a stereo camera to estimate depth information, which is in turn used to extract foreground objects. It does not depend on the motion information between successive frames, but on the disparity information between stereo views. However, the disparity estimation is another challenging task, requiring heavy computational loads. In [19, 20], a flash is used to extract foreground objects using a single camera. After acquiring an ordinary image without flashing, it also captures an additional image lit by a flash. Then, by comparing the flash image with the no-flash one, color and intensity differences are obtained to extract objects. An alternative method is to use a matting model . In an image with a shallow DoF, objects are focused while the background is not. Thus, the focused objects can be extracted automatically.
In this paper, we propose a novel algorithm to extract objects as well as humans from video sequences automatically. We extend the image segmentation techniques in [19, 20], which use a pair of flash and no-flash images, to the video segmentation case. The proposed algorithm is a tracking-based scheme using an alternate flashing system. When acquiring a video sequence, we capture even and odd frames with and without flash lights, respectively. Then, we find matching points between lit and unlit frames to construct a sensitivity map, from which the depth information can be inferred. In addition to the sensitivity map, color and temporal features are used to define an energy function, which is minimized by a graph cut algorithm to yield segmentation results. Simulation results demonstrate that the proposed algorithm provides reliable segmentation results.
The main contributions of this paper can be summarized as follows. First, we design a dedicated flashing system to capture an alternate series of lit and unlit frames. Second, we develop an efficient motion-compensated interpolation scheme, which matches lit and unlit frames to construct a sensitivity map. Third, we use the sensitivity map to accurately extract complex and deformable objects, especially humans, which are hard to segment out using conventional segmentation algorithms. Last, we implement a faster version of the proposed algorithm, which can be employed in real-time segmentation applications.
The rest of this paper is organized as follows. Section 2 describes our flashing system. Section 3 explains the features for segmentation, and Section 4 details the energy minimization scheme. Section 5 discusses implementation issues for real-time segmentation. Section 6 provides simulation results. Finally, Section 7 concludes the paper.
2.1. Video Acquisition
When we capture a video of human subjects, alternate flashing may annoy them. To alleviate the annoyance, we set the frame rate to 120 frames/s, which corresponds to the flashing frequency of 60 Hz. At this relatively high frequency, humans can hardly notice flickering and the lights appear to be turned on steadily.
Since the proposed algorithm can achieve accurate segmentation results, it can be employed in various applications, in which the flashing system can be installed. For example, it can be used to understand human behaviors in indoor environments , to substitute backgrounds in video conferencing applications [23, 24], and for mobile robots to detect obstacles . It is noted that the flashing system is less effective in bright outdoor environments. Also, the current prototype of the flashing system is relatively bulky, but we expect that its size would be reduced by sophisticated packaging and it would be combined into a handheld camera system.
2.2. Matching between Lit and Unlit Frames
Using the alternate flashing system, we capture an input sequence with the frame rate of 120 frames/s. The proposed algorithm extracts the object layer from the unlit sequence with the frame rate of 60 frames/s. As shown in Figure 2, for an unlit frame at time instance , the proposed algorithm synthesizes the corresponding lit frame , and then compares with to derive the depth information.
A synthesized lit frame is interpolated from the neighboring frames , , , and , as shown in Figure 2. To employ a motion-compensated interpolation scheme to synthesize , we develop a two-step motion estimation procedure. First, we estimate the global motion from to , which represents the motion of the background. Second, we refine the local motions of objects in a bilateral manner using the information in the subsequent frames , as well as the past frames , .
where and denote the spatial derivatives, and denotes the temporal derivative of the image intensity. By plugging (1) into the optical flow equation in (2), a linear system of equations for the unknown six parameters are derived and then solved using the least square method . Note that an equation is set up for each in the already segmented background layer of only to avoid the effects of individual object motions in the global background motion estimation. Then, the global motion between and is approximated as the half of that between and .
where , and and are the mixing weights of the Gaussian mixture model. Also, denotes the Gaussian distribution with mean and variance , and represents the sensitivity distribution of background pixels. Similarly, represents the sensitivity distribution of object pixels. Therefore, and can be interpreted as the likelihoods that pixel belongs to the background layer and the object layer, respectively.
Since the sensitivity is a main feature for the classification, the exposure time for lit frames, which affects the quality of the sensitivity map, should be selected carefully. Note that the exposure time for unlit frames is set to record the natural moods and colors of scenes properly. If is identical to , the intensities of object pixels may be saturated due to the limited dynamic range of the camera, making the sensitivity map unreliable. On the other hand, if the exposure time is too short, the pixels in the lit frame may be underexposed. Therefore, in this work, we set the exposure time for lit frames by considering the tradeoff between the saturation and the underexposure problems.
Although the sensitivity is a robust feature for segmentation, there are limitations in separating objects from the background using only the sensitivity map. Since the amount of reflected light is determined not only by the distance from the camera to the object but also by the surface albedo and normals, the sensitivity map does not match the depth information perfectly. Therefore, to achieve more reliable segmentation results, we use color and temporal coherence as additional features.
Colors make it easy to distinguish the layers, since they do not change dramatically between adjacent frames. After segmenting the last unlit frame , we estimate the probability density functions (pdf's) of colors in the object layer and the background layer, respectively, and use those pdf's to segment the current unlit frame . We regard the sensitivity as an additional color component, and represent colors in the space, where represents the sensitivity.
where is the volume of a unit -dimensional sphere: , and so forth. The Epanechinikov kernel is radially symmetric and uni-modal. It has an advantage that it can be calculated more quickly than the Gaussian kernel.
3.3. Temporal Coherence
where denotes the number of times that pixel is classified as the object layer in the last unlit frames , and is the motion vector of pixel in the current frame , which is estimated using the method in Section 2. is higher, if is assigned the same label as its predecessors.
We assign a label, , to each pixel in the current frame by minimizing an energy function. Let denote the label image composed of the pixel labels. The energy function is composed of the sensitivity, color, temporal coherence, and smoothness terms, which impose constraints on the pixel labels.
This sensitivity term indicates that, if pixel is labeled or classified as , its sensitivity probability in that class should be higher than that in the other class. Note that is the Gaussian mixture model of the overall sensitivity distribution in (8), which can be regarded as the reliability of the sensitivity . By incorporating as a weight in the summation in (13), pixels with more reliable sensitivity values play more important roles in the energy minimization.
which constrains that each pixel should be assigned the label with the higher color probability.
The temporal coherence term attempts to reduce outliers by giving a penalty to a pixel, which is assigned a different label from its temporal predecessors.
The energy minimization is carried out through the graph cut algorithm in , which is an effective energy minimization method. The min-cut of a weighted graph provides the segmentation that best separates objects from the background.
The proposed algorithm, described in Sections 2–4, achieves high quality segmentation results, but its computational complexity is relatively high. In this section, we develop a faster version of the proposed algorithm, which reduces the computational complexity significantly at the cost of slight performance degradation. The faster version can be used in real-time applications.
The proposed real-time segmentation algorithm also employs sensitivity, color, and coherence features. However, the feature computations are simplified as follows.
5.1. Simplified Motion Estimation
To compute the sensitivity map for an unlit frame , we synthesize the corresponding lit frame based on the motion-compensated interpolation. Since the motion estimation demands high complexity, it is simplified in the following way. While the global motion estimation and the local motion refinement are performed in Section 2.2, the real-time algorithm carries out the global motion estimation only. Furthermore, it uses the translation model instead of the affine model in (1). The two parameters for the translational motion are also estimated using the optical flow equation in (2) and the least square method. Pixels near object boundaries tend to have high matching errors after the global motion compensation. Thus, we mark those pixels as void and use only the non-void pixels, whose matching errors are less than a threshold, to synthesize and compute the sensitivities. The sensitivity distribution in (8) is also obtained and modeled using the EM algorithm, excluding the void pixels.
5.2. Block-Based Color Model
where is the vector of pixel . This equation is the same as (9), except for the reduction of the sample space from the frame to the block and the reduction of the color dimension. To reduce the complexity, a uniform kernel with a narrow bandwidth is employed in (18) and the color distributions for each block are saved as lookup tables.
5.3. Coherence Strip
The real-time algorithm exploits the property that an object generally does not change its positions abruptly between consecutive frames. Specifically, given the object contour in the previous frame, we construct a coherence strip , in which the object contour in the current frame is likely to be located. The notion of coherence strip helps to extract a spatio-temporally coherent video object as well as to reduce the computational complexity.
where is set to 10 for all experiments in this work. In addition, we translate the coherence strip in (20) by the object motion vector . The shifted spatio-temporal coherence is more accurate than (20) especially for an object with fast motion.
5.4. Energy Minimization
The proposed video object segmentation algorithm is implemented in the C++ language on a personal computer with Pentium-IV 3.0 GHz CPU and 2 Gbyte memory. Two versions of the proposed algorithm are implemented: the proposed algorithm I denotes the algorithm described in Sections 2–4, whereas the proposed algorithm II denotes the faster algorithm in Section 5 for real-time applications.
We use several test sequences of CIF size ( ), captured using the alternate flashing system. As mentioned in Section 2, the sequences are captured with the frame rate of 120 frames/s, and the segmentation is performed only on the unlit frames with the frame rate of 60 frames/s. For the proposed algorithm I, the bandwidth of the kernel in (9) is set to 2, in (11) is set to 8, and the weights in (17) are fixed to 0.3, 0.32, 0.012, respectively. For the proposed algorithm II, in (18) is 1, and in (25) are 0.2, 0.12, 0.04.
In this paper, we proposed an automatic video segmentation algorithm, which can provide high quality results using the alternate flashing system. By comparing unlit frames with lit ones, the proposed algorithm obtains the sensitivity map indicating depth information. The proposed algorithm also obtains the color pdf's for the object and the background layers, and constructs coherence likelihoods. By minimizing the energy function, composed of the sensitivity, color, coherence, and smoothness terms, the proposed algorithm obtains accurate segmentation results. Moreover, we developed a faster version of the proposed algorithm, which reduces the computational complexity significantly to achieve real-time segmentation. Experimental results on various test sequences demonstrated that the proposed algorithm provides reliable and accurate segmentation results.
This paper was supported partly by Basic Science Research Program through the National Research Foundation of Korea funded by the Ministry of Education, Science and Technology (2009-0083495), and partly by Seoul R & BD Program (no. ST090818).
- Moeslund TB, Hilton A, Krüger V: A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image Understanding 2006, 104(2-3):90-126. 10.1016/j.cviu.2006.08.002View ArticleGoogle Scholar
- Plänkers R, Fua P: Articulated soft objects for multiview shape and motion capture. IEEE Transactions on Pattern Analysis and Machine Intelligence 2003, 25(9):1182-1187. 10.1109/TPAMI.2003.1227995View ArticleGoogle Scholar
- Carranza J, Theobalt C, Magnor MA, Seidel H-P: Free-viewpoint video of human actors. ACM Transactions on Graphics 2003, 22(3):569-577. 10.1145/882262.882309View ArticleGoogle Scholar
- Agarwal A, Triggs B: 3D human pose from silhouettes by relevance vector regression. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '04), 2004 882-888.Google Scholar
- Sminchisescu C, Kanaujia A, Li Z, Metaxas D: Discriminative density propagation for 3D human motion estimation. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '05), 2005 20-25.Google Scholar
- Cui Y, Weng J: Appearance-based hand sign recognition from intensity image sequences. Computer Vision and Image Understanding 2000, 78(2):157-176. 10.1006/cviu.2000.0837View ArticleGoogle Scholar
- Song P, Yu H, Winkler S: Vision-based 3D finger interactions for mixed reality games with physics simulation. International Journal of Virtual Reality 2009, 8(2):1-6.Google Scholar
- Wang L, Tan T, Ning H, Hu W: Silhouette analysis-based gait recognition for human identification. IEEE Transactions on Pattern Analysis and Machine Intelligence 2003, 25(12):1505-1518. 10.1109/TPAMI.2003.1251144View ArticleGoogle Scholar
- Kale A, Sundaresan A, Rajagopalan AN, et al.: Identification of humans using gait. IEEE Transactions on Image Processing 2004, 13(9):1163-1173. 10.1109/TIP.2004.832865View ArticleGoogle Scholar
- Sun J, Zhang W, Tang X, Shum H: Background cut. Proceedings of the European Conference on Computer Vision, 2006 628-641.Google Scholar
- Criminisi A, Cross G, Blake A, Kolmogorov V: Bilayer segmentation of live video. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '06), 2006 53-60.Google Scholar
- Liu Z, Shen L, Han Z, Zhang Z: A novel video object tracking approach based on kernel density estimation and Markov random field. Proceedings of the 14th IEEE International Conference on Image Processing (ICIP '07), 2007 373-376.Google Scholar
- Ahn J-K, Kim C-S: Real-time segmentation of objects from video sequences with non-stationary backgrounds using spatio-temporal coherence. Proceedings of the International Conference on Image Processing (ICIP '08), October 2008, San Diego, Calif, USA 1544-1547.Google Scholar
- Chien S-Y, Ma S-Y, Chen L-G: Efficient moving object segmentation algorithm using background registration technique. IEEE Transactions on Circuits and Systems for Video Technology 2002, 12(7):577-586. 10.1109/TCSVT.2002.800516View ArticleGoogle Scholar
- Tsaig Y, Averbuch A: Automatic segmentation of moving objects in video sequences: a region labeling approach. IEEE Transactions on Circuits and Systems for Video Technology 2002, 12(7):597-612. 10.1109/TCSVT.2002.800513View ArticleGoogle Scholar
- Yin P, Criminisi A, Winn J, Essa I: Tree-based classifiers for bilayer video segmentation. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'07), 2007Google Scholar
- Zhang G, Jia J, Xiong W, Wong T-T, Heng P-A, Bao H: Moving object extraction with a hand-held camera. Proceedings of the 11th IEEE International Conference on Computer Vision (ICCV '07), 2007Google Scholar
- Kolmogorov V, Criminisi A, Blake A, Cross G, Rother C: Bi-layer segmentation of binocular stereo video. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '05), 2005 407-414.Google Scholar
- Sun J, Kang SB, Xu Z-B, Tang X, Shum H-Y: Flash cut: foreground extraction with flash and no-flash image pairs. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'07), 2007Google Scholar
- Sun J, Li Y, Kang SB, Shum H-Y: Flash matting. ACM Transactions on Graphics 2006, 25(1):772-778.View ArticleGoogle Scholar
- Li H, Ngan KN: Unsupervised video segmentation with low depth of field. IEEE Transactions on Circuits and Systems for Video Technology 2007, 17(12):1742-1751.View ArticleGoogle Scholar
- Point Greay Research : Triclops on-line manual. http://www.ptgrey.com/
- Baker H, Bhatti N, Tanguay D, et al.: Understanding performance in coliseum an immersive videoconferencing system. ACM Transactions on Multimedia Computing, Communications, and Applications 2005, 1(2):190-210. 10.1145/1062253.1062258View ArticleGoogle Scholar
- Gharai L, Perkins C, Riley R, Mankin A: Large scale video conferencing: a digital amphitheater. Proceedings of the 8th International Conference on Distributed Multimedia Systems, 2002Google Scholar
- Soumare S, Ohya A, Yuta S: Real-time obstacle avoidance by an autonomous mobile robot using an active vision sensor and a vertically emitted laser slit. In Intelligent Autonomous Systems 7. IOS Press; 2002.Google Scholar
- Horn BKP, Schunck BG: Determining optical flow. Artificial Intelligence 1981, 17(1–3):185-203.View ArticleGoogle Scholar
- Smolić A, Sikora T, Ohm J-R: Long-term global motion estimation and its application for sprite coding, content description, and segmentation. IEEE Transactions on Circuits and Systems for Video Technology 1999, 9(8):1227-1242. 10.1109/76.809158View ArticleGoogle Scholar
- Raskar R, Tan K-H, Feris R, Yu J, Turk M: Non-photorealistic camera: depth edge detection and stylized rendering using multi-flash imaging. ACM Transactions on Graphics 2004, 23(3):679-688. 10.1145/1015706.1015779View ArticleGoogle Scholar
- Agrawal A, Raskar R, Nayar SK, Li Y: Removing photography artifacts using gradient projection and flash-exposure sampling. ACM Transactions on Graphics 2005, 24(3):828-835. 10.1145/1073204.1073269View ArticleGoogle Scholar
- Debevec PE, Malik J: Recovering high dynamic range radiance maps from photographs. Proceedings of the ACM Conference on Computer Graphics (SIGGRAPH '97), 1997 369-378.Google Scholar
- Moon TK: The expectation-maximization algorithm. IEEE Signal Processing Magazine 1996, 13(6):47-60. 10.1109/79.543975View ArticleGoogle Scholar
- Silverman BW: Density Estimation for Statistics and Data Analysis. Champman and Hall, London, UK; 1986.View ArticleMATHGoogle Scholar
- Boykov YY, Jolly M-P: Interactive graph cuts for optimal boundary and region segmentation of objects in N-D images. Proceedings of the 8th International Conference on Computer Vision, 2001 105-112.Google Scholar
- Boykov Y, Kolmogorov V: An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence 2004, 26(9):1124-1137. 10.1109/TPAMI.2004.60View ArticleMATHGoogle Scholar
- Rother C, Kolmogorov V, Blake A: "GrabCut"—interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics 2004, 23(3):309-314. 10.1145/1015706.1015720View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.