Detection of moving objects in image plane for robot navigation using monocular vision
© Wang et al; licensee Springer. 2012
Received: 8 May 2011
Accepted: 14 February 2012
Published: 14 February 2012
This article presents an algorithm for moving object detection (MOD) in robot visual simultaneous localization and mapping (SLAM). This MOD algorithm is designed based on the defining epipolar constraint for the corresponding feature points on image plane. An essential matrix obtained using the state estimator is utilized to represent the epipolar constraint. Meanwhile, the method of speeded-up robust feature (SURF) is employed in the algorithm to provide a robust detection for image features as well as a better description of landmarks and of moving objects in visual SLAM system. Experiments are carried out on a hand-held monocular camera to verify the performances of the proposed algorithm. The results show that the integration of MOD and SURF is efficient for robot navigating in dynamic environments.
In recent years, more and more researchers solve the simultaneous localization and mapping (SLAM) as well as the moving object tracking (MOT) problems concurrently. Wang et al.  developed a consistency-based moving object detector and provided a framework to solve the SLAMMOT problems. Bibby and Reid  proposed a method that combines sliding window optimization and least-squares together with expectation maximization to do reversible model selection and data association that allows dynamic objects to be included directly into the SLAM estimation. Zhao et al.  used GPS data and control inputs to achieve global consistency in dynamic environments. There are many advantages to cope with SLAM and MOT problems simultaneously: for example, mobile robots might navigate in a dynamic environment crowded with moving objects. In this case the SLAM could be corrupted with the inclusion of moving entities if the information of moving objects is not taken account. Furthermore, the robustness of robot localization and mapping algorithms can be improved if the moving objects are discriminated from the stationary objects in the environment.
Using cameras to implement SLAM is the current trend because of their light weight and low-cost features, as well as containing rich appearance and texture information of the surroundings. However, it is still a difficult problem in visual SLAM to discriminate the moving objects from the stationary landmarks in dynamic environments. To deal with this problem, we propose the moving object detection (MOD) algorithm based on the epipolar constraint for the corresponding feature points on image plane. Given an estimated essential matrix it is possible to investigate whether a set of corresponding image points satisfy the defining epipolar constraint in image plane. Therefore, the epipolar constraint can be utilized to distinguish the moving objects from the stationary landmarks in dynamic environments.
For visual SLAM systems, the features in the environment are detected and extracted by analyzing the image taken by the robot vision, and then the data association between the extracted features and the landmarks in the map is investigated. Many researchers [4, 5] employed the concept by Harris and Stephens  to extract apparent corner features from one image and tracked these point features in the consecutive image. The descriptors of the Harris corner features are rectangle image patches. When the camera translates and rotates, the scale and orientation of the image patches will be changed. The detection and matching of Harris corner might fail in this case, unless the variances in scale and orientation of the image patches are recovered. Instead of detecting corner features, some works [7, 8] detect the features by using the scale-invariant feature transform (SIFT) method  which provides a robust image feature detector. The unique properties of image features extracted by SIFT method are further described by using a high-dimensional description vector . However, the feature extraction by SIFT requires more computational cost than that by Harris's method . To improve the computational speed, Bay et al.  introduced the concept of integral images and box filter to detect and extract the scale-invariant features, which they dubbed speeded-up robust features (SURF). The extracted SURF must be matched with the landmarks in the map of a SLAM system. The nearest-neighbor (NN) searching method  can be utilized to match high-dimensional data sets of description vectors.
In this article, an online SLAM system with a moving object detector is developed based on the epipolar constraint for the corresponding feature points on image plane. The corresponding image features are obtained using the SURF method  and the epipolar constraint is calculated using an estimated essential matrix. Moving object information is detected in image plane and integrated into the MOT process such that the robustness of SLAM algorithm can be considerably improved, particularly in highly dynamic environments where surroundings of robots are dominated by non-stationary objects. The contributions in this article are twofold. First, we develop an algorithm to solve the problems for MOD in image plane, and then the algorithm is integrated with the robot SLAM to improve the robustness of state estimation and mapping processes. Second, the improved SLAM system is implemented on a hand-held monocular camera which can be utilized as the sensor system for robot navigation in dynamic environments.
The SLAM problem with monocular vision will be briefly introduced in Section 2. In Section 3, the proposed algorithm of MOD is explained in detail. Some examples to verify the performance of the data association algorithm are described in Section 4. Section 5 is the concluding remarks.
2. SLAM with a free-moving monocular vision
where xk|k-1and xk|krepresent the predicted and estimated state vectors, respectively; K k is Kalman gain matrix; P denotes the covariance matrix, respectively; A k and W k are the Jacobian matrices of the state equation f with respect to the state vector x k and the noise variable w k , respectively; H k and V k are the Jacobian matrices of the measurement g with respect to the state vector x k and the noise variable v k , respectively.
2.1. Motion model
x C is a 12 × 1 state vector of the camera including the three-dimensional vectors of position r, rotational angle ϕ, linear velocity v, and angular velocity ω, all in world frame; m i is the three-dimensional (3D) coordinates of i th stationary landmark in world frame; O j is the state vector of j th moving object; n and l are the number of the landmarks and of the moving objects, respectively.
where p jk and v jk are the vectors of the position and linear velocity of j th moving object at time step k, respectively.
2.2. Vision sensor model
Moreover, the elements of the Jacobian matrices H k and V k are determined by taking the derivative of z i with respect to the state x k and the measurement noise v k . The Jacobian matrices are obtained for the purpose of calculating the innovation covariance matrix in EKF estimation process .
2.3. Feature initialization
The derivative is taken at and v k = 0.
2.4. Speeded-up robust features (SURF)
2.5. Implementation of SLAM
3. Moving object detection and tracking
Equation (27) indicates that the pixel coordinate of the corresponding feature in second image will be constrained on the epipolar line.
D is utilized in this article to denote the pixel deviation from the epipolar line which is induced by the motion and measurement noise in state estimation process. Depending on how the noise related to each constraint is measured, it is possible to design a threshold value in Equation (28) which satisfies the epipolar constraint for a given set of corresponding image points. For example, the image feature of a static object in the first image is located at (I x , I y ) = (50,70) and then the camera moves 1 cm in z c -axis. If the corresponding image feature in the second image is constrained within a deviation is limited in a range, as shown in Figure 7, as varying from 1 to 320.
4. Experimental results
In this section, the experimental works of the online SLAM with a moving object detector are implemented on a laptop computer running Microsoft Window XP. The laptop computer is Asus U5F with Intel Core 2 Duo T5500 (1.66 GHz), Mobile Intel i945GM chipset and 1 Gb DDR2. The free-moving monocular camera utilized in this work is Logitech C120 CMOS web-cam with 320 × 240-pixels resolution and USB 2.0 interface. The camera is calibrated using the Matlab tool provided by Bouquet . The focal lengths are f u = 364.4 pixels and f v = 357.4 pixels. The offset pixels are u0 = 156.0 pixels and v0 = 112.1 pixels, respectively. We carried out three experiments including the SLAM task in a static environment, SLAM with MOT, and people detection and tracking.
4.1. SLAM in static environment
4.2. SLAM with MOT
4.3. People detection and tracking
In this research, we developed an algorithm for detection and tracking of moving objects to improve the robustness of robot visual SLAM system. SURFs are also utilized to provide a robust detection of image features and a stable description of the features. Three experimental works have been carried out on a monocular vision system including SLAM in a static environment, SLAM with MOT, and people detection and tracking. The results showed that the monocular SLAM system with the proposed algorithm has the capability to support robot systems simultaneously navigating and tracking moving objects in dynamic environments.
This article was partially supported by the National Science Council in Taiwan under grant no. NSC100-2221-E-032-008 to Y.T. Wang.
- Wang CC, Thorpe C, Thrun S, Hebert M, Durrant-Whyte H: Simultaneous localization, mapping and moving object tracking. Int J Robot Res 2007, 26(9):889-916. 10.1177/0278364907081229View ArticleGoogle Scholar
- Bibby C, Reid I: Simultaneous Localisation and Mapping in Dynamic Environments (SLAMIDE) with Reversible Data Association. In Proceedings of Robotics: Science and Systems III. Georgia Institute of Technology, Atlanta; 2007.Google Scholar
- Zhao H, Chiba M, Shibasaki R, Shao X, Cui J, Zha H: SLAM in a Dynamic Large Outdoor Environment using a Laser Scanner. In Proceedings of the IEEE International Conference on Robotics and Automation. Pasadena, California; 2008:1455-1462.Google Scholar
- Davison AJ, Reid ID, Molton ND, Stasse O: MonoSLAM: Real-time single camera SLAM. IEEE T Pattern Anal 2007, 29(6):1052-1067.View ArticleGoogle Scholar
- Paz LM, Pinies P, Tardos JD, Neira J: Large-Scale 6-DOF SLAM with Stereo-in-Hand. IEEE T Robot 2008, 24(5):946-957.View ArticleGoogle Scholar
- Harris C, Stephens M: A combined corner and edge detector. In Proceedings of the 4th Alvey Vision Conference. University of Mancheter; 1988:147-151.Google Scholar
- Karlsson N, Bernardo ED, Ostrowski J, Goncalves L, Pirjanian P, Munich ME: The vSLAM Algorithm for Robust Localization and Mapping. In Proceedings of the IEEE International Conference on Robotics and Automation. Barcelona, Spain; 2005:24-29.Google Scholar
- Sim R, Elinas P, Little JJ: A Study of the Rao-Blackwellised Particle Filter for Efficient and Accurate Vision-Based SLAM. Int J Comput Vision 2007, 74(3):303-318. 10.1007/s11263-006-0021-0View ArticleGoogle Scholar
- Lowe DG: Distinctive image features from scale-invariant keypoints. Int J Comput Vision 2004, 60(2):91-110.View ArticleGoogle Scholar
- Bay H, Tuytelaars T, Van Gool L: SURF: Speeded up robust features. In Proceedings of the ninth European Conference on Computer Vision. Springer-Verlog, Berlin, German; 2006:404-417. Lecture Notes in Computer Science 3951Google Scholar
- Shakhnarovich G, Darrell T, Indyk P: Nearest-Neighbor Methods in Learning and Vision. MIT Press, Cambridge, MA; 2005.Google Scholar
- Smith R, Self M, Cheeseman P: Estimating Uncertain Spatial Relationships in Robotics. In Autonomous Robot Vehicles. Edited by: Cox IJ, Wilfong GT. Springer-Verlag, New York; 1990:167-193.View ArticleGoogle Scholar
- Blom AP, Bar-Shalom Y: The interacting multiple-model algorithm for systems with Markovian switching coefficients. IEEE T Automat Control 1988, 33: 780-783. 10.1109/9.1299View ArticleMATHGoogle Scholar
- Hutchinson S, Hager GD, Corke PI: A tutorial on visual servo control. IEEE T Robot Automat 1996, 12(5):651-670. 10.1109/70.538972View ArticleGoogle Scholar
- Sciavicco L, Siciliano B: Modelling and Control of Robot Manipulators. McGraw-Hill, New York; 1996.MATHGoogle Scholar
- Wang YT, Lin MC, Ju RC: Visual SLAM and Moving Object Detection for a Small-size Humanoid Robot. Int J Adv Robot Syst 2010, 7(2):133-138.Google Scholar
- Civera J, Davison AJ, Montiel JMM: Inverse Depth Parametrization for Monocular SLAM. IEEE Trans Robot 2008, 24(5):932-945.View ArticleGoogle Scholar
- Lindeberg T: Feature detection with automatic scale selection. Int J Comput Vision 1998, 30(2):79-116. 10.1023/A:1008045108935View ArticleGoogle Scholar
- Wang YT, Hung DY, Sun CH: Improving Data Association in Robot SLAM with Monocular Vision. J Inf Sci Eng 2011, 27(6):1823-1837.MathSciNetGoogle Scholar
- Longuet-Higgins HC: A computer algorithm for reconstructing a scene from two projections. Nature 1981, 293: 133-135. 10.1038/293133a0View ArticleGoogle Scholar
- Hartley RI: In Defense of the Eight-Point Algorithm. IEEE T Pattern Anal 1997, 19(6):580-593. 10.1109/34.601246View ArticleGoogle Scholar
- Luong QT, Faugeras OD: The Fundamental Matrix: Theory, Algorithms, and Stability Analysis. Int J Comput Vision 1996, 17(1):43-75. 10.1007/BF00127818View ArticleGoogle Scholar
- Bouguet JY: Camera Calibration Toolbox for Matlab.2011. [http://www.vision.caltech.edu/bouguetj/calib_doc/]Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.