# Detection of moving objects in image plane for robot navigation using monocular vision

- Yin-Tien Wang
^{1}Email author, - Chung-Hsun Sun
^{1}and - Ming-Jang Chiou
^{1}

**2012**:29

https://doi.org/10.1186/1687-6180-2012-29

© Wang et al; licensee Springer. 2012

**Received: **8 May 2011

**Accepted: **14 February 2012

**Published: **14 February 2012

## Abstract

This article presents an algorithm for moving object detection (MOD) in robot visual simultaneous localization and mapping (SLAM). This MOD algorithm is designed based on the defining epipolar constraint for the corresponding feature points on image plane. An essential matrix obtained using the state estimator is utilized to represent the epipolar constraint. Meanwhile, the method of speeded-up robust feature (SURF) is employed in the algorithm to provide a robust detection for image features as well as a better description of landmarks and of moving objects in visual SLAM system. Experiments are carried out on a hand-held monocular camera to verify the performances of the proposed algorithm. The results show that the integration of MOD and SURF is efficient for robot navigating in dynamic environments.

## Keywords

## 1. Introduction

In recent years, more and more researchers solve the simultaneous localization and mapping (SLAM) as well as the moving object tracking (MOT) problems concurrently. Wang et al. [1] developed a consistency-based moving object detector and provided a framework to solve the SLAMMOT problems. Bibby and Reid [2] proposed a method that combines sliding window optimization and least-squares together with expectation maximization to do reversible model selection and data association that allows dynamic objects to be included directly into the SLAM estimation. Zhao et al. [3] used GPS data and control inputs to achieve global consistency in dynamic environments. There are many advantages to cope with SLAM and MOT problems simultaneously: for example, mobile robots might navigate in a dynamic environment crowded with moving objects. In this case the SLAM could be corrupted with the inclusion of moving entities if the information of moving objects is not taken account. Furthermore, the robustness of robot localization and mapping algorithms can be improved if the moving objects are discriminated from the stationary objects in the environment.

Using cameras to implement SLAM is the current trend because of their light weight and low-cost features, as well as containing rich appearance and texture information of the surroundings. However, it is still a difficult problem in visual SLAM to discriminate the moving objects from the stationary landmarks in dynamic environments. To deal with this problem, we propose the moving object detection (MOD) algorithm based on the epipolar constraint for the corresponding feature points on image plane. Given an estimated essential matrix it is possible to investigate whether a set of corresponding image points satisfy the defining epipolar constraint in image plane. Therefore, the epipolar constraint can be utilized to distinguish the moving objects from the stationary landmarks in dynamic environments.

For visual SLAM systems, the features in the environment are detected and extracted by analyzing the image taken by the robot vision, and then the data association between the extracted features and the landmarks in the map is investigated. Many researchers [4, 5] employed the concept by Harris and Stephens [6] to extract apparent corner features from one image and tracked these point features in the consecutive image. The descriptors of the Harris corner features are rectangle image patches. When the camera translates and rotates, the scale and orientation of the image patches will be changed. The detection and matching of Harris corner might fail in this case, unless the variances in scale and orientation of the image patches are recovered. Instead of detecting corner features, some works [7, 8] detect the features by using the scale-invariant feature transform (SIFT) method [9] which provides a robust image feature detector. The unique properties of image features extracted by SIFT method are further described by using a high-dimensional description vector [9]. However, the feature extraction by SIFT requires more computational cost than that by Harris's method [6]. To improve the computational speed, Bay et al. [10] introduced the concept of integral images and box filter to detect and extract the scale-invariant features, which they dubbed speeded-up robust features (SURF). The extracted SURF must be matched with the landmarks in the map of a SLAM system. The nearest-neighbor (NN) searching method [11] can be utilized to match high-dimensional data sets of description vectors.

In this article, an online SLAM system with a moving object detector is developed based on the epipolar constraint for the corresponding feature points on image plane. The corresponding image features are obtained using the SURF method [10] and the epipolar constraint is calculated using an estimated essential matrix. Moving object information is detected in image plane and integrated into the MOT process such that the robustness of SLAM algorithm can be considerably improved, particularly in highly dynamic environments where surroundings of robots are dominated by non-stationary objects. The contributions in this article are twofold. First, we develop an algorithm to solve the problems for MOD in image plane, and then the algorithm is integrated with the robot SLAM to improve the robustness of state estimation and mapping processes. Second, the improved SLAM system is implemented on a hand-held monocular camera which can be utilized as the sensor system for robot navigation in dynamic environments.

The SLAM problem with monocular vision will be briefly introduced in Section 2. In Section 3, the proposed algorithm of MOD is explained in detail. Some examples to verify the performance of the data association algorithm are described in Section 4. Section 5 is the concluding remarks.

## 2. SLAM with a free-moving monocular vision

*k*can be expressed as

**x**

_{ k }is the state vector;

**u**

_{ k }is the input;

*w*

_{ k }is the process noise. The objective of the tracking problem is to recursively estimate the state

**x**

_{ k }of the target according to the measurement

**z**

_{ k }at time step

*k*,

*v*

_{ k }is the measurement noise. A hand-held monocular vision, as shown in Figure 1, is utilized in this article as the only sensing device for the measurement in SLAM system. We treat this hand-held vision sensor as a free-moving robot system with unknown inputs. The states of the system are estimated by solving the recursive SLAM problem using the extended Kalman filter (EKF) [12]

where **x**_{k|k-1}and **x**_{k|k}represent the predicted and estimated state vectors, respectively; K_{
k
} is Kalman gain matrix; P denotes the covariance matrix, respectively; A_{
k
} and W_{
k
} are the Jacobian matrices of the state equation *f* with respect to the state vector **x**_{
k
} and the noise variable *w*_{
k
}, respectively; H_{
k
} and V_{
k
} are the Jacobian matrices of the measurement *g* with respect to the state vector **x**_{
k
} and the noise variable *v*_{
k
}, respectively.

### 2.1. Motion model

*W*} and the camera frame {

*C*}, as shown in Figure 2. The state vector of the SLAM system with MOT in Equation (1) is arranged as

**x**_{
C
} is a 12 × 1 state vector of the camera including the three-dimensional vectors of position r, rotational angle ϕ, linear velocity **v**, and angular velocity ω, all in world frame; **m**_{
i
} is the three-dimensional (3D) coordinates of *i* th stationary landmark in world frame; **O**_{
j
} is the state vector of *j* th moving object; *n* and *l* are the number of the landmarks and of the moving objects, respectively.

**x**

_{ C }of the camera with a CV motion model at time step

*k*is expressed as:

_{ v }and w

_{ ω }are linear and angular velocity noise caused by acceleration, respectively. The state of

*i*th stationary landmark at time step

*k*is represented by 3D coordinates in space,

*k*can be expressed as

_{ jk }and s

_{ jk }are the state and motion mode of

*j*th moving object, respectively. The MOT problem can be expressed as a probability density function (pdf) in Bayesian probability

*p*(o

_{ k }|s

_{ k },

**z**

_{1:k}) is state inference;

**z**

_{1:k}is the set of measurements for time

*t*= 1 to

*k*;

*p*(s

_{ k }|

**z**

_{1:k}) is the mode learning. The EKF-based interacting multiple model (IMM) estimator [13] can be utilized to estimate the motion mode of a moving object. The state is computed at time step

*k*under each possible current model using

*r*filters, with each filter using a different combination of the previous model-conditioned estimates. The mode M

^{ j }at time step

*k*is assumed to be among the possible

*r*modes

*r*motion models, the object state o

_{ k }in Equation (7) is estimated. Instead of using the IMM estimator, a single CV model is utilized in the paper. That is, the moving objects are also presumed to move at a CV motion model. Their coordinates in 3D space are defined as

where p_{
jk
} and **v**_{
jk
} are the vectors of the position and linear velocity of *j* th moving object at time step *k*, respectively.

### 2.2. Vision sensor model

*m*is the number of the observed image features in current measurement. The perspective projection method [14] is employed to model the transformation from 3D space coordinate system to 2D image plane. For one observed image feature, the measurement is denoted as

*f*

_{ u }and

*f*

_{ v }are the focal lengths of the camera denoting the distance from the camera center to the image plane in

*u*- and

*v*-axis, respectively; (

*u*

_{0},

*v*

_{0}) is the offset pixel vector of the pixel image plane;

*α*

_{ C }is the camera skew coefficient; ${\mathit{h}}_{i}^{C}={\left[\begin{array}{ccc}\hfill {h}_{ix}^{C}\hfill & \hfill {h}_{iy}^{C}\hfill & \hfill {h}_{iz}^{C}\hfill \end{array}\right]}^{T}$ is the ray vector of

*i*th image feature in camera frame. The 3D coordinates of

*i*th image feature or landmark in world frame, as shown in Figure 2, is given as

*cϕ*= cos

*ϕ*and

*sϕ*= sin

*ϕ*;

*ϕ*

_{ x },

*ϕ*

_{ y }and

*ϕ*

_{ z }are the corresponding rotational angles in world frame. We can utilize Equatoin (10) to calculate the ray vector of an image feature in camera frame. The coordinates of the feature in image plane are obtained by substituting Equations (10) and (11) into Equation (9) with

*α*

_{ C }= 0

Moreover, the elements of the Jacobian matrices H_{
k
} and V_{
k
} are determined by taking the derivative of **z**_{
i
} with respect to the state **x**_{
k
} and the measurement noise *v*_{
k
}. The Jacobian matrices are obtained for the purpose of calculating the innovation covariance matrix in EKF estimation process [16].

### 2.3. Feature initialization

*m*image features with 3D position vectors,

*y*

_{ i },

*i*= 1,...,

*m*, which is described by the 6D state vector

*y*

_{ i }, the corresponding covariance matrix are initialized according to

*R*

_{ i }is the covariance of the measurement noise; σ

_{ρ}is the deviation of the estimated image depth. However, the inverse depth coordinates are 6D and computational costly. A switching criterion is established in reference [17] based on a linearity index

*L*

_{ d }. If

*L*

_{ d }>

*L*

_{d 0}, the inverse depth parameterization in Equation (13) is utilized as the augmented states; where

*L*

_{d 0}is the threshold value of the index defined in [17]. On the other hand, if

*L*

_{ d }≤

*L*

_{d 0}, then the state vector in Equation (10) is chosen and modified as

_{ i }, the correspondent elements of the Jacobian matrix

*H*

_{ k }are modified as:

The derivative is taken at ${x}_{k}={\widehat{x}}_{k\mid k-1}$ and *v*_{
k
} = 0.

### 2.4. Speeded-up robust features (SURF)

*D*

_{ ij }are the images filtered by the corresponding box filters;

*w*is a weight constant. The interest points or features are extracted by examining the extreme value of determinant of Hessian matrix. Furthermore, the unique properties of the extracted SURF are described by using a 64-dimensional description vector as shown in Figure 3 [9, 19].

### 2.5. Implementation of SLAM

## 3. Moving object detection and tracking

_{ C }is the matrix of camera intrinsic parameters which can be obtained from Equation (9). Note that the image coordinates are defined in camera frame. The essential matrix E in Equation (24) is defined as [20]

_{×}is the matrix representation of the cross product with t. The rotation matrix and translation vector would be determined using EKF estimator. Therefore, the essential matrix E can be calculated accordingly. Usually, the pixel coordinate constraint in Equation (24) is utilized to estimate the state vector and according essential matrix. Given a set of corresponding image points, it is possible to estimate the state vector and the essential matrix which optimally satisfy the pixel coordinate constraint. The most straight-forward approach is to set up a total least squares problem, commonly known as the eight-point algorithm [21]. On the contrary in SLAM problem, the state vector and the essential matrix is obtained using the state estimator. We could further utilize the estimated state vector and the essential matrix to investigate whether a set of corresponding image points satisfy the pixel coordinate constraint in Equation (24). First, define the known quantities in Equation (24) as a constant vector

Equation (27) indicates that the pixel coordinate of the corresponding feature in second image will be constrained on the epipolar line.

*I*

_{ x },

*I*

_{ y }) = (80,60). In the first simulation, the camera translates 1 cm along

*x*

_{ c }-axis direction, the corresponding image feature $\left({I}_{x}^{\prime},\phantom{\rule{0.3em}{0ex}}{I}_{y}^{\prime}\right)$ in the second image must be restricted in the epipolar line 1, as shown in Figure 6. Similarly, if the camera translates 1 cm along

*y*

_{ c }- or

*z*

_{ c }-axis direction, the corresponding features $\left({I}_{x}^{\prime},\phantom{\rule{0.3em}{0ex}}{I}_{y}^{\prime}\right)$ in the second image must be located on the epipolar line 2 or 3, respectively. In the second simulation, the camera first moves 1 cm in

*x*

_{ c }-axis and then rotates ϕ

_{ x }= 15° about

*x*

_{ c }-axis, the corresponding feature $\left({I}_{x}^{\prime},\phantom{\rule{0.3em}{0ex}}{I}_{y}^{\prime}\right)$ in second image must be located on the epipolar line 4, as shown in Figure 6. If the camera first moves 1 cm in

*x*

_{ c }-axis and then rotates ϕ

_{ y }= 15° about

*y*

_{ c }-axis or ϕ

_{ z }= 15° about

*z*

_{ c }-axis, the corresponding features $\left({I}_{x}^{\prime},\phantom{\rule{0.3em}{0ex}}{I}_{y}^{\prime}\right)$ in the second image must be located on the epipolar line 5 or 6, respectively.

*D*to represent the pixel deviation of the corresponding point from the epipolar line in image plane, as shown in Figure 7,

*D* is utilized in this article to denote the pixel deviation from the epipolar line which is induced by the motion and measurement noise in state estimation process. Depending on how the noise related to each constraint is measured, it is possible to design a threshold value in Equation (28) which satisfies the epipolar constraint for a given set of corresponding image points. For example, the image feature of a static object in the first image is located at (*I*_{
x
}, *I*_{
y
}) = (50,70) and then the camera moves 1 cm in *z*_{
c
}-axis. If the corresponding image feature $\left({I}_{x}^{\prime},\phantom{\rule{0.3em}{0ex}}{I}_{y}^{\prime}\right)$ in the second image is constrained within a deviation ${I}_{y}^{\prime}$ is limited in a range, as shown in Figure 7, as ${I}_{x}^{\prime}$ varying from 1 to 320.

*I*

_{ x },

*I*

_{ y }) = (160,120), the coefficients of the epipolar line obtained from Equation (26) are zero. Therefore, any point in the second image will satisfy the equation of the epipolar line in Equation (27). This situation is simulated and the result is depicted in Figure 8. In the simulation, the pixel coordinate in the first image

*I*

_{ x }varies from 0 to 320 and

*I*

_{ y }is fixed at 120. The corresponding feature $\left({I}_{x}^{\prime},\phantom{\rule{0.3em}{0ex}}{I}_{y}^{\prime}\right)$ in the second image is limited within the pixel coordinate constraint ${\mathit{h}}_{d}^{{C}^{T}}\mathit{Eh}{\prime}_{d}^{C}<1{0}^{-3}$. We can see that the range of the corresponding ${I}_{y}^{\prime}$ is unlimited when (

*I*

_{ x },

*I*

_{ y }) is close to (160,120), as shown in Figure 8. In real applications, those features located in a small region near the center of the image plane need a special treatment because the epipolar line constraint is not valid in this situation.

## 4. Experimental results

In this section, the experimental works of the online SLAM with a moving object detector are implemented on a laptop computer running Microsoft Window XP. The laptop computer is Asus U5F with Intel Core 2 Duo T5500 (1.66 GHz), Mobile Intel i945GM chipset and 1 Gb DDR2. The free-moving monocular camera utilized in this work is Logitech C120 CMOS web-cam with 320 × 240-pixels resolution and USB 2.0 interface. The camera is calibrated using the Matlab tool provided by Bouquet [23]. The focal lengths are *f*_{
u
} = 364.4 pixels and *f*_{
v
} = 357.4 pixels. The offset pixels are *u*_{0} = 156.0 pixels and *v*_{0} = 112.1 pixels, respectively. We carried out three experiments including the SLAM task in a static environment, SLAM with MOT, and people detection and tracking.

### 4.1. SLAM in static environment

*xyz*-axis are plotted in Figure 19. The camera pose deviations decrease suddenly at each loop-closure, because the old landmarks are revisited and the camera pose and landmark locations are updated accordingly. We also can see from Figure 19 that the lower-bound of the pose deviation is further decreased after each loop-closure.

### 4.2. SLAM with MOT

### 4.3. People detection and tracking

## 5. Conclusions

In this research, we developed an algorithm for detection and tracking of moving objects to improve the robustness of robot visual SLAM system. SURFs are also utilized to provide a robust detection of image features and a stable description of the features. Three experimental works have been carried out on a monocular vision system including SLAM in a static environment, SLAM with MOT, and people detection and tracking. The results showed that the monocular SLAM system with the proposed algorithm has the capability to support robot systems simultaneously navigating and tracking moving objects in dynamic environments.

## Declarations

### Acknowledgements

This article was partially supported by the National Science Council in Taiwan under grant no. NSC100-2221-E-032-008 to Y.T. Wang.

## Authors’ Affiliations

## References

- Wang CC, Thorpe C, Thrun S, Hebert M, Durrant-Whyte H: Simultaneous localization, mapping and moving object tracking.
*Int J Robot Res*2007, 26(9):889-916. 10.1177/0278364907081229View ArticleGoogle Scholar - Bibby C, Reid I: Simultaneous Localisation and Mapping in Dynamic Environments (SLAMIDE) with Reversible Data Association. In
*Proceedings of Robotics: Science and Systems III*. Georgia Institute of Technology, Atlanta; 2007.Google Scholar - Zhao H, Chiba M, Shibasaki R, Shao X, Cui J, Zha H: SLAM in a Dynamic Large Outdoor Environment using a Laser Scanner. In
*Proceedings of the IEEE International Conference on Robotics and Automation*. Pasadena, California; 2008:1455-1462.Google Scholar - Davison AJ, Reid ID, Molton ND, Stasse O: MonoSLAM: Real-time single camera SLAM.
*IEEE T Pattern Anal*2007, 29(6):1052-1067.View ArticleGoogle Scholar - Paz LM, Pinies P, Tardos JD, Neira J: Large-Scale 6-DOF SLAM with Stereo-in-Hand.
*IEEE T Robot*2008, 24(5):946-957.View ArticleGoogle Scholar - Harris C, Stephens M: A combined corner and edge detector. In
*Proceedings of the 4th Alvey Vision Conference*. University of Mancheter; 1988:147-151.Google Scholar - Karlsson N, Bernardo ED, Ostrowski J, Goncalves L, Pirjanian P, Munich ME: The vSLAM Algorithm for Robust Localization and Mapping. In
*Proceedings of the IEEE International Conference on Robotics and Automation*. Barcelona, Spain; 2005:24-29.Google Scholar - Sim R, Elinas P, Little JJ: A Study of the Rao-Blackwellised Particle Filter for Efficient and Accurate Vision-Based SLAM.
*Int J Comput Vision*2007, 74(3):303-318. 10.1007/s11263-006-0021-0View ArticleGoogle Scholar - Lowe DG: Distinctive image features from scale-invariant keypoints.
*Int J Comput Vision*2004, 60(2):91-110.View ArticleGoogle Scholar - Bay H, Tuytelaars T, Van Gool L: SURF: Speeded up robust features. In
*Proceedings of the ninth European Conference on Computer Vision*. Springer-Verlog, Berlin, German; 2006:404-417. Lecture Notes in Computer Science 3951Google Scholar - Shakhnarovich G, Darrell T, Indyk P:
*Nearest-Neighbor Methods in Learning and Vision*. MIT Press, Cambridge, MA; 2005.Google Scholar - Smith R, Self M, Cheeseman P: Estimating Uncertain Spatial Relationships in Robotics. In
*Autonomous Robot Vehicles*. Edited by: Cox IJ, Wilfong GT. Springer-Verlag, New York; 1990:167-193.View ArticleGoogle Scholar - Blom AP, Bar-Shalom Y: The interacting multiple-model algorithm for systems with Markovian switching coefficients.
*IEEE T Automat Control*1988, 33: 780-783. 10.1109/9.1299View ArticleMATHGoogle Scholar - Hutchinson S, Hager GD, Corke PI: A tutorial on visual servo control.
*IEEE T Robot Automat*1996, 12(5):651-670. 10.1109/70.538972View ArticleGoogle Scholar - Sciavicco L, Siciliano B:
*Modelling and Control of Robot Manipulators*. McGraw-Hill, New York; 1996.MATHGoogle Scholar - Wang YT, Lin MC, Ju RC: Visual SLAM and Moving Object Detection for a Small-size Humanoid Robot.
*Int J Adv Robot Syst*2010, 7(2):133-138.Google Scholar - Civera J, Davison AJ, Montiel JMM: Inverse Depth Parametrization for Monocular SLAM.
*IEEE Trans Robot*2008, 24(5):932-945.View ArticleGoogle Scholar - Lindeberg T: Feature detection with automatic scale selection.
*Int J Comput Vision*1998, 30(2):79-116. 10.1023/A:1008045108935View ArticleGoogle Scholar - Wang YT, Hung DY, Sun CH: Improving Data Association in Robot SLAM with Monocular Vision.
*J Inf Sci Eng*2011, 27(6):1823-1837.MathSciNetGoogle Scholar - Longuet-Higgins HC: A computer algorithm for reconstructing a scene from two projections.
*Nature*1981, 293: 133-135. 10.1038/293133a0View ArticleGoogle Scholar - Hartley RI: In Defense of the Eight-Point Algorithm.
*IEEE T Pattern Anal*1997, 19(6):580-593. 10.1109/34.601246View ArticleGoogle Scholar - Luong QT, Faugeras OD: The Fundamental Matrix: Theory, Algorithms, and Stability Analysis.
*Int J Comput Vision*1996, 17(1):43-75. 10.1007/BF00127818View ArticleGoogle Scholar - Bouguet JY: Camera Calibration Toolbox for Matlab.2011. [http://www.vision.caltech.edu/bouguetj/calib_doc/]Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.