Skip to main content

Detection of moving objects in image plane for robot navigation using monocular vision

Abstract

This article presents an algorithm for moving object detection (MOD) in robot visual simultaneous localization and mapping (SLAM). This MOD algorithm is designed based on the defining epipolar constraint for the corresponding feature points on image plane. An essential matrix obtained using the state estimator is utilized to represent the epipolar constraint. Meanwhile, the method of speeded-up robust feature (SURF) is employed in the algorithm to provide a robust detection for image features as well as a better description of landmarks and of moving objects in visual SLAM system. Experiments are carried out on a hand-held monocular camera to verify the performances of the proposed algorithm. The results show that the integration of MOD and SURF is efficient for robot navigating in dynamic environments.

1. Introduction

In recent years, more and more researchers solve the simultaneous localization and mapping (SLAM) as well as the moving object tracking (MOT) problems concurrently. Wang et al. [1] developed a consistency-based moving object detector and provided a framework to solve the SLAMMOT problems. Bibby and Reid [2] proposed a method that combines sliding window optimization and least-squares together with expectation maximization to do reversible model selection and data association that allows dynamic objects to be included directly into the SLAM estimation. Zhao et al. [3] used GPS data and control inputs to achieve global consistency in dynamic environments. There are many advantages to cope with SLAM and MOT problems simultaneously: for example, mobile robots might navigate in a dynamic environment crowded with moving objects. In this case the SLAM could be corrupted with the inclusion of moving entities if the information of moving objects is not taken account. Furthermore, the robustness of robot localization and mapping algorithms can be improved if the moving objects are discriminated from the stationary objects in the environment.

Using cameras to implement SLAM is the current trend because of their light weight and low-cost features, as well as containing rich appearance and texture information of the surroundings. However, it is still a difficult problem in visual SLAM to discriminate the moving objects from the stationary landmarks in dynamic environments. To deal with this problem, we propose the moving object detection (MOD) algorithm based on the epipolar constraint for the corresponding feature points on image plane. Given an estimated essential matrix it is possible to investigate whether a set of corresponding image points satisfy the defining epipolar constraint in image plane. Therefore, the epipolar constraint can be utilized to distinguish the moving objects from the stationary landmarks in dynamic environments.

For visual SLAM systems, the features in the environment are detected and extracted by analyzing the image taken by the robot vision, and then the data association between the extracted features and the landmarks in the map is investigated. Many researchers [4, 5] employed the concept by Harris and Stephens [6] to extract apparent corner features from one image and tracked these point features in the consecutive image. The descriptors of the Harris corner features are rectangle image patches. When the camera translates and rotates, the scale and orientation of the image patches will be changed. The detection and matching of Harris corner might fail in this case, unless the variances in scale and orientation of the image patches are recovered. Instead of detecting corner features, some works [7, 8] detect the features by using the scale-invariant feature transform (SIFT) method [9] which provides a robust image feature detector. The unique properties of image features extracted by SIFT method are further described by using a high-dimensional description vector [9]. However, the feature extraction by SIFT requires more computational cost than that by Harris's method [6]. To improve the computational speed, Bay et al. [10] introduced the concept of integral images and box filter to detect and extract the scale-invariant features, which they dubbed speeded-up robust features (SURF). The extracted SURF must be matched with the landmarks in the map of a SLAM system. The nearest-neighbor (NN) searching method [11] can be utilized to match high-dimensional data sets of description vectors.

In this article, an online SLAM system with a moving object detector is developed based on the epipolar constraint for the corresponding feature points on image plane. The corresponding image features are obtained using the SURF method [10] and the epipolar constraint is calculated using an estimated essential matrix. Moving object information is detected in image plane and integrated into the MOT process such that the robustness of SLAM algorithm can be considerably improved, particularly in highly dynamic environments where surroundings of robots are dominated by non-stationary objects. The contributions in this article are twofold. First, we develop an algorithm to solve the problems for MOD in image plane, and then the algorithm is integrated with the robot SLAM to improve the robustness of state estimation and mapping processes. Second, the improved SLAM system is implemented on a hand-held monocular camera which can be utilized as the sensor system for robot navigation in dynamic environments.

The SLAM problem with monocular vision will be briefly introduced in Section 2. In Section 3, the proposed algorithm of MOD is explained in detail. Some examples to verify the performance of the data association algorithm are described in Section 4. Section 5 is the concluding remarks.

2. SLAM with a free-moving monocular vision

SLAM is a target tracking problem for the robot system during navigating in the environment [12]. The targets to be tracked include the state of the robot itself as well as of the landmarks and moving objects in the environment. The state sequence of the SLAM system at time step k can be expressed as

x k = f x k - 1 , u k - 1 , w k - 1
(1)

where x k is the state vector; u k is the input; w k is the process noise. The objective of the tracking problem is to recursively estimate the state x k of the target according to the measurement z k at time step k,

z k = g x k , v k
(2)

where v k is the measurement noise. A hand-held monocular vision, as shown in Figure 1, is utilized in this article as the only sensing device for the measurement in SLAM system. We treat this hand-held vision sensor as a free-moving robot system with unknown inputs. The states of the system are estimated by solving the recursive SLAM problem using the extended Kalman filter (EKF) [12]

Figure 1
figure 1

A free-moving monocular vision sensor.

x k k - 1 = f ( x k - 1 k - 1 , u k - 1 , 0 )
(3a)
P k k - 1 = A k P k - 1 k - 1 A k T + W k Q k - 1 W k T
(3b)
K k = P k k - 1 H k T ( H k P k k - 1 H k T + V k R k V k T ) - 1
(3c)
x k k = x k k - 1 + K k ( z k - g ( x k k - 1 , 0 ) )
(3d)
P k k = I - K k H k P k k - 1
(3e)

where xk|k-1and xk|krepresent the predicted and estimated state vectors, respectively; K k is Kalman gain matrix; P denotes the covariance matrix, respectively; A k and W k are the Jacobian matrices of the state equation f with respect to the state vector x k and the noise variable w k , respectively; H k and V k are the Jacobian matrices of the measurement g with respect to the state vector x k and the noise variable v k , respectively.

2.1. Motion model

Two coordinate systems are set at the world frame {W} and the camera frame {C}, as shown in Figure 2. The state vector of the SLAM system with MOT in Equation (1) is arranged as

Figure 2
figure 2

Coordinate setting for the camera system.

x = [ x C m 1 m 2 m n O 1 O 2 O l ] T
(4)

x C is a 12 × 1 state vector of the camera including the three-dimensional vectors of position r, rotational angle ϕ, linear velocity v, and angular velocity ω, all in world frame; m i is the three-dimensional (3D) coordinates of i th stationary landmark in world frame; O j is the state vector of j th moving object; n and l are the number of the landmarks and of the moving objects, respectively.

The motion of the hand-held camera is presumed to be at constant velocity (CV), and the acceleration is caused by an impulse noise from the external force. Therefore, the state x C of the camera with a CV motion model at time step k is expressed as:

x C k = r k ϕ k v k ω k = r k - 1 + ( v k - 1 + w v k - 1 ) Δ t ϕ k - 1 + ( ω k - 1 + w ω k - 1 ) Δ t v k - 1 + w v k - 1 ω k - 1 + w ω k - 1
(5)

where w v and w ω are linear and angular velocity noise caused by acceleration, respectively. The state of i th stationary landmark at time step k is represented by 3D coordinates in space,

m i k = [ X i k Y i k Z i k ] T
(6)

In the motion model of MOT, the targets to be tracked include the state and motion mode of the moving object in the environment. The state of the MOT system at time step k can be expressed as

O j k = [ o j k s j k ] T ,  for j = 1 , 2 , , l

where o jk and s jk are the state and motion mode of j th moving object, respectively. The MOT problem can be expressed as a probability density function (pdf) in Bayesian probability

p ( o k , s k z 1 : k ) = p ( o k s k , z 1 : k ) p ( s k z 1 : k )
(7)

where p(o k |s k ,z1:k) is state inference; z1:kis the set of measurements for time t = 1 to k; p(s k |z1:k) is the mode learning. The EKF-based interacting multiple model (IMM) estimator [13] can be utilized to estimate the motion mode of a moving object. The state is computed at time step k under each possible current model using r filters, with each filter using a different combination of the previous model-conditioned estimates. The mode Mj at time step k is assumed to be among the possible r modes

M j M = M 1 , M 2 , , M r

Given r motion models, the object state o k in Equation (7) is estimated. Instead of using the IMM estimator, a single CV model is utilized in the paper. That is, the moving objects are also presumed to move at a CV motion model. Their coordinates in 3D space are defined as

o j k = p j k - 1 + ( v j k - 1 + w j k - 1 ) Δ t v j k - 1 + w j k - 1
(8)

where p jk and v jk are the vectors of the position and linear velocity of j th moving object at time step k, respectively.

2.2. Vision sensor model

The measurement vector of the monocular vision system is expressed as

z k = [ z 1 k z 2 k z m k ] T

m is the number of the observed image features in current measurement. The perspective projection method [14] is employed to model the transformation from 3D space coordinate system to 2D image plane. For one observed image feature, the measurement is denoted as

z i k = I i x I i y = u 0 + f u h i x C h i z C + α C f u h i y C h i z C v 0 + f v h i y C h i z C  for  i = 1 , 2 , , m
(9)

where f u and f v are the focal lengths of the camera denoting the distance from the camera center to the image plane in u- and v-axis, respectively; (u0, v0) is the offset pixel vector of the pixel image plane; α C is the camera skew coefficient; h i C = [ h i x C h i y C h i z C ] T is the ray vector of i th image feature in camera frame. The 3D coordinates of i th image feature or landmark in world frame, as shown in Figure 2, is given as

m i = [ X i Y i Z i ] T = r + R C W h i C
(10)

R C W is the rotational matrix from world frame to camera frame, represented by using the elementary rotations [15],

R C W = c ϕ y c ϕ z s ϕ x s ϕ y c ϕ z - c ϕ x s ϕ z c ϕ x s ϕ y c ϕ z + s ϕ x s ϕ z c ϕ y s ϕ z s ϕ x s ϕ y s ϕ z + c ϕ x c ϕ z c ϕ x s ϕ y s ϕ z - s ϕ x c ϕ z - s ϕ y s ϕ x c ϕ y c ϕ x c ϕ y
(11)

where = cosϕ and = sinϕ; ϕ x , ϕ y and ϕ z are the corresponding rotational angles in world frame. We can utilize Equatoin (10) to calculate the ray vector of an image feature in camera frame. The coordinates of the feature in image plane are obtained by substituting Equations (10) and (11) into Equation (9) with α C = 0

I i x = u 0 + f u c ϕ y c ϕ z ( X i - r x ) + c ϕ y s ϕ z ( Y i - r y ) - s ϕ y ( Z i - r z ) ( c ϕ x s ϕ y c ϕ z + s ϕ x s ϕ z )( X i - r x ) + ( c ϕ x s ϕ y s ϕ z - s ϕ x c ϕ z )( Y i - r y ) + c ϕ x c ϕ y ( Z i - r z )
(12a)
I i y = v 0 + f v ( s ϕ x s ϕ y c ϕ z - c ϕ x s ϕ z )( X i - r x ) + ( s ϕ x s ϕ y s ϕ z + c ϕ x c ϕ z )( Y i - r y ) + s ϕ x c ϕ y ( Z i - r z ) ( c ϕ x s ϕ y c ϕ z + s ϕ x s ϕ z )( X i - r x ) + ( c ϕ x s ϕ y s ϕ z - s ϕ x c ϕ z )( Y i - r y ) + c ϕ x c ϕ y ( Z i - r z )
(12b)

Moreover, the elements of the Jacobian matrices H k and V k are determined by taking the derivative of z i with respect to the state x k and the measurement noise v k . The Jacobian matrices are obtained for the purpose of calculating the innovation covariance matrix in EKF estimation process [16].

2.3. Feature initialization

Because of the lack of one-dimensional range information in image, how to initialize features becomes an important topic. Some researchers have successfully solved this problem either in time-delayed method [16] or un-delayed method [17]. The un-delayed method will be utilized in this research. When an image feature is selected, the spatial coordinates of the image feature are calculated by employing the method of inverse depth parameterization [17]. Assume that there are m image features with 3D position vectors, y i , i = 1,...,m, which is described by the 6D state vector

ŷ i = [ r ^ i x W r ^ i y W r ^ i z W θ ^ i W ψ ^ i W ρ ^ i ] T
(13)

r ^ W = [ r ^ i x W r ^ i y W r ^ i z W ] T indicates the estimated state of the camera when the feature was observed, as shown in Figure 2; ρ ^ i is the estimated image depth of the feature; θ ^ i W and ψ ^ i W are the longitude and latitude angles of the spherical coordinate system which locates at the camera center. To compute the longitude and latitude angles, a normalized vector η i W in the direction of the ray vector is constructed by using the perspective project method:

η i W = R C W ( ϕ ^ W ) I i x - u 0 f u I i y - v 0 f v 1 T
(14)

Therefore, from Figure 2, the longitude and latitude angles of the spherical coordinate system can be obtained as

θ ^ i W = tan - 1 η i z W η i x W
(15a)
ψ ^ i W = tan - 1 η y W η i x W 2 + η i z W 2
(15b)

When image features are selected to be new landmarks or moving objects, the inverse depth parameterization vector in Equation (13) is assigned to be new augmented states in the EKF-based SLAM. Meanwhile, for each new state variable y i , the corresponding covariance matrix are initialized according to

P k k n e w = J P k k 0 0 0 R i 0 0 0 σ ρ 2 J T
(16)
J = I 0 y r W y ϕ C 0 0 y z i y ρ i
(17)

where R i is the covariance of the measurement noise; σρ is the deviation of the estimated image depth. However, the inverse depth coordinates are 6D and computational costly. A switching criterion is established in reference [17] based on a linearity index L d . If L d > Ld 0, the inverse depth parameterization in Equation (13) is utilized as the augmented states; where Ld 0is the threshold value of the index defined in [17]. On the other hand, if L d Ld 0, then the state vector in Equation (10) is chosen and modified as

Ŷ i = Ŷ i x Ŷ i y Ŷ i z = r ^ i x W r ^ i y W r ^ i z W + 1 ρ ^ i m ( θ ^ i W ψ ^ i W )
(18)
m ( θ ^ i W ψ ^ i W ) = cos ( θ ^ i W ) cos ( ψ ^ i W ) sin ( ψ ^ i W ) sin ( θ ^ i W ) cos ( ψ ^ i W )
(19)

where m ( θ ^ i W ψ ^ i W ) is the unit ray vector. In this case, the corresponding covariance matrix is rearranged as

P new = J P J T
(20)
J = I 0 0 0 Y i y i 0 0 0 I
(21)

Furthermore, for each new state variable υ i , the correspondent elements of the Jacobian matrix H k are modified as:

H k = g x x k v k = z 1 r W z 1 ϕ C z 1 v W z 1 ω C z 1 υ 1 z 1 υ n z 2 r W z 2 ϕ C z 2 v W z 2 ω C z 2 υ 1 z 2 υ n z m r W z m ϕ C z m v W z m ω C z m υ 1 z m υ n
(22)

The derivative is taken at x k = x ^ k k - 1 and v k = 0.

2.4. Speeded-up robust features (SURF)

The basic concept of a scale-invariant method is to detect image features by investigating the determinant of Hessian matrix H in scale space [18]. In order to speed up the detection of image features, Bay et al. [10] utilize integral images and box filters to process on the image instead of calculating the Hessian matrix, and then the determinant of Hessian matrix is approximated by

det ( H ) approx . = D x x D y y - ( w D x y ) 2
(23)

where D ij are the images filtered by the corresponding box filters; w is a weight constant. The interest points or features are extracted by examining the extreme value of determinant of Hessian matrix. Furthermore, the unique properties of the extracted SURF are described by using a 64-dimensional description vector as shown in Figure 3 [9, 19].

Figure 3
figure 3

64-dimensional description vector for SURF.

2.5. Implementation of SLAM

The SLAM is implemented on the free-moving vision system by integrating the motion and sensor models, as well as the extraction of SURF. A flowchart for the developed SLAM system is depicted in Figure 4. The images are captured by the monocular camera and features are extracted using SURF method. In the SLAM flowchart, data association in between the landmarks in the database and the image features of the extracted SURF is carried out using the NN ratio matching strategy [9]. A map managerial tactic is designed to manage the newly extracted features and the bad features in the system. The details of the map management have been explained in previous article [19]. The properties of the newly extracted features are investigated and the moving objects will be discriminated from the stationary objects by using a proposed detection algorithm which will be described in next section. All the stationary landmarks and moving objects are included in the state vector. On the other hand, those features which are not continuously detected at each time step will be treated as bad features and erased from the state vector in Equation (4).

Figure 4
figure 4

Flowchart of EKF-based visual SLAM.

3. Moving object detection and tracking

In the flowchart of Figure 4, the function block of MOD is designed based on the concept of the pixel coordinate constraint of a static object in image plane. An object in space is represented by the corresponding features in two consecutive images. The pixel coordinate constraint equation for these corresponding image features can be expressed as

h d C T Eh d C =0
(24)

where h d C and h d C are the homogenous normalized image coordinates of the corresponding features abstracted from two consecutive images, images 1 and 2, respectively. They are defined as

h d C = K C - 1 I x I y 1 h d C = K C - 1 I x I y 1 ; and K C = f u α C f u u 0 0 f v v 0 0 0 1

K C is the matrix of camera intrinsic parameters which can be obtained from Equation (9). Note that the image coordinates are defined in camera frame. The essential matrix E in Equation (24) is defined as [20]

E = [ t ] × R = 0 - t z t y t z 0 - t x - t y t x 0 R
(25)

where R is the rotation matrix and t is the translation vector of the camera frame with respect to world frame; [t]× is the matrix representation of the cross product with t. The rotation matrix and translation vector would be determined using EKF estimator. Therefore, the essential matrix E can be calculated accordingly. Usually, the pixel coordinate constraint in Equation (24) is utilized to estimate the state vector and according essential matrix. Given a set of corresponding image points, it is possible to estimate the state vector and the essential matrix which optimally satisfy the pixel coordinate constraint. The most straight-forward approach is to set up a total least squares problem, commonly known as the eight-point algorithm [21]. On the contrary in SLAM problem, the state vector and the essential matrix is obtained using the state estimator. We could further utilize the estimated state vector and the essential matrix to investigate whether a set of corresponding image points satisfy the pixel coordinate constraint in Equation (24). First, define the known quantities in Equation (24) as a constant vector

a b c = I x I y 1 T ( K C - 1 ) T E K C - 1
(26)

Equation (26) is further rearranged as an equation of the epipolar line constraint in image plane for two corresponding points [22], as shown in Figure 5

Figure 5
figure 5

Epiploar plane of two camera views.

a I x + b I y + c = 0
(27)

Equation (27) indicates that the pixel coordinate of the corresponding feature in second image will be constrained on the epipolar line.

Simulation of pixel constraint on the corresponding features for the cases of camera translation and rotation is carried out and depicted in Figure 6. Assume that the image feature of a static object in first image is located at (I x , I y ) = (80,60). In the first simulation, the camera translates 1 cm along x c -axis direction, the corresponding image feature ( I x , I y ) in the second image must be restricted in the epipolar line 1, as shown in Figure 6. Similarly, if the camera translates 1 cm along y c - or z c -axis direction, the corresponding features ( I x , I y ) in the second image must be located on the epipolar line 2 or 3, respectively. In the second simulation, the camera first moves 1 cm in x c -axis and then rotates ϕ x = 15° about x c -axis, the corresponding feature ( I x , I y ) in second image must be located on the epipolar line 4, as shown in Figure 6. If the camera first moves 1 cm in x c -axis and then rotates ϕ y = 15° about y c -axis or ϕ z = 15° about z c -axis, the corresponding features ( I x , I y ) in the second image must be located on the epipolar line 5 or 6, respectively.

Figure 6
figure 6

Epipolar lines.

Motion and measurement noise is involved in the state estimation process. If the measurement of image points is subject to noise, which is the common case in any practical situation, it is not possible to find a corresponding feature point which satisfies the epipolar constraint exactly. That is, the corresponding feature point might not be constrained on the epipolar line because of the presence of noise. Define the distance D to represent the pixel deviation of the corresponding point from the epipolar line in image plane, as shown in Figure 7,

Figure 7
figure 7

Deviation from the epipolar line.

D = a I x + b I y + c a 2 + b 2
(28)

D is utilized in this article to denote the pixel deviation from the epipolar line which is induced by the motion and measurement noise in state estimation process. Depending on how the noise related to each constraint is measured, it is possible to design a threshold value in Equation (28) which satisfies the epipolar constraint for a given set of corresponding image points. For example, the image feature of a static object in the first image is located at (I x , I y ) = (50,70) and then the camera moves 1 cm in z c -axis. If the corresponding image feature ( I x , I y ) in the second image is constrained within a deviation I y is limited in a range, as shown in Figure 7, as I x varying from 1 to 320.

There are two situations that the pixel coordinate constraint in Equation (24) will result in trivial solutions. First, if the camera is motionless between two consecutive images, we can see from Equation (25) that the essential matrix becomes a zero matrix. Therefore, the object status could not be obtained by investigating the pixel coordinate constraint. In this case, we assume the camera being stationary in space and compute the pixel distance in between the corresponding features in two consecutive images to determine the object status. Second, if the image feature of a static object in the first image is located near the center of image plane (I x , I y ) = (160,120), the coefficients of the epipolar line obtained from Equation (26) are zero. Therefore, any point in the second image will satisfy the equation of the epipolar line in Equation (27). This situation is simulated and the result is depicted in Figure 8. In the simulation, the pixel coordinate in the first image I x varies from 0 to 320 and I y is fixed at 120. The corresponding feature ( I x , I y ) in the second image is limited within the pixel coordinate constraint h d C T Eh d C <1 0 - 3 . We can see that the range of the corresponding I y is unlimited when (I x , I y ) is close to (160,120), as shown in Figure 8. In real applications, those features located in a small region near the center of the image plane need a special treatment because the epipolar line constraint is not valid in this situation.

Figure 8
figure 8

Deviation from epipolar line.

4. Experimental results

In this section, the experimental works of the online SLAM with a moving object detector are implemented on a laptop computer running Microsoft Window XP. The laptop computer is Asus U5F with Intel Core 2 Duo T5500 (1.66 GHz), Mobile Intel i945GM chipset and 1 Gb DDR2. The free-moving monocular camera utilized in this work is Logitech C120 CMOS web-cam with 320 × 240-pixels resolution and USB 2.0 interface. The camera is calibrated using the Matlab tool provided by Bouquet [23]. The focal lengths are f u = 364.4 pixels and f v = 357.4 pixels. The offset pixels are u0 = 156.0 pixels and v0 = 112.1 pixels, respectively. We carried out three experiments including the SLAM task in a static environment, SLAM with MOT, and people detection and tracking.

4.1. SLAM in static environment

In this experiment, the camera is carried by a person to circle around a bookshelf (1.5 m × 2 m floor dim.) in our laboratory. The resultant map and the camera pose estimation are plotted in Figure 9. In this figure, the estimated states of the camera and landmarks are illustrated in a 3D plot. The ellipses in the figure indicate the uncertainty of the landmarks obtained from the extracted image features. The rectangular box represents the free-moving monocular camera and the solid line depicts the trajectory of the camera. More detail image frames of the experimental results are illustrated in Figures 10, 11, 12, 13, 14, 15, 16, and 17. For each figure, the captured image is shown in the left panel and the top-view plot is depicted in the right panel. The (blue) circular marks in the left panel of the figures indicate the landmarks extracted from the captured image with an unknown image depth, while the (red) square marks represent the landmarks with a known and stable image depth. In the right panel of the figures, the estimated states of the camera and landmarks are illustrated in a 2D plot. The red (dark) ellipses represent the uncertainty of the landmarks which have known image depths and the green (light) ellipses denote the uncertainty of the landmarks which have unknown image depths. Meanwhile, the rectangular box represents the free-moving camera and the trajectory of the estimated camera pose is plotted as solid lines. As shown in the right panel of Figure 10 for the 31st image frame, the SLAM system starts up and captures four image features with known positions. These features will help to initialize the map scale. After the start-up, more image features are extracted and treated as landmarks with unknown image depth, as indicated in circular marks in Figure 11 for the 32nd frame. In Figure 12 for the 71st frame, the uncertainty of the image depth of feature 10 is reduced to a small region (red ellipse). The SLAM system builds the environment map and estimates the camera pose concurrently, when the camera is carried to circle around the laboratory, as shown in Figure 13 for the 645th frame. In Figures 14 and 15 for the 2265th and 2340th frames, the camera comes to the place it had visited before and the trajectory loop is closing. Some old landmarks (with number less than 100) are captured again and the covariance of the state vector is reduced gradually. The second and third times of loop-closure are depicted in Figures 16 and 17 for the 3355th and 4254th frames. In these frames, old landmarks are visited again and the covariance of the state vector is reduced further.

Figure 9
figure 9

Three-dimensional map and camera pose estimation.

Figure 10
figure 10

31st frame: the SLAM system starts up with four known features, 0, 1, 2, 3.

Figure 11
figure 11

32nd frame: some new features are detected and initialized.

Figure 12
figure 12

71st frame: the image depth uncertainty of feature 10 decreases to a small region.

Figure 13
figure 13

645th frame: the SLAM system maps the environment and estimates the camera pose concurrently.

Figure 14
figure 14

2265th frame: before the loop is closing.

Figure 15
figure 15

2340th frame: the first time the camera reaches the loop-closure.

Figure 16
figure 16

3355th frame: the second time the camera reaches the loop-closure.

Figure 17
figure 17

4254th frame: the third time the camera reaches the loop-closure.

The map size and the sampling frequency (Hz) at each frame are depicted in Figure 18. The map size increases to about 145 features at the first loop-closure and remains at almost constant size (about 200 features at the third loop-closure). That is, the SLAM can rely on the map with old landmarks for localization when it repeats visiting the same environment. The high sampling frequency in Figure 18 is about 20 Hz when the map size is small. When the map size increases, the low sampling frequency keeps at about 5 Hz.

Figure 18
figure 18

Map size and sampling frequency.

The deviations of the camera pose in xyz-axis are plotted in Figure 19. The camera pose deviations decrease suddenly at each loop-closure, because the old landmarks are revisited and the camera pose and landmark locations are updated accordingly. We also can see from Figure 19 that the lower-bound of the pose deviation is further decreased after each loop-closure.

Figure 19
figure 19

The deviation of the camera pose estimation.

4.2. SLAM with MOT

The camera is carried to move around at one corner of our laboratory in this example. Meanwhile, the SLAM is implemented to map the environment and estimate the camera pose, as well as to detect and track a moving object. The estimated camera pose and landmarks are illustrated in a 3D plot as shown in Figure 20. The ellipses in the figure indicate the landmarks obtained from the extracted image features. The rectangular box represents the free-moving monocular camera and the solid line depicts the trajectory of the camera. More detail image frames of experimental results are illustrated in Figures 21, 22, 23, 24, 25, 26, 27, 28, 29 and 30. For each figure, the captured image is shown in the left panel and the top-view plot is depicted in the right panel of each figure. As shown in Figure 21 for the 52nd image frame, the SLAM system starts up and captures four image features with known 3D coordinates. After the start-up, some stationary features with unknown status are extracted and treated as new landmarks for mapping, as shown in the 98th frame in Figure 22. These stationary landmarks are initialized using inverse depth parameterization [17]. The system demonstrates a stable implementation of SLAM as shown in Figures 22, 23, 24, and 25. In Figure 26, soon after the feature no. 30 is detected, it is discriminated from the stationary features using the proposed MOD algorithm and then treated as a moving object. This moving object is tracked in the image plane and encircled by a rectangle in Figures 26, 27, 28, 29 and 30. Therefore, the developed system demonstrates the capability of simultaneous localization, mapping and MOT.

Figure 20
figure 20

Three-dimensional map and camera pose estimation.

Figure 21
figure 21

52nd frame: system start-up with four known features, 0, 1, 2, 3.

Figure 22
figure 22

98th frame: three new features are initialized.

Figure 23
figure 23

330th frame: SLAM.

Figure 24
figure 24

526th frame: SLAM.

Figure 25
figure 25

563rd frame: one object (the box) is moving to right direction.

Figure 26
figure 26

615th frame: feature 30 on the object is deteced and initialized with large image depth uncertainty.

Figure 27
figure 27

621st frame: the object stops moving and the image depth uncertainty of the feature is reduced.

Figure 28
figure 28

634th frame: the image depth uncertainty of the tracked feature is further reduced and treated as a 3D point.

Figure 29
figure 29

657th frame: the object moves to the right again and the covariance increases.

Figure 30
figure 30

684th frame: the object keeps moving to the right and the covariance increases further.

4.3. People detection and tracking

In this experiment, the SLAM system with MOD is utilized to detect and track a moving people in the environment. The camera is carried to move around at a corner of our laboratory. The estimated states of the camera pose, the landmarks, and the moving people are illustrated in a 2D top-view plot in the right panel of Figures 31, 32, 33, 34, and 35. The captured image is shown in the left panel of each figure. As shown in Figure 31 for the 350th image frame, the SLAM system builds the environment map and estimates the camera pose concurrently and stably. The ellipses in the right panel indicate the landmarks obtained from the extracted image features. The rectangular box represents the free-moving monocular camera and the thin solid line depicts the trajectory of the camera. One person gets in the scene, as shown in the 365th frame in Figure 32, and two image features on the human body are detected. These image features are discriminated from the stationary features using the proposed MOD algorithm, and are further initialized in state vector and tracked using Equation (8). The SLAM system continuously tracks the moving people until he goes out the scene, as shown in Figures 33, 34, and 35. The trajectory of the moving people is depicted as a thick solid line in the right panel of each figure. Hence, the developed system also demonstrates the capability of simultaneous localization, mapping, and moving people tracking.

Figure 31
figure 31

350th frame: the system performing SLAM.

Figure 32
figure 32

365th frame: moving objects are detected.

Figure 33
figure 33

386th frame: the detected moving objects are tracked.

Figure 34
figure 34

392nd frame: the system performs SLAM and MOT.

Figure 35
figure 35

410th frame: MOT.

5. Conclusions

In this research, we developed an algorithm for detection and tracking of moving objects to improve the robustness of robot visual SLAM system. SURFs are also utilized to provide a robust detection of image features and a stable description of the features. Three experimental works have been carried out on a monocular vision system including SLAM in a static environment, SLAM with MOT, and people detection and tracking. The results showed that the monocular SLAM system with the proposed algorithm has the capability to support robot systems simultaneously navigating and tracking moving objects in dynamic environments.

References

  1. Wang CC, Thorpe C, Thrun S, Hebert M, Durrant-Whyte H: Simultaneous localization, mapping and moving object tracking. Int J Robot Res 2007, 26(9):889-916. 10.1177/0278364907081229

    Article  Google Scholar 

  2. Bibby C, Reid I: Simultaneous Localisation and Mapping in Dynamic Environments (SLAMIDE) with Reversible Data Association. In Proceedings of Robotics: Science and Systems III. Georgia Institute of Technology, Atlanta; 2007.

    Google Scholar 

  3. Zhao H, Chiba M, Shibasaki R, Shao X, Cui J, Zha H: SLAM in a Dynamic Large Outdoor Environment using a Laser Scanner. In Proceedings of the IEEE International Conference on Robotics and Automation. Pasadena, California; 2008:1455-1462.

    Google Scholar 

  4. Davison AJ, Reid ID, Molton ND, Stasse O: MonoSLAM: Real-time single camera SLAM. IEEE T Pattern Anal 2007, 29(6):1052-1067.

    Article  Google Scholar 

  5. Paz LM, Pinies P, Tardos JD, Neira J: Large-Scale 6-DOF SLAM with Stereo-in-Hand. IEEE T Robot 2008, 24(5):946-957.

    Article  Google Scholar 

  6. Harris C, Stephens M: A combined corner and edge detector. In Proceedings of the 4th Alvey Vision Conference. University of Mancheter; 1988:147-151.

    Google Scholar 

  7. Karlsson N, Bernardo ED, Ostrowski J, Goncalves L, Pirjanian P, Munich ME: The vSLAM Algorithm for Robust Localization and Mapping. In Proceedings of the IEEE International Conference on Robotics and Automation. Barcelona, Spain; 2005:24-29.

    Google Scholar 

  8. Sim R, Elinas P, Little JJ: A Study of the Rao-Blackwellised Particle Filter for Efficient and Accurate Vision-Based SLAM. Int J Comput Vision 2007, 74(3):303-318. 10.1007/s11263-006-0021-0

    Article  Google Scholar 

  9. Lowe DG: Distinctive image features from scale-invariant keypoints. Int J Comput Vision 2004, 60(2):91-110.

    Article  Google Scholar 

  10. Bay H, Tuytelaars T, Van Gool L: SURF: Speeded up robust features. In Proceedings of the ninth European Conference on Computer Vision. Springer-Verlog, Berlin, German; 2006:404-417. Lecture Notes in Computer Science 3951

    Google Scholar 

  11. Shakhnarovich G, Darrell T, Indyk P: Nearest-Neighbor Methods in Learning and Vision. MIT Press, Cambridge, MA; 2005.

    Google Scholar 

  12. Smith R, Self M, Cheeseman P: Estimating Uncertain Spatial Relationships in Robotics. In Autonomous Robot Vehicles. Edited by: Cox IJ, Wilfong GT. Springer-Verlag, New York; 1990:167-193.

    Chapter  Google Scholar 

  13. Blom AP, Bar-Shalom Y: The interacting multiple-model algorithm for systems with Markovian switching coefficients. IEEE T Automat Control 1988, 33: 780-783. 10.1109/9.1299

    Article  MATH  Google Scholar 

  14. Hutchinson S, Hager GD, Corke PI: A tutorial on visual servo control. IEEE T Robot Automat 1996, 12(5):651-670. 10.1109/70.538972

    Article  Google Scholar 

  15. Sciavicco L, Siciliano B: Modelling and Control of Robot Manipulators. McGraw-Hill, New York; 1996.

    MATH  Google Scholar 

  16. Wang YT, Lin MC, Ju RC: Visual SLAM and Moving Object Detection for a Small-size Humanoid Robot. Int J Adv Robot Syst 2010, 7(2):133-138.

    Google Scholar 

  17. Civera J, Davison AJ, Montiel JMM: Inverse Depth Parametrization for Monocular SLAM. IEEE Trans Robot 2008, 24(5):932-945.

    Article  Google Scholar 

  18. Lindeberg T: Feature detection with automatic scale selection. Int J Comput Vision 1998, 30(2):79-116. 10.1023/A:1008045108935

    Article  Google Scholar 

  19. Wang YT, Hung DY, Sun CH: Improving Data Association in Robot SLAM with Monocular Vision. J Inf Sci Eng 2011, 27(6):1823-1837.

    MathSciNet  Google Scholar 

  20. Longuet-Higgins HC: A computer algorithm for reconstructing a scene from two projections. Nature 1981, 293: 133-135. 10.1038/293133a0

    Article  Google Scholar 

  21. Hartley RI: In Defense of the Eight-Point Algorithm. IEEE T Pattern Anal 1997, 19(6):580-593. 10.1109/34.601246

    Article  Google Scholar 

  22. Luong QT, Faugeras OD: The Fundamental Matrix: Theory, Algorithms, and Stability Analysis. Int J Comput Vision 1996, 17(1):43-75. 10.1007/BF00127818

    Article  Google Scholar 

  23. Bouguet JY: Camera Calibration Toolbox for Matlab.2011. [http://www.vision.caltech.edu/bouguetj/calib_doc/]

    Google Scholar 

Download references

Acknowledgements

This article was partially supported by the National Science Council in Taiwan under grant no. NSC100-2221-E-032-008 to Y.T. Wang.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yin-Tien Wang.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Authors’ original file for figure 12

Authors’ original file for figure 13

Authors’ original file for figure 14

Authors’ original file for figure 15

Authors’ original file for figure 16

Authors’ original file for figure 17

Authors’ original file for figure 18

Authors’ original file for figure 19

Authors’ original file for figure 20

Authors’ original file for figure 21

Authors’ original file for figure 22

Authors’ original file for figure 23

Authors’ original file for figure 24

Authors’ original file for figure 25

Authors’ original file for figure 26

Authors’ original file for figure 27

Authors’ original file for figure 28

Authors’ original file for figure 29

Authors’ original file for figure 30

Authors’ original file for figure 31

Authors’ original file for figure 32

Authors’ original file for figure 33

Authors’ original file for figure 34

Authors’ original file for figure 35

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Wang, YT., Sun, CH. & Chiou, MJ. Detection of moving objects in image plane for robot navigation using monocular vision. EURASIP J. Adv. Signal Process. 2012, 29 (2012). https://doi.org/10.1186/1687-6180-2012-29

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1687-6180-2012-29

Keywords