EURASIP Journal on Applied Signal Processing 2003:1, 41–47 c ○ 2003 Hindawi Publishing Corporation Retrieval by Local Motion

Motion feature plays an important role in video retrieval. The current literature mostly addresses motion retrieval only by camera motion and global motion of individual video objects in a video scene. In this paper, we propose two new motion descriptors that capture the local motion of the video object within its bounding box. The proposed descriptors are rotation and scale invariant and based on the angular and circular area variances of the video object and the variances of the angular radial transform coe ﬃ cients. Experiments show that ranking obtained by querying with our proposed descriptors closely match with the human ranking.


INTRODUCTION
As the advancements in digital video compression resulted in the availability of large video databases, indexing and retrieval of video became a very active research area.Unlike still images, video has a temporal dimension that we can associate with motion features.We use this information as one of the key components to describe video sequences; for example, "this is the part where we were salsa dancing" or "this video shows my daughter skating for the first time."Consequently, motion features play an important role in contentbased video retrieval.
It is possible to classify the types of video motion features into three groups.
(i) Global motion of the video or camera motion (e.g., camera zoom, pan, tilt, roll).(ii) Global motion of the video objects within a frame (e.g., an object is moving from the left to the right of the scene).(iii) Local motion of the video object (e.g., a person is raising his/her arms).
Camera operation analysis is generally performed by analyzing the directions of motion vectors that are present in compressed video bit stream [1,2,3] or optical flow analysis in the spatial domain [4].For example, panning and tilting motions are likely to be present if most of the motion vectors inside a frame are in the same direction.Similarly, zooming motion can be identified by determining whether or not the motion vectors at the top/left of the frame have opposite directions than the motion vectors at the bottom/right of the frame [5,6].
Global motion of video objects is represented with their motion trajectories, which are formed by tracking the location of video objects (object's mass center or some selected points on the object) over a sequence of frames.Forming motion trajectories generally requires segmentation of video objects in a video scene.In MPEG-4, the location information of the video object bounding box (the upper-left corner) is already available in the bit stream making the formation of the trajectory a simple task [7].The classification and matching of object motion trajectories is a challenging issue as the trajectories contain both the path and the velocity information of the objects.In [8], Little and Gu proposed to extract separate curves for the object path and speed and match these two components separately.Rangarajan et al. [9] demonstrated a two-dimensional motion trajectory matching through scale-space and Chen and Chang [10] proposed to match the motion trajectories via a wavelet decomposition.
Most available content-based video retrieval systems in the literature employ camera motion features and/or global object motion for retrieval by motion.For example, the Jacob system [11] supports queries using common camera motion changes such as pan, zoom, and tilt.Another retrieval system, VideoQ, employs a spatio-temporal segmentation algorithm in order to retrieve individual objects with their global motion inside a scene [12].It allows the user to specify an arbitrary polygonal trajectory for the query object and retrieves the video sequences that contain video objects with similar trajectories.Similar to VideoQ, NeTra-V supports spatio-temporal queries and utilizes motion histograms for global camera and video object motion retrieval [13].Moreover, the content-based description standard MPEG-7 [14,15] supports motion descriptors, in particular, camera motion which characterizes the 3D camera operations, motion trajectory which captures 2D transitional motion of objects, parametric motion which describes the global deformations, and motion activity which specifies the intensity of action.
On the other hand, local motion, the motion video objects within their bounding box, could give valuable information about its articulated parts, elasticity, occlusion, and so forth.Classifying and identifying video objects using their local motion is potentially useful in many applications.For example, it could be useful to identify some suspicious human actions in surveillance video sequences.It could also be useful for efficient video compression, where the encoder can allocate more coding bits or a better communication channel for the video objects that demonstrates important actions, for example, a person running out of a store (there is a chance that the person might be a criminal) or a player scoring.Moreover, processing database queries such as "find a video sequence where people are dancing" would be possible only by enabling the retrieval of video objects with their local motion.The current research in detecting the local motion of video objects has been restricted mostly to specific domains.Stalidis et al. employed a wavelet-based model using boundary points of magnetic resonance images (MRI) to describe the cardiac motion in [16].Miyamori and Iisaku [17] proposed to classify the actions of tennis players using 2D appearancebased matching.Hoey and Little suggested a method for the classification of motion, which is based on the representation of flow fields with Zernike polynomials in [18].Their method is applied to the classification of facial expressions.In [19], Fujiyoshi and Lipton presented a process to analyze human motion by first obtaining the skeleton of the objects and then determining the body posture and motion of skeleton segments to determine human activities.Human motion classification was also studied by other researchers including Little and Boyd in [20], where they proposed to recognize individuals by periodic variation in the shape of their motion, and Heisele and Woehler in [21], where they suggested discriminating pedestrians by characterizing the motion of the legs.Moreover, Cutler and Davis [22] proposed to characterize the local motion by detecting periodicity of the motion by Fourier analysis on the gray scale video.Most of the work in this area focuses on "recognizing" the motion of specific objects and they assume prior knowledge about the video content.
As the video object content becomes more widely available, mostly due to the emergence of 3D video capture devices [23,24], object-based MPEG-4 [25] video encoding standard, and the availability of the state of the art segmentation algorithms [26,27], there is a need for more generic motion features that describe the local motion of video objects.In this paper, we propose two content-independent local motion descriptors.Motivated by the fact that any significant motion of video objects within their bounding box would very likely result in changes in their shape, our motion descriptors are based on the shape deformations of video objects.The first descriptor, angular circular local motion (ACLM), is computed by dividing the video object area into a number of angular and circular segments and computing the variance of each segment over a period of time.The other proposed descriptor is based on the variances of the angular radial transform (ART) coefficients.We assume that the segmented objects are obtained prior.The proposed descriptors are extracted using video objects' binary shape masks.The rest of the paper is organized as follows.Sections 2 and 3 describe the proposed local motion descriptors as well as their extraction and matching.Experimental results that illustrate the retrieval performance of our methods and the associated trade-offs are presented in Section 4. Conclusions are given in Section 5.

ANGULAR CIRCULAR LOCAL MOTION (ACLM) DESCRIPTOR
Unlike the shape of visual objects in still images, the shape of a video object is not fixed and is very likely to change with time.Given that the camera effects, such as zooming, are compensated for, the shape deformations in an object's lifespan could offer some valuable information about the object's local motion, occlusion, articulated parts, and elasticity.The variance of the object area is a good measure for such shape deformations.Nevertheless, it may not be sufficient to capture the motion of the video objects in some cases, especially if the object motion does not have an effect on the area of the object.For example, if an object has an articulated part that is rigid in shape, then the object's area may not change even if there is local motion.Here, we propose to divide the binary shape mask of a video object into M angular and N circular segments and use the variance of the pixels that fall into each segment to describe the local motion.Variances are computed for each angular circular segment in the temporal direction using the temporal instances of the video objects.Then, the local motion feature matrix is formed for each video object as follows: where M and N are the number of angular and circular sections, respectively, and σ 2 n,m is the variance of the pixels that fall into the segment (n, m) and computed as follows: where K is the number of the temporal instances of the video object, VOP k is the binary shape mask of the video object plane (VOP) at kth instant, VOP k (ρ, θ) is the value of the binary shape mask in VOP k at the (θ, ρ) position in the polar coordinate system centered at the mass center of VOP k , A(n, m) is the area, θ m is the start angle, and ρ m is the start radius of the angular circular segment (n, m), and they are defined as where M and N are the number of angular and circular sections, respectively and ρ max is found by where VOP k is the kth instant of the video object and ρ VOPk is the radius of the tightest circle around the VOP k that is centered at the mass center of VOP k .The proposed descriptor is scale invariant since the number of angular and circular segments is the same for all video objects, and the size of each segment is scaled with ρ max .We attain an approximate rotation invariance of the descriptor by employing an appropriate query matching method similar to the one used for matching the contour-based shape descriptor in MPEG-7 [14].That is, we provide the rotation invariance by reordering the feature matrix R so that the angular segment with the largest variance is in the first column of R.This is achieved by first summing the columns of the feature matrix R to obtain the 1 × M projection vector A and then finding the maximum element of A, which corresponds to the angular segment m L that has the largest variance.Finally, we circularly shift to the left the columns of R by m L to obtain a rotation invariant feature vector.
The trade-offs associated with using different numbers of angular and circular segments for this descriptor are presented in Section 4.

ART-BASED LOCAL MOTION DESCRIPTOR
Employing angular radial transform (ART)-based shape descriptors is an efficient way to retrieve shape information as they are easy to extract and match.Consequently, an ART-based descriptor was recently adopted by MPEG-7 [14].Here, we propose to use the variance of the ART coefficients, computed for each object plane of a video object, as a local motion descriptor.As the ART descriptors describe the region of a shape, different than their contour-based counterparts such as curvature scale-space and Fourier descriptors, they are capable of representing holes and unconnected regions in the shape.Therefore, our proposed ART-based descriptor captures a large variety of shape region deformations caused by the local motion.The ART transform is defined as [14] where F nm is an ART coefficient of order n and m, f (ρ, θ) is the binary shape map in polar coordinates, and V nm (ρ, θ) is the ART basis function, which is separable along the angular and radial directions as follows: The angular and radial basis functions are given by The discrete ART coefficients of a binary shape map are found as follows.First, the size of the binary shape data is normalized by a linear interpolation to a predefined width W and height H, to obtain the size invariant shape map I(x, y).The mass center of the binary shape map is aligned with the center of I(x, y), that is, I(W/2, H/2).Then, the discrete ART coefficients of the shape map of the object plane k (VOP k ) are computed by The ART coefficients of the individual object planes are rotation variant.When ART coefficients are employed for still shape retrieval, the magnitude of the ART coefficients are employed for rotation invariance.Since we would like to capture any rotational changes that may be present in the shape of the video object when computing the variances in the ART coefficients, we employ the complex ART coefficients.The final ART-based local motion descriptor is defined as the magnitude of the complex variance computed over time, which is rotation invariant.
Because the area of the object shape is normalized for size prior to computing the ART coefficients, the local motion descriptor captures the real deformations of the shape, and it is robust to changes in the area of the video objects due to the events such as camera zooming, partial occlusion, and so on.If it is desired by the application that the motion descriptor capture such events, the size normalization of the descriptor should be done with respect to the largest object plane of the video object.The retrieval performance results of this descriptor, obtained by using a various number of angular and radial functions, are presented in Section 4.

Performance evaluation
We present our retrieval results by utilizing the normalized modified retrieval rank (NMRR) measure used in the MPEG-7 standardization activity [28].NMRR not only indicates how much of the correct items are retrieved, but also how highly they are ranked among the retrieved items.NMRR is given by where NG is the number of ground truth items marked as similar to the query item and Rank(k) is the ranking of the ground truth items by the retrieval algorithm, where K is equal to min(4 * NG(q), 2 * GTM) where GTM is the maximum of NG(q) for all the queries.The NMRR is in the range of [0 1] and the smaller values represent a better retrieval performance.ANMRR is defined as the average NMRR over a range of queries.

Retrieval performance
Here, we demonstrate the performance of each of our proposed local motion descriptors.Our database contains over 20 arbitrarily shaped video objects, coded in 2 to 3 different spatial resolutions, each resulting in an MPEG-4 object database of over 50 bit streams.The ANMRR values presented in this section are obtained by averaging the retrieval results of 12 query video objects that have a large variety of local motions.The ground truth objects are decided by having three human subjects rank the video objects for their local motion similarity to the query video objects.The similarity distance between two shapes is measured by computing the Euclidean distance on their local motion descriptors.
Retrieval performance results using the ACLM descriptor with various numbers of angular and circular segments is presented in Figure 1.Note that smaller ANMRR values represent a better retrieval performance.Employing a large number of angular and circular bins generally results in a better retrieval performance but with the cost of more bits required to represent the descriptor.The highest retrieval rates (i.e., lowest ANMRR) here are obtained by using 6 angular and 3 circular segments (ANMRR = 0.090) and 8 angular and 2 circular segments (ANMRR = 0.089).
Some query examples using 6 angular and 3 circular segments are presented in Tables 1 and 2. Note that the dimensions given in the parentheses are not the dimensions of the video objects, but the resolutions of the video sequences from which they are extracted.The dimensions of the video objects are different for each plane of the video object.One important point to note is that, because of the simple upsampling/downsampling methods used to obtain various resolutions of the same video objects, the different resolutions of the same objects are not likely to have exactly the same shapes.Thus, even though our descriptor is scale invariant, the query distances corresponding to the different resolutions of the same object may not be identical.
The first query, shown in Figure 2,1 is a very low-motion anchorperson video object, News 1, which is coded in two different resolutions in our database.As presented in Table 1,  using the ACLM descriptor, the two different resolutions of the News 1 video object are retrieved as the first two items.The other highly ranked two anchorperson video objects, illustrated in Figure 2, are also very low in motion.The Coastguard video object, ranked 7th, 8th, and 9th, is also an object without any articulated parts (a boat object and its waves) and with moderate local motion.Our second query, Hall Monitor 1, is the video object of a walking man captured by a surveillance camera as shown in Figure 3.The query results for this object are presented in Table 2.The three different resolutions of the video object are ranked the highest, and another walking man video object from the same sequence,  Hall Monitor 2, is ranked immediately after.The fish object, which has large moving fins and a tail as depicted in Figure 3, is ranked 6th.The different resolutions of a video object that contain a person playing tennis are ranked 8th and 9th.
As can be seen from these query examples, the ACLM descriptor successfully classifies the local motion of the video objects.
The number of angular and radial functions of the ART descriptor determines how accurately the shape is represented.Considering that the video object shapes, different than trademark shapes for example, generally do not contain much detail, using a small number of basis functions to represent the shape maps would be sufficient and result in a more compact descriptor.Representation with a small number of basis functions also makes the descriptor more robust to the potential segmentation errors.The retrieval performance achieved by using different number of angular and radial functions is presented in Figure 4.As can be observed from the table, employing 4 angular and 2 radial basis functions offers a good trade-off between the retrieval performance (ANMRR = 0.181141) and the compactness of the descriptor.

CONCLUSIONS
In this paper, we proposed two local motion descriptors for the retrieval of video objects.As presented in Section 4, the ranking obtained by employing our descriptors closely matches with the human ranking.According to the AN-MRR scores obtained, the ACLM descriptor offers a better retrieval rate than the ART-based descriptor.Given that each descriptor value is quantized to [0 255] range, ACLM descriptor requires 16 bytes and the ART-based descriptor requires 8 bytes to represent.ACLM descriptor is less computationally complex to extract.Nevertheless, if the ART coefficients of the video object is already computed and attached to the video objects as metadata for shape retrieval, then the extra computations required to extract the local motion descriptors based on the ART coefficients are minimal.Depending on the application, either of the proposed descriptors could be used for efficient video object retrieval by local motion.

3 Figure 1 :Figure 2 :
Figure 1: Retrieval results of the ACLM descriptor obtained by using various numbers of angular and circular (CIR) segments.

Figure 3 :
Figure 3: The video objects classified as being similar in terms of their local motion to the query video object Hall Monitor 1.

Figure 4 :
Figure 4: Retrieval results of the ART-based local motion descriptor obtained by employing different number of angular and radial (RAD) basis functions.

Table 1 :
Local motion retrieval results for the News 1 video object query.

Table 2 :
Local motion retrieval results for the Hall Monitor 1 video object query.