Temporal Segmentation of MPEG Video Streams

Many algorithms for temporal video partitioning rely on the analysis of uncompressed video features. Since the information relevant to the partitioning process can be extracted directly from the MPEG compressed stream, higher e ﬃ ciency can be achieved utilizing information from the MPEG compressed domain. This paper introduces a real-time algorithm for scene change detection that analyses the statistics of the macroblock features extracted directly from the MPEG stream. A method for extraction of the continuous frame di ﬀ erence that transforms the 3D video stream into a 1D curve is presented. This transform is then further employed to extract temporal units within the analysed video sequence. Results of computer simulations are reported.


INTRODUCTION
The development of highly efficient video compression technology combined with the rapid increase in desktop computer performance, and a decrease in the storage cost, have led to a proliferation of digital video media.As a consequence, many terabytes of video data stored in large video databases, are often not catalogued and are accessible only by the sequential scanning of the sequences.To make the use of large video databases more efficient, we need to be able to automatically index, search, and retrieve relevant material.
It is important to stress that even by using the leading edge hardware accelerators, factors such as algorithm complexity and storage capacity are concerns that still must be addressed.For example, although compression provides tremendous space savings, it can often introduce processing inefficiencies when decompression is required to perform spatial processing for indexing and retrieval.With this in mind, one of the initial considerations in development of a system for video retrieval is an attempt to enhance access capabilities within the existing compression representations.
Since the identification of the temporal structures of video is an essential task of video indexing and retrieval [1], shot detection has been generally accepted to be a first step in the indexing algorithm implementation.We define a shot as a sequence of frames that were (or appear to be) "continuously captured from the same camera" [2].A scene is defined as a "collection of one or more adjoining shots that focus on an object or objects of interest" [3].
Shot change detection algorithms can be classified, according to the features used for processing, into uncompressed and compressed domain algorithms.Algorithms in the uncompressed domain utilize features extracted from the spatial video domain: pixel-wise difference [4], histograms [5], edge tracking [6], and so forth.These techniques are computationally demanding and time-consuming, and thus inferior to the approach based on the compressed domain analysis.
Development in this area is particularly focused on the use of the prevalent MPEG compression standard.Pioneering work by Arman et al. [7] introduced the initial approach to the compressed domain shot detection by analysing the Discrete Cosine Transform (DCT) coefficient subsets and their correlation.Yeo and Liu [8] analysed the sequence of the reduced images extracted from DC coefficients in the transformation domain called the DC sequence.Sethi and Patel [9] used DC sequence histograms to apply χ 2 statistical test.Continuing in the similar manner, Lee et al. [10] exploited information from the first few AC coefficients in the transformation domain, and tracked binary edge maps to parse the video sequence.Although utilizing DCT coefficients appeared to be a much faster approach than the spatial domain analysis, processing time needed to apply motion compensation remained an obstacle to this approach.On the other hand, the algorithms that omitted motion compensation and analysed only I frames, required a second pass to accurately detect the shot change at B or P frames.Meng et al. [11] presented an original approach by utilizing only features directly embedded in MPEG stream: statistics on the numbers and types of prediction vectors used to encode P and B frames.Likewise, Kobla et al. [12] detect shot changes using discontinuous difference metrics and validate the changes by analysing the DCT data.A step forward was done by Pei and Chou [13] where they matched patterns of macroblocks (MB) types within abrupt or gradual change with the expected shapes combining it with partially spatial information.However, these methods have not shown realtime processing capabilities and none of them generated continuous output, essential for further scalable analysis.
In this paper, the main goal is to develop a new approach to the fundamental problems of a system for real-time video retrieval, searching, and browsing.The initial research objectives are directed towards the performance of the core video processing algorithms in the compressed domain, using the established international video standards: MPEG-1-2, H.263, and in future MPEG-4.This approach should introduce improvements in video retrieval with low access latency, as well as advances in processing speed and algorithm complexity.A method for extraction of a continuous frame difference that transforms the 3D video stream into a 1D curve is presented.This transform is then further employed to extract temporal units within the analysed video sequence.
This paper is organized as follows.In Section 2, the algorithm for detection of abrupt shot changes is presented.Section 3 describes the gradual transition detection algorithms that are built on the similar approach, as well as adds some interesting conclusions.Overall results are presented in Section 4, while Section 5 brings final conclusions and a summary of the paper.

SCENE CHANGE DETECTION
MPEG-2 encoders compress video by dividing each frame into blocks of size 16 × 16 called macroblocks (MB) [14].An MB contains information about the type of temporal prediction and corresponding vectors used for motion compensation.The character of the MB prediction is defined in an MPEG variable called MBType.It can be intra coded, forward referenced, backward referenced, or interpolated.Within a video sequence, a continuously strong interframe reference will be present as long as no significant changes occur in the scene.The "amount" of interframe reference in each frame and its temporal changes can be used to define a metric, which measures the probability of scene change in a given frame.We propose to extract MBType information from the MPEG stream and to use it to measure the "amount" of interframe reference.Scene changes are then detected by thresholding the resulting function.
Without loss of generality, we assume that a group of pictures (GOP) in the analysed MPEG stream has the standard frame structure: [IBBPBBPBBPBBPBB].Observe that this frame structure can be split into groups of three having the form of a triplet: IBB or PBB.In the sequel, both types of the reference frames (I or P) are denoted by R i , first bidirectional frame of the triplet as B i , while the second bidirectional frame is denoted as b i .Thus, the MPEG sequence can be analysed as a group of frame-triplets in the form This convention can be easily generalized to any other GOP structure.The possible locations of a cut in a frametriplet are depicted in Figure 1.If the first referenced frame B i is the first frame in the next shot, the next reference frame R i+2 predicts a significant percentage of interframe MBs in both B i and b i+1 .If the scene change occurs at R i , then the previous bidirectional frames B i−2 and b i−1 will be mainly referenced to R i−3 .Finally, if the scene change occurs at b i , then B i−1 will be referenced to R i−2 while b i will be referenced to R i+1 .
If two frames are strongly referenced, then most of the MBs in each frame will have the corresponding type, forward, backward, or interpolated, depending on the type of reference.Thus, we can define a metric for the visual frame difference by analyzing the percentage (or simply the number) of MBs in a frame that are forward referenced and/or backward referenced.
Let Φ T (i) be the set containing all forward referenced MBs and B T (i) the set containing all backward referenced MBs in a given frame with index i and type T. Then we denote the cardinality of Φ T (i) by ϕ T (i) and the cardinality of B T (i) by β T (i).The frame difference metric ∆(i) is defined as Since ∆(i) is a frame-to-frame difference metrics, peaks in the ∆(i) are presenting the strong and abrupt changes in the video content.The cut positions are determined by thresholding using either predefined constant threshold or an adaptive threshold.

GRADUAL CHANGES DETECTION
The next step in the implementation of a shot change detection algorithm is the detection of gradual changes.Unlike cuts, gradual transitions do not show such a significant change in any of the features, and thus are more difficult to detect.Furthermore, there are various types of gradual changes: dissolves, where the frames of the first shot become dimmer, while the frames from the second one become brighter and superimposed; wipes, where the image of the second shot replaces the first one in a regular pattern, such as vertical line, and so forth.Since there is inevitably additional processing in feature analysis for gradual changes extraction, real-time implementation is even more unachievable than it is for basic cut detection.
To reduce additional processing for gradual changes, a new approach is applied.Since the change of features during a gradual transition lasts longer than the analysed frametriplet unit, it is essential to include a component in a difference metrics that will be proportional to the overall change in a GOP.The difference metrics formula for a frame with index i and type T(i) is becoming a linear combination of cardinalities of macroblock type sets within one GOP: (2) In addition to the previously defined sets Φ T (i) and B T (i), sets of intracoded MBs are denoted by I T (i), while interpolated MBs are denoted by Π T (i).Cardinalities of the corresponding sets are denoted by ϕ T (i), β T (i), ı T (i), and π T (i).The metric ∆(i) is proportional to visual changes within the frame triplet as well as to longer alterations during the gradual transitions.During a gradual change, the number of intracoded macroblocks is increasing, because of the lack in visual similarity with both reference frames.On the contrary, the number of interpolated macroblocks is falling, so that we can use this behaviour to enhance the metric sensitivity.
After noise suppression depicted in Figure 2, the same depending on the frame type, there are three different linear combinations of variables ϕ T (i), β T (i), ı T (i), and π T (i) for both bidirectional frames in a frame triplet.Each linear combination has two main coefficients that are directly proportional to the visual content change within predicted and reference frame in a frame triplet (k = +1), and two that are inversely proportional (k = −1) to it.Additional factors k π and k ı are describing overall change in a triplet, one in direct (k ı ) and one in inverse (k π ) proportion.The coefficient values are determined by the rule of thumb, and are presented in Table 1.
The raw difference metric has a strong noise that makes further processing of the data almost impossible.However, we know that the source of this noise is in the discontinuous nature of the difference metrics.Since the metrics value is determined separately for each frame and the content change is based on frame triplets, low-pass filtering with kernel proportional to triplet length would eliminate the noise.The filter with Gaussian pulse response is applied where i ∈[−4σ, 4σ] and σ = 1.5.The value for σ is chosen to maximize the smoothing within one frame triplet.Metric with suppressed noise is calculated as a convolution of Gaussian filter pulse response and the raw noisy metrics After noise suppression, the same filtering procedure is applied to eliminate small spurious peaks and to smooth the difference metrics function.As in noise suppression, the filtering kernel is Gaussian, but with parameter σ = 3.The positions of the central points in the shot change are determined by locating local maxima of the smooth metrics curve.Continuing the process of Gaussian filtering with increasing kernel value, the scale-space of metric curves is generated.It enables a multiresolution analysis of the temporal structure within the analysed video sequence.

RESULTS
The collection of C++ classes called MPEG development classes, implemented by Dongge and Sethi [15], are used as the main tool for manipulating the MPEG streams, while Berkeley mpeg2codec was used as the reference MPEG codec.Some test sequences were produced by Multimedia and Vision Research Lab, Queen Mary, University of London, while some others were provided by the School of Electronic Engineering, Dublin City University, Dublin, Ireland.
To show the typical behaviour of the first algorithm, a sample of MPEG-2 video sequence is generated with three abrupt shot changes at the 6th, 16th, and 23rd frame.
As depicted in Figure 3, the first cut is positioned at rear b frame, and as proposed, it is clear that the level of forward reference is high at previous B frame ϕ (5), and that at the present frame there is strong backward referencing β (6).In the same way, for the 16th I type frame there are significant levels of ϕ (13) and ϕ (14), and the 23rd B type frame has strong β(23) and β(24).Stages of the noise suppression and smoothing process are depicted in Figure 4.It shows noisy raw metric, metric after the noise suppression, and the smoothed metric.
To evaluate the algorithms behaviour, statistical comparison "based on the number of missed detections (MD) and false alarms (FA), expressed as recall and precision" [2] is applied Recall = Detects Detects + MDs , Precision = Detects Detects + FAs . ( Performance comparison and dataset is based on the work of Gargi et al. [2].Dataset is generated as an MPEG-1 sequence with resolution 320×240, having the same sequence length (1200 seconds), same number and type of transitions for a particular programme type (news, sports, and sitcom) as in the performance evaluation.Manually detected positions of the shot boundaries are taken as the ground truth, defining in that way the number of missed detections and   [7], while MB uses variance and prediction statistics of macroblock prediction types [11].MD analyses DC sequence differences [8], whilst ME applies χ 2 statistical test on DC coefficients [9].Two algorithms from the spatial domain family showed the best results: 1D bin-to-bin colour histogram comparison in LAB colour space, and 3D histogram intersection in Munsell colour system.

CONCLUSIONS
A novel scene change detection technique based on the motion variables extracted from the MPEG video stream is proposed.First, a method for abrupt changes detection that uses interframe reference derived only from the statistics of the macroblock types was introduced.Second, the similar interframe reference metrics was applied in the algorithm for gradual shot detection.Improved frame difference metric that utilizes additional MBType information enables the detection of longer transitions.Finally, the experimental results were introduced in Section 4.
Performance comparison with other compressed domain algorithms shows much better results in terms of both recall and preciseness.Furthermore, implementation of the algorithm on PC 750 MHz workstation runs four times faster than real-time requirement for CIF MPEG-1 stream.Unlike the most MPEG-based video partitioning methods, this algorithm generates continuous 1D frame difference metric, suitable for further steps of video indexing.A scale-space of curves can be generated to index the sequence in hierarchical and scalable way.
Possibilities of improving the real-time gradual shot changes detection by using multidimensional clustering of MPEG compressed features are investigated.Also, the multiresolution analysis of the temporal structure for hierarchical and scalable video indexing is in the development process.

Figure 1 :
Figure 1: Possible positions of the cut in a frame triple.

Table 1 :
Coefficients in the linear combination ∆(i).