A bottom-up summarization algorithm for videos in the wild

Video summarization aims to provide a compact video representation while preserving the essential activities of the original video. Most existing video summarization approaches relay on identifying important frames and optimizing target energy by a global optimum solution. But global optimum may fail to express continuous action or realistically validate how human beings perceive a story. In this paper, we present a bottom-up approach named clip growing for video summarization, which allows users to customize the quality of the video summaries. The proposed approach firstly uses clustering to oversegment video frames into video clips based on their similarity and proximity. Simultaneously, the importance of frames and clips is evaluated from their corresponding dissimilarity and representativeness. Then, video clips and frames are gradually selected according to their energy rank, until reaching the target length. Experimental results on SumMe dataset show that our algorithm can produce promising results compared to existing algorithms. Several video summarizations results are presented in supplementary material.


Introduction
Videos in the wild are abundant in personal collections as well as on the web. The processing demand has been increasing rapidly. A number of related work have been proposed over the past decade [1][2][3]. Such videos mostly have clutter background and abundant human action. And most of these videos remain unedited and contain a large quantity of redundant information. Therefore, several video processing tasks like video summarization need to be performed, which not only present audiences a compact version that captures most informative parts of the video but also benefit companies highly related to video processing and searching.
According to [4], there are two fundamental types of video summarization: unsupervised methods [5][6][7][8][9][10][11][12][13][14][15][16] and supervised methods [17][18][19][20][21][22][23][24][25]. However, these tasks are usually treated as independent. Through experiments, we found that these tasks are actually related. The main idea of these tasks is first to measure the significance *Correspondence: quxingming@tju.edu.cn 1 College of Intelligence and Computing, Tianjin University, Yaguan Road, Tianjin, China 5 Bobby B. Lyle School of Engineering, Southern Methodist University, Boaz Lane, Dallas 75205, USA Full list of author information is available at the end of the article of video frames and then select the appropriate video frames according to the different needs of users. The previous summarization methods imply a global optimum with input frames under certain criteria, but the ideal conception seldom leads to satisfactory results. One possible reason is that people watch and understand videos from local perspective rather than from global perspective.
In this paper, we propose a clip-based bottom-up approach for video summarization and experiment with both wild video and non-wild video. A video clip represents a spatiotemporally coherent frame sequence, which is initialled with a constant length and could be extended to a shot. The most related work to ours is the work done by Michael Gygli in [26], where superframes defined with a definition of consecutive frames are aligned with positions of a video that are appropriate for a video cut. Inspired by superframe [26], superpixel [27,28], and video clip growth [29], our algorithm can be summarized as below: clustering is first used to oversegment video into video clips. Intuitively, frames are not isolated and all frames in a short period of time should have a high degree of similarity. Therefore, it is convenient to work with compact video clips when dealing with video processing task. Then, we perform importance measure to assign energy value to each frame, we name it frame's "energy, " which consists of two factors: dissimilarity energy and representativeness energy. Based on frame's energy, video clips' average energy also can be calculated. Finally, video clip growing algorithm is used to select the appropriate video clips by their energy to generate video abstract.
Our main contributions are summarized as follows: (1) proposing a bottom-up algorithm for video summarization, which can gradually generate arbitrary length by gradually adding video clips and frames to the output; (2) presenting an energy function to measure each frame's importance in pixel-level; and (3) our method could be easily extended into other video processing applications.

Related work
Video summarization is an important topic that potentially enables faster browsing of large video collections and also more efficient content indexing and access.
Video summarization has been surveyed from multiple perspectives. By analysing whether the analyzed information was sourced directly from the video stream, Money and Agius consider video summarization into three categories: internal type, external type, and hybrid type [30]. According to the different form of frames' temporal continuity, Truong and Venkatesh divide video summarization methods into two categories: key frames and video skims [31]. Panda et al. [4] classify video summarization methods into two categories: unsupervised and supervised methods.

Clustering
The basic idea of clustering methods is to produce the summary by clustering together similar frames or shots and then showing a limited number of frames per cluster. Based on color feature extraction from video frames and k-means clustering algorithm, Avila et al. [32] present a methodology for the production of static video summaries. Almeida et al. [5] present an approach for video summarization that works in the compressed domain and allows user interaction. The proposed method is based on both exploiting visual features extracted from the video stream and using a simple and fast algorithm to summarize the video content. Guan et al. [6] propose a top-down approach consisting of scene identification and scene summarization. The scene summarization is formulated as choosing those frames that best cover a set of local descriptors with minimal redundancy.

Energy minimization
Some work treats video summarization as a process of energy minimization, which is determined by the context in which frames appear in video. Pritch et al. [8] propose to generate a short video that will be a synopsis of an endless video streams, generated by webcams or surveillance cameras. Feng et al. [9] propose a method that adopts an online content-aware approach in a stepwise manner, hence applicable to endless video, with less computational cost.

Sparse optimizations
The problem of finding the representatives could also be formulated as a sparse optimizations problem. Yang et al. [10] formulate video summarization as a top keyframe selection problem using sparsity consistency, and a global optimization algorithm is introduced to solve the keyframe selection model. Vidal et al. [11] propose a framework to detect and reject outliers from the dataset using the solution of the proposed optimization program. Panda et al. [12] develop a diversity-aware sparse optimization method for multi-video summarization by exploring the complementarity within the videos. Panda and Roy-Chowdhury [13] propose an unsupervised framework for summarizing top related videos by exploring complementarity within videos, and a sparse optimization method is developed to extract a diverse summary that is both interesting and representative in describing the video collection. Vidal et al. [11] consider video summarization of finding a few representatives for a dataset and formulate the problem of finding the representatives as a sparse multiple measurement vector problem. Dornaika and Aldine [14] propose a decremental Sparse Modeling Representative Selection (D-SMRS) in which the selection of the representatives is broken down into several nested processes. Meng et al. [15] propose to summarize a video into a few key objects by selecting representative object proposals generated from video frames. Zhao and Xing [16] propose online video highlighting, a principled way of generating short video summarizing the most important and interesting contents of an unedited and unstructured video, costly both time-wise and financially for manual processing.

Leveraging crawled
Several researchers focus on leveraging crawled web images or videos for video summarization recently. Khosla et al. [33] develop a summarization algorithm that uses the web-image based prior information in an unsupervised manner. Kim et al. [34] develop a parallelizable approach for creating not only high-quality video summaries but also novel structural summaries of online images as storyline graphs. Panda and Roy-Chowdhury [35] develop an approach to extract a summary that simultaneously captures both important particularities arising in the given video and generalities identified from the set of videos. Song et al. [36] present TVSum, an unsupervised video summarization framework that uses title-based image search results to find visually important shots.

Supervised methods
Departing from unsupervised methods, recent work formulates video summarization as a supervised learning problem. Gygli et al. [17] introduce a method that uses a supervised approach in order to learn the importance of global characteristics of a summary and jointly optimizes for multiple objectives. Gong et al. [18] consider video summarization as a supervised subset selection problem and propose the sequential determinantal point process for diverse sequential subset selection. Sharghi et al. [19] develop a probabilistic model, Sequential and Hierarchical Determinantal Point Process (SH-DPP), for query-focused extractive video summarization. Zhang et al. [20] propose a subset selection technique that leverages supervision in the form of human-created summaries to perform automatic keyframe-based video summarization. Xiong et al. [21] propose a storyline representation that expresses an egocentric video as a set of jointly inferred, through MRF inference, story elements comprising of actors, locations, supporting objects, and events, depicted on a timeline. Ghosh et al. [22] introduce egocentric features to train a regressor that predicts important regions. Yao et al. [23] propose a pairwise deep ranking model that employs deep learning techniques to learn the relationship between highlight and non-highlight video segments. Zhang et al. [24] introduce automatically selecting keyframes or key subshots to summarize videos with a long short-term memory supervised learning technique with. Potapov et al. [25] assign importance scores to each video segment with an SVM classifier, and resulting video assembles the sequence of segments with the highest scores. However, the above approaches assume the availability of large amount of human-created video-summary pairs or importance annotations, which are in practice difficult to obtain in real applications. Our method is designed to fill the above gaps. The novelty of clip growing method is based on the bottom-up strategy. Different from all of the previous techniques, our method grows summarized videos from short to long; therefore, it can obtain extremely accurate summarized video's length frame by frame. Moreover, it does not use any specific information beyond the video content.

Quantitative evaluation and benchmark
Evaluating the correctness of a video summarization algorithm is not a straightforward task due to the lack of an objective ground-truth. Ideally, in order to compare different algorithms, each one should be tested on the same datasets and measured using the same metrics. Unfortunately, there is no definite quantitative evaluation and benchmark for previous works until now. But some publicly available datasets of user videos allow for a quantitative evaluation of video summarization algorithms these years. Mundur et al. [7] test algorithms on 50 randomly chosen video segments from the Open Video Project and develop an evaluation procedure with significance factor, overlap factor, and compression factor. Avila et al. [32] demonstrate a validity evaluation by testing algorithms on a sample of videos from the Open Video Project. The summaries' quality is evaluated by the accuracy rate and error rate. Panda et al. [12] introduce Tour20 dataset, which contains 140 videos with multiple manually created summaries. Song et al. [36] introduce TVSum50 dataset, which contains 50 videos with their shotlevel importance scores annotated via crowdsourcing. Kim et al. [34] collect a dataset of 20 outdoor activities, consisting of 2.7M Flickr images and 16K YouTube videos, and evaluate algorithms via crowdsourcing using Amazon Mechanical Turk. Gygli et al. [26] contribute SumMe dataset with human scores for video segments, which allows for an automatic evaluation of different methods. In this paper, we use SumMe as the benchmark for quantitative comparison. Figure 1 presents the overview of our algorithm. As pre-processing, our approach employs dimensionality reduction to generate an Eigenspace of low dimension for each video frame. Firstly, oversegmentation is performed to divide a video into clips. Secondly, each frame's importance, we called Energy, will be measured by two factors: dissimilarity energy and representativeness. Therefore, we can also measure a video clip's energy by summing all the frames in the clip with a weight coefficient. Thirdly, according to each video clip's energy, our algorithm is used to select video clips and frames with higher energy to reach a target length. Finally, video clip merging algorithm is applied to deal with the conflicts in the clip growing process. Three merging cases will be discussed in detail later.

Pre-processing
In order to reduce the amount of computation, a preprocessing step is presented in this section to convert frames into feature vectors.

Frame representation
Operating directly on frames makes the computational complexity extremely high, which makes the operation hard to handle. Therefore, Singular Value Decomposition (SVD) is performed for dimensionality reduction: where F is a frame. In this paper, we use 120 × 160 grayscale image. U and V T are the real left and right singular vector matrices respectively, and is the real λ × λ diagonal matrix. Next, the first λ left singular vectors will be picked out and reshaped to a column vector x, which can be used as the feature of a frame. In this paper, x is a 720D vector. After pre-processing, a consecutive sequence of frames can be converted to be a sequence of vectors.

Video oversegmentation
Because frames are not isolated, all frames in a short period of time should have a high degree of similarity. Therefore, it is convenient to work with video clips which are compact, local, and representative. The processing steps are as follows:

Frames distance measure
Measuring the distance between frames is based on their similarity and proximity. The computation is done in [ xt] space, where x is the feature of a frame and t is the frame sequence number.
where ds, dp is the similarity and proximity between two frame vectors. Since the maximum possible distance between two frame vectors is limited and the temporal distance in the t axis depends on the video length, normalizing need to be done before combining similarity and proximity. Thus, min-max normalization is performed.
After that a variable is used to balance the effect of them. A distance measure D is defined as follows: where D is the sum of the similarity distance and temporal distance normalized by a variable γ . The greater the value of γ is, the more significance the temporal proximity counts. In this paper, we use γ = 0.5. The experimental results show that frames in each video clips have high similarity.

Video oversegmentation
Input a considerable number K, oversegmentation algorithm will divide the video into K video clips. Considering a video with N frames, the length of each video clips L is about N/K. To begin with, the video is divided into K equal video clips and the center of each video clips is assigned as cluster center. Since frames far from the cluster center frame generally do not belong to this cluster, we can safely assume frames that belong to this cluster center lie within a 2L area around the center on the t axis.
Next, for each cluster center C k , search 2L area around the center and assign the nearest frames to this cluster until all the frames are classified. Then, calculate the average frame vector of K video clips to get K new cluster center. Iteratively repeat the above process until the cluster center convergence.
Finally, some margin frames of video clips might be mislabeled after clustering. This is because frames at boundaries are often at similar distances from two adjacent clusters center. Despite just a few frames have been mislabeled, it still affects later operations. Therefore, we re-label these frames by using a voting window with length L v . Specifically, in the window of length L v centered at x t , the label with the highest occurrences is used as the label of x t .
Besides, it is worth discussing the selection of K, which needs some trade-offs. For example, a small K will cause video clips become too long so that some details may be missed. And a large K will produce too many video clips which makes the computational complexity increase. Therefore, we empirically set K = Length(video)/30, which means the initial length of video clips is about 1 s.

Importance measure
After oversegmentation, K video clips is generated. The next step is to measure the importance of each video clip. In this section, we introduce each frame's importance, we called Energy, which can be evaluated from both dissimilarity and representativeness. Intuitively, if there is a great difference between a frame and its neighbor frames, this frame tends to have higher dissimilarity Energy, vice versa. And if there are many similarities between the content before and after a frame, this frame tends to have higher representativeness Energy, vice versa. Dissimilarity energy can be directly obtained by calculating how many pixels in two frames have changed. If a pixel changes more than a threshold, we call it an active pixel. Active pixels ratio is used as the dissimilarity energy.
where Ed(t) is the dissimilarity energy of t th frame. F represents the original frame (a, b represent the pixel location) and I is the indicator function. I = 1 when F(a, b, t)− F(a, b, t + 1) > σ; otherwise, I = 0. We use σ = 3 in this paper.
To compute the representativeness energy of a frame, a sliding window with length L w is created to collect vectors before and after this frame. L w can be adjusted according to different types of video. Generally, if the content of the video changes drastically, small L w should be used, vice versa.
wherex i is the average vector of frames in sliding window. And l = L w /2. The representativeness energy can be calculate by : Before combining these two energy, we need to make the two items on the same magnitude.
where α, β are hyper-parameters controlling the importance of the two parts, respectively. After importance measure, each frame has its own energy so that we can obtain video clips' energy by simply computing the average energy for each video clip.

Video clip growing
The energy of a video clip indicates the significance of the video over a period of time. Higher energy video clips and frames tend to be selected. The video clip growing method takes a video clip candidate set C = {c 1 , c 2 · · · c n } from Section 3.2 as input, where c i represents a video clip with length L ci . The left and right adjacent frames of c i are called "neighbor. " It is worth mentioning that each c i has its own "neighbor. " The idea of our proposed video clip growing method is to pick higher energy c i to form a video clip selected set S. Then, by constantly growing each c in S through adding their "neighbor" frames, the total length T L , of all c in S, can be reached, where T L is defined by user. To obtain output video S, firstly, we sort all c in C by their average energy in descending order and select the first c as the initial S. Secondly, pick out c from S whose neighbor frames have the highest energy E n and re-calculate the average energy E ave of this c. If E n is less than the E ave and the current length C L of all c in S plus the length of next c from C is less than T L , add the next c from C into S (Fig. 2 (3)); otherwise, add the highest energy neighbor frame into this c ( Fig. 2 (1)). Repeatedly add new frames or add new video clips into S until T L is reached.

Weight coefficient
Since T L can be achieved by combining many shorter video clips or several longer video clips, we introduce a weight coefficient (c) to determine the number and lengths of video clips according to the users' preferences. It is worth mentioning that each clip c has its own (c).
Since the neighbor frame with the highest E n has been picked out as indicated in the previous section, now we multiply a to E n and then redo the comparison between E n and E ave . Each time we add a new neighbor frame, needs to be updated. For example, if users prefer more shorter video clips, the can be assigned a number less than 1. Every time we add a neighbor frame into S, becomes smaller. Thus, the product of the E n and its is more likely to be smaller than E ave and then more video clips tend to be selected. Similarly, we can continue increasing when users prefer more longer video clips. In this paper, the weight coefficient is defined as a strictly decreasing function: where length(E n ) is the current length of a video clip E n ; γ is a constant, and we set γ = 0.1 in our experiments.

Merge overlapping video clips
When constantly growing video clips overlap, checking and merging need to be performed. Simply merge a video clip into its neighbor video clips when overlap happened (if a video clip overlaps with two other video clips at the same time, merge it into its left neighbor). Then, check if this new video clip overlaps with other video clips. Continue merging steps until there is no overlap. There are three cases which need to perform merging (Fig. 2).
Case.1 A new video clip overlaps with one pervious video clip ( Fig. 2 (2)). Case.2 Similar to Case.1, a new video clip overlaps with two pervious video clips. Case.3 A new video clip is adjacent to one pervious video clip (Fig. 2 (4)).
The pseudo code of the video clip growing method is shown in Algorithm 1.

Dateset
Experiments have been conducted using a publicly available dataset "SumMe" given in [26] to verify the effectiveness of our algorithm. "SumMe" dataset consists of 25 videos covering holidays, events, and sports. Detailed description of these datasets is available in this page 1 . To provide a quantitative comparison, we use the f-measure and human consistency defined in [26]. We compare our Algorithm 1: Video clip growing method Input: Candidate set C Output: non-overlapping video clip selected set S 1 Initialize Target video length T L ; 2 Sort all c in C by their average energy value in descending order ; 3 add the first c from C into S ; 4 while C L < T L do 5 pick out c whose E n is highest ; method with several existing methods in [26], including random, uniform, clustering and visual attention [37] baseline.

Parameters selection
We provide some range of parameter choices. The oversegmentation number K can be in the range [10,300]. The voting window can be in the range [3,25]. α and β, the hyper-parameters controlling the importance of Ed and Er, can be adjusted according to different videos. Generally, if the video is intense, α should be greater than β; if the video is relatively calm, the situation is opposite. In our experiments, we set α = 0.65 , β = 0.35, K = 260, and voting window is 15.

Results
In order to be consistent with [26], f-measures at 15% summary length is used for our method. As Table 1 shows, in 21 of the total 24 test videos, our method performs best or second best. Our method achieves an average performance of 57% relative to the upper bound, which is 5% higher than the method in [26]. Our results have exceeded mean human on eight test videos if we compare to the human consistency. Furthermore, the results on some other videos are actually very close to mean human.
In order to visualize the results, we draw the energy curve for all the videos and represent the selected video clips with a green rectangle. Besides, "energy curve" from users is defined to make good comparison: In [26], 15 different people were asked to produce a summary with summary length about 15% of the total length. At the beginning, we initialize all frames' energy equal to 0; if one frame is selected by a person, the energy of this We show f-measures at 15% summary length for our method, the baselines, and the human selections. We highlight the best (italics) and the second best (bold) computational methods. "Ran" represents random sample. "Uniform" and "Cluster" are computational methods from [26]. "Att." is the visual attention from [37]. "Superframe" is the method from [26] frame is increased by a constant. Therefore, user-produced "energy curve" could be plotted as shown in Fig. 3. As can be seen, our energy curves are very similar to users' curves. The peaks of high energy are exactly found, and video clips subsequently grow in peak's vicinity.

Discussion
For static camera videos (Air Force One, Paintball, Car over camera, Fire Domino), our result outperforms all baselines. The reason behind this is actually very obvious: only violent movements of objects which attract user most cause Ed to rise and no camera movement causes Er to decrease. Thus, relatively accurate Er and Ed are obtained.
In addition to static camera videos, most of other results have exceeded baseline and perform well on other two video types. Our worst result is made on video "Excavators river crossing, " which contains a lot of shots zoom in and zoom out. Besides, the content of the video is repeated. As can be seen in Fig. 3j: users' attention are only concerned on the first time that people cross the river. However, our algorithm objectively considers all crossing the river as equivalent. So, the selected video clips are very uniform.

The flexibility
Since our algorithm produces video summary by adding frames, it is flexible to produce different lengths of video  Figure 4 shows the results of "Base jumping" from SumMe dataset on different length. High-energy peaks will not be missed, and low-energy frames will not be added.

Supplementary material
Video summaries of video "Base jumping, " "Cooking, " and "Valparaiso Downhill" are made and will be shown in Additional files 1, 2, 3, 4, 5, and 6. Figure 5 shows visual examples of the video summarization by the proposed method on "Cockpit Landing" video from SumMe dataset. It shows that the produced summaries can capture both repeated visual contents that reflect the global commonness and local contents that are representative of the video.

Other applications
By adjusting parameters to control the length of output video clips or frames, the proposed method could be applied to other video processing applications, e.g., key-frame extraction.

Limitation
The limitation of the proposed algorithm lies in the fact that it chooses frames based on their energy. Therefore, Fig. 4 Results of "Base jumping" from SumMe dataset for different summarization ratios. The selected frames are presented with a green bar at time axis. The energy curves and the selected clips are shown for lengths a15%, b30%, c45%, d60%, e75%, and f90% whether the computed energy curve is similar to the user's energy curve is very significant. However, this task is quite subjective. The definition of the importance of frames varies in different people. To calculate Ed, our algorithm believes that dramatic changes in content lead to higher energy while users might think it is less attractive because dramatic changes lead to chaos. To calculate Er, our algorithm believes a frame which is similar to its adjacent frames has higher energy while users might think it is boring because there is not much change before and after this frame. Therefore, parameters of the algorithm need to be adjusted depending on different videos, which requires a lot of experiments.

Conclusion and future work
In this paper, we have presented a greedy video clip growing algorithm for video summarization. Clustering is performed to oversegment videos into video clips. Then, we propose the frames' energy which is used as the standard of selecting video clips and adding frames. Clip growing allows users to customize the quality of the video summaries, which is important because different users often vary in needs. By adjusting the parameters, our algorithm can adapt different types of video as well.
Rigorous experiments have been performed on SumMe dataset. Our results show that it is able to create good