- Open Access
A stereoscopic video conversion scheme based on spatio-temporal analysis of MPEG videos
© Lin et al.; licensee Springer. 2012
- Received: 9 March 2012
- Accepted: 18 October 2012
- Published: 12 November 2012
In this article, an automatic stereoscopic video conversion scheme which accepts MPEG-encoded videos as input is proposed. Our scheme is depth-based, relying on spatio-temporal analysis of the decoded video data to yield depth perception cues, such as temporal motion and spatial contrast, which reflect the relative depths between the foreground and the background areas. Our scheme is shot-adaptive, demanding that shot change detection and shot classification be performed for tuning of algorithm or parameters that are used for depth cue combination. The above-mentioned depth estimation is initially block-based, followed by a locally adaptive joint trilateral upsampling algorithm to reduce the computing load significantly. A recursive temporal filter is used to reduce the possible depth fluctuations (and also artifacts in the synthesized images) resulting from wrong depth estimations. The traditional Depth-Image-Based-Rendering algorithm is used to synthesize the left- and right-view frames for 3D display. Subjective tests show that videos converted by our scheme provide comparable perceived depth and visual quality with those converted from the depth data calculated by stereo vision techniques. Also, our scheme is shown to outperform the well-known TriDef software in terms of human’s perceived 3D depth. Based on the implementation by using “OpenMP” parallel programming model, our scheme is capable of executing in real-time on a multi-core CPU platform.
- Stereoscopic video conversion
- Depth estimation
- Depth cue
- 3D perception
Recently, 3D (more accurately, stereo 3D) images/videos, which surely move our home audio-visual entertainment towards a greater perceptual realism, are attracting more attention in applications, such as multimedia, games, TV broadcasting, and augmented reality. With the advances in the technologies of 3D content capturing (e.g., dual-eye cameras or time of flight depth camera) and stereoscopic display, the influence of 3D videos on human beings’ daily life are getting more important. Though many LCD-TV manufacturers are promoting their 3DTV products to the market from year 2010, the popularity is however limited by the availability of 3D video content. Though digital 3DTV broadcasting via Digital Broadcasting Satellite in Japan, 3D Digital Multimedia Broadcast system in Korea, and Advanced Three-dimensional Television System Technologies, and FP7 framework program  in Europe are currently in operation or under development, the sources of 3D video content are still not diverse enough. Since professional 3D video capturing devices are not so popular and normally expensive, the lack of sufficient amount of 3D video content motivates researchers to convert existing 2D videos into their stereoscopic versions . With a rapid provision of abundant 3D video content, a quick progress in consumer electronics industry can thus be ensured.
The technique of converting a 2D video into a 3D stereoscopic version is called stereoscopic video conversion (SVC) or 2D-to-3D video conversion [3, 4]. Recently, some researchers and companies, such as Dynamic Digital Depth (DDD), HDlogix, Sony Image Works, and Victor Company of Japan, have paid much attention on this technique . One kind of SVC methods [5–7] tries to create stereo effect (for a two-view display) without estimating the depth map. This kind of depth-free methods relies on its ability in analyzing motion information and then directly synthesizing the left- and right-views from the original image sequence. The basic concept of this kind of methods is similar to structure from motion . For example, Okino et al.  proposed a Modified Time Difference (MTD) method, by which binocular images are generated by selecting two frames with a time delay determined according to the magnitudes of estimated motion vectors (MVs). The problem to be solved is how to choose an appropriate matching frame for a given base frame. Another problem is how to find a suitable mapping between the disparity and the magnitude of an MV. The MTD method is however only suitable for image sequences with horizontal object motion. The work of Wang et al.  presents a similar strategy, but is restricted to image sequences without object motions. Kim et al.’s study  accepts MPEG-4 video as input, extracts the background Video Object Plane (VOP) and the primary (foreground) VOP, classifies the background motion type (left motion, right motion, or static) according to its MV field, and finally assigns disparities for the foreground and the background VOPs individually.
The other kind estimates a depth map for each 2D color frame and then synthesizes a pair of left- and right-view (i.e., binocular) images for stereo display [8–11]. Compared with the depth-free methods, this 2D-plus-depth format is more advantageous from the viewpoint of applications. For example, multi-view video for autostereoscopic displays can be generated based on a popular Depth-Image-Based-Rendering (DIBR) technique and the “perceived depth” can be adjustable under varying viewing conditions. In addition, the overhead for compressing the depth map is only about 20% in the bit rate , whereas that for compressing the secondary view (or, the right-view) might be over 40% .
Though depth plays such an important role in 3DTV applications, its derivation from a mono-view image or video is really challenging. In , depth maps were built by using only MVs extracted from the compressed video. Obviously, estimating depths solely from the motion cue cannot be suitable for various types of videos. In , image depths were calculated by measuring and combining other cues such as contrast, sharpness, and chrominance of the input image. In , an object-based SVC method was proposed, where depth ordinal (based on occlusion reasoning) and depth consistency between detected objects are analyzed for depth estimation.
Other studies [9, 15], on the other hand, emphasize on separating foreground from background, estimating depths individually, and then combining them into a single depth image. To provide an impressive stereo effect, the estimation of a background depth profile is necessary. For example, Angot et al.  define six profiles of background depths and select a proper one according to image features. Another popular method is to establish the depth geometry of backgrounds by detecting vanishing points/lines  in the image. However, line features necessary for vanishing point/line detection might not be apparent or even do not exist in a video sequence.
A well-known software TriDef, developed by DDD, adopts an off-line machine learning algorithm [10, 18], where image features, such as color components and 2D coordinates, are used to construct a relation between an image pixel and its associated depth by using a classifier (e.g., neural network). Since it is straightforward and fast to extract these image features and make depth estimation (using neural classifier as an estimator), the conversion can be achieved easily in real time. However, there is a severe drawback that pixels locating at the lower and central parts of an image are likely to be assigned with nearer depths, resulting in a smoothly tilted depth pattern for most input images. That is, TriDef presents unlayered depths in light of any image content.
Most of the SVC methods [3, 9, 14, 19] estimate depths solely from cues in a single frame. The drawback is that depth fluctuation may occur in temporal domain (since depth cue estimation is an ill-posed problem and thus unstable) and hence make viewers uncomfortable. It thus demands that depths be estimated by referring to information from more previous frames. However, a bulk of buffers to store information propagated from previous frames should be avoided for practical consideration.
From the viewpoint of human aid, SVC methods can be categorized into three classes: manual, semi-automatic, and fully automatic . Though manual and semi-automatic SVC methods can provide high-quality depth, their drawback is heavy time consumption. On the other hand, fully automatic SVC methods can convert existing 2D videos to 3D content in a more efficient manner. Therefore, in this article, we aim at developing a depth-based automatic SVC scheme to generate stereoscopic videos with the popular MPEG videos as the input. Our method is capable of automatically detecting shot change, classifying the following video shot, and accordingly performing proper algorithms and tuned parameters to estimate initial depth maps based on depth cues from spatio-temporal analysis. Subtitles play an important part in most of the commercial videos. Its depth arrangement in 3D videos will certainly affects viewers’ comfort. The detection of subtitle regions and their corresponding depth assignment are developed in this article. Furthermore, interpolation and recursive temporal filtering of the initially estimated depth maps are performed to make depth edges conform to color edges and avoid depth fluctuation, respectively. Since the temporal filtering is done recursively, extra buffering of information from previous frames is kept a minimum. To make real-time conversion, a reality solely by software, parallel programming on a multi-core CPU platform is also implemented.
The remainder of this article is organized as follows. Section 2 describes the design concept of our scheme. Sections 3, 4, and 5 elaborate the pre-processing, depth estimation, and post-processing steps of our scheme, respectively. In Section 6, experiment results are given and finally Section 7 draws some conclusions and future work.
shot classification is necessary to enable shot-adaptive depth estimation;
motion information (e.g., MVs) is extracted directly from the compressed bit stream, but not re-estimated from the reconstructed pixels, for speed consideration;
there should have a less number of references to previous frames in depth estimation to prevent the requirement of large memory buffers and a long time delay;
3D perception artifacts in the temporal domain are much more observable than those in the spatial domain and should be eliminated with priority; and
parallel processing on prevailing multi-core platforms should be optimized for speedup.
As for depth estimation, we adopt a shot-adaptive strategy, by which features including inter-frame difference, frame complexity, and camera motion parameters at the shot-change boundary frame are analyzed for shot classification. Four shot categories are designed for analysis. To be consistent with MPEG videos, all computations of depth cues in spatial domain are block-based. To make depths smooth within an object and sharp near object boundaries, a depth-based foreground segmentation algorithm is developed for further depth refinement.
As pointed out in , spatial misalignment of edges in the depth and color images will degrade the stereo visual quality. This means that spatial blockiness and jerkiness in the depth map should be avoided. On the other hand, most of the current SVC works place less emphasis on solving perception artifacts resulting from temporal depth inconsistency. Hence, our post-processing stage is to scale up the block-based initial depth map, align it to color edges, and simultaneously make it smooth in the temporal domain. To achieve this, the Joint Bilateral Upsampling (JBU) algorithm  is first modified to interpolate the estimated depth map and then recursive temporal filtering is performed to eliminate possible 3D perception artifacts (e.g., depth fluctuation) resulting from wrong depth estimations.
As for view synthesis, the popular image warping technique, DIBR , is applied to construct the left- and right-views for 3D display. We elaborate each part of the proposed SVC scheme in the following sections.
3.1. Shot change detection
where α is a pre-defined constant, T S (u, v) is a region-dependent threshold, min ( · ) represents the minimum operator, and U( · ) denotes the unit step function U(x) = 1 for x > 0 and U(x) = 0 for x ≤ 0. The average HD value over the k past frames is calculated and used as an alternative threshold to adapt to various scenic changes. Since humans often pay more attention to the central zones of a frame, their corresponding T S (u, v) are set higher than others. The event of shot change can then be detected by thresholding SHD(t).
3.2. MV refinement and compensation
According to , motion parallax is the dominant depth cue for human beings at a viewing distance of less than 10 m. It can be revealed by image MVs which are often inversely proportional to the depth of a moving object (i.e., a larger MV possibly corresponds to a nearer object). However, MVs retrieved from MPEG videos are encoding-oriented and might be incorrect from the viewpoint of motion analysis. They should be refined before being further analyzed or processed.
The procedure of refining MVs is similar to that proposed in , but much easier and faster for implementation. First, four bins (corresponding to the four quadrants in the Euclidean plane) are prepared for direction histogramming of the 3 × 3 MVs around a considered MacroBlock (MB, 16 × 16 pixels). The dominant one is found if the cluster (bin) size is above a threshold (e.g., 4). MV of the current MB is replaced with the mean MV of the dominant cluster if it does not belong to that, but remains unchanged otherwise. The result of this sub-procedure is denoted as (MV H C , MV V C ).
where ΔH and ΔV stand for the compensation amounts calculated according to the camera motion parameters.
3.3. MV interpolation
where (x, y) and (x′, y′) represent the coordinates of corresponding point pair between two consecutive frames, p1p6 are transform parameters, and the superscript symbol T denotes the transpose operator. The point pairs established by at least 3 MVs of an MB and its 8-neighboring ones could be used to derive a system of equations based on Equation (4) for solving p1p6 by using the least square error method.
Based on the computed p1–p6, the corresponding point (x′, y′) in the reference frame for a given block (8 × 8 pixels) can be obtained by substituting the (x, y) coordinates of its left-upper corner point into the right-hand side of Equation (4). The interpolated MV for that block can then be derived as (x′ − x, y′ − y). In case of insufficient point-pairs for solving p1–p6, the MV of an inter-coded MB is copied to its four descendants, or a maximum value (usually the value of search range for motion estimation) is assigned for intra-coded MBs.
4.1. Shot classification
The performance of a non-adaptive SVC scheme is not satisfactory in dealing with different kinds of video content. In the proposed SVC scheme, each segmented shot would be classified into four categories: (C1) neither object nor camera motion exists, (C2) no object motion but camera motion exists, (C3) object motion exists and frame complexity is low, and (C4) object motion exists and frame complexity is high. To determine the category of a video shot, features in terms of inter-frame difference, frame complexity, and camera motion parameters are calculated. Note that due to real-time requirement, shot classification is based on features calculated merely from the shot change boundary frame (instead of frames of the whole shot). That is, the result of shot classification (C1–C4) will endure until next shot change boundary frame is detected and re-classified.
where T F (u, v) is a threshold smaller than T S (u, v) in Equation (2). As for the measure of frame complexity, the variance of pixel values in a frame is concerned. It is intended that the larger the variance is, the higher the frame complexity is.
4.2. Initial depth estimation
Human visual system perceives depth by combining multiple cues  from all domains to estimate distances of objects or relative displacements between them. Popular monoscopic depth perception cues known to the human beings include motion parallax, texture gradient, brightness, atmospheric perspective, linear perspective, and so on [4, 29]. Therefore, one issue of SVC is to compute monocular cues and suitably fuse them to obtain stereoscopic information. In our system, the frame next to the shot change boundary is used for initial depth estimation.
The motion parallax [4, 29] describes the relative motion of objects against the background. It is the dominant depth cue for human beings at a viewing distance of less than 10 m  and has popularly been adopted for depth estimation [11, 19]. Atmospheric perspective [4, 29], also called aerial perspective, explains the impact of space or atmosphere between an object and an observer on the appearance of an object. Atmospheric perspective induces a phenomenon that a far object looks hazy or is of low contrast [4, 29]. Here, we devise an algorithm of depth estimation based on motion parallax and atmospheric perspective.
4.2.1. Motion parallax cue
where MV H and MV V are the horizontal and vertical components, respectively, after MV interpolation, and (u, v) denotes the (8 × 8) block index. Note that at the shot change boundary frame, the motion cue is unreliable and should be ignored. On the other hand, the MV field of the previous frame is retained if the current frame is intra-coded (i.e., I frame) and not at a shot change boundary.
4.2.2. Atmospheric perspective cue
where I(t) is the t th luminance image of original resolution, Ī(t, u, v) denotes the mean value of the (u, v)th block, and I H (t, u, v) and I L (t, u, v) represent the average values of pixels above and below Ī(t, u, v), respectively. According to Equation (10), we can observe that the smaller the f C is, the lower the contrast is. In addition, since I H (t, u, v) and I L (t, u, v) are the average values, it is expected that f C is more robust than that computed traditionally.
4.2.3. Combination of depth cues
According to , the overall depth can be estimated as a weighted combination of different depth cues. Since each shot is classified into four categories, as addressed in Section 4.1, the initial depth d E for each frame therein is calculated adaptively with different parameters below:
C1: d E (t, u, v) = d E (t − 1, u, v)
where ω M = 0.6, ω C = 0.4;
where ω M = 0.8, ω C = 0.2;
where ω M and ω C are pre-determined weighting parameters of the motion parallax and atmospheric perspective cues, respectively, and and are normalized versions (between 0 and 255) of f M and f C , respectively. Values of ω M and ω C are determined experimentally. Note that normalization is performed before weighted combination. Larger ’s and larger ’s lead to larger d E ’s, which stand for nearer distances. Though more cues can be collected for more accurate depth estimation, only two cues (one from temporal domain and the other from spatial domain) are chosen for speed consideration.
4.3. Depth-based foreground segmentation and foreground depth refinement
By binary thresholding on initial depth map d E , the foreground area (represented by an object mask Ω t F ) can be segmented out.
Note that all the above processes in Sections 4.2 and 4.3 are based on the blocks of 8 × 8 pixels, hence speeding the processing.
4.4. Depth assignment for subtitles
In addition to the video content, the impact of the subtitle’s depth on 3D visual quality should be also concerned. Though some commercial software  provides the capability of manually adding external subtitles to video and assigning depths for them simultaneously, it is however our focuses to automatically detect subtitles embedded in frames and assign the depths for them (to the best of the authors’ knowledge, few works discussed this issue). Flickering or depth fluctuation of the subtitles will certainly lower down the perceived visual quality and make viewers uncomfortable, which motivates us to assign a constant depth for the detected subtitles along the whole sequence. However, we still face a problem of maintaining a constant depth for the whole subtitle area or for individual characters in subtitles. Since the later alternative necessitates precise and stable character segmentation for the same subtitle that endures several frames, the former one is selected in our system in considering stable quality.
The depth map estimated in Section 4 (i.e., d L (t, u, v) in Equation 11) is block-based for speed consideration. It should be scaled up to the pixel level and aligned to conform to the color edges for better 3D perception. We describe the post-processing in detail here
5.1. Locally adaptive joint trilateral upsampling (LA-JTU)
Notice that edge misalignment between color and depth images after depth upsampling may cause visual artifacts when synthesizing the left- and right-view images. To enlarge d L (t, u, v) and spatially smooth it with the depth edges being registered to the corresponding color edges, a JBU algorithm has ever been proposed . Its rationale is to interpolate and smooth a depth map, while preserving the edge information, by computing a weighted average for each pixel (x, y) in the high-resolution depth image d H . Within a local window Ω(x,y) centered at (x, y), each pixel (x′, y′) is associated with a weight which is a function of the Euclidean distance and color difference with respect to the central one. However, JBU cannot function well when (x, y) and (x′, y′) (1) have similar colors but different depths, or (2) have different colors but similar depths. In these two cases, wrong depth interpolations for (x, y) will cause the ghost and flickering artifacts.
where T d is a threshold and I q , q = r, g, b, represent the three color components.
5.2. Temporal filtering
As noted, the depth map d H (t, x, y) is created by referring only two consecutive frames (i.e., t and t – 1). This has the disadvantage of unstable depths frame-by-frame, especially when the motion or contrast cue fluctuates due to varying lighting. According to experiences, depth artifacts in the temporal domain are much more harmful to human’s 3D perception quality than those in the spatial domain. Also note that large depth variations will result in a discontinuity or bending of the object contours (especially lines in the vertical direction) after image rendering (discussed later). In fact, the distortion of object contours and horizontal/vertical lines in an image is an important factor in measuring video quality . This motivates us to apply a temporal filtering to d H (t, x, y)’s before they can be used for image rendering.
where ω t , 0 < ω t < 1.0, is the weighting factor to determine the temporal smoothness of the depth map; is the filtered result. Obviously, the prior estimated depth d H (t − k, x, y) has an exponentially weighted contribution, (ω t ) k · (1 − ω t ) · d H (t − k, x, y), to . Taking ω t = 0.75 for example, only 8% of contribution is left after four frames of decay, i.e., 0.08 · d H (t − 4, x, y). This kind of exponentially weighted running average filter requires no window size definition and only one buffer is needed to store the past . Note that at the shot change boundary frame, ω t is set to 0 to prevent incorrect depth propagation (see Figure 3).
To evaluate the performance of our proposed SVC scheme, several image sequences, e.g., “Breakdancer”, “Flamenco”, “Akko&kayo”, “Ballet”, “Close to you”, “True legend”, “2012”, “New moon”, and one music video, whose frame sizes are all 640 × 480 pixels, are used for testing. Among them, depths calculated by using stereo vision techniques are provided for Breakdancer and Ballet (thanks to Microsoft Co. ) and considered as ground truths. All the video clips are MPEG-encoded (in an encoding structure of “IPPP…I”) at a frame rate of 30 Hz. The parameter settings in post processing are (1) a window size (i.e., Ω(x,y)) of 17 × 17 pixels for LA-JTU and (2) ω t = 0.75. The converted 3D videos are played on an Acer 3D notebook (Aspire 5738DG) with a 3D display (odd-even interleaved scan lines, viewed with polarizing glasses).
where T φ is a pre-defined threshold and N H and N W are the height and width of a frame, respectively.
The better the three indices (perceived depth, visual quality, and temporal smoothness) are, the better the performance of an SVC method is. For subjective tests, five grades, similar to mean opinion score (MOS) described in , are adopted: 5 (excellent), 4 (good), 3 (fair), 2 (poor), and 1 (bad).
6.1. Depth estimation and view synthesis
It is observed from Row 4 of Figure 6 that combination of motion and contrast cues according to the shot-classification result is effective in identifying the foreground objects (especially for Figure 6d2, d3). The result demonstrates that block-based initial depth estimation, enhanced with LA-JTU and temporal filtering, is sufficient to provide satisfactory depth maps for stereo conversion, while keeping the conversion time limited for real-time applications (see Section 6.5 later). As for the detection of the subtitle region, the result in Figure 6d2, where the constant depth is set to 255, is also satisfactory.
6.2. Evaluation of depth assignment for subtitles
We experiment with four types of depth assignment for subtitles: (1) , (2) constant 255, (3) constant 128, and (4) constant depth to all segmented characters in subtitles. The subjective test results in terms of MOS are 3.11, 4.21, 4.061, and 3.063, respectively. This result matches the analysis in . Obviously, a bad visual comfort (e.g., flickering artifact) will be perceived for human beings when discontinuity of depths in subtitle region occurs in either the temporal (type 1) or spatial (type 4) domain. In addition, the difference in human perception is not so significant when the constant depth value is changed (types 2 and 3).
6.3. Effectiveness of recursive temporal filtering
6.4. Subjective comparison with TriDef
TriDef 3D software [4, 18] is implemented with DDD’s unique SVC scheme to make existing 2D photos and movies viewable in 3D perception. Here, subjective tests are conducted on seven video clips to show the human perception difference between the 3D videos converted by using our SVC scheme and by using TriDef.
Comparisons on perceived depth and visual quality
Comparisons on perceived depth
6.5. Parallel programming
The proposed SVC scheme is implemented, based on OpenMP parallel programming model, in a personal computer with multi-core CPU platform to meet the real-time requirement. In multi-core platform, a program is composed of several threads that can possibly be executed in parallelism by multiple cores.
Speed performance of our SVC scheme for 640 × 480 pixels at 30 Hz format video
With OpenMP Intel Core 2 Quad (2.4 GHz) (sec @frame)
With OpenMP Intel I7 (2.67 GHz) (sec @ frame)
MPEG decoding (640 × 480 pixels at 30 Hz)
View synthesis/interleaving (output: 640 × 480 pixels)
0.0322 (31.05 Hz)
0.0252 (39.68 Hz)
In this article, an automatic SVC scheme accepting MPEG videos as input is proposed. Our depth estimation is based on spatio-temporal analysis of video data, including estimations of motion and contrast cues which are capable of reflecting the relative depths between the foreground and the background areas. This study is specifically featured of (1) a shot-adaptive (categories C1–C4) algorithm adapting to diverse video content, (2) use of initially down-sampled depth estimation and following LA-JTU algorithm to reduce the computing load significantly, (3) use of a recursive temporal filter (Equation 20) to reduce possible depth fluctuations resulting from wrong depth estimation, and (4) a processing architecture suitable for real-time implementation on multi-core platform.
Several kinds of videos are tested to evaluate the performance of the proposed SVC scheme. Our scheme is now capable of converting videos of 640 × 480 pixels resolution in real-time (above 30 Hz) on commercial multi-core CPU platforms. Some results show that our processing scheme does lessen the impact of depth fluctuation on perceived 3D quality. Subjective tests show that videos converted by our SVC scheme provide comparable perceived depth and visual quality with those converted from the depth data calculated by stereo vision techniques. Also, our SVC scheme is shown to outperform the well-known TriDef software in terms of human’s perceived 3D depth.
Theoretically, our algorithm can be applicable to most kinds of video shots, except those containing dim or low-contrast scenarios, which make our depth cue estimation ineffective. A direction for future work is to explore and combine other monoscopic depth perception cues for more accurate depth estimation under a given processing-time limitation. In addition, a more sophisticated technique based on human visual perception to prevent human’s perception uncomfortableness caused by substantial depth estimation errors or fluctuations is needed. In the future, videos of HD format, or higher resolutions, will be more popular in our daily life, which necessitates the use of graphical processing unit (GPU) for real-time conversion. Fortunately, our proposed algorithm is advantageous of local, regular, and repeated operations, which makes its implementation on GPU easier.
This research was supported by the National Science Council, Taiwan, under the grant of NSC 99-2221-E-194-003-MY3 and NSC 100-2628-E-212-001.
- Cheng CM, Lin SJ, Lai SH: Spatio-temporally consistent novel view synthesis algorithm from video-plus depth sequences for autostereoscopic displays. IEEE Trans. Broadcast. 2011, 57(2):523-532.View ArticleGoogle Scholar
- Quan HT, Barkowsky M, Callet PL: The importance of visual attention in improving the 3D-TV viewing experience: overview and new perspectives. IEEE Trans. Broadcast. 2011, 57(2):421-431.View ArticleGoogle Scholar
- Lin GS, Yeh CY, Chen WC, Lie WN: A 2D to 3D conversion scheme based on depth cues analysis for MPEG videos. In Proceedings of the IEEE International Conference on Multimedia and Expo. Singapore; 2010:1141-1145.Google Scholar
- Zhang L, Vazquez C, Knorr A: 3D-TV content creation: automatic 2D-to-3D video conversion. IEEE Trans. Broadcast. 2011, 57(2):372-383.View ArticleGoogle Scholar
- Wang HM, Chen YH, Yang JF: A novel matching frame selection method for stereoscopic video generation. In Proceedings of the IEEE Int'l Conf. on Multimedia and Expo. New York, USA; 2009:1174-1177.Google Scholar
- Okino T, Murata H, Taima K, Iinuma T, Oketani K: New television with 2D/3D image conversion technologies. Proc. SPIE 1996, 2653: 96-103. 10.1117/12.237421View ArticleGoogle Scholar
- Kim M, Park A, Cho Y: Object-based stereoscopic conversion of MPEG-4 encoded data. In Proceedings of the IEEE Pacific Rim Conference on Multimedia. Tokyo, Japan; 2004:491-498.Google Scholar
- Murata H, Mori Y: A Real-Time 2D to 3D Image Conversion Technique Using Image Depth. SID, DIGEST; 1998:919-922.Google Scholar
- Kim D, Min D, Sohn K: A stereoscopic video generation method using stereoscopic display characterization and motion analysis. IEEE Trans. Broadcast. 2008, 54(2):188-197.View ArticleGoogle Scholar
- Harman P, Flack J, Fox S, Dowley M: Rapid 2D to 3D conversion. Proc. SPIE 2002, 6696: 78-86.View ArticleGoogle Scholar
- Pourazad M, Nasiopoulos P, Ward R: Generating the depth map from the motion information of H.264-encoded 2D video sequence. EURASIP J. Image Video Process 2010, 2010: 1-13.View ArticleGoogle Scholar
- Tekalp AM, Kurutepe E, Civanlar MR: 3DTV over IP: end-to-end streaming of multiview video. IEEE Signal Process. Mag. 2007, 24: 77-87.View ArticleGoogle Scholar
- Chiang JC, Chen WC, Liu LM, Hsu KF, Lie WN: A fast H.264/AVC-based stereo video encoding algorithm based on hierarchical two-stage neural classification. IEEE J. Sel. Topics. Signal Process. 2011, 5(2):309-320.Google Scholar
- Feng Y, Ren JC, Jiang JM: Object-based 2D-to-3D video conversion for effective stereoscopic content generation in 3D-TV applications. IEEE Trans. Broadcast. 2011, 57(2):500-509.View ArticleGoogle Scholar
- Guo G, Zhang N, Huo L, Gao W: 2D to 3D conversion based on edge defocus and segmentation. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing 2008, 2181-2184.Google Scholar
- Angot LJ, Huang WJ, Liu KC: A 2D to 3D video and image conversion technique based on a bilateral filter. In Proceedings of SPIE-IS&T Electronic Imaging, vol. 7526. San Jose, USA; 2010.Google Scholar
- Ideses I, Yaroslavsky L, Fishbain B: Depth map manipulation for 3D visualization. In Proceedings of the 3DTV Conference: The True Vision—Capture, Transmission and Display of 3D Video. Istanbul, Turkey; 2008:337-340.Google Scholar
- TriDef 3D display software. http://www.tridef.com/3d-experience/
- Ideses I, Yaroslavsky LP, Fishbain B: Real-time 2D to 3D video conversion. J. Real-Time Image Process. 2007, 2: 3-9. 10.1007/s11554-007-0038-9View ArticleGoogle Scholar
- Oh KJ, Yea S, Vetro A, Ho YS: Depth reconstruction filter and down/up sampling for depth coding in 3-D video. IEEE Signal Process. Lett. 2009, 16(9):747-750.View ArticleGoogle Scholar
- Kopf J, Cohen MF, Lischinski D, Uyttendaele M: Joint bilateral upsampling. ACM Trans. Graph. 2007, 26(3):96-1-96-5.Google Scholar
- Fehn C: Depth-image-based (DIBR), compression and transmission for a new approach on 3D-TV. Proc. SPIE 2004, 5291: 93-104. 10.1117/12.524762View ArticleGoogle Scholar
- Lin GS, Chang MK, Chiu ST: A feature-based Scheme for detecting and classifying video-shot transitions based on spatio-temporal analysis and fuzzy classification. Int. J. Pattern Recognit. Artif. Intell. 2009, 23(6):1179-1200. 10.1142/S0218001409007521View ArticleGoogle Scholar
- Fehn C, Kauff P, Op de Beeck M, Ernst F, Ijsselsteijn W, Pollefeys M, Vangool L, Ofek E, Sexton I: An evolutionary and optimised approach on 3D-TV. Proceedings of the International Broadcast Convention 2002, 357-365.Google Scholar
- Lie WN, Lai CM: News video summarization based on spatial and motion feature analysis. In Proceedings of the Pacific-Rim Conference on Multimedia. Tokyo, Japan; 2004:246-255.Google Scholar
- Tan YP, Saur DD, Kulkarni SR: Rapid estimation of camera motion from compressed video with application to video annotation. IEEE Trans. Circuits Syst. Video Technol. 2000, 10(1):133-146. 10.1109/76.825867View ArticleGoogle Scholar
- Lee MC, Chen WG, Lin CLB, Gu C, Markoc T, Zabinsky SI, Szeliski R: A layered video object coding system using sprite and affine motion model. IEEE Trans. Circuits Syst. Video Technol. 1997, 7(1):130-145. 10.1109/76.554424View ArticleGoogle Scholar
- Young MJ, Landy MS, Maloney LT: A perturbation analysis of depth perception from combination of texture and motion cues. Vis. Res. 1993, 33(18):2685-2696. 10.1016/0042-6989(93)90228-OView ArticleGoogle Scholar
- Mendiburu B: 3D Movie Making—Stereoscopic Digital Cinema From Script to Screen. Focal Press; 2009.Google Scholar
- Ko J, Kim M, Kim C: 2D-to-3D stereoscopic conversion: depth-map estimation in a 2D single-view image. Proc. SPIE 2007, 6696: 66962A.1-66962A.9.View ArticleGoogle Scholar
- Wu HR, Rao KR: Digital Video Image Quality and Perceptual Coding. CRC Press, Taylor & Francis Group; 2006.Google Scholar
- Gonzalez RC, Woods RE: Digital Image Processing. 3rd edition. Prentice-Hall; 2008.Google Scholar
- Z depth. http://www.sonycreativesoftware.com/zdepth
- Pinson MH, Wolf S: A new standardized method for objectively measuring video quality. IEEE Trans. Broadcast. 2004, 50(3):312-322. 10.1109/TBC.2004.834028View ArticleGoogle Scholar
- Zitnick CL, Kang SB, Uyttendaele M, Winder S, Szeliski R: High-quality video view interpolation using a layered representation. ACM SIGGRAPH ACM Trans. Graph. 2004, 23(3):600-608. 10.1145/1015706.1015766View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.