- Research Article
- Open Access
Stereoscopic Visual Attention-Based Regional Bit Allocation Optimization for Multiview Video Coding
© Yun Zhang et al. 2010
- Received: 26 December 2009
- Accepted: 18 June 2010
- Published: 29 June 2010
We propose a Stereoscopic Visual Attention- (SVA-) based regional bit allocation optimization for Multiview Video Coding (MVC) by the exploiting visual redundancies from human perceptions. We propose a novel SVA model, where multiple perceptual stimuli including depth, motion, intensity, color, and orientation contrast are utilized, to simulate the visual attention mechanisms of human visual system with stereoscopic perception. Then, a semantic region-of-interest (ROI) is extracted based on the saliency maps of SVA. Both objective and subjective evaluations of extracted ROIs indicated that the proposed SVA model based on ROI extraction scheme outperforms the schemes only using spatial or/and temporal visual attention clues. Finally, by using the extracted SVA-based ROIs, a regional bit allocation optimization scheme is presented to allocate more bits on SVA-based ROIs for high image quality and fewer bits on background regions for efficient compression purpose. Experimental results on MVC show that the proposed regional bit allocation algorithm can achieve over % bit-rate saving while maintaining the subjective image quality. Meanwhile, the image quality of ROIs is improved by dB at the cost of insensitive image quality degradation of the background image.
- Background Region
- Mean Opinion Score
- Stereoscopic Video
- Multiview Video
- Visual Attention Model
Three-Dimensional Video (3DV) provides Three-Dimensional (3D) depth impression and allows users to freely choose a view of a visual scene . With these features, it would allow many multimedia applications, such as photorealistic rendering of 3D scenes, free viewpoint television , 3D television broadcasting, and 3D games, to introduce new and exciting features for users. Multiview video plus depth  supports high image quality and low complexity of rendering a continuum of output views. It has been the main representation of 3D scene and applied to many multiview multimedia applications. However, multiview video requires huge amount of storage and transmission bandwidth which are multiples of traditional monoview video. Thus, it is necessary to develop efficient Multiview Video Coding (MVC) algorithms for practical uses.
MVC had been studied on the basis of several video coding standards, including MPEG-2, MPEG-4, H.263, and H.264. Since the Moving Picture Experts Group (MPEG) had recognized the importance of MVC technologies, an Ad Hoc Group (AHG) on 3D Audio and Visual (3DAV) was established. The MPEG surveyed some MVC schemes, such as "Group-of-GOP prediction (GoGOP)", "sequential view prediction", and "checkerboard decomposition", . Yea and Vetro proposed a view synthesis prediction-based MVC scheme for improving interview compression efficiency . Yun et al. developed an efficient MVC algorithm which adaptively selects optimal prediction structure according to the spatiotemporal correlation of 3DV sequence . Merkle et al. also proposed another MVC scheme using Hierarchical B Pictures (MVC-HBPs) and achieved superior compression efficiency and temporal scalability . It has been adopted into MVC standardization draft by Joint Video Team (JVT) and used in the Joint Multiview Video Model (JMVM).
In many of the previous MVC schemes [4–7], intra, inter, and interview prediction compensation technologies are adopted to reduce spatial, temporal, and interview redundancies. Additionally, YUV color space transform, integer transform, and quantization technologies are also utilized to explore visual redundancies including chroma redundancies and high frequency redundancies. According to the studies on visual psychology, the Human Visual System (HVS) in fact does not treat visual information equally from regions to regions of the video content . It is mentioned that HVS is more sensitive to the distortion in the Region-Of-Interests (ROIs) or attention areas than those in background regions . Those are visual redundancies coming from regional interests existing in 3DV. However, previous MVC schemes have not taken the regional selective property and 3D depth perception of HVS into consideration. Applying the concept of ROI to video coding is regarded as a promising way to improve coding efficiency by exploiting regional visual redundancies. However, there are two major problems to be tackled, they are ROI detection and the ROI-based bits allocation.
For unsupervised ROI extraction, visual attention has been introduced as one of the key technologies in video/image system [10, 11]. Accordingly, many efforts have been devoted to researches on visual attention model [11–16] so as to simulate the visual attention mechanism of HVS accurately. Itti and Koch developed a bottom-up visual attention model  for still images based on Treisman's stimulus integration theory . It generates saliency map with the integration of perceptual stimuli from intensity contrast, colour contrast, and orientation contrast. Zhai et al. used the low-level features as well as cognitive features, such as skin colour and captions, in their visual attention model . Motion is another important cue for visual attention detection in video, thus, a bottom-up spatiotemporal visual attention model is proposed for video sequences in . Wang et al. proposed segment-based video attention detection method . Ma et al. also proposed a bottom-up and top-down combined visual attention model by integrating multiple features, including contrast in image, motion, face detection, audition, and text . However, all these visual attention models were proposed either for static image or single view video and did not take stereoscopic or depth perception into account. On the other hand, stereoscopic parallax is not available in the single-view video.
From the video coding point of view, many bit allocation algorithms [17–24] are proposed for improving compression efficiency. Kaminsky et al. proposed a complexity-rate-distortion model to dynamically allocate bits with both complexity and distortion constraints . Lu et al. proposed a Group-Of-Picture (GOP-)level bit allocation  scheme and Shen et al. proposed another frame-level bit allocation method which decreases the average standard deviation of video quality . Özbek and Tekalp proposed a bit allocation among views for scalable multiview video coding . All these bit allocation schemes improve the average Peak Signal-to-Noise Ratio (PSNR) but did not take the regional selective properties of HVS into account. Chen and Wang et al. proposed a bit allocation scheme that allocated more bits on ROI for MPEG-4 standard [21, 22]. These two schemes require very high ROI extraction accuracy. Chi et al. proposed an ROI video coding based on H.263+ for low bit-rate multimedia communications . In the scheme, the ROI was extracted according to skin-color clue and a fuzzy logic controller was designed adaptively to adjust the quantization parameters for each macroblock (MB). Tang et al. proposed a bit allocation scheme for 2D video coding which is guided by visual sensitivity considering motion and texture structures . However, these bit allocation schemes were proposed for single-view video coding and can not be directly applied to MVC because interview prediction is adopted in MVC.
In this paper, we propose a Stereoscopic Visual Attention-(SVA-) based regional bit allocation for improving MVC coding efficiency. We firstly present a framework of MVC in Section 2. In Section 3, we propose an SVA model to simulate visual attention mechanism of HVS. And then, SVA-based bit allocation optimization algorithm is proposed for MVC in Section 4. Section 5 presents the regional selective image quality metrics which are adopted in the coding performance evaluation. In Section 6, SVA-based ROI extraction and multiview video coding experiments are performed and evaluated with various multiview video test sequences. Finally, Section 7 gives conclusions.
At the client side, the color and depth bitstream is de-multiplexed and decoded by the MVC decoder. With the decoded multiview color videos, depth videos as well as the transferred video cameras' parameters, view generation module renders a continuum of output views, ( ), through depth image-based rendering . According to different types of display device, for example, HDTV, stereoscopic display, or multiview display, different number of views is displayed.
3.1. Framework of SVA Model
where is SVA saliency map, D is the intensity of depth maps which indicates the distance between video content and imaging camera/viewer, , , and are image saliency, motion saliency and depth saliency, respectively.
3.2. Spatial Attention Detection for Static Image
We adopted Itti's bottom-up attention model [12, 27] for our spatial visual attention model. The seven neuronal features implemented are sensitive to color contrast (red/green and blue/yellow), intensity contrast, and four orientations ( , , , and ) for static images. Centre and surround scales are obtained using dyadic Gaussian pyramids with nine levels. Then, Centre-Surround Differences (CSD)  are computed as the pointwise differences across pyramid levels; and then, six feature maps for CSD network are computed for each of the seven features, yielding a total of 42 feature maps. Finally, all feature maps are integrated into the unique scalar image saliency .
3.3. Temporal Attention Detection
where denotes the across-level difference between two maps at the center (c) and the surround (s) levels of the respective feature pyramids, , , ; ⊕ is across-level addition; is a normalization operator. There are also several normalization strategies available in , such as learning, iterative localized iteration. However, these normalization strategies are supervised or very time consuming. Therefore, we adopted the "Naive" strategy in  for its low complexity and unsupervised purpose, the normalization operator is , which adjusts the saliency value to fixed rang 0 255 (value 255 indicates being most salient) for image .
3.4. Depth Impacts on SVA and Depth Attention Detection
When watching 3D video, people are usually more interested in the regions visually moving out of the screen, that is, pop-out regions, which are with small depth values or large disparities.
As the distance between video object and viewer/camera increases, interesting ratio of the video object decreases.
The out of Depth-Of-Field (DOF) objects of the camera system is usually not the attention areas, for example, defocusing blurred background object or foreground object.
Depth discontinuous regions or depth contrast regions are usually the attention areas in the 3DV as they provide strong depth sensation, especially when view angles or view positions are switching.
- (5)Depth map is an 8-bit gray image that can be captured by depth camera or computed using multiview video. Each pixel in the depth map represents a relative distance between video object and camera. In this paper, we firstly estimate the disparity for each pixel in multiview video by using stereo matching method. Then, the disparity is converted into perceptive depth. Finally, intensity of each pixel in depth map is mapped to an irregular space with nonuniform quantization . HVS perceptive depth, Z, is shown as
where " " is floor operation, z f and z n indicate the farthest and nearest depth, respectively, and , , f is the focal length, and B is the baseline between cameras. The space between z f and z n is divided into narrow spaces around the z n plane and is divided into wide spaces around the z f plane.
w x andw y are width and height for each boundary depression level, W and H are width and height of the stereoscopic video, respectively.
3.5. Depth-Based Fusion for SVA Model
where , , and are weighted coefficients for depth saliency, motion saliency, and image saliency, respectively, and they satisfy , . Relative larger weighted coefficient value shall be given to more dominative saliency. denotes correlation between saliency a and saliency b, , is a weighted coefficient for , , and is a scaling function for depth intensity video. If the depth video is not provided, then (13) will be considered as a spatiotemporal scheme which fuses motion and still image saliency.
where and where and are the numbers of views and time instants in one GOP, i and j are temporal and interview position, respectively. denotes the number of bits of encoding a frame at position (i, j) while its ROIs are coded with and background regions are coded with and denote the QP differences between the ROI and the background regions, respectively.
where is coefficient independent to , and is a negative value which indicates the slope of image quality degradation. is a negative value and it will decrease as increases to improve compression ratio.
where symbol " " is floor operation. Meanwhile, is truncated to 0 if is smaller than 0. Coefficients A, T, and are bQP dependent and will be modeled experimentally from MVC experiments presented in Section 6.3.
where Γ is the maximum pixel value, here it is 255.
where , , , , , and are derived from the subjective quality evaluation experiments in . In the following sections, PMOSs of PSNR_Y and SSIM are denoted by PMOS_PSNR and PMOS_SSIM, respectively.
Parameters and Features of the Test Multiview Videos.
20 cm/1D arc
Slow and fast motion
20 cm/1D arc
Slow and very fast motion
6.5 cm /1D
Complex indoor scene, slow motion
6.5 cm /1D
Outdoor scene and complex background
Large image size, slow motion
Simple background and very slow motion
6.1. SVA-Based ROI Extraction
In the 3DV, motion saliency object is usually the most salient regions in the visual attentive area; next is the image saliency. Depth saliency is relatively less important and is given smaller weighted coefficient while comparing with motion saliency and image saliency except that the 3DV provides strong depth perception. So in the experiments, relative larger weighted coefficient value is given to dominative or more important motion saliency, and , , and are empirically set as 0.2, 0.35, and 0.45 under the constraints and . On the other hand, in the Multiview video, image, motion, and depth saliencies are correlated with each other. The correlation between image and motion saliencies is higher than the other two correlations, that is, correlations between depth and image saliency, depth and motion saliency. It is because detected moving objects are likely textural objects. However, there are no explicit correlations between depth and image/motion saliency. Thus, the weighted coefficients are larger than , , and they empirically are set as 0.6, 0.2, and 0.2, respectively. Actually, in order to accurately simulate the mechanism of human visual attention, values of parameters , , and , and , , and, should be adjusted according to motion, textual, and depth characteristics of the multiview video sequences.
In the depth video, the z f and z n planes are mapped to 0 and 255, respectively, with the non-uniform quantization process in (8), which treats z f plane as infinite far away and supposes that saliency in z f plane is completely unimportant. However, z f planes of the video sequences are usually not infinite. So, we use the scaling function , where is a positive constant, to map the z n z f plane to and take the saliency in z f plane into account. Usually, shows the importance of the saliency in z f plane compared with that of z n plane. It increases as z f plane closes toz n plane and decreases to 0 as z f becomes infinite. In the SVA extraction experiments, is set to 50 because most of the test video sequences are indoor scene and their z f planes close to z n plane.
Figure 7(g) illustrates motion saliency maps and Figure 7(k) shows the ROI extracted on the basis of motion saliency only. Generally, large motion contrast areas are very likely to be potential attention areas. However, it is not always true. For example, for Ballet sequence, the shadow of the dancing girl exhibits high motion contrast, but it is not an attentive area. This kind of noise can be eliminated by combining the depth saliency and static image saliency. Figure 7(h) shows the depth saliency extracted from depth video by using the proposed algorithm in Section 3.3. As we can see from the depth saliency map, the depth contrast regions are extracted as the most salient, which is coinciding with the discovery that people are in particularly interested in depth contrast regions because it provides more impressive stereoscopic perception. Besides, regions with small depth, that is, large intensity in depth map, are also extracted as salient region, which is also in accordance with the fact that people are usually more interested in an object close to them in a view than that far away from them. According to the extracted depth saliency of various test sequences, the proposed depth saliency detection algorithm is efficient and maintains high accuracy as the depth map is accurate. However, for inaccurate depth and the sequences with weak depth perception, only depth saliency turns out to be not sufficient to simulate visual attention. Such cases can be noted in Pantomime and Breakdancers.
Figure 7(i) shows the final SVA saliency map generated by the proposed SVA model. We can see that Figure 7(i) can simulate visual attention mechanism of HVS better for all sequences when compared with Figures 7(f)–7(h). Taking Ballet sequence as an example, the proposed SVA model can depress the noise in spatial saliency map (black region on the wall in color image), noise in motion saliency map (shadow of the dancing girl), and noise in depth (the foreground floor). Favorable saliency map and ROI are created. For Doorflowers sequence, multiple attention cues including motion (two men and the door), static image attention (clock, painting, and chair), and depth (the sculpture) are integrated together very well by the proposed model. Similar results can be found for other multiview video sequences. Therefore, it can be concluded that the proposed model detects the SVA accurately and simulates HVS well by fusing depth information, static image saliency, and motion saliency. Additionally, though there are noises in both the depth map and/or the image saliency, the proposed model still can obtain satisfactory SVA jointly using depth, motion, and texture information and depress noises in each channel. Thus, the proposed model is error resilient and with high robustness.
The ROI extraction results, as illustrated in Figures 7(j)–7(m), are generated by four schemes, that is, S-scheme, T-scheme, ST-scheme, and proposed SVA scheme. S-scheme denotes ROI extraction only using static image information. T-scheme denotes that ROI is extracted only using motion information. ST-scheme indicates ROI extraction using both static image information and motion information. SVA denotes ROI is extracted based on our proposed SVA model. Figure 7(m) shows the extracted MB level ROI based on SVA and Figure 7(n) is MB level ROI mask in which Black blocks are ROI MBs, gray blocks are transitional MBs, and white blocks are background MBs. Comparing Figures 7(j), 7(k), and 7(l) with Figure 7(m), we can see that extracted ROIs based on SVA model are similar to this ROI extraction based on static image saliency (S-scheme) for simple textural multiview video, such as Pantomime and Champagne tower. However, for complex textural multiview video, such as Dog, Ballet, Alt Moabit, and Doorflowers, the ROIs extracted based on the proposed SVA model are much better and more favorable than S-scheme, T-scheme, and ST-scheme because they lack of information from depth or motion channel.
6.2. Subjective Evaluation for SVA-Based ROI Extraction
Subjective evaluation of SVA-based ROIs extraction results has also been performed. Polarization-multiplexed display method is used for displaying stereo video and image. Stereoscopic images are played back through a stereoscopic dual projection system, where two BenQ P8265 DLP projectors are used to project left and right view images on a 150-inche silver screen. Viewers wear polarized glasses to watch the stereo video. Extracted ROI results are randomly ordered and displayed on a traditional monoview LCD display at the time when stereoscopic video is being displayed via the stereoscopic video system. The experiment is conducted in a special room with ambient illumination, color temperature, and ambient sound controlled according to the requirements in ITU-R Recommendation 500 . There are 20 participants recruited in campus, age from 22 to 32, 7 females and 13 males, 2 participants are experts, 15 participants have some stereoscopic image processing knowledge, and the rest 3 participants do not have image processing knowledge. That is the 18 participants are nonexpert and they are not concerned with the visual attention and the ROI extraction in their normal work. All participants passed the color vision test and achieved the minimum criteria: acuity of 20 : 30 vision, stereoscopic visual acuity of 40 sec.arc.
z-scores, mean opinion score and standard errors for ROI extraction schemes.
Std. errors for MOS
As shown in Table 2, for the five sequences, including Champagne tower, Dog, Doorflowers, Alt Moabit, and Ballet sequences, ROIs generated by the proposed SVA-based ROI extraction scheme are of the highest z-scores which means these ROIs are most identical to people's preference. However, for Breakdancers sequence, the z-score of ST-scheme is 0.401 (better than the SVA scheme) because the sequence has dramatically high speed motion attracting more attentions. For Pantomime sequences, the proposed SVA scheme is ranked no. 2 because the sequence is with simple background and provides relatively weak stereoscopic perception. In addition, the extracted ROIs of the four schemes are quite similar and hard to be distinguished. Generally, according to the average z-scores, the proposed SVA extraction scheme achieves the best performance for the test 3DV. Then, the performance ST-scheme comes next. S-scheme and T-scheme have relatively low performance and low robust because they highly depend on the texture and motion properties of video sequences.
The middle four rows of the Table 2 show MOS of the ranking ROIs, in which smaller value indicates better performance. As far as MOS is concerned, similar results can be found. The proposed SVA-based ROI extraction scheme has the best performance as it has the lowest MOS for five test sequences and lowest average MOS. In the last four rows, standard errors for MOS are also illustrated. We can see that the deviation for SVA scheme (0.99 on average) is larger than ST-Scheme (0.77 on average). It is because the participants' depth sensations vary from person to person. While viewing the stereo video and images, some non-expert viewers seem to be more sensitive to depth perception. On the contrary, expert viewers pay more attentions on motion, textural, or semantic areas because they are already familiar with the depth sensation.
6.3. SVA-based Regional Bit Allocation Optimization for MVC
To determine the optimal used in the MVC scheme, video coding experiments are implemented on JMVM7.0  with MVC-HBP prediction structure, bQP and are set as and . Multiview video sequences, Ballet and Breakdancers, are adopted because they have both slow and fast motion characteristic. Eight views and 91 frames in each view (6 GOPs while GOP length is 15) are encoded. Parameter and are empirically set as 3 and 6 for first and second level transitional areas.
6.4. MVC Experiments
SVA-based MVC experiments are implemented on the JMVM 7.0 reference software with seven multiview video sequences and their ROI masks, Ballet, Breakdancers, Doorflowers, Alt Moabit, Pantomime, Champagne tower, and Dog, to evaluate the effectiveness of the proposed SVA-based bit allocation. The MVC-HBP prediction structure is adopted for MVC simulation. Eight views and GOP Length are 15, fast motion/disparity estimation is enabled, and search range is 64. There are three kinds of picture in the MVC-HBP prediction structure: intracoded picture (I-picture), interpredicted picture (P-picture), and hierarchical bidirectional predicted picture (B-picture). In the coding experiment, all B- and P-pictures are coded with regional bit allocation optimization and I-pictures are coded with original MVC scheme without bit allocation optimization. The bQP is set as 12, 17, 22, 27, or 32, and the QPs of background and ROI are set according to (16) and obtain optimal in Figure 15. PMOS_SSIM and PMOS_PSNR are adopted to evaluate image quality of the reconstructed video frames.
Objective image quality and coding bits corresponding to Figure 18.
Because people usually pay less attention to the background regions and more attention to ROIs, HVS is less perceptible to distortion in the background regions than that of ROIs. This implies that people are more sensitive to distortions in the ROIs than in the background region. As a result, high image quality is required in ROIs. For Ballet multiview video sequence, ΔPSNR_ is 0.46 dB while ΔPSNR_ is dB. It means that the proposed SVA-based MVC scheme improves image quality of ROI up to 0.46 dB; meanwhile, to improve compression ratio, the proposed SVA-based MVC scheme allocates fewer bits on the background regions and at the cost of its PSNR_ . In the proposed MVC scheme, the image quality of ROIs is getting better than that of background region, that is, PSNR_ PSNR_ , which meets the requirements of HVS. Thus, the quality of the reconstructed images is improved. While evaluated by the regional selective image quality metrics, ΔPMOS_SSIM is 0.78 and ΔPMOS_PSNR is −0.70. It means the difference between the qualities of reconstructed images coded by the proposed MVC scheme and JMVM is tiny and imperceptible. However, the important and interesting fact is that is 21.06%, which indicates that 21.06% bit rate saving is achieved by the proposed MVC scheme while comparing with JMVM benchmark. Similar results can also be found for Breakdancers, Doorflowers, Alt Moabit, and Dog sequence. For Pantomime and Champagne tower sequences, because the background regions are very flat and smooth, MBs in these regions are coded with SKIP/DIRECT mode and only very few bits are allocated by original JMVM, thus, a relative low saving ratio, 8.19% and 8.58%, is achieved by the proposed MVC.
In summary, the proposed MVC scheme achieves significant bit-rate saving ratio, up to ; meanwhile, the ROIs' image quality is improved up to dB at the cost of imperceptible quality degradation at background regions. Additionally, PSNR_Y of ROI is better than that of background, which meets requirements of HVS. Moreover, the proposed MVC scheme can save over 20% bit rate with imperceptible image quality degradation according to the evaluation of region selective image quality metrics.
A stereoscopic visual attention- (SVA-) based regional bit allocation optimization scheme is proposed to improve the compression efficiency of MVC. We proposed a bottom-up SVA model to simulate the visual attention mechanisms of the human visual system with stereoscopic perception. This model adopts multiple low level perceptual stimuli, including color, intensity, orientation, motion, depth, and depth contrast. Then the semantic region-of-interest (ROI) is extracted based on the saliency maps of SVA. The proposed model is not only able to efficiently simulate stereoscopic visual attention of human eyes, but also can reduce noise in each stimulus channel. Based on the extracted semantic ROIs, a regional bit allocation optimization scheme is also proposed for high compression efficiency by exploiting visual redundancies. Experimental results on MVC showed that the proposed bit allocation algorithm can achieve over bit-rate saving at high bit rate while maintaining the same objective image quality and subjective image qualities. Meanwhile, the image quality of ROIs is improved by dB at the cost of indiscriminate image quality degradation in background regions, which is less conspicuous and sensitive to human visual system. It can be foreseen that the stereoscopic visual attention will play a more important role in the areas such as content-oriented three-dimensional video processing, video retrieval, and computer vision in future.
The Interactive Visual Media Group at Microsoft Research, HHI, and Nagoya University have kindly provided The authors with multiview video sequences and depth maps. Thanks are due to Dr. Sam Kwong for giving us many good suggestions and help. This work is supported by the Natural Science Foundation of China (Grant 60872094, 60832003), 863 Project of China (2009AA01Z327). It was also sponsored by K.C.Wong Magna Fund in Ningbo University.
- Muller K, Merkle P, Wiegand T: Compressing time-varying visual content. IEEE Signal Processing Magazine 2007, 24(6):58-65.View ArticleGoogle Scholar
- Tanimoto M: Overview of free viewpoint television. Signal Processing: Image Communication 2006, 21(6):454-461. 10.1016/j.image.2006.03.009Google Scholar
- Merkle P, Smolic A, Müller K, Wiegand T: Multi-view video plus depth representation and coding. Proceedings of the International Conference on Image Processing (ICIP '07), 2007, San Antonio, Tex, USA 1: 201-204.Google Scholar
- Survey of algorithms used for multi-view video coding (MVC) ISO/IEC JTC1/SC29/WG11, N6909, Hong Kong, China, January 2005Google Scholar
- Yea S, Vetro A: View synthesis prediction for multiview video coding. Signal Processing: Image Communication 2009, 24(1-2):89-100. 10.1016/j.image.2008.10.007Google Scholar
- Yun Z, Jiang GY, Mei Y, Yo SH: Adaptive multiview video coding scheme based on spatiotemporal correlation analyses. ETRI Journal 2009, 31(2):151-161. 10.4218/etrij.09.0108.0350View ArticleGoogle Scholar
- Merkle P, Müller K: Efficient prediction structures for multiview video coding. IEEE Transactions on Circuits and Systems for Video Technology 2007, 17(11):1461-1473.View ArticleGoogle Scholar
- Lu Z, Lin W, Yang X, Ong E, Yao S: Modeling visual attention's modulatory aftereffects on visual sensitivity and quality evaluation. IEEE Transactions on Image Processing 2005, 14(11):1928-1942.View ArticleGoogle Scholar
- Ohm J-R: Encoding and reconstruction of multiview video objects. IEEE Signal Processing Magazine 1999, 16(3):47-54. 10.1109/79.768572View ArticleGoogle Scholar
- Han J, Ngan KN, Li M, Zhang H-J: Unsupervised extraction of visual attention objects in color images. IEEE Transactions on Circuits and Systems for Video Technology 2006, 16(1):141-145.View ArticleGoogle Scholar
- Ma Y-F, Hua X-S, Lu L, Zhang H-J: A generic framework of user attention model and its application in video summarization. IEEE Transactions on Multimedia 2005, 7(5):907-919.View ArticleGoogle Scholar
- Itti L, Koch C: Computational modelling of visual attention. Nature Reviews Neuroscience 2001, 2(3):194-203. 10.1038/35058500View ArticleGoogle Scholar
- Treisman AM, Gelade G: A feature-integration theory of attention. Cognitive Psychology 1980, 12(1):97-136. 10.1016/0010-0285(80)90005-5View ArticleGoogle Scholar
- Zhai G, Chen Q, Yang X, Zhang W: Scalable visual sensitivity profile estimation. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '08), April 2008, Las Vegas, Nev, USA 873-876.Google Scholar
- Zhai Y, Shah M: Visual attention detection in video sequences using spatiotemporal cues. Proceedings of the 14th Annual ACM International Conference on Multimedia (MM '06), October 2006, Santa Barbara, Calif, USA 815-824.View ArticleGoogle Scholar
- Wang PP, Zhang W, Li J, Zhang Y: Realtime detection of salient moving object: a multi-core solution. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '06), April 2008, Las Vegas, Nev, USA 1481-1484.Google Scholar
- Kaminsky E, Grois D, Hadar O: Dynamic computational complexity and bit allocation for optimizing H.264/AVC video compression. Journal of Visual Communication and Image Representation 2008, 19(1):56-74. 10.1016/j.jvcir.2007.05.002View ArticleGoogle Scholar
- Lu Y, Xie J, Li H, Cui H: GOP-level bit allocation using reverse dynamic programming. Tsinghua Science and Technology 2009, 14(2):183-188. 10.1016/S1007-0214(09)70028-8View ArticleGoogle Scholar
- Shen L, Liu Z, Zhang Z, Shi X: Frame-level bit allocation based on incremental PID algorithm and frame complexity estimation. Journal of Visual Communication and Image Representation 2009, 20(1):28-34. 10.1016/j.jvcir.2008.08.003View ArticleGoogle Scholar
- Özbek N, Tekalp AM: Content-aware bit allocation in scalable multi-view video coding. Proceedings of the Multimedia Content Representation, Classification and Security (MRCS '06), September 2006, Lecture Notes in Computer Sciences 4105: 691-698.View ArticleGoogle Scholar
- Chen Z, Han J, Ngi K: Dynamic bit allocation for multiple video object coding. IEEE Transactions on Multimedia 2006, 8(6):1117-1124.View ArticleGoogle Scholar
- Wang H, Schuster GM, Katsaggelos AK: Rate-distortion optimal bit allocation for object-based video coding. IEEE Transactions on Circuits and Systems for Video Technology 2005, 15(9):1113-1123.View ArticleGoogle Scholar
- Chi M-C, Chen M-J, Yeh C-H, Jhu J-A: Region-of-interest video coding based on rate and distortion variations for H.263+. Signal Processing: Image Communication 2008, 23(2):127-142. 10.1016/j.image.2007.12.001Google Scholar
- Tang C-W, Chen C-H, Yu Y-H, Tsai C-J: Visual sensitivity guided bit allocation for video coding. IEEE Transactions on Multimedia 2006, 8(1):11-18.View ArticleGoogle Scholar
- Kauff P, Atzpadin N, Fehn C, Müller M, Schreer O, Smolic A, Tanger R: Depth map creation and image-based rendering for advanced 3DTV services providing interoperability and scalability. Signal Processing: Image Communication 2007, 22(2):217-234. 10.1016/j.image.2006.11.013Google Scholar
- Zhang Y, Jiang G, Yu M, Chen K: Stereoscopic visual attention model for 3D video. Proceedings of the International Multimedia Modeling Conference (MMM '10), January 2010, Lecture Notes in Computer Sciences 5916: 314-324.Google Scholar
- Itti L, Koch C, Niebur E: A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 1998, 20(11):1254-1259. 10.1109/34.730558View ArticleGoogle Scholar
- Barron JL, Fleet DJ, Beauchemin SS: Performance of optical flow techniques. International Journal of Computer Vision 1994, 12(1):43-77. 10.1007/BF01420984View ArticleGoogle Scholar
- Itti L, Koch C: Feature combination strategies for saliency-based visual attention systems. Journal of Electronic Imaging 2001, 10(1):161-169. 10.1117/1.1333677View ArticleGoogle Scholar
- Tanimoto M, Fujii T, Suzuki K: Improvement of depth map estimation and view synthesis. ISO/IEC JTC1/SC29/WG11, M15090, Antalya, Turkey, January 2008Google Scholar
- Qi F, Wu JJ, Shi GM: Extracting regions of attention by imitating the human visual system. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '09), April 2009, Taipei, Taiwan 1905-1908.Google Scholar
- Zhang Y, Jiang GY, Yu M, Yang Y, Peng ZJ, Chen K: Depth perceptual region-of-interest based multiview video coding. Journal of Visual Communication and Image Representation 2010, 21(5-6):498-512. 10.1016/j.jvcir.2010.03.002View ArticleGoogle Scholar
- Takagi K, Takishima Y, Nakajima Y: A study on rate distortion optimization scheme for JVT coder. Visual Communication and Image Processing, July 2003, Lugano, Switzerland, Proceedings of SPIE 5150: 914-923.Google Scholar
- Engelke U, Nguyen VX, Zepernick H-J: Regional attention to structural degradations for perceptual image quality metric design. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '08), March 2008, Las Vegas, Nev, USA 869-872.Google Scholar
- Wang Z, Bovik AC, Sheikh HR, Simoncelli EP: Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 2004, 13(4):600-612. 10.1109/TIP.2003.819861View ArticleGoogle Scholar
- Feldmann I, Mueller M, Zilly F, et al.: HHI test material for 3D video. ISO/IEC JTC1/SC29/WG11, M15413, Archamps, France, April 2008Google Scholar
- Zitnick CL, Kang SB, Uyttendaele M, Winder S, Szeliski R: High-quality video view interpolation using a layered representation. In Proceedings of ACM SIGGRAPH Transactions on Graphics, August 2004, Los Angeles, Calif, USA. ACM; 600-608.Google Scholar
- Tanimoto M, Fujii T, Fukushima N: 1D parallel test sequences for MPEG-FTV. ISO/IEC JTC1/SC29/WG11, M15378, Archamps, France, April 2008Google Scholar
- Stankiewicz O, Wegner K: Depth map estimation software version 2. ISO/IEC JTC1/SC29/WG11, M15338, Archamps, France, April 2008Google Scholar
- ITU-R Recommendation BT.500-11 : Methodology for the subjective assessment of the quality of television pictures. 2002.Google Scholar
- Rajae-Joordens R, Engel J: Paired comparisons in visual perception studies using small sample sizes. Displays 2005, 26(1):1-7. 10.1016/j.displa.2004.09.003View ArticleGoogle Scholar
- Vetro A, Pandit P, Kimata H, Smolic A, Wang YK: Joint multiview video model (JMVM) 7.0. Joint Video Team of ITU-T VCEG and ISO/IEC MPEG, Antalya, Turkey; January 2008.Google Scholar
- Bjontegaard G: Calculation of average PSNR differences between RD-curves. ITU-T VCEG, VCEG-M33, April 2001Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.