Rate-distortion-optimized multi-view streaming in wireless environment using network coding

Greco, Claudio; D. Nemoianu, Irina; Cagnazzo, Marco; Pesquet-Popescu, Béatrice

doi:10.1186/s13634-016-0308-4

Research
Open access
Published: 09 February 2016

Rate-distortion-optimized multi-view streaming in wireless environment using network coding

Claudio Greco¹,
Irina D. Nemoianu²,
Marco Cagnazzo ORCID: orcid.org/0000-0001-6731-3755³ &
…
Béatrice Pesquet-Popescu³

EURASIP Journal on Advances in Signal Processing volume 2016, Article number: 17 (2016) Cite this article

1505 Accesses
4 Citations
Metrics details

Abstract

Multi-view video streaming is an emerging video paradigm that enables new interactive services, such as 3D video, free viewpoint television, and immersive teleconferencing. Because of the high bandwidth cost they come with, multi-view streaming applications can greatly benefit from the use of network coding, in particular in transmission scenarios such as wireless network, where the channels have limited capacity and are affected by losses. In this paper, we address the topic of cooperative streaming of multi-view video content, wherein users who recently acquired the content can contribute parts of it to their neighbors by providing linear combinations of the video packets. We propose a novel method for selection and network encoding of the transmitted frames based on the users’ preferences for the different views and the rate-distortion properties of the stream. Using network coding enables the users to retrieve the content in a faster and more reliable manner and without the need for coordination among the senders. Our experimental results prove that our preference-based approach provides a high-quality decoding even when the uplink capacity of each node is only a small fraction of the rate of the stream.

1 Introduction

In recent years, the advances in video acquisition, compression, transmission, and rendering have made possible the development of technologies that can enhance the viewers’ experience by including the third dimension. While traditional 2D video offers the viewer only a passive view of the scene, a more realistic experience can be obtained through applications such as 3D video or free viewpoint selection. 3D cinema productions have already generated big revenues, but other applications such as 3DTV and Free Viewpoint TV (FTV) [1, 2] are also becoming more desirable due to the increased affordability of 3D displays for home use.

Multi-view video (MVV) is one of the key elements of these applications; it consists in the simultaneous representation of a scene captured by N cameras placed in different spatial positions, called points of view.

By using more than two cameras during video acquisition, adjacent views act like local stereo pairs to guarantee stereoscopy to the viewer. This can be used to synthesize virtual views different from the acquired ones. This functionality is used in FTV where the user interactively controls the viewpoint in the scene. On the other hand, since 3D video could not be deployed if the quality perceived by the user does not exceed the existing 2D quality standards, the bandwidth for storage and transmission of the multiple views is accordingly increased.

A first solution for multi-view video transmission, known as simulcast [3], is to compress and send each view independently [4]. While simple to implement and backward compatible with the existing infrastructures, this technique does not take into account the redundancy due to the similarities among the views that can be used to further compress the data. On the other hand, it allows for easier switching between views, as the lack of inter-view prediction makes the views independently decodable.

The multi-view video coding (MVC) extension of the H.264/MPEG-4 AVC standard [5] exploits inter-view dependency in a simple, yet effective ways; images from other views (but at the same time instant) can be used as references for the current frame prediction (inter-view prediction). This is the only major change introduced in the MVC extension of H.264. The MVC extension of HEVC, referred to as MV-HEVC, is based on very similar principles [6]. With MVC, two main coding schemes are particularly worth mentioning: view progressive and fully hierarchical. In the view progressive architecture, the first view, called the base view, is encoded independently from the others. In any other view, for each GOP, there is one frame, the V-frame, that is predicted using only inter-view prediction from the corresponding I-frame in the base view. For all other frames, only temporal prediction is used. In the second architecture, both hierarchical temporal prediction and inter-view prediction are performed for all P/B-frames of all views except for the the base view. These tools allow a rate reduction, for the same subjective quality, estimated around 50 % with respect to the case of independent view coding (simulcast) [5].

Even though recently a relevant part of the attention of the research in 3D has been attracted by depth-based formats [7] (which allow virtual viewpoint synthesis), the interest in MVV coding is still very high, as witnessed by the activity of the ad hoc group on free viewpoint TV and super-multi-view video (i.e., video with more than 30 views, and holoscopic video) [8–10]. The quality of synthesized view generated with depth data is still questionable, at such a point that it is still not completely clear whether depth-based format has a clear advantage over MVV or super-MVV, above all when subjective quality is considered [11]. In summary, super-MVV seems still being a serious candidate for FTV and 3D video services [12].

Multi-view streaming becomes an even more challenging task in the context of mobile networking, where the high bitrate issue of multi-view adds on top of the existing problems of mobile networking. Even though streaming applications are nowadays commonplace, and the technology involved has greatly advanced in the past few years [13, 14], in a wireless network, it is difficult to meet the inherent requirement of continuous delivery necessary for an uninterrupted presentation of the content, as the nodes move freely and independently in all directions—thus, the channel conditions of the links and the link themselves are unreliable and erratic—and individual nodes may connect and disconnect asynchronously [15].

Also, in the context of a streaming application, it would be desirable to have the quality of the received media degrade gracefully as the network environment and resources change and to tolerate losses to some extent. Even though techniques to provide graceful degradation and loss immunity exist, these usually require an increase in the bitrate of the stream, a condition that could be difficult to satisfy in a wireless network, where the nodes’ uplink capacity is typically quite limited.

One positive aspect of wireless networks w.r.t. video streaming is the inherently broadcast nature of the medium. This makes more straightforward for a sender the task of multicasting the content to several receiver but also allows a single receiver to collect video packet from several servers.

Recently, good results have been achieved, in the context of mobile video streaming, by exploiting the broadcast nature of the medium through the construction of video packet delivery overlays [16, 17]. These logical networks, built on top of the actual wireless network through the cooperation of nodes, allow to provide a streaming service with good video quality and graceful degradation.

However, these techniques were designed for single-view streams and relied on the use of multiple description coding (MDC) [18], a joint source-channel coding technique that does not lend itself well to be conjugated with multi-view, due to its additional bitrate cost, a cost already considerable for multi-view streams.

In this article, we propose to use network coding for the robust delivery of MVV and super-MVV over an unreliable network such as a wireless network. In order to do so, we design a rate-distortion-optimized (RDO) scheduling algorithm that, at each sending opportunity, selects which video packet has to be added to the coding window, in such a way as to minimize the expected video distortion measured at the receiver. This optimization will be performed by taking into account the preferences of the users in terms of required views, an approach already successfully exploited for video caching of single-view streams in mobile environment [19]. Being the wireless medium inherently broadcast, we exploit the fact that each receiver could be exposed to multiple senders. We thus ensure that senders transmit innovative packets (i.e., packets with novel information with respect to those already sent) even though they do not coordinate their actions.

The particularity of the coding structures of the multi-view representation reflects in a non-trivial impact of each coded frame on the overall quality of the reconstructed multi-view content. If this impact is properly captured, it can be used to design an intelligent transmission scheme that allocates the limited channel capacities in a rate-distortion-optimized order (scheduling). In order to effectively disseminate the content to the end users, an analogous scheme can be devised to schedule the frames for transmission [20].

Network coding (NC) [21] has been proposed as an elegant and effective solution for multi-view transmission. In NC, instead of merely relaying packets, the intermediate nodes of a network send linear combinations of the packets they have previously received, with random coefficients taken from a finite field. The coding coefficients, needed to reconstruct the original packets, are typically sent along the combinations as headers [22–25], unless more advanced reconstruction schemes are implemented at the receiver side [26, 27]. Used as an alternative to traditional routing, NC has proved beneficial to real-time streaming applications, both in terms of maximization of the throughput and in terms of reduction of the effects of losses [28–33].

In a NC-based transmission system, rather than sending the data packets, the users send mixed packets. The advantage of this technique is that even though the users act independently from each other, with high probability, each of them will contribute innovative information to the transmission [20, 34]. In the most common implementation of network coding, referred to as practical network coding (PNC) [25], the content is divided into groups of packets known as generations, and only packets belonging to the same generation can be mixed together. In our system, each packet contains only one encoded frame, and we only mix frames belonging to the same GOP. The set of packets actually used to generate a mixture is referred to as coding window.

One technique based on the network coding principles has been proposed by Wang et al. [35] for peer-to-peer video-on-demand applications. More recently, Kao et al. [36] proposed a general framework able to provide an interactive streaming service, i.e., allowing random access operations to the users. However, neither of these techniques addresses the multi-view case, nor takes into account the rate-distortion properties of the stream, nor the users’ preferences.

Other existing works have tackled the subject of distributed video services, achieving similar properties, by proposing to use rateless codes—conceptually similar to network coding—for video delivery [37, 38]. However, even though these techniques have been proposed for video delivery, only the delay requirements of video streaming have been exploited, while our method is tailored for multi-view video content and in particular it uses the prediction structure of the encoded sequence in its optimization algorithm. It should be noted that in our method, a proper RDO-based scheduling is performed in order to provide the users with the best possible video quality given the limited channel capacity allocated to each node.

The rest of this article is organized as follows. In Section 2, we review some recent works closely related to our problem. Then, in Section 3, we present the system model, detailing and motivating our assumptions. In Section 4, we describe the selection method used to decide which frames will be included in the coding window of the transmitting nodes. In Section 5, we present the experimental validation of the proposed technique and analyze the results. Finally, in Section 6, we draw our conclusions and point out some directions for future work.

2 Related work

Unlike previous works on multi-view streaming rather than focusing on the source encoding of the content, and rather than considering each client as an independent agent, we study how the distribution of the stream can take advantage of an a priori knowledge about the different clients and, in particular, the fact that they share common preferences—in this case, in terms of preferred view.

Examples of work in the context of multi-view streaming that take user preferences into account include the source rate allocation technique proposed in [39] and the joint source-channel coding scheme introduced in [40].

While these works consider similar applications as ours, we address here a substantially different problem, in which the multi-view video has been already encoded, and we must decide, at each sending opportunity, about which parts of the content have to be included in the coding window for transmission. We also consider the case when the preference estimation used to decide the packet scheduling does not perfectly correspond to the actual user preferences.

In our work, we also rely on a network coding scheme that allows for the prioritization of certain packets with respect to others. Several works exist that make use of similar schemes, in which the video stream is divided into layers of priority and unequal error protection is given to the different layers using PNC.

For instance, in [29], a receiver-driven network coding strategy is proposed, where the receiving peers request packets from classes with varying importance. Packet classes are constructed based on the unequal contribution of the various video packets to the overall quality of the presentation or in scalable video streams. Prioritized transmission is achieved by varying the number of packets from each class that are used in network coding operations. The coding operations are driven by the children nodes that determine the optimal amount of coding allocated to each importance class of the data to which they subscribe.

The work in [29] has later been extended to the case of multi-view video in [41]. Cameras’ streams are organized into layered subsets, with subsets organized based on their priority levels. These prioritized layers are transmitted in an UEP fashion, sending in a more reliable way more important subsets. Inter-view dependencies are built based on the subset organization; views from a given subset can depend from views of the same subset or lower ones. In this way, since lower subsets are more likely to be received than higher ones, every time a view has to be decoded, most likely the reference view from which it depends has been already received.

This work is related to ours both for its use of network coding and its application to multi-view content. However, there are notable differences both in the model of the service provided to the user and, as a consequence, to the utility function that is maximized.

In the scenario envisioned in this work, users request viewpoints that are, in general, synthesized from camera views either by coinciding with one them or by using depth-image-based rendering on a couple of camera views bracketing the synthetic viewpoint. The distortion to minimize depends on the spatial distance between the synthetic view and each of the camera views used to reconstruct it. Priority, in the sense of a higher redundancy to insure reception in the face of losses, is defined based on the utility of camera view subsets in reconstructing the synthetic views requested by the users.

In our work, on the other hand, the users are only interested in camera views, i.e., no view synthesis is used. This implies that, while in the abovementioned work, there are different combinations of received camera views that can satisfy the view request of a user, with different levels of distortion depending on their distance; in our scheme, only the exact camera view the user is interested in can increase its quality of experience.

Furthermore, in our scheme, priority is not intended in the sense of loss protection but rather the arrival order. In our scheme, the different treatment of layers is not intended to differentiate the likelihood of their reception but rather the delay experienced by the user before they can start displaying it. For this reason, while the network coding scheme used in [41] varies the number of packets from each layer in the coding window, in our scheme, all packets from lower layers are introduced in the coding window before any packet of a higher layer is introduced.

Notice that this work only considers the case of aligned and equally spaced cameras, so that correlation between views decreases with their distance. In a more recent work [42], the same authors extend this model to optimize other settings, but this work does not address the communication aspects.

Another relevant approach to video transmission from multiple senders is proposed in [43], wherein the authors jointly tackled the problem of defining an optimal schedule and an optimal network coding strategy using a prioritizing network coding scheme. Unlike ours, this work only considers the case of single-view content, therefore there are no preferences to be taken into account, and the optimal schedule is unique. Furthermore, in order to find an optimal solution, this technique requires some degree of coordination among the senders, whereas, we assume that coordination is not feasible and relies on randomization in order to circumvent this limitation.

3 System model

In order to optimize the rate-distortion performance of the transmitted content, we select the frames to be included in the coding window based on their popularity among the users. Before explaining in detail our proposed technique, in this section, we list and justify some assumptions about the system that will be used in the design of the technique.

From the point of view of the network, we assume that the users are connected in a (generally partial) mesh network in which each node can potentially receive from multiple servers. This reflects the case of wireless networks and in particular ad hoc networks. Furthermore, we assume that the connectivity among the users can be modeled with a set of independent channels, each of them having a given capacity C, expressed as a fraction of the encoded video bitrate. When C=100 %, each node is able to transmit all the packets of a GOP in the time allocated to a GOP. Still, these packets may be lost on the channels. We consider two models for these channels: a simple packet erasure channel (PEC) with loss rate ε, and a Gilbert-Elliot erasure channel (GE), characterized by loss rates in good and bad state (ε _G and ε _B) and by transition probabilities (p _GB and p _BG). Notice that each channel does not necessarily provide sufficient capacity for transferring the whole multi-view stream. Our study will focus on the video quality achieved by a generic receiver R exposed to M senders or sources S ₁,…,S _M. This scenario is represented in Fig. 1.
Fig. 1
Simulated scenario for each receiver. I(v,k) and $\widehat {I}(\text {\textit {v,k}})$ are, respectively, the original and reconstructed version of frame k of view v. S _m, m=1,…,M are the senders (or sources), NC _m the network coding modules, C _m the capacity of the channels, RX is the receiver R’s buffer
Full size image
From the point of view of the content, we assume that the stream is encoded using H.264/MVC [5] or a similar inter-view prediction scheme, such as MV-HEVC [6]. In our experiments, the stream is encoded using the prediction structure depicted in Fig. 2, with M=5 views and N=8 pictures per view in a GOP. This structure is a compromise between view progressive and fully hierarchical MVC that uses inter-view prediction in order to achieve a better coding efficiency but is not fully hierarchical in order to reduce the dependencies among the frames, thus reducing the propagation of the effects of losses. However, it should be noted that our study can easily be extended to other coding techniques and prediction structures of multi-view content.
Fig. 2
Prediction structure used to encode the multi-view stream with temporal and inter-view prediction. Labels indicate prediction level. This structure provides a good trade-off between coding efficiency and loss propagation. Each row represents the timeline of a different view
Full size image
For the user’s preferences, we assume that the choice of the preferred view for each user follows the same, known distribution. Notice that, even though the proposed method could be applied to any preference model, how the learning and keeping track of the preference distribution is performed is outside the scope of this article and shall not be addressed in the following. However, these preferences may be easily learned and spread over the network with approaches similar to those shown in [17].
We assume that the preference distribution does not change too fast over time, that is, we assume that it can be considered valid for at least the duration of a GOP, defined as an independently decodable set of N×W frames, as depicted in Fig. 2. This implies that our system is able to work even when users’ preferences change as frequently as once per GOP, which typically lasts less than 1 s. Any change in preferences during a GOP will be taken into account at the next GOP.

An example of the complete system is shown in Fig. 3. The video server S sends the encoded video packets together with side information about RD characteristics of the sequence. Nodes 1 to 9 relay the video using the proposed system.

We focus on a given node receiving the video sequence from M sources (or senders), performing network coding and relaying the video to downlink nodes. For example, node 6 sees M=4 sources, i.e., nodes 1 to 4. Node 8 sees M=2 sources, i.e., nodes 5 and 6. We propose an algorithm to decide the order of inclusion of frames in the coding window. We assume (for simplicity) that nodes do not compete for capacity but the available capacity may be less than the video coding bitrate. We model each channel’s capacity as a percentage of the encoded video bitrate, and that each node has view preferences according to a given probability distribution.

4 Proposed method

In this section, we describe our proposed method of network encoding for a wireless streaming of multi-view video content based on the users’ preferences.

As we mentioned in Section 1, most practical implementations of NC are achieved by segmenting the data flow into generations and combining only packets belonging to the same generation. Packets are made of the same length by padding. All packets in a generation are jointly decoded as soon as enough linearly independent combinations have been received, by means of linear system solving. Since the coefficients are taken from a finite field, perfect reconstruction is assured.

It has been proposed [29] to apply NC to video content delivery, dividing the video stream into layers of priority and providing unequal error protection for the different layers via PNC. Layered coding requires that all users receive at least the base layer, hence all received packets must be stored in a buffer until a sufficient number of independent combinations are received, which introduces a decoding delay that may be undesirable in real-time streaming applications.

There exist several techniques aimed to reduce the decoding delay, proposed by both the NC and the video coding communities. In our technique, we use an implementation of random linear network coding referred to as expanding window network coding (EWNC) [28, 32]. The key idea of EWNC is to increase the size of the coding window (i.e., the set of packets in the generation that may appear in combination vectors) for each new packet. Using Gaussian elimination at the receiver side, this method provides instant decodability of packets. Thanks to this property, EWNC is preferable over PNC in streaming applications. Even though PNC could achieve almost instant decodability using a small generation size, this would be ineffective in a wireless network, where a receiver could be surrounded by a large number of senders, and if the size of the generation is smaller than the number of senders, some combinations will necessarily be linearly dependent. On the other hand, EWNC automatically adapts the coding window size allowing early decodability, and innovation (i.e., linear independence) can be achieved if the senders include the packets in the coding window in a different order. However, these orders should take into account the RD properties of the video stream. In our previous work, we already successfully applied EWNC principles to multi-view streaming in the context of wireless networking [44], but we did not take into account the preferences of the users in terms of displayed view.

As mentioned in Section 1, in other works, user preferences were used to optimize the rate allocation in the encoding process. Here, we show how they can be used to decide which parts of the content have to be included in the coding window in order to optimize the rate-distortion properties of the transmitted stream.

We model the distribution of users’ preferences with a probability vector $\vec {p}$, such that p _v is the probability that a member of the group chooses to watch view v∈{1,…N} for the current GOP.

In our case, the transmitted packets will contain linear combinations of frames belonging to the same GOP. In order to select the order in which the frames will be included in the coding window, which we denote by $\mathcal {W}$, we proceed as follows. For each GOP, all the frames of the current GOP are stored in a bi-dimensional frame buffer $\vec {B}$, with N rows, and W columns, where N is the number of views and W is the per-view time length of the GOP. For clarity, a summary of the notation used in this article is given in Table 1. The maximum possible size of the coding window, i.e., the generation size, will be the size of the GOP NW, while the current size of the coding window will be denoted r≤N W.

Table 1 Summary of the notation used in this article

Rate-distortion-optimized multi-view streaming in wireless environment using network coding

Abstract

1 Introduction

2 Related work

3 System model

4 Proposed method

4.1 A running example

5 Experimental results

6 Conclusions

References

Author information

Authors and Affiliations

Additional information

Competing interests

Rights and permissions

About this article

Cite this article

Share this article

Keywords