Scalable video encoding with macroblock-level parallelism
© Sankaraiah et al.; licensee Springer. 2014
Received: 3 January 2014
Accepted: 1 September 2014
Published: 19 September 2014
H.264 video codec provides a wide range of compression options and is popularly implemented over various video recording standards. The compression complexity increases when low-bit-rate video is required. Hence, the encoding time is often a major issue when processing a large number of video files. One of the methods to decrease the encoding time is to employ a parallel algorithm on a multicore system. In order to exploit the capability of a multicore processor, a scalable algorithm is proposed in this paper. Most of the parallelization methods proposed earlier suffer from the drawbacks of limited scalability, memory, and data dependency issues. In this paper, we present the results obtained using data-level parallelism at the macroblock (MB) level for encoder. The key idea of using MB-level parallelism is due to its less memory requirement. This design allows the encoder to schedule the sequences into the available logical cores for parallel processing. A load balancing mechanism is added to allow the encoding with respect to macroblock index and, hence, eliminating the need of a coordinator thread. In our implementation, a dynamic macroblock scheduling technique is used to improve the speedup. Also, we modify some of the pointers with advanced data structures to optimize the memory. The results show that with the proposed MB-level parallelism, higher speedup values can be achieved.
H.264 is an emerging video coding standard developed by the ITU-T Video Coding Experts Group (VCEG) and ISO/IEC Moving Picture Experts Group (MPEG) together with the partnership of Joint Video Team (JVT). H.264 has been developed with the aim of providing good-quality video at lower bit rates compared to previous video compression standards. H.264 also provides flexibility in serving broad range of video applications by supporting various bit rates and resolutions . The improvement on bit rate efficiency of H.264 is at the cost of increased complexity compared to existing standards. The higher complexity of H.264 encoder results in longer encoding time . This creates a need for improving the encoding time of the video for batch processing or real-time applications. Hardware acceleration or parallel algorithm for multicore processor is often needed to increase the processing speed of the encoder. Parallel algorithm is becoming popular with the usage of multicore processors over the years even for mobile devices. Parallel algorithms for H.264 encoder design have been discussed in several papers [2–9]. These papers describe different levels of parallelism that can be applied on H.264 encoder such as GOP level, frame level, slice level, and macroblock level. Out of these, macroblock-level parallelism is often favored for its fine granularity and its ability to prevent any video quality losses from its serial algorithm counterpart . Macroblock-level parallelism provides good scalability and load balancing. Some of the main concerns in designing macroblock-level algorithm are the accessing pattern, the data partitioning, and the load balancing. Macroblock access pattern defines the way in which the data is to be processed in order to reduce the data dependencies. The data partitioning process defines how each macroblock can effectively be assigned to separate processor core. The load balancing mechanism ensures that each processor is loaded with similar amount of workload to prevent the processing core from staying idle or starvation.
In general, all the threads may not run in parallel and there will be a time difference of a few milliseconds to microseconds among the threads. This problem leads to load imbalancing. In this paper, a dynamic thread scheduling strategy is proposed to solve the load imbalancing problem. Furthermore, a new technique to access data patterns is also proposed to improve the encoding time. Another contribution of this paper is on memory optimization using advanced data structures. This paper proposes a scalable algorithm based on the above strategies that exploits the capability of a multicore processor using macroblock-level parallelism for video encoder. The remainder of this paper is organized as follows: Section 2 gives the description of previous work related to macroblock parallelism. Section 3 gives an overview on the design consideration of the parallel algorithm. In Section 4, the design and implementation of macroblock parallelism of H.264 encoder parallelism are discussed in detail. In Section 5, the experimental results of the design are presented and analyzed. Section 6 consists of the conclusion and the possible future work.
2 Literature review
Many researchers have been working on parallel algorithms. The popular parallel algorithms that are proposed are at the GOP, frame, slice, and macroblock levels. Many researchers have implemented macroblock-level parallelism [3–10], but all the proposed methods so far have scalability issues. In , a method using SIMD instructions has been proposed to improve the encoding time of H.264. However, this approach is too complex to implement on personal computers. The parallel algorithm using wave-front technique reported in  splits a frame into macroblocks and maps these blocks to different processors along the horizontal axis. This technique requires data communication among the parallel processing blocks (except for the outer blocks of a frame), slowing down the encoding process. The speedup values achieved with this implementation are 3.17 and 3.08 for quarter common intermediate format (QCIF) and common intermediate format (CIF) video formats, respectively . The macroblock region partition (MBRP) algorithm proposed in  adopts wave-front technique and focuses on reducing the data communication between processors using a new data partitioning method. This data partitioning method assigns a specific macroblock region for each processor, so that neighboring macroblocks are mostly handled by the same processor. However, in this implementation, the waiting time of the processors before starting to encode a new macroblock is high . The speedup values achieved with this method are 3.32 and 3.33, respectively, for CIF and standard definition (SD) video formats. The MBRP algorithm has not been applied to higher resolutions such as high-definition (HD) and full high-definition (FHD). A new macroblock-level parallelism method has been reported in . In this method, the data partitioning on the macroblocks eliminates the dependency among the macroblocks at the beginning of the encoding process. Encoding the subsequent frames is initiated only when the reconstructed macroblocks constitute more than half of a frame. Thus, this method increases the concurrency of the thread-level parallelism to process multiple frames. The speedup values achieved with this method for CIF, SD, and HD video resolutions are around 3.8 ×. However, in this implementation, the authors have used only I and P frames and they have not included B frames. The dynamic data partition algorithm proposed in  for macroblock-level parallelism reduces data communication overhead and improves concurrency. The dynamic data partition algorithm achieves speedup values of 3.59 for CIF, 3.88 for 4CIF, and 3.89 for HD resolution video formats. Even though good speedup values are obtained, these values are not consistent with different video formats. Various thread-level techniques have been proposed in  to effectively utilize a multicore processor. We have adopted some of these techniques in the proposed algorithm to improve the encoding time.
3 Design considerations
3.1 Data dependencies
3.2 Load balancing
Scalability and load balancing are the two major concerns when parallelizing a program [11, 12]. Scalability implies maximum number of threads that can be created for a parallel program. Load balancing is concerned with the issue of allocating the same amount of load to all the processing elements and ensuring that the execution times of these processors are nearly the same. The main challenge of implementing macroblock-level parallelism is to reduce the idle time among the processors. Processors should wait until the reference macroblocks are encoded except for the first macroblock. The balancing of tasks in the MB level is performed by determining the execution time of each function dynamically using the profiling. We have written a function call in the program, which monitors the threads’ function with their execution times and makes all the threads active without going to idle state by any interruptions during the task execution. Load balancing is further improved by allowing each of the processing to access the structure and load the index of the macroblock that can be encoded by themselves without the need of a coordinator thread. A reference flag is created in the program for each thread to identify the status of the thread. The 0 and 1 statuses of a reference flag will indicate respectively whether the thread is active or not active. Each thread identifies starting and ending positions of its region based on its own thread ID with reference flag. Each thread shall only encode the macroblocks within its own region after all the data dependencies are resolved as shown in Figure 2. However, each thread will perform entropy encoding if there are no macroblocks available for encoding. This idea is to effectively balance the workload among the threads, so that no threads will fall into an idle state. Hence, threads do not have to wait for the availability of macroblocks to encode within its region. In this way, the balancing of tasks in the MB level is achieved dynamically without going any thread to idle state, which solves the load balance problem.
The parallel algorithm of encoder is designed in such a way that each thread is independent and all the threads continue their execution by checking the status of the reference flag. All the threads can be independently executed without sharing of the cores by the threads. So, no data race condition occurs, and this will not incur any extra latency and eventually upgrade the overall encoding performance. Another benefit with this configuration is that for any thread, no extra waiting cycle is required to acquire the Mutual exclusion (Mutex) lock. Whenever there is no macroblock available for encoding, each thread will enter the shared region. In the shared region, only one thread is allowed to access at one time. Before entering the shared region, each thread will try to acquire a Mutex lock to gain an access and execute the code inside the shared region. In this way, no thread will fall into a waiting state, since there will be no repeated failures in acquiring the Mutex lock. This will not incur extra latency and eventually upgrade the overall encoding performance. Therefore, this solves the thread synchronization problem without the data race condition and thread locking overhead.
3.3 Data partitioning
3.4 Parallel system design
4 Design and implementation of parallel video coding
4.1 Macroblock-level parallelism design methodology
where t represents the time cycle number, tid represents the thread ID, and MB represents the number of macroblocks per row.
4.3 Data structures for memory optimization
The source code of JM 18.0 is implemented with pointers all over the locations . All the structures are tangled together by these pointers which work fine in serial mode. However, these pointers cause problems for memory access when parallelized. This is due to the fact that when making a structure private to a processing core, the pointer’s value is made private but not the memory location pointed by the pointer. There are actually two problems to be taken into account to tackle this issue. The first problem occurs when a structure in the higher hierarchy has a pointer pointing back to another structure in the lower hierarchy. The second problem occurs when a structure is having two pointers one pointing to a structure in the higher hierarchy and the other pointing to a structure in the lower hierarchy. In the original source code of JM 18.0 , pointers can be replaced with suitable data structures. To replace a pointer with a suitable data structure, the encoding parameter structure p_Enc (JM 18.0 encoder code) is made as a global variable and all the structures can be directly or indirectly linked to the p_Enc structure.
Cache performance metrics
MB-level parallelism is performed on all parallel threads without any dependencies by dynamically detecting the threads’ status using reference flag so that the subsequent frames processed by each thread would not depend on the results of other threads. Each thread will access only a specific portion of the memory (of the core) without altering the existing memory mapping structure. Each thread writes the results prior to reading the results. This would not affect the processor core’s results to access the external DRAM and will not affect the memory bandwidth bottleneck issues. For multicore architectures, this is one main benefit to enable the flexible shared memory subsystem. This minimizes the data exchanges between pipeline stages and enables non-blocking handshaking between tasks of a multicore architecture.
4.5 Simulation Environment
In this implementation, an Intel i7 platform is used for simulating a four-physical-core system and as an eight-logical-core system using hyper-threading technology. It is assumed that each core has an independent data cache (L1) and data can be copied from additional caches (L2 and L3) through four channels. To record the encoder’s elapse time, all existing native services and processes in the cores are closely monitored and controlled. It is also important to ensure that the computer is not running any additional background tasks during encoding as it will incur additional overhead to the processor. The experimental results are obtained based on H.264 high profile using I, P, and B frames. The experiments are conducted using JM 18.0 reference software  and compiled with Microsoft Visual Studio 2010 using Intel i7 platform (Redmond, WA, USA) as described below: Intel Core™ i 7 CPU 930, running at 2.8GHz with four 32-KB D-Cache (L1), four 32-KB I-Cache (L1), four 256-KB cache (L2) with 8-way set associative, and 8-MB L3 cache with 16-way set associative and 8-GB RAM. The operating system used is Windows 7 64-bits Professional version. The following are some of the additional settings that are used to create the testing environment:
All external devices are disconnected from the computer excluding the keyboard and mouse.
All drivers for network adapters are disabled.
Windows Aero, Gadget, Firewall are disabled.
Visual effect is set to get better performance.
Power setting is changed to “Always on” for all devices.
All extra windows features are removed with the exception of Microsoft.net framework.
All simulations are performed under this controlled environment and the encoder’s elapsed time is recorded using Intel Parallel Studio’s 2011 Vtune Amplifier and AMD code analyst. The memory leaks are analyzed using Intel Parallel Inspector 2011. The parallel programming is implemented using OpenMP technique. The resolutions of the video sequences used in the simulation are QCIF, CIF, SD, and HD resolutions. The scalability is tested by increasing the number of processing cores and applying homogeneous software optimization techniques to each core.
5 Experimental results
The H.264 reference software JM 18.0 is implemented in sequential with C language. After modifying the JM 18.0 with some optimized C language data structures, JM 18.0 is parallelized by using OpenMP. The simulation is performed using a high-motion video sequence (rush_hour) with different resolutions such as QCIF, CIF, SD, and HD. In this implementation, 300 frames are encoded for all sequences. For each of this resolution, a variable number of threads from 2 to 8 are tested.
5.1 CPU performance
5.2 Speedup performance
Speedup comparison with different HD video sequences with different configurations
The speedup values obtained using different video test sequences with HD resolution using threads, four-thread and eight-thread configurations, are shown in Table 2.
MBRP with data partitioning
Dynamic data partitioning
Our proposed method
A new scalable method based on macroblock-level parallelism has been presented. The proposed method has advantages such as good load balancing, scalability, and higher speedup values compared to the existing methods. Unlike the existing methods where one thread is specifically used for the purpose of assigning macroblock indices, the proposed method makes use of all the threads to encode the macroblocks leading to good load balancing. This is achieved by using a dynamic scheduling technique. In order to obtain better scalability, the proposed method makes use of a dynamic data partitioning method. Experimental results show that speedup values close to theoretical values can be obtained using the proposed method. Speedup values of 1.97, 3.96, and 7.71 have been obtained using two, four, and eight threads, respectively. Furthermore, it has been found that the speedup values remain constant for QCIF, CIF, SD, and HD resolutions. These values are very close to the theoretical speedup values without degradation in the video quality. Although, the focus of this paper is on the use of H.264 encoder, the proposed technique can be applied to other video codecs and computationally intensive applications to speedup the process.
The authors would like to thank Intel Technology Sdn.Bhd for their sponsored funding for this research.
- Richardson IE: The H.264 Advanced Video Compression Standard. Wiley, London; 2010.View ArticleGoogle Scholar
- Luo C, Sun J, Tao Z: The research of, H.264/AVC video encoding parallel algorithm. In Second International Symposium on Intelligent Information Technology Application. Shanghai, China; 21–22 Dec 2008:201-205.View ArticleGoogle Scholar
- Sankaraiah S, Lam HS, Eswaran C, Abdullah J: GOP Level Parallelism on H.264 Video Encoder for Multicore Architecture. IACSIT, Singapore);Google Scholar
- Lee J, Moon S, Sung W: H.264 decoder optimization exploiting SIMD instruction. Proc. IEEE Asia Pac. Conf. Circuits Syst 2004, 2: 1149-1152.Google Scholar
- Zhao Z, Liang P: A highly efficient parallel algorithm for H.264 video encoder. ICASSP 2006, 5: 489-492.Google Scholar
- Sun S, Wang D, Chen S: A highly efficient parallel algorithm for H.264 encoder based on macro-block region partition. High Perform. Comput. Commun. Lect. Notes Comput. Sci 2007, 4782: 577-585. 10.1007/978-3-540-75444-2_55View ArticleGoogle Scholar
- Kim J, Park J, Lee K, Tae Kim J: Dynamic data partition algorithm for a parallel H.264 encoder. World Acad. Sci. Eng. Technol 2010, 72: 350-353.Google Scholar
- Sankaraiah S, Lam HS, Eswaran C: Junaidi Abdullah Parallel full-HD video decoding for multicore architecture. In Lecture Notes in Electrical Engineering (LNEE). Edited by: Herawan T, Mat Deris M, Abawajy J. (Springer,, Singapore); 317-324.Google Scholar
- Chen Y, Li EQ, Zhou X, Ge S: Implementation of H.264 encoder and decoder on personal computers. J. Vis. Commun. Image Representation 2006, 17(2):509-532. 10.1016/j.jvcir.2005.05.004View ArticleGoogle Scholar
- Ge S, Tian X, Chen YK: Efficient multithreading implementation of H.264 encoder on Intel hyper-threading architectures. In Proceedings of the 2003 Joint Conference of the Fourth International Conference on Information, Communications and Signal Processing, 2003 and the Fourth Pacific Rim Conference on Multimedia. (IEEE,, Piscataway; 2003:469-473.View ArticleGoogle Scholar
- Sankaraiah S, Lam HS, Eswaran C, Junaidi Abdullah: Performance Optimization Of Video Coding Process On Multi-Core Platform Using GOP Level Parallelism. International Journal of Parallel Programming (Springer) 2014, 42(6):931-947. 10.1007/s10766-013-0267-4View ArticleGoogle Scholar
- Kim YIL, Kim JT, Bae S, Baik H, Song HJ: H.264/AVC decoder parallelization and optimization on asymmetric multicore platform using dynamic load balancing. In the IEEE international conference on multimedia and expo. Hannover, Germany; 26 April–23 June 2008:1001-1004.Google Scholar
- Fraunhofer Heinrich Hertz Institute, JM 18.0 2014.http://iphome.hhi.de/suehring/tml/download/old_jm/jm.18.0.zip . Accessed 30 Nov 2011
- Taylor S: Optimizing Applications for Multi-core Processors: Using the Intel Integrated Performance Primitives. Intel Press, Santa Clara; 2007.Google Scholar
- Gerber R, Bik AJC, Smith KB, Tian X: The Software Optimization Cookbook: High-Performance Recipes for IA-32 Platforms. Intel Press, Santa Clara; 2005.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.