Comparing an FPGA to a Cell for an Image Processing Application
© Ryan N. Rakvic et al. 2010
Received: 2 December 2009
Accepted: 8 March 2010
Published: 18 April 2010
Modern advancements in configurable hardware, most notably Field-Programmable Gate Arrays (FPGAs), have provided an exciting opportunity to discover the parallel nature of modern image processing algorithms. On the other hand, PlayStation3 (PS3) game consoles contain a multicore heterogeneous processor known as the Cell, which is designed to perform complex image processing algorithms at a high performance. In this research project, our aim is to study the differences in performance of a modern image processing algorithm on these two hardware platforms. In particular, Iris Recognition Systems have recently become an attractive identification method because of their extremely high accuracy. Iris matching, a repeatedly executed portion of a modern iris recognition algorithm, is parallelized on an FPGA system and a Cell processor. We demonstrate a 2.5 times speedup of the parallelized algorithm on the FPGA system when compared to a Cell processor-based version.
For most of the history of computing, the amazing gains in performance we have experienced were due to two factors: decreasing feature size and increasing clock speed. However, there are fundamental physical limits to this approach—decreasing feature size gets more and more expensive and difficult due to the physics of the photolithographic process used to make CPUs, and increasing clock speed results in a subsequent increase in power consumption and heat dissipation requirements. Parallel computation has been in use for many years in high performance computing; however, in recent years, multicore architectures have become the dominate computer architecture for achieving performance gains. The signal of this shift away from ever increasing clock speeds occurred when Intel Corporation cancelled development of its new single core processors to focus development on dual core technology. Executing programs in parallel on hardware specifically designed with parallel capabilities is the new model to increase processor capabilities while not entering into the realm of extensive cooling and power requirements.
The Cell processor is a joint effort by Sony Computer Entertainment, Toshiba Corporation, and IBM that began in 2000, with the goal of designing a processor with performance of an order of magnitude over that of desktop systems shipping in 2005. The result was the first-generation Cell Broadband Engine (BE) processor, which is a multicore chip comprised of a 64-bit Power Architecture processor core and eight synergistic processor cores. A high-speed memory controller and high-bandwidth bus interface are also integrated on-chip .
According to IBM, the Cell BE is capable of achieving in many cases 10 times the performance of the latest PC processors . The first major commercial application of the Cell processor was in Sony's PlayStation3 game system. The PlayStation3 has only 6 SPU cores available due to one core being reserved by the OS and 1 core being disabled in order to increase production yields. Sony has made it very easy to install a new Linux-based operating system onto the PlayStation3, thereby making the game system a popular choice for experimenting with the Cell BE.
Historically programmers have thought in sequential terms, and programming these multicore processors can be difficult. Often times, this involves completely redesigning an existing program from the ground up and implementing complex synchronization protocols. Parallel programming is based on the simple idea of division of labor—that large problems can be broken up into smaller ones that can be worked on simultaneously. Making it more challenging is the fact that the SPEs in the Cell do not share memory with the PPE. Additionally, they are not visible to the operating system, thereby leaving all management of SPE code and data to the programmer.
Another popular approach to parallelization is to use Field Programmable Gate Arrays (FPGAs). FPGAs are complex programmable logic devices that are essentially a "blank slate" integrated circuit from the manufacturer and can be programmed with nearly any parallel logic function. They are fully customizable and the designer can prototype, simulate, and implement a parallel logic function without the costly process of having a new integrated circuit manufactured from scratch. FPGAs are commonly programmed via VHDL (VHSIC Hardware Description Language). VHDL statements are inherently parallel, not sequential. VHDL allows the programmer to dictate the type of hardware that is synthesized on an FPGA. For example, if you would like to have many ALUs that execute in parallel, then you program this in the VHDL code.
In this work, we have parallelized a repeatedly executed portion of an image processing algorithm with both an FPGA and a Cell processor. In Section 2 we present the Iris Recognition Algorithm and iris template matching. In Section 3, we present an approach to iris matching utilizing parallel logic with field-programmable gate arrays and cell processors. In Section 4 we demonstrate this efficiency with a comparison between the FPGA, the Cell processor, and a sequential processor. We provide concluding statements in Section 5.
2. Iris Recognition Algorithm
Iris recognition stands out as one of the most accurate biometric methods in use today. One of the first iris recognition algorithms was introduced by pioneer Dr. John Daugmann . An alternate iris recognition algorithm, referred to as the Ridge Energy Direction (RED) algorithm , will be the basis for this work. There are many iris detection algorithms. What follows is a brief description of the RED algorithm. Since this research is focused on computational acceleration, we refer the reader to [6–12].
Once the iris is segmented, the algorithm takes the iris and divides it into concentric annuli and radial lines, which results in an representation of the iris. This step is effectively a rectangular to polar coordinate conversion. The energy of each pixel is merely the square of the value of the infrared intensity within the pixel and is used to distinguish features within the iris. The next step is to encode the iris image from two dimensional brightness data down to a two dimensional binary signature, referred to as the template ("Template Generation" in Figure 2), to accomplish this, the energy data are passed into two directional filters to determine the existence of ridges and their orientation. The RED algorithm uses directional filtering to generate the iris template, a set of bits that meaningfully represents a person's iris.
A template mask is also created during this filtering process. If both filter output values are not above a certain threshold, then a mask bit is cleared for that particular pixel location. The template mask is used to identify pixel locations where neither vertical nor horizontal directions are identified.
The operator is the exclusive-or operation used to detect disagreement between corresponding bit pairs in the two templates, represents the binary AND function, and masks A and B identify the values in each template that are not corrupted by artifacts such as eyelids/eyelashes and specularities. The denominator of (1) ensures that only valid bits are included in the calculation, after artifacts are discounted. The lower the HD result, the greater the match between the two irises being compared. The fractional Hamming distance between two templates is compared to a predetermined threshold value and a match or nonmatch declaration is made.
The HD calculation, or iris matching, is critical to the throughput performance of iris recognition since this task is repeated many times, seen in Figure 4. Traditional systems for HD calculation have been coded in sequential logic (software); databases have been spread across multiple processors to take advantage of the parallelism of the database search, but the inherent parallelism of the HD calculation has not been fully exploited.
3.1. Sequential on a CPU
Currently, iris recognition algorithms are deployed globally in a variety of systems ranging from computer access to building security to national size databases. These systems typically use central processing unit- (CPU-) based computers. CPU based computers are general purpose machines, designed for all types of applications and are to first order programmed as sequential machines, though there are provisions for multiprocessing and multithreading. Recently, there has been an interest in exploring the parallel nature of this application . It is challenging to exploit the inherent parallelism of many algorithms in such architectures.
We would like to highlight the sequential nature of this code. For example, since the XOR function is performed 32 bits at a time, a loop (for loop denoted) is necessary. Since it is computing 2048 bits, this loop is executed 64 times. Also, note that the XOR and AND computations are also performed sequentially. These instructions could be scheduled to execute in parallel, but a modern CPU has a limited number of functional units, therefore limiting the amount of parallel execution. Summation of the bits is performed using lookup tables. Finally, the HD score is computed as a ratio of the number of differences between the templates to the total number of bits that are not masked.
3.2. Parallel on an FPGA
Field Programmable Gate Arrays (FPGAs) are complex programmable logic devices that are essentially a "blank slate" integrated circuit from the manufacturer and can be programmed with nearly any parallel logic function. They are fully customizable and the designer can prototype, simulate, and implement a parallel logic function without the costly process of having a new integrated circuit manufactured from scratch. FPGAs are commonly programmed via VHDL (VHSIC Hardware Description Language). VHDL statements are inherently parallel, not sequential. VHDL allows the programmer to dictate the type of hardware that is synthesized on an FPGA. Ideally, if 2,048 matching elements could fit onto the FPGA, all 2048 bits of the template could be compared at once, with a corresponding increase in throughput. Here we perform the same function as the aforementioned C++ code. However, we are doing this computation completely in parallel. There are 2,048 XOR gates and 4,096 AND gates required for this computation. In addition, adders are required for summing and calculating the score.
This code is contained within a "process" statement. The process statement is only initiated when a signal in the sensitivity list changes values. The sensitivity list of the process contains the clock signal and therefore the code is executed once per clock cycle. In this code, the clock signal is drawn from our FPGA board which contains a 50 Mhz clock. Therefore, every 20 ns, this hamming distance calculation is computed. This code is fully synthesizable and can be downloaded onto an FPGA for direct hardware execution.
3.3. Parallel on a CELL
We have also parallelized the HD calculation on the Cell processor on the PlayStation3. As stated before, SPE management is left entirely to the programmer. We therefore have completely separate code and compilations for the PPE and the SPEs. The code on the PPE works as a slave master, spawning off threads of work to the 6 individual SPEs. The work is divided up on iris template matching boundaries, not within a template match. Therefore, each SPE is individually responsible for 1/6th of the HD comparisons. To maximize performance, the HD calculation is vectorized on the SPEs, taking advantage of the SIMD capabilities of the SPU's.
The CPU experiment is executed on an Intel Xeon X5355  workstation class machine. The processor is equipped with 8 cores, 2.66 GHz clock, and an 8 MB L2 cache. While there are eight cores available, only one core is used to perform this test, therefore allowing all cache and memory resources for the code under test. The HD code was compiled under Windows XP using the Visual Studio software suite. The code has been fully optimized to enhance performance. Additionally, millions of matches were executed to ensure that the templates are fully cached in the on-chip L2 cache. We report the best-case per match execution time.
The PlayStation3 is used for our Cell experiments. Fedora Core 8 was chosen for installation onto the PlayStation3. Fedora Core 8 is not the most recent release of Fedora but was chosen because it is the most recent release that has been fully adapted to the PlayStation3. Additionally, the installation procedures available online for FC8 are the most detailed and complete of any Linux distribution. Furthermore, the IBM SDK, which is required for writing code that runs on the Cell's SPUs, is specifically only released for the commercial Red Hat Enterprise Edition Linux or the freely available Fedora Core.
The FPGA experiment is executed on a DE2  board provided by Altera Corporation. The DE2 board includes a Cyclone-II EP2C35 FPGA chip, as well as the required programming interface. Although the DE2 board is utilized for this research, only the Cyclone-II chip is necessary to execute our algorithm. The Cyclone-II  family is designed for high-performance, low-power applications. It contains over 30,000 logic elements (LE) and over 480,000 embedded memory bits. In order to program our VHDL onto the Cyclone-II, we utilize the Altera Quartus software for implementation of our VHDL program. The Quartus suite includes compilation, synthesis, simulation, and programming environments. We are able to determine the size required of our program on the FPGA, and the resulting execution time. The optimized C++ code time is actually faster than some of the times reported in the literature for commercial implementations . We attribute this difference to improvements in CPU speed and efficiency between the time of our experiments and the previous reports. However, this indicates that our C++ code is a reasonable target for comparison and that we may reasonably expect similar improvements from application of FPGA technology to other HD-based algorithms.
All VHDL code is fully synthesizable and is downloaded onto our DE2 for direct hardware execution. As discussed above, our code is fully contained within a "process" statement. The process statement is only initiated when a signal in its sensitivity list changes values. The sensitivity list of our process contains the clock signal and therefore the code is executed once per clock cycle. In this code, the clock signal is drawn from our DE2 board which contains a 50 MHz clock. Therefore, every 20 ns, our calculation is computed.
FPGA versus CPU comparison for iris match execution.
Optimized Xeon Code on PS3
CELL (with 6 SPEs)
Cyclone-II EP2C35 (50 MHz)
Cyclone-II estimated @ 100 MHz
Stratix IV estimated @ 500 MHz
Time per match (ns)
10 ns (est)
2 ns (est)
Speedup over Xeon
% usage of chip
In the Cyclone-II FPGA, there are over 400,000 memory bits available for on-chip storage. The iris templates must be stored either in memory on the FPGA or off-chip. In one instance of our implementation, we have implemented a 2048-bit wide memory in VHDL. We have added this to our code to verify that a small database can be stored on chip. One of the two templates compared is received from this dual-ported, 2048-bit wide, single-cycle cache implemented on our Cyclone-II FPGA. Therefore, once per clock cycle, a 2048-bit vector is fetched from on-chip memory, and the HD calculation is performed. Again, therefore, the entire process can be executed in 20 ns. We have successfully implemented and tested the HD calculation with and without a memory device.
Also reported in Table 1 is the utilization of the FPGA resources. Our implementation of the Hamming Distance algorithm utilizes 73% of our Cyclone-II FPGA. In terms of on-chip memory usage, one of the two templates compared is stored in the dual-ported, 2048-bit wide, single-cycle cache implemented on our Cyclone-II FPGA. Each stored template consumes 0.7% of on-chip memory. We have added this to our code to verify that a small database of approximately 230 can be stored on chip.
The Cyclone-II is not built for performance and is also not a state-of-the-art design. A projection of the performance of a faster Cyclone-II (100 MHz) and a state-of-the-art Stratix IV (500 MHz) FPGA is given in Table 1. A still modest Cyclone version clocked at 100 MHz is able to outperform the sequential version by a factor of 38. The faster Stratix IV is projected to perform approximately 190 times faster than the sequential version. Additionally, our implementation on the Stratix IV would only consume approximately 7.3% of the chip. On-chip memory for the Stratix-IV is also much larger with 22.4 Mbits of on-chip storage. For example, a database consisting of 10000 irises can be stored on the Stratix-IV. We anticipate this storage scaling trend to continue into the future, with larger and larger database storage becoming available. If a larger database is necessary, we propose an implementation where a DRAM chip is provided as part of the package, and the on-chip database is concurrently loaded while hamming distances are being computed. In addition, with a larger FPGA, it is possible to compute multiple matches in parallel. This available parallelism is also demonstrated in Table 1.
The trend in modern computing is toward a multicore design. In this research, we are interested in the performance of a modern multicore, Cell processor, compared to an FPGA for an image processing algorithm. We demonstrate that a vital portion of an iris recognition algorithm can be parallelized on both systems, and our results on an FPGA are 2.5 times better than the CELL processor. FPGAs have been on an impressive scaling trend over the last 10 years. We expect this scaling trend to continue in the short term and we even believe that an FPGA could potentially be a part of the General Purpose Computer of tomorrow.
- Synergistic processing in cell's multicore architecture http://www.research.ibm.com/people/m/mikeg/papers/2006_ieeemicro.pdf
- Cell Broadband Engine Programming. IBM Developer Works; https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/1741C509C5F64B3300257460006FD68D
- Cell Broadband Engine https://www-01.ibm.com/chips/techlib/techlib.nsf/products/Cell_Broadband_Engine
- Daugman J: Probing the uniqueness and randomness of iriscodes: results from 200 billion iris pair comparisons. Proceedings of the IEEE 2006, 94(11):1927-1935.View ArticleGoogle Scholar
- Ives RW, Broussard RP, Kennell LR, Rakvic RN, Etter DM: Iris recognition using the ridge energy direction (RED) algorithm. Proceedings of the 42nd Annual Asilomar Conference on Signals, Systems and Computers, November 2008, Pacific Grove, Calif, USA 1219-1223.Google Scholar
- Park C-H, Lee J-J, Smith MJT, Park K-H: Iris-based personal authentication using a normalized directional energy feature. Proceedings of Audio and Video Based Biometric Person Authentication Conference, 2003 2688: 224-232.View ArticleMATHGoogle Scholar
- Chen Y, Dass SC, Jain AK: Localized iris image quality using 2-D wavelets. Proceedings of the International Conference on Biometrics (ICB '06), January 2006, Hong Kong 373-381.Google Scholar
- Shao S, Xie M: Iris recognition based on feature extraction in kernel space. Proceedings of the IEEE Biometrics Symposium, September 2006, Baltimore, Md, USAGoogle Scholar
- Broussard RP, Kennell LR, Ives RW: Identifying discriminatory information content within the iris. Biometric Technology for Human Identification V, March 2008, Orlando, Fla, USA, Proceedings of SPIEGoogle Scholar
- Gupta G, Agarwal M: Iris recognition using non filter-based technique. Proceedings of the Biometrics Symposium, September 2005, Arlington, Va, USA 45-47.Google Scholar
- Ives RW, Kennell L, Broussard R, Soldan D: Iris recognition using directional energy. Proceedings of the IEEE International Conference on Image Processing (ICIP '08), October 2008, San Diego, Calif, USAGoogle Scholar
- Masek L: Recognition of human iris patterns for biometric identification, M.S. thesis. The University of Western Australia, Perth Crawley, Australia; 2003. http://www.csse.uwa.edu.au/~pk/studentprojects/libor/LiborMasekThesis.pdfGoogle Scholar
- Kennell L, Ives RW, Gaunt RM: Binary morphology and local statistics applied to iris segmentation for recognition. Proceedings of the IEEE International Conference on Image Processing (ICIP '06), October 2006, Atlanta, Ga, USAGoogle Scholar
- Daugman J: Statistical richness of visual phase information: update on recognizing persons by iris patterns. International Journal of Computer Vision 2001, 45(1):25-38. 10.1023/A:1012365806338View ArticleMATHGoogle Scholar
- Broussard RP, Rakvic RN, Ives RW: Accelerating iris template matching using commodity video graphics adapters. Proceedings of the 2nd IEEE International Conference on Biometrics: Theory, Applications and Systems (BTAS '08), September 2008, Crystal City, Va, USAGoogle Scholar
- Intel Corporation June 2008, http://www.intel.com/products/processor/manuals/index.htm
- Intel Corporation June 2008, http://processorfinder.intel.com/details.aspx?sSpec=SL9YM
- Altera Corporation June 2008, http://www.altera.com/education/univ/materials/boards/unv-de2-board.html
- Altera Corporation June 2008, http://www.altera.com/products/devices/cyclone2/cy2-index.jsp
- Daugman J: How iris recognition works. IEEE Transactions on Circuits and Systems for Video Technology 2004, 14(1):21-30. 10.1109/TCSVT.2003.818350View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.