Open Access

Rapid VLIW Processor Customization for Signal Processing Applications Using Combinational Hardware Functions

  • Raymond R. Hoare1Email author,
  • Alex K. Jones1,
  • Dara Kusic1,
  • Joshua Fazekas1,
  • John Foster1,
  • Shenchih Tung1 and
  • Michael McCloud1
EURASIP Journal on Advances in Signal Processing20062006:046472

https://doi.org/10.1155/ASP/2006/46472

Received: 12 October 2004

Accepted: 12 July 2005

Published: 2 March 2006

Abstract

This paper presents an architecture that combines VLIW (very long instruction word) processing with the capability to introduce application-specific customized instructions and highly parallel combinational hardware functions for the acceleration of signal processing applications. To support this architecture, a compilation and design automation flow is described for algorithms written in C. The key contributions of this paper are as follows: (1) a 4-way VLIW processor implemented in an FPGA, (2) large speedups through hardware functions, (3) a hardware/software interface with zero overhead, (4) a design methodology for implementing signal processing applications on this architecture, (5) tractable design automation techniques for extracting and synthesizing hardware functions. Several design tradeoffs for the architecture were examined including the number of VLIW functional units and register file size. The architecture was implemented on an Altera Stratix II FPGA. The Stratix II device was selected because it offers a large number of high-speed DSP (digital signal processing) blocks that execute multiply-accumulate operations. Using the MediaBench benchmark suite, we tested our methodology and architecture to accelerate software. Our combined VLIW processor with hardware functions was compared to that of software executing on a RISC processor, specifically the soft core embedded NIOS II processor. For software kernels converted into hardware functions, we show a hardware performance multiplier of up to times that of software with an average times faster. For the entire application in which only a portion of the software is converted to hardware, the performance improvement is as much as 30X times faster than the nonaccelerated application, with a 12X improvement on average.

[1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162]

Authors’ Affiliations

(1)
Department of Electrical and Computer Engineering, University of Pittsburgh

References

  1. Altera Corporation : Stratix II Device Handbook, Volume 1. available on-line: http://www.altera.com
  2. Xilinx Incorporated : Virtex-4 Product Backgrounder. available on-line: http://www.xilinx.com
  3. Lattice Semiconductor Corporation : LatticeECP and EC Familiy Data Sheet. available on-line: http://www.latticesemi.com
  4. Apple Computer Inc : Optimizing with SHARK, Big Payoff, Small Effort.Google Scholar
  5. Suresh DC, Najjar WA, Vahid F, Villarreal JR, Stitt G: Profiling tools for hardware/software partitioning of embedded applications. Proceedings of ACM SiGPLAN Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES '03), June 2003, San Diego, Calif, USA 189-198.Google Scholar
  6. De Micheli G, Ku D, Mailhot F, Truong T: The Olympus synthesis system. IEEE Design and Test of Computers 1990, 7(5):37-53. 10.1109/54.60605View ArticleGoogle Scholar
  7. Lavagno L, Sentovich E: ECL: a specification environment for system-level design. Proceedings of 36th Design Automation Conference (DAC '99), June 1999, New Orleans, La, USA 511-516.Google Scholar
  8. Gupta S, Dutt N, Gupta R, Nicolau A: SPARK: a high-level synthesis framework for applying parallelizing compiler transformations. Proceedings of 16th IEEE International Conference on VLSI Design (VLSI Design '03), January 2003, New Delhi, India 461-466.Google Scholar
  9. Gupta S, Savoiu N, Dutt N, Gupta R, Nicolau A: Using global code motions to improve the quality of results for high-level synthesis. IEEE Transactions On Computer-Aided Design Of Integrated Circuits and Systems 2004, 23(2):302-312. 10.1109/TCAD.2003.822105View ArticleGoogle Scholar
  10. Jones AK, Bagchi D, Pal S, Banerjee P, Choudhary A: Pact HDL: compiler targeting ASIC's and FPGA's with power and performance optimizations. In Power Aware Computing. Edited by: Graybill R, Melhem R. Kluwer Academic, Boston, Mass, USA; 2002:169-190. chapter 9View ArticleGoogle Scholar
  11. Tang X, Jiang T, Jones AK, Banerjee P: Behavioral synthesis of data-dominated circuits for minimal energy implementation. Proceedings of 18th IEEE International Conference on VLSI Design (VLSI Design '05), January 2005, Kolkata, India 267-273.Google Scholar
  12. Jung E: Behavioral synthesis using systemC compiler. Proceedings of 13th Annual Synopsys Users Group Meeting (SNUG '03), March 2003, San Jose, Calif, USAGoogle Scholar
  13. Black D, Smith S: Pushing the limites with behavioral compiler. Proceedings of 9th Annual Synopsys Users Group Meeting (SNUG '99), March 1999, San Jose, Calif, USAGoogle Scholar
  14. Bartleson K: A New Standard for System-Level Design. Synopsys White Paper, 1999Google Scholar
  15. Goering R: Behavioral Synthesis Crossroads. EE Times Article, 2004Google Scholar
  16. Pursley DJ, Cline BL: A practical approach to hardware and software SoC tradeoffs using high-level synthesis for architectural exploration. Proceedings of of the GSPx Conference, March–April 2003, Dallas, Tex, USAGoogle Scholar
  17. Chappell S, Sullivan C: Handel-C for Co-Processing and Co-Design of Field Programmable System on Chip. Celoxica White Paper, 2002Google Scholar
  18. Banerjee P, Haldar M, Nayak A, et al.: Overview of a compiler for synthesizing MATLAB programs onto FPGAs. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 2004, 12(3):312-324.View ArticleGoogle Scholar
  19. Banerjee P, Shenoy N, Choudhary A, et al.: A MATLAB compiler for distributed, heterogeneous, reconfigurable computing systems. Proceedings of 8th Annual IEEE International Symposium on FPGAs for Custom Computing Machines (FCCM '00), April 2000, Napa Valley, Calif, USA 39-48.Google Scholar
  20. McCloud S: Catapult C Synthesis-Based Design Flow: Speeding Implementation and Increasing Flexibility. Mentor Graphics White Paper, 2004Google Scholar
  21. Chaiyakul V, Gajski DD: Assignment decision diagram for high-level synthesis. In Tech. Rep. #92-103. University of California, Irvine, Calif, USA; December 1992.Google Scholar
  22. Chaiyakul V, Gajski DD, Ramachandran L: High-level transformations for minimizing syntactic variances. Proceedings of 30th Design Automation Conference (DAC '93), June 1993, Dallas, Tex, USA 413-418.Google Scholar
  23. Ghosh I, Fujita M: Automatic test pattern generation for functional RTL circuits using assignment decision diagrams. Proceedings of 37th Design Automation Conference (DAC '00), June 2000, Los Angeles, Calif, USA 43-48.View ArticleGoogle Scholar
  24. Zhang L, Ghosh I, Hsiao M: Efficient sequential ATPG for functional RTL circuits. Proceedings of IEEE International Test Conference (ITC '03), September–October 2003, Charlotte, NC, USA 1: 290-298.Google Scholar
  25. Chouliaras VA, Nunez J: Scalar coprocessors for accelerating the G723.1 and G729A speech coders. IEEE Transactions on Consumer Electronics 2003, 49(3):703-710. 10.1109/TCE.2003.1233807View ArticleGoogle Scholar
  26. Atzori E, Carta SM, Raffo L: 44.6% processing cycles reduction in GSM voice coding by low-power reconfigurable co-processor architecture. IEE Electronics Letters 2002, 38(24):1524-1526. 10.1049/el:20021019View ArticleGoogle Scholar
  27. Hilgenstock J, Herrmann K, Otterstedt J, Niggemeyer D, Pirsch P: A video signal processor for MIMD multiprocessing. Proceedings of 35th Design Automation Conference (DAC '98), June 1998, San Francisco, Calif, USA 50-55.Google Scholar
  28. Garg R, Chung CY, Kim D, Kim Y: Boundary macroblock padding in MPEG-4 video decoding using a graphics coprocessor. IEEE Transactions on Circuits and Systems for Video Technology 2002, 12(8):719-723. 10.1109/TCSVT.2002.800857View ArticleGoogle Scholar
  29. Hinds CN: An enhanced floating point coprocessor for embedded signal processing and graphics applications. Proceedings of Conference Record 33rd Asilomar Conference on Signals, Systems, and Computers, October 1999, Pacific Grove, Calif, USA 1: 147-151.Google Scholar
  30. Alves JC, Matos JS: RVC-a reconfigurable coprocessor for vector processing applications. Proceedings of 6th IEEE Symposium on FPGAs for Custom Computing Machines (FCCM '98), April 1998, Napa Valley, Calif, USA 258-259.View ArticleGoogle Scholar
  31. Bridges T, Kitchel SW, Wehrmeister RM: A CPU utilization limit for massively parallel MIMD computers. Proceedings of 4th Symposium on the Frontiers of Massively Parallel Computation, October 1992, McLean, Va, USA 83-92.Google Scholar
  32. Schmit H, Whelihan D, Tsai A, Moe M, Levine B, Taylor RR: PipeRench: A virtualized programmable datapath in 0.18 micron technology. Proceedings of IEEE Custom Integrated Circuits Conference (CICC '02), May 2002, Orlando, Fla, USA 63-66.Google Scholar
  33. Goldstein SC, Schmit H, Budiu M, Cadambi S, Moe M, Taylor RR: PipeRench: a reconfigurable architecture and compiler. Computer 2000, 33(4):70-77. 10.1109/2.839324View ArticleGoogle Scholar
  34. Goldstein SC, Schmit H, Moe M, et al.: PipeRench: a coprocessor for streaming multimedia acceleration. Proceedings of 26th IEEE International Symposium on Computer Architecture (ISCA '99), May 1999, Atlanta, Ga, USA 28-39.Google Scholar
  35. Cadambi S, Weener J, Goldstein SC, Schmit H, Thomas DE: Managing pipeline-reconfigurable FPGAs. Proceedings of 6th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA '98), February 1998, Monterey, Calif, USA 55-64.Google Scholar
  36. Schmit H: Incremental reconfiguration for pipelined applications. Proceedings of 5th Annual IEEE Symposium on FPGAs for Custom Computing Machines (FCCM '97), April 1997, Napa Valley, Calif, USA 47-55.Google Scholar
  37. Levine BA, Schmit H: Efficient application representation for HASTE: hybrid architectures with a single, transformable executable. Proceedings of 11th Annual IEEE Symposium on FPGAs for Custom Computing Machines (FCCM '03), April 2003, Napa Valley, Calif, USA 101-110.Google Scholar
  38. Ebeling C, Cronquist DC, Franklin P: RaPiD - reconfigurable pipelined datapath. Proceedings of 6th International Workshop on Field-Programmable Logic and Applications (FPL '96), September 1996, Darmstadt, Germany 126-135.Google Scholar
  39. Ebeling C, Cronquist DC, Franklin P, Fisher C: RaPiD - a configurable computing architecture for compute-intensive applications. In Tech. Rep. TR-96-11-03. University of Washington, Department of Computer Science & Engineering, Seattle, Wash, USA; 1996.Google Scholar
  40. Ebeling C, Cronquist DC, Franklin P, Secosky J, Berg SG: Mapping applications to the RaPiD configurable architecture. Proceedings of 5th Annual IEEE Symposium on FPGAs for Custom Computing Machines (FCCM '97), April 1997, Napa Valley, Calif, USA 106-115.Google Scholar
  41. Cronquist DC, Franklin P, Berg SG, Ebeling C: Specifying and compiling applications for RaPiD. Proceedings of 6th IEEE Symposium on FPGAs for Custom Computing Machines (FCCM '98), April 1998, Napa Valley, Calif, USA 116-125.View ArticleGoogle Scholar
  42. Cronquist DC, Fisher C, Figueroa M, Franklin P, Ebeling C: Architecture design of reconfigurable pipelined datapaths. Proceedings of 20th Anniversary Conference on Advanced Research in VLSI, March 1999, Atlanta, Ga, USA 23-40.View ArticleGoogle Scholar
  43. Mirsky E, DeHon A: MATRIX: a reconfigurable computing architecture with configurable instruction distribution and deployable resources. Proceedings of 4th IEEE Symposium on FPGAs for Custom Computing Machines (FCCM '96), April 1996, Napa Valley, Calif, USA 157-166.Google Scholar
  44. Kapasi UJ, Dally WJ, Rixner S, Owens JD, Khailany B: The imagine stream processor. Proceedings of IEEE International Conference on Computer Design: VLSI in Computers and Processors, September 2002, Freiberg, Germany 282-288.View ArticleGoogle Scholar
  45. Khailany B, Dally WJ, Kapasi UJ, et al.: Imagine: media processing with streams. IEEE Micro 2001, 21(2):35-46. 10.1109/40.918001View ArticleGoogle Scholar
  46. Owens JD, Rixner S, Kapasi UJ, et al.: Media processing applications on the Imagine stream processor. Proceedings of IEEE International Conference on Computer Design: VLSI in Computers and Processors, September 2002, Freiberg, Germany 295-302.View ArticleGoogle Scholar
  47. Hauser JR, Wawrzynek J: Garp: a MIPS processor with a reconfigurable coprocessor. Proceedings of 5th Annual IEEE Symposium on FPGAs for Custom Computing Machines (FCCM '97), April 1997, Napa Valley, Calif, USA 12-21.Google Scholar
  48. Callahan TJ, Hauser JR, Wawrzynek J: The Garp architecture and C compiler. Computer 2000, 33(4):62-69. 10.1109/2.839323View ArticleGoogle Scholar
  49. Callahan T: Kernel formation in Garpcc. Proceedings of 11th Annual IEEE Symposium on FPGAs for Custom Computing Machines (FCCM '03), April 2003, Napa Valley, Calif, USA 308-309.Google Scholar
  50. Hauck S, Fry TW, Hosler MM, Kao JP: The Chimaera reconfigurable functional unit. Proceedings of 5th Annual IEEE Symposium on FPGAs for Custom Computing Machines (FCCM '97), April 1997, Napa Valley, Calif, USA 87-96.Google Scholar
  51. Hauck S, Hosler MM, Fry TW: High-performance carry chains for FPGAs. Proceedings of ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA '98), February 1998, Monterey, Calif, USA 223-233.Google Scholar
  52. Hoare R, Tung S, Werger K: A 64-way SIMD processing architecture on an FPGA. Proceedings of 15th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS '03), November 2003, Marina del Rey, Calif, USA 1: 345-350.Google Scholar
  53. Dutta S, Wolfe A, Wolf W, O'Connor KJ: Design issues for very-long-instruction-word VLSI video signal processors. Proceedings of IEEE Workshop on VLSI Signal Processing, IX, October–November 1996, San Francisco, Calif, USA 95-104.View ArticleGoogle Scholar
  54. Capitanio A, Dutt N, Nicolau A: Partitioned register files For VLIWs: a preliminary analysis of tradeoffs. Proceedings of 25th Annual International Symposium on Microarchitecture (MICRO '92), December 1992, Portland, Ore, USA 292-300.View ArticleGoogle Scholar
  55. Trimaran, An Infrastructure for Research in Instruction-Level Parallelism 1998, http://www.trimaran.org
  56. Jones AK, Hoare R, Kourtev IS, et al.: A 64-way VLIW/SIMD FPGA architecture and design flow. Proceedings of 11th IEEE International Conference on Electronics, Circuits and Systems (ICECS '04), December 2004, Tel Aviv, Israel 499-502.Google Scholar
  57. Lee C, Potkonjak M, Mangione-Smith WH: MediaBench: a tool for evaluating and synthesizing multimedia and communications systems. Proceedings of 30th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '97), December 1997, Research Triangle Park, NC, USA 330-335.Google Scholar
  58. Degener J, Bormann C: GSM 06.10 lossy speech compression library. available on-line: http://kbs.cs.tu-berlin.de/~jutta/toast.html
  59. Golub G, Loan CFV: Matrix Computational. Johns Hopkins University Press, Baltimore, Md, USA; 1991.Google Scholar
  60. Hassibi B, Vikalo H: On sphere decoding algorithm. I. Expected complexity. submitted to IEEE Transactions on Signal Processing, 2003Google Scholar
  61. Hassibi B, Vikalo H: On sphere decoding algorithm. II. Examples. submitted to IEEE Transactions on Signal Processing, 2003Google Scholar
  62. Chobe Y, Narahari B, Simha R, Wong WF: Tritanium: augmenting the trimaran compiler infrastructure to support IA64 code generation. Proceedings of 1st Annual Workshop on Explicitly Parallel Instruction Computing Architectures and Compiler Techniques (EPIC '01), December 2001, Austin, Tex, USA 76-79.Google Scholar

Copyright

© Hoare et al. 2006