EURASIP Journal on Applied Signal Processing 2005:13, 2018–2025 c ○ 2005 Hindawi Publishing Corporation Scalable IC Platform for Smart Cameras

Smart cameras are among the emerging new fields of electronics. The points of interest are in the application areas, software and IC development. In order to reduce cost, it is worthwhile to invest in a single architecture that can be scaled for the various application areas in performance (and resulting power consumption). In this paper, we show that the combination of an SIMD (single-instruction multiple-data) processor and a general-purpose DSP is very advantageous for the image processing tasks encountered in smart cameras. While the SIMD processor gives the very high performance necessary by exploiting the inherent data parallelism found in the pixel crunching part of the algorithms, the DSP offers a friendly approach to the more complex tasks. The paper continues to motivate that SIMD processors have very convenient scaling properties in silicon, making the complete, SIMD-DSP architecture suitable for different application areas without changing the software suite. Analysis of the changes in power consumption due to scaling shows that for typical image processing tasks, it is beneficial to scale the SIMD processor to use the maximum level of parallelism available in the algorithm if the IC supply voltage can be lowered. If silicon cost is of importance, the parallelism of the processor should be scaled to just reach the desired performance given the speed of the silicon.


INTRODUCTION
Real-time video processing on (low-cost and low-power) programmable platforms is now becoming possible thanks to advances in integration techniques [1,2,3,4]. This is relevant to a number of applications such as mobile communications, home robotics, and even industrial image processing [5,6]. It is important that these platforms are programmable since new applications for smart cameras emerge every month. The complexity (and possible error-proneness) of the algorithms and the fickleness of real-life scenes are also strongly motivating complete programmability. Repeatedly building application-specific ICs or weakly programmable ICs is simply too costly and by far not sufficient for this changing market.
It seems like a daunting task to create programmable hardware that is able to process multimillion pixels per second for complex decision tasks. However, we will show in this paper that this is possible by exploiting the inherent parallelism present in the various levels of image processing.
A desirable "silicon" property of the resulting parallel architectures is that they are easily scaled up or down in performance and/or power consumption whenever the application changes. This not only lowers the need to develop new architectures from scratch for new vision application areas, but it also allows the design team to use the same software suite with only some settings changed in include files. This significantly reduces the overall cost of vision solutions as they can be reused among the portfolio of the producer.
The two types of processors that we propose to be definitely included in smart camera architectures are the SIMD (single instruction multiple-data) massively parallel processor and (one or more) general purpose DSPs [7]. Enough has  been written about general purpose DSPs, so we will mainly focus in this paper on the merits of SIMD processors for the computationally demanding image processing tasks. After reading the paper, it will be clear that they have unique and very clear merits from a silicon and algorithmic point of view for realistic IC implementations of programmable smart cameras.
This paper deals with hardware processor architectures and their properties for scalable platforms. However, designing an easy-to-use software environment for these scalable platforms with multiple processor cores is of course an immense task. Even though all processors might be programmable in a similar language, still (automatic) decisions have to be made to decide on which processor to run which task, when, and how to communicate data. A recent result of the smart camera project is a programming method where algorithmic kernels in the shape of skeletons are used to describe the applications. Based on the available processor cores and memory in the final (scaled) architecture, different skeletons are chosen, linked, and scheduled. The virtue of this design suite is that several skeletons are available for a certain task, optimized for different processor cores. An implementation of a complete application can now survive scaling of the hardware. For more information on this software environment, the interested reader is kindly referred to [8].
The remainder of the paper is organized as follows. In Section 2, we show how well-known processor architectures map to the image processing algorithms. Section 3 deals with some application areas and their performance demands. This is followed by the proposed SIMD-VLIW-based vision platform for smart cameras in Section 4. The scaling properties of SIMD regarding silicon are discussed in Section 5. Finally, conclusions are drawn in Section 6.

ALGORITHM CLASSIFICATION
Applications on smart cameras have typically as input images or life video from the observed scene and produce lowrate data output in the form of decisions or identification results. Among the examples that are worked on now are person and object identification, gesture control, event recognition, and data measurement. More challenging applications are expected to come in the near future.  The algorithms in the application areas of smart cameras can be grouped into three levels: low-level, intermediate-level, and high-level tasks. Figures 1 and 2 show the task classification and the corresponding data entities, respectively.
The low-or early-image processing level is associated with typical kernel operations like convolutions and datadependent operations using a limited neighbourhood of the current pixels. In this part, often a classification or the initial steps towards pixel classification are performed. Because every pixel could be classified in the end as "interesting," the algorithms per pixel are essentially the same. So, if more performance is needed in this level of image processing, with (now!) up to a billion pixels per second, it is very fruitful to use this inherent data parallelism by operating on more pixels per clock cycle. The processors enabling this have an SIMD architecture [9,10]. In SIMD architectures, the same instruction is issued on all data items in parallel. This lowers the overhead of instruction fetch, decoding, and data access leading to economical solutions to meet the high performance and throughput found in this task level.
Also from a power consumption point of view, SIMD processors prove to be very good [11]. The parallel architecture namely reduces the number of memory accesses, clock speed, and instruction decoding, thereby enabling higher performance at lower power consumption [3,4], see Section 5.1.
An important silicon property of SIMD processors is the regularity of the design. This enables cost effective scaling of the platform for different performance regions by simply increasing or reducing the number of processors in the parallel array. The hardware design can be reused and what is even more important is that inherently the software suite remains the same. So, using SIMD for pixel processing lowers the design cost (and time-to-market) of rapidly changing applications.
In the intermediate level, measurements are performed on the objects found to analyze the quality or properties of objects in order to make decisions on the image contents. It appears that SIMD type of architectures can do these tasks, but they are not very efficient because only part of the image (or line) contains objects and the SIMD processors are always processing the entire image (or line). However, they can be  Figure 3: Face recognition actually has two parts, the detection and the recognition. The complete detection part is perfectly mapped to the SIMD processor doing the low-level operations. The recognition part is dealt with by the VLIW processor.
easily performed by a general purpose DSP in the system, if the performance demands are met. If the performance needs to be increased, a viable way is to use the property that similar algorithms are performed on the various objects, which leads to task-parallel object processing on different processors. Finally, in the high-level part of image processing, decisions are made and forwarded to the user. General-purpose processors are ideal for these tasks because they offer the flexibility to implement complex software tasks and are often capable of running an operating system and doing networking applications.
An illustrative application where we can easily indicate the three processing levels is face recognition and detection (see Figure 3). Here the low-level part of the algorithm classifies each pixel as belonging to a face or not (face detection). This is mapped on "Xetal," an SIMD processor [4]. The intermediate level determines the features and identifies each found face object. Finally, in the high-level part of the algorithm, the decisions to open a door, sound an alarm, or start a new guest program are taken. The latter two tasks are performed by "TriMedia," a VLIW (very large instruction word) processor [12]. This example was mentioned purely because it shows the three levels quite clearly, more information on this application for the interested reader can be found in [11,13].

APPLICATION PROFILING
The nature of the application determines the scaling that needs to be applied to the SIMD processor and the accompanying DSP in order to meet the performance. Although the basic operations handled by the SIMD processor remain identical, the application dictates the way the computations are performed and the intensity. Consequently, the SIMD architecture needs to be scaled up or down. To a varying degree, all application segments look for better performance at lower-cost and lower-power consumption.

Mobile multimedia processing
This class of applications is characterized by low cost, moderate computational complexity, and low power. The application can usually live with reduced quality of services, for ex-ample, lower-frame rate and compressed video streams. The objective from the point of view of product manufacturers is cost reduction which translates to reduction in silicon area and power-efficient computation. The latter objective can often be compromised for the former since mobile devices are active for a short period of time compared to standby duration and the battery energy is wasted mainly in the standby phase.
Thus, to cut costs, the SIMD has to be scaled down from the power-optimal massively parallel architecture. However, as technology shrinks, cost-per-unit area decreases and scaling down becomes less interesting since the tendency in mobile video applications are increased frame sized and rates, and more complex applications.
The computational complexity for this class of applications varies from 300 MOPs for basic camera preprocessing (VGA at 30 fps : 640 * 480 * 30 pix/s * 30 OPs/pix = 277 MOPs) to 1.5 GOPs for more complex preprocessing including auto white balance, exposure time control (about 150 operations per pixel).

Intelligent home interfaces and home robotics
This class of applications corresponds to emerging house robots with vision features. Typical examples include intelligent devices with gesture and face recognition [13], autonomous video guidance for robots, and smart home surveillance cameras. These devices cover the medium-cost range. From a user point of view, the response times and accuracies of the intelligent devices are of high importance and imply faster burst performance. They need to operate in uncontrolled environments (lighting conditions, etc.), and smarter (complex) algorithms are needed to achieve the desired performance. The power aspect remains an issue especially in standalone modules such as battery powered surveillance robots.
Because of the extra intelligence needed in this class of applications, the number of operations per pixel is in the order of 300 or more. This translates to more than 3 GOPs for a 30-frame-per-second VGA size video stream. Even though the devices can exhibit long idle times until they are excited by an event, for example, an intruder in a scene, in their active duration, the same degree of performance is required to guarantee real-time behaviour.

Industrial vision
Unlike the previous two cases, applications in this segment are cost-tolerant and more emphasis is given in achieving top performance sometimes at a given power budget. The emergence of smart cameras has made it possible for various industrial applications to replace large expensive PC-based vision systems with compact and light modules having different vision functionalities. The massively parallel SIMD is very suited for this class of applications from cost, performance, and power consumption points of view.
The scaling here is mainly in accordance with the incoming video format, one can expect a corresponding increase in the SIMD array with increase in the resolution of image sensors. In this class of applications, a number of basic pixel-level operations such as edge detection, enhancement, morphology, and so forth, need to be performed. Because of the high video rates often in excess of 500 fps, the computational complexity is easily more than 4 GOPs.

THE SIMD-VLIW VISION PLATFORM
The platform discussion in this section is based on the application profiling and algorithm classification discussed in the previous sections. The different aspects of the algorithmic levels have made us choose for a dual processor approach where the low-level image processing and part of the intermediate level are (as in Figure 3) mapped on a massively parallel SIMD processor "Xetal" [3]. The high-level image processing part and the remaining intermediate-level parts are mapped on a high-performance DSP core "TriMedia" [12]. This DSP has a VLIW architecture where instruction fetch, data fetch, and processing are performed in a pipelined fashion. For most applications, the two processors can be simply connected in series as shown in Figure 4.
The first part of the smart camera architecture is a CMOS image sensor, it can take up to 100 frames per second with a resolution of 1280 × 960 pixels. The Xetal processor contains 320 pixel-level processors organized as a linear processor array. The array processes an entire image line in 1 to 4 instructions depending on the line width, when each pixel processor is responsible for 1 to 4 columns. Around 1600 instructions can be handled by each processor per line time, depending on the clock setting. It has 16 line memories to save information and a control processor with the program memory to host the programs [3]. Figure 5 shows the architecture of the Xetal processor in more detail.
The TriMedia functions as the high-level DSP in the architecture. It exploits limited instruction-level parallelism and can handle up to 5 operations in parallel.

PERFORMANCE-DRIVEN SIMD SCALING
The discussion on application profiling indicates that the performance requirements vary by orders of magnitude from 300 MOPs to more than 4 GOPs. In this section, we address the scaling of a massively parallel SIMD architecture to match the computational complexity of a given application.
The impact of scaling is studied with respect to the different SIMD building blocks and quantified in terms of silicon area and power dissipation. The silicon area directly relates to the cost while the power dissipation dictates the applicability of the device in a system with maximum power constraint.

SIMD performance and power scaling
In mobile applications, both battery life and packaging are important issues. For a given performance requirement, scaling the number of processors in the SIMD machine has direct impact on the power consumption. Power dissipation determines the complexity and cost of packaging and cooling of the devices.
In this section, we look at the performance-power tradeoff when scaling SIMD processors. The analysis is based on the assumption that dynamic power dissipation is the dominant component and uses the well-known CMOS dynamic power dissipation formula: Power ∝ CV 2 f , where C is the switched capacitance, V is the supply voltage, and f the switching frequency.
Energy consumption of an SIMD machine is decomposed into the following components: computation modules (E comp ), communication network (E comm ), memory blocks (E mem ), and control and address generation units (E caddr ). This decomposition helps to identify where most of the power is spent. Equations (1)- (5) give the intrinsic energy model of the components as a function of the convolution filter width (W), number of processing elements (P), number of pixels per image line (N), and the size of the working line memory (N(W − 1)). The model parameters have been derived based on a high-level power estimation Petrol [14,15] and were later calibrated with measurement results of the Xetal chip. The basis for choosing a convolution algorithm in our investigation is the fact that convolution involves all the four components (computation, memory, communication, and control) that contribute to energy consumption. In the formulae, it is assumed that the different SIMD configurations operate at the same supply voltage:  The computation energy (E comp ) is a quadratic function of the filter width (W) and does not depend on the number of processing elements as the same number of arithmetic operations needs to be done for all configurations. On the other hand, the other energy components depend on all three dimensions (W, P, N) and have been modelled by a first-order approximation. In essence, scaling the SIMD architecture affects the number of accesses to the working line memories. With each access, a certain amount of energy is consumed by the communication channel, the control, address, and generator, and the memory block. As the number of processors increases, the number of accesses to memory decreases thereby reducing the total energy dissipation. In general, the following relationship holds between the energy components: E mem > E comp > E caddr > E comm . Figure 6 shows curves of the total energy for N = 640 (the number of pixels in a VGA image line) with filter width as parameter. The curves in Figure 6 show that beyond a certain degree of parallelism the saving in energy is very marginal. While the trend is the same, larger filter kernels benefit from increased number of PEs.
It should be noted that configurations with more processing elements (PEs) can handle increased throughput for the same algorithmic complexity. The increase is proportional to P since P ≤ N and the filter kernels can be fully parallelised over the pixels in an image line. The minimum number of processing elements needed to meet the real-time constraint is given by P min = C alg × f pixel / f max , where C alg is the algorithmic complexity in number of instructions, f pixel the pixel rate, and f max the maximum clock frequency of the processing elements (PEs). The clock frequency of the PEs can be increased further by optimisation and pipelining. While optimisation for speed leads to larger PE sizes and increases computation energy dissipation, the impact of pipelining on the SIMD scaling needs to be investigated further.
When throughput comes in the picture, power dissipation becomes a more convenient metric for comparison [16]. Figure 7 shows the power dissipation versus the number of PEs with performance (Per f = N ops × f pixel ) as parameter. The curves start with P min computed for PEs designed to run at f max = 50 MHz. Increasing parallelism beyond P min increases the chip cost (area) which has been traded for lower power dissipation through supply voltage and frequency scaling [17].
In Figure 8, the energy scaling factor is shown which has been used to generate the power dissipation curves of Figure 7. The energy scaling factor (6) has been derived from the CMOS propagation delay model given in [18]. The scaling factor starts with unity at the P min corresponding to a given performance and decreases as the number of PEs increases. A threshold voltage of V th = 0.5 V and maximum supply voltage of V dd max = 1.8 V have been assumed in the plotted curves. To allow for noise margin, the lowest operating voltage has been set to V dd min = V th + 0.2 V:

Scaling the linear processor array
Due to its regularity, the linear processor array (LPA) can be easily scaled according to the desired performance, cost, and power figures. Silicon area of the processor array can be given by A parray = P × A PE , where A PE is area of a single processing element. The array area scales linearly with the number of processing elements (P).

Scaling the line memories
The size of on-chip line memories is dictated by the algorithm to be executed and is independent of the number of PEs used in the SIMD configuration. From area point of view, the line memories do not scale; they only change in layout shape since more pixels would be allocated per PE as the number of PEs decreases. The silicon area contribution of the line memories becomes A lmem = N lines A line .

Scaling the global controller
Like the line memories, the global controller also does not scale with change in the number of PEs. This is to be expected since, in the SIMD principle, the global controller is already a shared resource by all PEs. The global controller area is simply A gcon .

Scaling the program memory
The program memory scaling depends on how the impact of loop overhead is addressed. When there are fewer PEs than the number of pixels in a line, algorithms need to be iterated over partitions of an image line. The overhead is associated to the instructions that control the iterations and can be considerable when the loop body is small. A straightforward way of reducing the loop overhead is to unroll the loop by replicating the algorithm code. This results in an increase in the program memory by an amount, that is, a function of the length of the unrolled code and the number of replications. The area of the program memory can be given by  A pmem = N ops N unroll (1 + γ)A instr , where A instr is the area of a single instruction and γ is the increase factor related to address-width expansion. Instead, one can use special loop control hardware in the global controller to avoid the cost of replicating codes.

Scaling case study
To summarize the SIMD scaling issue, we collect the components into one cost function described in terms of silicon area: A SIMD = A parray + A lmem + A gcon + A pmem . Figure 9 shows how the SIMD area scales with the scaling in the number of PEs. The curves are offset by an amount equivalent to the line memory and global controller areas which do not scale. Since the size of the program memory is small relative to the other components, the relative impact of loop unrolling is minimal for large array sizes. For lower number of processing elements (P < 200), the area of the nonscaling components dominates. Under this condition, it is sensible not to scale down the number PEs so that the performance loss in the no loop-unrolling case can be compensated for. When combined with the power scaling curves given earlier, the area scaling curve provides a quantitative means for performancedriven SIMD scaling.

CONCLUSIONS
In this paper, we have motivated the use of a scalable programmable architecture for video processing in the various applications of smart cameras. It appears that the highly parallel nature of image processing algorithms allows to put the major part of the load on an SIMD type of processors. Another processor in the system has to be a general-purpose microcontroller, microprocessor, or DSP.
We have also shown that the SIMD architecture scales nicely with regard to performance, power consumption and silicon area (cost). All this, while the so-costly program suite remains equal. Exploiting parallelism saves energy and increases performance, but the gain starts to stabilize when a certain number of processors is used. The SIMD area increase is linear in the number of processors with an offset. For practical low-cost applications, a design at maximum silicon speed is preferred with a sufficient level of parallelism to obtain some power savings. When voltage scaling is used, the lowest power consumption for a given performance is achieved at the needed number of processors at the maximum silicon speed of the IC at the lowest voltage supply. Richard P. Kleihorst received the M.S. and Ph.D. degrees in electrical engineering from Delft University of Technology, the Netherlands, in 1989 and 1994, respectively. In 1989, he worked at the Philips Research Laboratories in Eindhoven, the Netherlands, on fuzzy classification techniques for video-speed optical character recognition. From 1990 to 1994, he worked as a Research Assistant, investigating the application of order statistics for image processing in the Laboratory for Information Theory, Delft University of Technology, the Netherlands. In 1994, he joined the VLSI Design Group, Philips Research Laboratories, Eindhoven, the Netherlands. He worked on single-chip MPEG-2 encoding, embedded compression techniques, and parallel image processing. Currently he focuses on programmable architectures for real-time high-performance computer vision. During 2002-2004, he was a Committee Member of the IEEE International On-Line Testing Symposium, and since 2003, he has been a Committee Member of the Advanced Concepts for Intelligent Vision Systems Conference. At present, his interests include digital video processing, with emphasis on coupling expertise between DSP and silicon implementations.