Segmentation algorithm via Cellular Neural/ Nonlinear Network: implementation on Bio-inspired hardware platform

The Bio-inspired (Bi-i) Cellular Vision System is a computing platform consisting of sensing, array sensing-processing, and digital signal processing. The platform is based on the Cellular Neural/Nonlinear Network (CNN) paradigm. This article presents the implementation of a novel CNN-based segmentation algorithm onto the Bi-i system. Each part of the algorithm, along with the corresponding implementation on the hardware platform, is carefully described through the article. The experimental results, carried out for Foreman and Car-phone video sequences, highlight the feasibility of the approach, which provides a frame rate of about 26 frames/s. Comparisons with existing CNN-based methods show that the conceived approach is more accurate, thus representing a good trade-off between real-time requirements and accuracy.


Introduction
Due to the recent advances in communication technologies, the interest in video contents has increased significantly, and it has become more and more important to automatically analyze and understand video contents using computer vision techniques. In this regard, segmentation is essentially the first step toward many image analysis and computer vision problems [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15]. With the recent advances in several new multimedia applications, there is the need to develop segmentation algorithms running on efficient hardware platforms [16][17][18]. To this purpose, in [16] an algorithm for the real-time segmentation of endoscopic images running on a special-purpose hardware architecture is described. The architecture detects the gastrointestinal lumen regions and generates binary segmented regions. In [17], a segmentation algorithm was proposed, along with the corresponding hardware architecture, mainly based on a connected component analysis of the binary difference image. In [18], a multiple-features neural-network-based segmentation algorithm and its hardware implementation have been proposed. The algorithm incorporates static and dynamic features simultaneously in one scheme for segmenting a frame in an image sequence.
Referring to the development of segmentation algorithms running on hardware platforms, in this article the attention is focused on the implementation of algorithms running on the Cellular Neural/Nonlinear Network (CNN) Universal Machine [5][6][7]. This architecture offers great computational capabilities, which are suitable for complex image-analysis operations in objectoriented approaches [8][9][10]. Note that so far few CNN algorithms for obtaining the segmentation of a video sequence into moving objects have been introduced [5,6]. These segmentation algorithms were only simulated, i.e., the hardware implementation of these algorithms is substantially lacking. Based on these considerations, this article presents the implementation of a novel CNN-based segmentation algorithm onto the Bio-inspired (Bi-i) Cellular Vision System [9]. This system builds on CNN type (ACE16k) and DSP type (TX 6×) microprocessors [9]. The proposed segmentation approach focuses on the algorithmic issues of the Bi-i platform, rather than on the architectural ones. This algorithmic approach has been conceived with the aim of fully exploiting both the capabilities offered by the * Correspondence: giuseppe.grassi@unisalento.it 2 Dipartimento di Ingegneria dell'Innovazione, Università del Salento, 73100 Lecce, Italy Full list of author information is available at the end of the article Bi-i system, that is, the analog processing based on the ACE16k as well as the digital processing based on the DSP. We would point out that, referring to the segmentation process, the goal of our approach is to find moving objects in video sequences characterized by almost static background. We do not consider in this article still images or moving objects in a video captured by a camera located on a moving platform, where the background is also moving.
The article is organized as follows. Section 2 briefly revises the basic notions on the CNN model and the Bii cellular vision architecture. Then the segmentation algorithm is described in detail (see the block diagram in Figure 1). In particular, in Section 3, the motion detection is described, whereas Section 4 presents the edge detection phase, which consists of two blocks, the preliminary edge detection and the final edge detection. In Section 5, the object detection block is illustrated. All the algorithms are described from the point of view of their implementation on the Bi-i, that is, for each task it is specified which templates (of the CNN) run on the ACE16k chip and which parts run on the DSP. Finally, Section 6 reports comparisons between the proposed approach and the segmentation algorithms described in [3] and [5], which have been also implemented on the Bi-i Cellular Vision System.

Cellular Neural/Nonlinear Networks and Bio-Inspired Cellular Vision System
Cellular Neural/Nonlinear Networks represent an information processing system described by nonlinear ordinary differential equations (ODEs). These networks, which are composed of a large number of locally connected analog processing elements (called cells), are described by the following set of ODEs [1]: where x ij (t) is the state, y ij (t) the output, and u ij (t) the input. The constant I ij is the cell current, which could also be interpreted as a space-varying threshold [19]. Moreover, A ij,kl and B ij,kl are the parameters forming the feedback template A and the control template B, respectively, whereas kl ∈ Nr is a grid point in the neighborhood within the radiusr of the cell ij [20].
Since the cells cooperate in order to solve a given computational task, CNNs have provided in recent years an ideal framework for programmable analog array computing, where the instructions are represented by the templates. This is in fact the basic idea underlying the CNN Universal Machine [1], where the architecture combines analog array operations with logic operations (therefore named as analogic computing). A global programming unit was included in the architecture, along with the integration of an array of sensors. Moreover, local memories were added to each computing cell [1]. The physical implementations of the CNN Universal Machine with integrated sensor array proved the physical feasibility of the architecture [11,12].
Recently, a Bio-inspired (Bi-i) Cellular Vision System has been introduced, which combines Analogic Cellular Engine (ACE16k) and DSP type microprocessors [9]. Its algorithmic framework contains several feedback and automatic control mechanisms among the different processing stages [9]. In particular, this article exploits the Bi-i Version 2 (V2), which has been described in detail in reference [9]. The main hardware building blocks of this Bi-i architecture are illustrated in Figure 2. It has a color (1280 * 1024) CMOS sensor array (IBIS 5-C), two high-end digital signal processors (TX C6415 and TX Figure 1 Block diagram of the overall segmentation algorithm.

Figure 2
The main hardware building blocks of the Bi-i cellular vision system described in [9]. C6701), and a communication processor (ETRAX 100) with some external interfaces (USB, FireWire, and a general digital I/O, in addition to the Ethernet and RS232).
Referring to the Analogic Cellular Engine ACE16k, note that a full description can be found in [12]. Herein, we recall that it represents a low resolution (128 * 128) grayscale image sensor array processor. Thus, the Bi-i is a reconfigurable device, i.e., it can be used as a monocular or a binocular device with a proper selection of a high-resolution CMOS sensor (IBIS 5-C) and a lowresolution CNN sensor processor (ACE16k) [9].
Two tools can be used in order to program the Bi-i Vision System, i.e., the analogic macro code (AMC) and the software development kit (SDK). In particular, by using the AMC language, the Bi-i Vision System can be programmed for simple analogic routines [9], whereas the SDK is used to design more complex algorithms (see Appendix). Referring to the image processing library (IPL), note that the so-called TACE_IPL is a library developed within the SDK. It contains useful functions for morphological and grey-scale processing in the ACE16k chip (see Appendix). Additionally, the Bi-i V2 includes an InstantVision™ library [9].
Finally, note that through the article, the attention is focused on the way the proposed segmentation algorithm is implemented onto the Bi-i Cellular Vision System. Namely, each step of the algorithm has been conceived with the aim of fully exploiting the Bi-i capabilities, i.e., the processing based on the ACE16k chip as well as the processing based on the DSP.

Motion detection
This section illustrates the motion detection algorithm ( Figure 1). Let Y LP i and Y LP i -3 be two gray-level images, processed by a low-pass (LP) filtering, and let Y MD i be the motion detection (MD) mask. In order to implement the motion detection onto the Bi-i, the first step (see Equation 3) consists in computing the difference between the current frame Y LP i and the third preceding frame Y LP i -3 using the ACE16k chip. The indices i and i-3 denote that the frames i-2 and i-1 are skipped. Namely, the analysis of the video sequences considered through the article suggests that it is not necessary to compute the difference between successive frames, but it is enough every three frames. However, as far as the algorithm goes, every frame is evaluated, even though the reference frame is three frames older. This means that we need to store every frame, because the frame i + 1 requires frame i-2 as a reference.
Then, according to Step 2 in Equation 3, positive and negative threshold operations are applied to the difference image via the ConvLAMtoLLM function [13] implemented on the ACE16k chip. This function (included in the SDK) converts a grey-level image stored in the local analog memory (LAM) into a binary image stored in the local logic memory (LLM). Successively, the logic OR operation is applied between the output of the positive threshold and the output of the negative threshold. The resulting image includes all the changed pixels.

Edge detection
The proposed edge detection phase consists of two blocks, the preliminary edge detection and the final edge detection (see Figure 1). In the first block, the CNN- based dual window operator (proposed by Grassi and Vecchio [10]) is exploited to reveal edges as zero-crossing points of a difference function, depending on the minimum and maximum values in the two windows. After this preliminary selection of edge candidates, the second block enables accurate edge detection to be obtained, using a technique able to highlight the discontinuity areas.

Preliminary edge detection
The aim of this phase is to locate the edge candidates. The dual window operator is based on a criterion able to localize the mean point within the transition area between two uniform luminance areas [10]. Thus, the first step consists in determining the minimum and maximum values in the two considered windows. Given the input image Y LP i , we consider for each sample s ∈ Y LP i (x, y) two concentric circular windows, centered in s and having radius r and R, respectively (r < R). Let M R and m R be the maximum and minimum values of Y LP i within the window of radius R, and let M r and m r be the maximum and minimum values within the window of radius r [10]. Note that, for the video-sequences considered through the article, we have taken the values r = 1 pixel and R = 2 pixels. For each sample s, let us define the difference function D(s) = a 1 (s) -a 2 (s), where a 1 (s) = M R -M r and a 2 (s) = m rm R . By assuming that s is the middle point in a luminance transition, the relationship a 1 (s) = a 2 (s) holds. In the case of noise, the change in the sign of the difference function D(s) is a more effective indicator of the presence of a contour [10]. Since D(s) approximates the directional derivative of the luminance signal along the gradient direction [10], the relationship D(s) = 0 is equivalent to find the flex points of luminance transitions. In particular, we look for zero-points and zero-crossing points of D(s). Hence, the introduction of a threshold is required, so that samples s satisfy the condition -threshold <D(s) <threshold. Successively, edge samples are detected according to the following algorithm [10]: In other words, by applying the algorithm (4) to the sample itself and to the four neighboring samples, preliminary edge detection is achieved. In order to effectively implement (4) onto the Bi-i, the first step is the computation of D(s), which can be realized using order-statistics filters. They are nonlinear spatial filters that enable maximum and minimum values to be readily computed onto the Bi-i platform. Their behaviors consist in ordering the pixels contained in a neighborhood of the current pixel, and then replacing the pixel in the centre of the neighborhood with the value determined by the selected method. Therefore, these filters are well suited to find the minimum and maximum values in the neighborhood of the current pixel. The implementation of D (s) gives the images in Figure 4a, c for Foreman and Car-phone, respectively.
Going to Step 2, the threshold is implemented on the ACE16k using the ConvLAMtoLLM function. Then, the relationship -threshold <D(s) <threshold is satisfied by implementing the operations inversion, OR and inversion again onto the ACE16k chip. Note that we look for samples s so that D(s) = 0. Additionally, we look for samples s satisfying the condition that D (s) ≥ 0 but, simultaneously, D(s) must be negative in a cross-shape neighborhood of s. Specifically, at least one of the four conditions D (x 0 ± 1,y 0 ± 1) < 0 must be satisfied. Thus, we need to compute D(s) by exploring proper neighborhoods of (x 0 ,y 0 ), two examples of which are reported in

Final edge detection
The aim of this phase is to better select the previously detected edges. Referring to the previous section, note that the zeros of D(s) are not only flex points of luminance transitions, but also the set of pixels having a neighborhood where luminance is almost constant [10].
Since noise causes small fluctuations, these fluctuations may generate changes in the sign of D that would be incorrectly assumed as edge points. Therefore, in order to better select the edges detected in the previous phase, we need to integrate the available information with the slope of the luminance signal. To this purpose, note that M R and m R identify the direction of maximum slope in the neighborhood of s [10]. Therefore, by suitably exploiting M R and m R , we first need to generate a matrix S, which takes into account the slope of the luminance signal. Then, a threshold gradient operation is applied to S, with the aim to obtain a gradient matrix G. Namely, the final objective is to obtain an image that includes all the edges selected by the gradient operation (i.e., Y grad i ). Successively, the image Y grad i needs to be cleaned and skeletonized, in order to reduce all the edges to one-pixel thin lines. The image reporting the ; (e) neighborhood of (x 0 , y 0 ) containing and edge; (f) neighborhood of (x 0 , y 0 ) not containing any edge; (g) edges obtained by the condition -threshold <D(s) <threshold; (h) edges obtained by the four conditions on the neighborhoods of (x 0 , y 0 ).
In order to effectively implement the algorithm (5) onto the Bi-i, at first the matrix D(s) is processed by means of the ConvLAMtoLLM function, which implements the threshold 'zero' on D(s). Then, the pixels in D that correspond to D (s) ≥ 0 assume the maximum value of the luminance signal (within the window of radius R) and generate the image M R D . Similarly, the pixels in D that correspond to D(s) < 0 assume the minimum value of the luminance signal and generate the image m R D . Then, in order to implement the matrix S(s), we need the following new switch template: The matrix S(s) is generated onto the ACE16k chip, where M R D is used as input, m R D as state whereas the output of the 'zero' threshold is used as mask. Referring to the template (6), we have chosen the name switch since the image S(s) is obtained by 'switching' between M R (s) and m R (s), depending on the mask values. Note that the template (6), by providing the matrix S(s), enables the slope of the luminance signal to be taken into account. The experimental result of S(s) are reported in Figures 5a and 6a for Foreman and Carphone, respectively.
Then, according to the algorithm (5), we need to implement the threshold gradient operation onto the Bii. This can be done using a sequence of eight templates, applied in eight directions N, NW, NE, W, E, SW, S, and SE. For example, referring to the NW direction, the following novel template is implemented on the ACE16k: where the bias is used as a threshold level (herein, thres = -1.1). The other seven remaining templates can be easily derived from (7). Then the logic OR is applied to the eight output images in order to obtain a single image, which is denoted by G(s) (see Figure 5b). Note that G stands for gradient, given that it represents the output of the threshold gradient (7). However, the image G needs to be cleaned, since it usually contains some open lines (see the upper left-side in Figure 5b). These open lines can be deleted by applying the prune template: The output of the prune function is reported in Figure  5c, where it can be seen that the open line in the upper left-side part has been partially deleted. Note that the prune function also enables the back part in Figure 5c to become more compact (i.e., the white dots in the black part have disappeared). Then, the hollow template reported in [13] has to be applied. This template, running on the ACE16k chip, enables the concave locations of objects to be filled. In order to achieve this objective, the hollow template needs to be applied. The output of the hollow is shown in Figure 5d. The white part in Figure 5d indicates that the corresponding part in the image S(s) does not contain information related to edges. Since the hollow is time-consuming, it is useful to carry out this operation by exploiting the great computational power offered by the CNN chip.
Finally, by using the switch template (6) with input =

Object detection
The proposed object detection phase can be described using the following iterative procedure: This template is applied to the inverted image of Y final edge i with the aim to fill all the holes. Figure 7 depicts the outputs of the hole-filler after different processing times, with the aim to show the system behavior when the processing times are increased. Note that the hole-filler has to be applied in a recursive way, in order to fill more and more holes. However, differently from Figure 7 that has an explanatory purpose, we need to apply this template by slowly increasing the processing times. Namely, if we slowly increase the processing times, it is possible to highlight at the most two closed objects at a time, so that these objects can be extracted in the next steps. As a consequence, the hole-filler plays an important role: by slowly filling the holes in a morphological way, it enables the closed objects to be extracted in the next steps of the algorithm. In order to implement the second step, the logic XOR is applied between the output of the hole-filler (i.e., where the image Y dilation (k+1) i is used as input and the image Y final edge i as state. In order to show how the recall template works, Figure 9 shows its output after different processing times. Note that the recall template has to be applied in a recursive way. In particular, by increasing the processing times, note that more and more objects are recalled (see Figure 9). However, differently from Figure 9 that has an explanatory purpose, herein we need to apply this template by slowly increasing the processing times. Namely, in order to guarantee a satisfying total frame rate, we need to recall few objects at a time, so that the processing times due to the recall template are not large. In this way, the slow recursive application of the recall template does not affect the overall system performances. In conclusion, the recall template plays an important role: by taking into account the image containing the final edge (state), it enables the objects enclosed in the dilated image (input) to be recalled and subsequently extracted. Now, by applying the recall template (11) using the image in Figure 8b as input and the image in Figure 5f as state, the image reported in Figure 10a is obtained. This image, indicated by Y recall (k+1) i , is constituted by groups of objects. In order to obtain new objects at each iteration, we need to detect the changes between the images Y recall (k+1) i and Y changes (k) i , as indicated by Step 6. To this purpose, we can apply the logic XOR between Y changes (k) i and Y changes (k) i . If changes are detected, we need to check whether the extracted object belongs to the moving objects. This operation is implemented by exploiting the AND operation between the output of previous XOR and the motion detection mask Y MD i . The output of the AND is indicated by Y extracted (k + 1) i . For example, the objects extracted after the first iteration are shown in Figure 10b. Finally, the extracted object Y This iterative procedure is carried out until all the objects are extracted. Namely, the procedure ends when the condition Y is achieved for two consecutive iterations. Figures 8 and 10 summarize some of the fundamental steps of the object detection algorithm for Foreman video sequence. Similar results have been obtained for Car-phone video sequence.

Discussion
We discuss the results of our approach by making comparisons with previous CNN-based methods illustrated in [3] and [5]. We would remark that the comparison between the proposed approach and the methods in [3] and [5] is homogeneous, since we have implemented all these techniques on the same hardware platform (i.e., the Bi-i). At first, we compare these approaches by visual inspection. By analyzing the results in Figures 11  and 12, it can be noticed that the proposed technique provides more accurate segmented objects than the ones obtained by the techniques in [5] and [3]. For example, the analysis of Figure 11a suggests that the proposed approach is able to detect man's mouth, eyes, and nose. Note the absence of open lines too. The methods depicted in Figure 11b, c do not offer similar capabilities. Referring to Figure 12a, note that we have obtained an accurate result, since man's mouth, eyes, and nose are detected, along with some moving parts in the back of the car. Again, the approaches depicted in Figure  12b, c do not reach similar performances. It can be concluded that, by exploiting the proposed approach, the edges are much more close to the real edges with respect to the method in [5] and [3].
Now an estimation of the processing time achievable by the proposed approach is given in Table 1. Note that the motion detection and the object detection phases can be fully implemented onto the ACE16k chip, whereas the edge detection phase requires that some parts be implemented on the DSP (see Section 4). The sum of the processing times of the different phases is 37767 μs, which gives a frame rate of about 26 frames/s. Note that the computational load is mainly due to the DSP in the edge detection phase (28778 μs) and, specifically, to the presence of the order-statistics filters. On the other hand, these filters are requested to implement the dual window operator, which is in turn required to achieve accurate edge detection, as explained in [10]. Namely, edge detection is a crucial step for segmentation. If we detect edge accurately, we can segment the images correctly. If we analyze the result in reference [5], we note that the authors use a threshold gradient algorithm, which is not particularly suitable for edge detection. On the other hand, the dual window operator is one of the best edge detector (see [10]), even though its implementation is time consuming. Referring to the processing times measured on the Bi-i for the methods in [3] and [5], their values are 13861 and 5254 μs, respectively. The corresponding frame rates are 72 and 190 frames/s, respectively, while our approach gives 26 frames/s. Thus, the segmentation methods in [3] and [5] are faster than the proposed approach, even though they are less accurate, as confirmed by Figures 11 and 12. Anyway, we believe that 26 frames/s can be considered a satisfying frame rate achievable by the proposed approach, since it represents a good trade-off between accuracy and speed.
Finally, we would point out that, while we have conducted this research, a novel Bio-inspired architecture called Eye-RIS vision system has been introduced [21]. It is based on the Q-Eye chip [21], which represents an evolution of the ACE family with the aim to overcome the main drawbacks of ACE chips, such as lack of robustness and large power consumption. Our plan is to implement the segmentation algorithm developed herein on the Eye-RIS vision system in the near future. To this purpose, note that one of the authors (F. Karabiber) has already started to work on the Eye-RIS vision system, as is proof by the results published in [22].  Figure 11 Foreman video sequence.(a) segmentation by our method; (b) segmentation by the method in [5]; (c) early segmentation in [3].

Conclusion
This article has presented the implementation of a novel CNN-based segmentation algorithm onto a Bio-inspired hardware platform, called Bi-i Cellular Vision System [9]. This platform combines the analog processing based on the ACE16k processor [11] as well as the digital processing based on the DSP. The proposed experimental results, carried out for some benchmark video sequences, have shown the feasibility of the approach, which provides a satisfying frame rate of about 26 frames/s. Finally, comparisons with the CNN-based techniques in [5] and [3] have highlighted the accuracy of the proposed method.