- Research
- Open Access

# Initial condition for efficient mapping of level set algorithms on many-core architectures

- Gábor János Tornai
^{1}Email author and - György Cserey
^{1}

**2014**:30

https://doi.org/10.1186/1687-6180-2014-30

© Tornai and Cserey; licensee Springer. 2014

**Received:**17 September 2013**Accepted:**20 February 2014**Published:**10 March 2014

## Abstract

In this paper, we investigated the effect of adding more small curves to the initial condition which determines the required number of iterations of a fast level set (LS) evolution. As a result, we discovered two new theorems and developed a proof on the worst case of the required number of iterations. Furthermore, we found that these kinds of initial conditions fit well to many-core architectures. To show this, we have included two case studies which are presented on different platforms. One runs on a graphical processing unit (GPU) and the other is executed on a cellular nonlinear network universal machine (CNN-UM). With the new initial conditions, the steady-state solutions of the LS are reached in less than eight iterations depending on the granularity of the initial condition. These dense iterations can be calculated very quickly on many-core platforms according to the two case studies. In the case of the proposed dense initial condition on GPU, there is a significant speedup compared to the sparse initial condition in all cases since our dense initial condition together with the algorithm utilizes the properties of the underlying architecture. Therefore, greater performance gain can be achieved (up to 18 times speedup compared to the sparse initial condition on GPU). Additionally, we have validated our concept against numerically approximated LS evolution of standard flows (mean curvature, Chan-Vese, geodesic active regions). The dice indexes between the fast LS evolutions and the evolutions of the numerically approximated partial differential equations are in the range of 0.99±0.003.

## Keywords

- Graphical Processing Unit
- Gaussian Mixture Model
- Cellular Neural Network
- Active Front
- Speed Function

## Introduction

The use of level-set (LS)-based curve evolution has become an interesting research topic due to its versatility and accuracy. These flows are widely used in various fields like computational geometry, fluid mechanics, image processing, computer vision, and materials science [1]. In general, the method entails that one evolves a curve, surface, or image with a partial differential equation (PDE) and obtains the result at a point in the evolution. There is a subset of problems where only the steady state of the LS evolution is of practical interest like segmentation and detection. In this article, only this subset is considered. In addition, we do not form any operator or speedfield (*F*) for driving the evolution of the LSs. However, two theoretical worst-case bounds of the required number of iterations are proposed to reach the steady state for a well-defined class of LS-based evolution. These bounds depend only on the initial condition. Furthermore, the bounds only allow an extremely small number of iterations if the evolution is calculated with a properly chosen initial condition. These kinds of evolutions are calculated very quickly on many-core devices.

The subject of this paper is both theoretical and practical. The theoretical side is clearly the two new theorems on the worst case of the required number of iterations of the Shi LS evolution [2]. This evolution omits the numerical solution of the underlying PDE and successfully approximates it with a rule-based evolution. It is based on the sign of the driving forces (*F*) normal to the curves to be changed. The first theorem gives a general bound, and the second one assumes a special kind of discrete convexity defined in subsection ‘Basic definitions’. The practical side is presented through two case studies, namely, the LS evolution of Shi is mapped in a straightforward way on two completely different many-core architectures. With a lot of small curves in the initial condition, which would be infeasible on a conventional single-core processor, the proposed theorems ensure small number of iterations. Additionally, with the change in the initial condition (instead of one curve, a lot of small curves are used), the computing width of the many-core platform are utilized.

The first successful method to speed up the LS evolution was introduced in [3]. A local method was proposed in [4] with better characteristics. These methods are labeled as narrow banding methods. However, we are not presenting any PDE operators and do not design any speedfield. Instead, we direct the reader to the work of Sapiro [5] who gives a detailed picture from the art of PDE operator design for a given purpose. Furthermore, there are several results [6–9] regarding region- and model-based evolutions. In [10] the authors used Sobolev norm instead of the standard inner-product-based *L*_{2} norm and showed that this norm allows new energies to implement otherwise considered infeasible.

There are multiple results reporting successful mapping of LS evolution to many-core platforms. An interactive 3D solver for GPU is presented in [11]. This realization uses a graphics application programming interfaces (API) (OpenGL) and the rendering pipeline. Two later works [12, 13] applied the computing-unified device architecture (CUDA) of NVIDIA (NVIDIA Corporation, Santa Clara, CA, USA). Both papers worked with 3D volumes. In [12] the authors mapped a sparse solver, while in [13] a higher-order scheme was used to evolve the LSs. Cellular neural networks (CNN) [14] proved to be an inspiring construct. There have been results regarding the mapping of LS-like evolutions to CNN [15–17]. Cserey et al. [15] successfully mapped the nonlinear histogram modification to CNN. A later work [17] realized an online boundary detection algorithm based on LS curve evolution to extract the volume of the right atrium. These papers and results indicate that various LS evolutions can be mapped and used on different many-core platforms. In this paper, not a new numerical variant or many core, parallel implementation of the narrow band algorithm is presented. Instead, we are focusing on the given type of evolution on many-core devices and for this evolution, we give two theorems upperbounding the required number of iterations of the evolution process.

This paper is organized as follows. The next section summarizes the theory for curve evolution realized as LS motion, then some definitions follow and the two theorems on worst-case bound of the required number of iterations with their proofs close the theoretical part. Section ‘Many-core hardware platforms’ describes the hardware platforms used in the two case studies. Section ‘Experiments’ presents the experimental results. In section ‘Validation,’ we validate our method against three numerically solved PDE of LS flows. It is followed by section ‘Discussion’ in and the last section concludes our paper.

## Theory

### Basic curve evolution

*γ*(

*s*,

*t*) (or surfaces) in

**R**

^{ k }, (

*k*=2,3) where

*s*parameterizes the curve and

*t*the (artificial) time. Our aim is to trace

*γ*(

*s*), which moves normal to itself, with a given speed function

*F*. If there are no restrictions on the sign of

*F*, then the motion of

*γ*can be complex. Furthermore, it is hardly trackable by curve-based (Lagrangian) methods since these methods cannot handle topological changes and become unstable near singularities. However, the initial curve

*γ*(

*s*,

*t*=0) can be embedded as the zero LS of a higher dimensional function so the evolution of this function is linked to the propagation of the front through the time evolution of this initial value problem. This family of methods is called level set methods (LSM). The evolution equation of these methods is as follows:

*γ*is embedded into a function that is called LS function. It is denoted by

*ϕ*. The uniformly sampled domain of

*ϕ*is denoted by

*D*and a point

**x**∈

*D*is characterized by its coordinates (

**x**=(

*x*

_{1},..

*x*

_{ k })). The region enclosed by

*γ*is referred to as the object region and is denoted by

*Ω*. Similarly, the region outside

*γ*is referred to as the background region and is denoted by

*Γ*. Let

*Ω*

^{∗}and

*Γ*

^{∗}denote the true object region and true background region, respectively.

*ϕ*is chosen to be negative inside

*Ω*and positive in

*Γ*. The zero LS is represented implicitly by

*ϕ*. The LS evolution process of [2] is used throughout in this work. In the neighborhood of the zero LS, two sets are defined uniquely with respect to the selected discrete neighborhood:

*N*(

**x**) is the selected neighborhood defined by Equation 4, but can be any discrete neighborhood known from discrete topology.

*L*

_{in}and

*L*

_{out}are referred to as active front, and

*ϕ*is defined as an approximated signed distance function:

The LSM of Shi uses only the sign of the speed function *F* to determine the motion of the active front at a given point. The speed function itself is chosen according to the requirements and can be arbitrary (Chan-Vese, geodesic active contour, geodesic active region (GAR) - gaussian mixture model (GMM), nonparametric mixture model, etc.). It is (re)calculated in each cycle for the active front, but its sign at every point of the active front determines the action on these points. Furthermore, from an implementation point of view, both sets of the active front are realized as two separate lists. The motion of the active front is realized by two local switch operators, one for outward and one for inward motion. By applying *switch_in()* to any points in *L*_{out} having positive sign *F* at that point, the active front is moved outward one grid point from that location. Namely, the point is moved from *L*_{out} to *L*_{in}, its exterior neighbors are put to *L*_{out}, and its interior neighbors are deleted from *L*_{in}. The *switch_out()* procedure operates similarly in the other direction.

One iteration of the evolution of the algorithm performs four scans. Firstly, it scans *L*_{out} for *switch_in()* then it scans *L*_{in} for elements that are not in the active front any more (all neighbors reside in *Ω*), afterwards, *L*_{in} is scanned for *switch_out()*. Lastly, *L*_{out} is scanned for elements that are not in the active front any more. Basically, the algorithm makes one step inward or outward according to the sign of the speed function. If it contains a curvature-dependent smoothing term, then sharpened gaussian filtering (shock-filtered heat diffusion) can be applied just on the active front itself in a different cycle after a given number of evolution cycles. For further details see [2, 18, 19].

### Basic definitions

A path *p* between **x** and **y** is a sequence of points **x**_{
l
}(*l*=0,1,…,*L*)∈*D* subject to **x**_{
l
} ∈ *N*(**x**_{l+1}) and **x**=**x**_{0} and **y**=**x**_{
L
}. A set of points
forms a connected region if and only if there exists a path *p* between every $\mathbf{x},\mathbf{y}\phantom{\rule{0.3em}{0ex}}\in \mathcal{A}$ subject to ∀**x**_{
l
} ∈*p* is an element of
.

The length of a path is a non-negative integer (*L*) and *L*=|*p*|−1, where |.| denotes the number of points in the path. A minimum path *p*_{min} is a shortest path meaning there are no shorter *p*^{′} paths between **x** and **y**. Minimum path is usually not unique and can depend on the chosen discrete neighborhood. The distance between **x** and **y** is a non-negative integer that is exactly the length of a minimum path between the two points. This is a real metric and is going to be referred to as *d*_{
d
}.

Within a connected region
, a minimal path *p* between **x** and **y** is minimal if and only if $\mathcal{A}\cap p=p$ and there are no shorter *p*^{′} paths within
between **x** and **y**. Like the minimum path, the minimal path may not be unique and may depend on the chosen neighborhood. The diameter *B* of a connected region is the longest minimum path having at least its endpoints within the connected region. A connected region is considered as convex if all minimal paths are minimum paths at the same time. A configuration *C*={*D*×*ϕ*} is the actual state of the LS function, namely, the shape of the zero LS and the connected regions (*Ω*_{
p
},*Γ*_{
q
}) composing the object and the background region.

Now we have all the necessary tools to establish proper worst-case bounds on the number of iterations required by the Shi LSM to converge.

### Theoretical results: worst-case bounds

#### Theorem 1 (General bound)

Let the true object region *Ω*^{∗} be composed of *P*-connected regions ${\Omega}_{p}^{\ast}\phantom{\rule{2.83795pt}{0ex}}(\text{where}\phantom{\rule{2.83795pt}{0ex}}p=1\dots P)$ and the true background region *Γ*^{∗} be composed of q connected regions ${\Gamma}_{q}^{\ast}\phantom{\rule{2.83795pt}{0ex}}(\text{where}q=1\dots Q)$. Assume that *F*>0 in *Ω*^{∗} and *F*<0 in *Γ*^{∗}. At initialization, *C* is chosen such that *Ω*=∪_{
i
}*Ω*_{
i
}, *Γ*=∪_{
j
}*Γ*_{
j
} and ${\Omega}_{p}^{\ast}\cap \Omega \ne \varnothing ,\phantom{\rule{2.83795pt}{0ex}}\forall p=1\dots P$ and $(D\setminus \Omega )\cap {\Gamma}_{q}^{\ast}\ne \varnothing ,\phantom{\rule{2.83795pt}{0ex}}\forall q=1\dots Q$. Then, the Shi LSM converges to *Ω*^{∗} in *N*_{it}≤ max(max*i*(|*Ω*_{
i
}|), max*j*(|*Γ*_{
j
}|)) iterations, where |.| denotes the number of elements in the region.

#### Theorem 2 (Convex bound)

If either *Ω*^{∗} or *Γ*^{∗} is convex, then the Shi LSM converges to *Ω*^{∗} in ${N}_{\text{it}}\le max\left(\underset{i}{max}\right({B}_{{\Omega}_{i}}),\underset{j}{max}({B}_{{\Gamma}_{j}}\left)\right)$ iterations, where *B* denotes the diameter of the given region.

*Proof of Theorem* 1*on general bound*. Let ${\Omega}^{a}={\Omega}^{\ast}\cap \Omega =\bigcup _{p=1}^{P}{\Omega}_{p}^{a}$. These are fixed sets and will not change during the evolution process. Furthermore, *F*(*x*_{
k
})>0, ∀*x*_{
k
}∈*Ω*^{
a
} which ensures that *Ω*^{
a
}⊆*Ω* as *Ω* evolves.

*Ω*

_{ i }two cases are possible. First case:

*Ω*

_{ i }∩

*Ω*

^{∗}=

*∅*. Therefore,

*Ω*

_{ i }⊆

*Γ*

^{∗}so,

*F*(

**x**)<0. On the boundary of

*Ω*

_{ i },

*L*

_{in,i}, a

*switch_out*operation is applied so the diameter of

*Ω*

_{ i }becomes smaller with two in every iteration. Second case:

*Ω*

_{ i }∩

*Ω*

^{∗}≠

*∅*. Therefore, the longest possible path in

*Ω*

_{ i }gives the upperbound of the number of iterations that is obviously upperbounded by the number of points in

*Ω*

_{ i }. Following similar arguments, we can also show this for

*Γ*

_{ j }. Taking the maximum of the upperbounds completes the proof of Theorem 1.

*Proof of Theorem*2

*on convex bound.*Obviously, the first case of the proof of Theorem 1 obeys the desired bound. The second case is as follows. Since

*Ω*

^{∗}is convex, the length of the longest path is bounded by the diameter of

*Ω*

_{ i }. In worst case,

*Ω*

_{ i }∩

*Ω*

^{∗}is one of the endpoints of the diameter. Following similar arguments, we can also show this for

*Γ*

_{ j }. Taking the maximum of the diameters in each initial and background region completes the proof of Theorem 2.

*N*

_{it}, and the iteration cycle checking the stopping condition is not necessary if the number of iteration has reached this upper bound. This worst-case bound is approached if

*Ω*

^{∗}or

*Γ*

^{∗}are degenerated in some sense (see Figure 1D and Table 1 for example). However, in many cases the stricter bound can be applied. We shall emphasize that the worst-case bound is a quantity derived from the initial condition.

**Experimental validation of the theorems**

Values | ||||||||
---|---|---|---|---|---|---|---|---|

Number of circles ( | 1 | 2 | 4 | 8 | 16 | 24 | 32 | 64 |

Bound according to Theorem 1 | 64 | 32 | 256 | 64 | 16 | 9 | 4 | 1 |

Bound according to Theorem 2 | 127 | 63 | 31 | 15 | 7 | 5 | 3 | 1 |

Circle | 26 | 16 | 9 | 6 | 4 | 3 | 3 | 1 |

Degenerate | 145 | 68 | 18 | 7 | 6 | 3 | 3 | 1 |

The possibility of choosing the initial shape of the regions *Ω*_{
i
} and *Γ*_{
j
} is essential to minimize the required number of iterations. It shall be noted that according to the Shi LSM, all calculations are done in the active front that have direct connection with the initial shape of the aforementioned regions. Making both *Ω*_{
i
} and *Γ*_{
q
} smaller, the smaller the worst-case bounds can be. This statement leads us to section ‘Experiments’, namely, how to construct initial conditions that are minimal in the sense of worst-case bounds and can be mapped and processed efficiently on a many-core architecture. It should be noted that the presentation above does not depend on the dimensionality of the data so the theorems are general from this point of view and the dimension can be arbitrary.

## Many-core hardware platforms

### CNN universal machine

*N*

_{r}). The connections are weighted and these weights are called templates. The templates define the operator that is applied on the input and the state, hence, they define the output dynamics.

Here *x*_{
i
j
},*u*_{
i
j
},*y*_{
i
j
} stand for the state, input, and output of the cell *i* *j*, respectively; *A*,*B*,*z* are the template parameters: *A* for feedback, *B* for input, and *z*_{
i
j
} for bias. The CNN-UM, in addition to the standard CNN, contains memories and control units to allow performing series of template operations and branching.

The experiments were done on an Eye-RIS v1.3 vision system (VS) (Anafocus Ltd., Seville, Spain). It consists of a Q-Eye, Altera NIOS-II 32-b RISC microprocessor and on chip RAM. The Q-Eye is a quarter common intermediate format (QCIF) monochrome image sensor processor with 7- to 8-b *de facto* accuracy. It is a fine-grain CNN-UM implementation with nearest neighborhood capable operations. The microprocessor handles the memory, the I/O ports and organizes the execution. The consumption of the complete VS is below 750 mW.

### GPU

Recent GPUs are feasible for nongraphic operations as well and programmable through general purpose APIs like C for CUDA [21] or OpenCL [22]. In this paper, OpenCL nomenclature is used. The description below is a brief overview of GPUs and, in addition to the basics, gives only those details that have great influence on the LS evolution.

A function that can be executed on the GPU is called a kernel. Any call to a kernel must specify an NDRange for that call. This defines not only the number of work-items to be launched, but also the arrangement of groups of work-items to work-groups and work-groups to the NDRange. The dimensionality of a work-group can be one, two, or three.

Physically, the elementary computing element is the computing element. A few computing element together with a given amount of SDRAM, scheduling unit, and special function unit forms a computing unit (CU). A device consists of several CUs and a global memory (off-chip).

The experiments were done on an NVIDIA 780 GTX GPU. It has 12 CUs, 192 computing elements, and 48KB shared memory in each CU, and 3 GB global memory. The hosting PC runs on Intel core i7-2600 CPU @3.4 GHz with 8 GB system memory, the operating system is Debian with Linux kernel the GPU driver version is 325.15.

## Experiments

Theorems 1 and 2 give upperbounds on the required number of iterations (*N*_{it}). A practical proposal of this paper is to construct configurations that have as low worst-case bounds on *N*_{it} as feasible and can be computed efficiently on many-core architectures. This scenario is presented through two case studies. The first one is on an Eye-RIS v1.3 VS that is a hardware implementation of the CNN-UM and the second one is on a GPU.

The whole image is covered with many-many small active fronts, and as a consequence, the intersection condition (${\Omega}_{p}^{\ast}\cap \Omega \ne \varnothing $) is automatically fulfilled. During the case studies, the speedfield *F* has been very simple, +1 for the object region and −1 for the background regions.

### A case study on CNN-UM

In subsection ‘CNN universal machine,’ a short overview was given on the CNN-UM. Now the details of the mapped algorithm are described. The perspective in this scenario is the precedence of locality which becomes increasingly important as the technology feature size decreases, and the delay together with power consumption of global communication increases.

*L*

_{in}and

*L*

_{out}two other sets are defined.

Templates AND, OR denote elementary logic, ANDNOT performs logic subtraction (*O* *p* 1∧¬*O* *p* 2), DIL4 and ERODE4 are the four connected dilatation and erosion (spatial logic). All templates are of the nearest neighbor kind. In the ‘Update *L*_{out}’ phase, *foutmask* is computed first. It contains the points that are going to move outward. *foutmask* is used in three different ways. It is subtracted (ANDNOT) from *L*_{out}, added (OR) to *L*_{in} and dilated (DIL4, ANDNOT, AND) to generate its own outer neighbors. This is the new stepped *L*_{out} part and the unchanged parts are added with an OR operation. The resulting set is finalized as the new *L*_{out} (black rectangle in Figure 2). From the old *F*_{out} the new *L*_{out} is subtracted (ANDNOT) to get the new *F*_{out} (again, black rectangle in Update *L*_{out} phase). Finally, the modified *L*_{in} is added to *F*_{in}. In the ‘Clean *L*_{in}’ phase, the merged *foutmask*, *L*_{in}, and *F*_{in} is the only input. The new *L*_{in} is the outer pixel layer of this merged input. The new *F*_{in} is obtained by a simple four connected erosion while *L*_{in} is the result of a subtraction. ‘Update *L*_{in}’ and ‘Clean *L*_{out}’ are nearly identical, only the input of the operations are switched, and another mask is used (*finmask*).

The algorithm is implemented on an Eye-RIS 1.3 VSoC. One step of the algorithm is performed in 400 to 440 µs on a QCIF image. It must be noted that the actual computing is finished within 60 to 70 µs and the remaining time (340 to 370 µs) is required for the data movement from the main memory of the Eye-RIS (on the Altea NIOS-II microprocessor) to the Q-Eye chip memory.

### A case study on GPU

The evolution process is divided into two steps. The first one is the planner step and the second is the evolution step. The planner creates the so-called plan. It contains the position offsets of the 16×16 tiles that are calculated actually in the iteration step. The planner works on the indicator image. The indicator is a tiny image and each pixel of the indicator is *true* if the corresponding tile on the input image shall be processed in this iteration and *false* otherwise. The size of the plan is calculated by local prefix-sum work-group wise, and global atomic addition is used to correctly determine the offset of the work-group within the plan. The source of the planner kernel is provided as Additional file 1.

The evolution kernel processes only those tiles of the LS function that are inserted in the plan. The evolution kernel makes a step either inward or outward direction depending on the sign of the force field on the LS function. This is done simultaneously unlike in the sequential algorithm. Each work-group processes a 16×16 tile provided in the plan and writes the complete tile back to the global memory. First, each work-item calculates the new value of the pixel of the LS function. Then the neighbors of each pixel are updated as the *switch_out()* and *switch_in()* operations require, and the active front is cleaned to maintain the two pixel width. If there is no activity inside the tile, then set the corresponding pixel of the indicator image to *false*. The boundary of the tile requires special care, namely, to properly update the corresponding neighboring pixels of the LS function and the indicator. The source of the evolution kernel is provided as Additional file 2.

**Time measurements on NVIDIA GTX 780 GPU compared to Intel core i7 CPU**

Data size | Initial condition | ${\stackrel{\u0304}{T}}_{\mathbf{\text{iteration}}}$ on GPU (µs) | ${\stackrel{\u0304}{T}}_{\mathbf{\text{iteration}}}$ on CPU (µs) |
N
| Speedup |
---|---|---|---|---|---|

256×256 | 1×1 | 129 | 1,610 | 32 | 12.5 |

256×256 | 2×2 | 126 | 2,242 | 59 | 17 |

256×256 | 8×8 | 140 | 3,164 | 20 | 22 |

256×256 | 32×32 | 143 | 8,874 | 8 | 62 |

512×512 | 1×1 | 317 | 3,190 | 64 | 10 |

512×512 | 4×4 | 167 | 8,724 | 40 | 52 |

512×512 | 16×16 | 157 | 12,897 | 25 | 82 |

512×512 | 64×64 | 123 | 16,246 | 18 | 132 |

1,024×1,024 | 1×1 | 534 | 6,431 | 129 | 12 |

1,024×1,024 | 8×8 | 548 | 27,461 | 55 | 50 |

1,024×1,024 | 32×32 | 590 | 43,739 | 32 | 74 |

1,024×1,024 | 128×128 | 490 | 84,078 | 12 | 171 |

2,048×2,048 | 1×1 | 560 | 14,972 | 210 | 26 |

2,048×2,048 | 16×16 | 703 | 79,920 | 79 | 113 |

2,048×2,048 | 64×64 | 830 | 198,980 | 28 | 239 |

2,048×2,048 | 256×256 | 684 | 327,541 | 7 | 478 |

### Number of iterations

In the experiments more initial configurations were tested. In each configuration, regions of *Ω* and *Γ* were placed in a chessboard like pattern as it is showed in Figure 1A,B. Two sample objects are presented in Figure 1C,D that shall be detected. Additionally, the two objects represent the two object families: the degenerate and convex ones having worst-case bounds stated in the theorem 1 and 2.

Iteration results are presented in Table 1 together with the two different bounds of the given configuration. The number of iterations (*N*_{it}) was measured on the original sequential algorithm of Shi and these values are presented in the table. It is below or equal to the worst-case bounds in every cases.

In the case of CNN-UM, *N*_{it} coincides with the values presented in the table, while in the case of GPU implementation, *N*_{it} is consistently higher with one iteration. This means that it exceeded the bounds in the case of *n*=32 and *n*=64. However, the reason is as follows: the boundary pixels of the subregion have one iteration delay in the cleaning process. This causes the additional iteration so it is not a violation of the theorems.

## Validation

*Ω*

_{1}and

*Ω*

_{2}of the two different methods, the coefficient is defined as

Its value is in the range of 0 and 1; 0 means complete difference and 1 means complete agreement.

### Mean curvature flow

*κ*is the (euclidean) curvature of the LS. It is the norm of the second derivative of

*γ*with respect to the (euclidean) arc length (

*κ*=∥

*γ*

_{ s s }(

*s*)∥,

*s*is the arc length parametrization of the curve). Another possible, precise, and more easy way to calculate the curvature of an LS from

*ϕ*is as follows:

This force term appears in almost every LS flow as a smoothing and regularizing term. The steady-state solution is a circle with infinitesimal diameter. In practice, the object region vanishes. In this case, not only the steady state but the evolution itself is also investigated. This is an autonomous motion and does not have any control term from an external image.

The details of the numerical approximation are as follows. The LS function *ϕ* is a signed distance function. It was recalculated after every 30 iterations. The time (*T*_{maximum}) run to 800 units. The time step (*Δ* *t*) size has been set to 0.4. The curvature has been calculated from the LS function from Equation 11.

In the case of the fast LS evolution, the curvature was calculated according to the work Merriman, Bence, and Osher (MBO) [23, 24], namely, by *G*⊗*ϕ*, where *G* is a 2D gaussian of a given variance.

### Chan-Vese flow

*μ*=1,

*λ*

_{1}=0.8,

*λ*

_{2}=0.8.

*I*represents the input image intensities, the constants

*c*

_{1}=0.5 and

*c*

_{2}=0 are simply the means of pixel intensities inside and outside the zero LS. The artificial time parameter runs to 180 units, the time step is 0.5 units. The total number of iterations are 360. The initial condition is 25 circles arranged uniformly in five rows and five columns each with diameter 27 pixels. The LS function (signed distance) is recalculated in each 30 iterations for the numerical solution. The initial condition is 5×5 circles each with diameter 27 pixels. The steady states of the two Cahn-Vese evolutions are shown in Figure 5A. The dice index of the two states is 0.998.

### Geodesic active regions flow

*R*

_{1}and

*R*

_{2}are the regions to be separated,

*b*is a strictly decreasing function of boundary probability, and

*α*is a balancing constant. In our case

*α*=0.3, and

*b*is defined as follows:

Here *G* is a 2D gaussian with *σ*=3. The GMM parameters are calculated from the image histogram with a recursive expectation maximization algorithm. The artificial time runs to 6 units, the time step is 0.02 units. The total number of iterations are 300. The LS function (signed distance) is recalculated in each 30 iterations for the numerical solution. The initial condition is the same as in the case of Chan-Vese evolution, 5×5 circles each with diameter 27 pixels. Steady states are shown in Figure 5B. The dice index of the two states is 0.998.

## Discussion

In this paper, given our investigation of the initial condition and the required number of iterations as a function of it, we presented two bounds on the required number of iterations of an LS evolution. The bounds were proven theoretically and validated experimentally with the original algorithm and also with two different mappings of the algorithm on many-core machines (GPU, CNN-UM). The bounds depend only on the initial configuration of the LS function. The many-core realizations required not only a very small number of iterations less than or equal to the bounds, but the execution of an iteration was also fast (see Table 2 for detailed measurement data).

In addition to the drastic decrease of the required number of iterations, the total execution time decreases as well if dense initial condition is used for the evolution. The total execution time on CPU with sparse initial condition is comparable to the total execution time with dense initial condition. For the smaller images, the dense initial condition was less effective by 30% to 15%; but in the case of the biggest image, the dense iteration was the faster by 35%. In the case of the dense initial condition on GPU, there is a significant speedup compared to the sparse initial condition in all cases since our proposed dense initial condition together with the algorithm utilizes the properties of the underlying architecture. Therefore, greater performance gain can be achieved on GPU if dense initial condition is used.

A great property of the results is the scalability. This is true for the performance as a function of cores and for the number of iterations as a function of size of the disjoint active fronts. Considering the chessboard-like initial configuration with increasingly finer regions, the general bound is proportional to the area of the regions and the convex bound is proportional to the half perimeter of the regions. This is changed in three dimensions to the volume of region in the case of general bound and half of the longest perimeter of the volume in the case of a convex bound.

The assumption on *F* is stronger in Theorem 1 than the one that was given in the convergence analysis in [2]. In the examples presented there, our stronger assumption stands for at least one of the regions *Ω*^{∗},*Γ*^{∗}. However, there may be cases when for a short period of iterations the sign of *F* changes. Typically, this is the case when inside the true object region, the actual state of the LS function contains a concave background region with high negative curvature. In these cases, the curvature-based term can be greater than the region term (the pixel-intensity-based terms), but this is a temporary effect. As soon as the local concavity is vanished, the region term becomes again greater and the sign of *F* changes back. Furthermore, as it was declared in the introduction, the construction of the speedfield and its components is out of the scope of this paper. Additionally, the validations indicate that the method converges *de facto* to the same state as the exact numerical solutions.

The fact that the active front of the initial condition covers the whole image has a special consequence, namely, separate, disjoint regions of the same object or multiple target objects can be found automatically without user interaction. For example, the grey matter of the brain on an MRI slice is disconnected and may be composed of 8 to 20 disjointed regions on the given slice. The problem of detecting all regions is greatly simplified with our dense initial condition. Similarly, the selected group of cells on a histology image shows this property as well. Additionally, histology images can be extremely large (2 to 30 Mpixel), and the performance gain of our proposed method (initial condition together with the parallel algorithm) becomes more expressed on larger images. A conventional sparse initialization can easily fail this task, with wrongly chosen initial condition, see for instance the initialization and evolution of a gold standard LS implementation of [25], which is a widely used framework for medical image segmentation and analysis.

*a priori*information into the initial condition. Figure 7 shows a histology image from the skin, where one class of cells is to be segmented.

It must be emphasized that the case studies presented here are not necessarily optimal mappings of the Shi LS evolution by any means. The purpose of presenting them is twofold: (1) to highlight the advantage of the proposed initial condition concept especially on those machines and (2) to give a proof of concept mapping of this fast evolution on two totally differently organized (virtual and physical) many-core machines.

## Conclusions

To automatically detect segment object on an image or on a region of it, the LS-based algorithms are feasible tools. In this paper, it was shown theoretically and experimentally through two case studies that the initial condition plays an essential role in decreasing the execution time. It must be emphasized that this is only validated on many-core architectures where the computations can be distributed among the cores. Furthermore, based on the initial condition configuration, two worst-case bounds were given on the required number of iterations depending on the convexity of the object to be found. The bounds are proven theoretically and validated experimentally. Additionally, the execution time of one iteration was measured on both architectures. It was below 70+370 µs on the Eye-RIS system handling a QCIF image (where 70 µs is the processing time and 370 µs is the outer memory delay). The timing results of the GPU is presented in Table 2 in details. In the case of the proposed dense initial condition on GPU, there is a significant speedup compared to the sparse initial condition in all cases since our dense initial condition together with the algorithm utilizes the properties of the underlying architecture. Therefore, greater performance gain can be achieved (up to 18 times speedup compared to the sparse initial condition on GPU).

The results and tools presented in this paper provide a method to efficiently map LS algorithms on many-core architectures and ensure bounds on the execution time through the two theorems.

## Declarations

### Acknowledgements

This research was supported by the European Union and the State of Hungary, cofinanced by the European Social Fund in the framework of TÁMOP 4.2.4.A/1-11-1-2012-0001 (National Excellence Program). The support grants TÁMOP-4.2.1.B-11/2/KMR-2011-0002 and TÁMOP-4.2.2/B-10/1-2010-0014 are also gratefully acknowledged. The authors would like to thank Ádám Rák for his help and suggestions.

## Authors’ Affiliations

## References

- Sethian JA:
*Level Set Methods and Fast Marching Methods: Evolving Interfaces in Computational Geometry, Fluid Mechanics, Computer Vision, and Materials Science*. New York: Cambridge University; 2000.Google Scholar - Shi Y, Karl W: A real-time algorithm for the approximation of level-set-based curve evolution. IEEE Trans.
*Image Process.*2008, 17(5):645-656. [http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4480128]MathSciNetView ArticleGoogle Scholar - Adalsteinsson D, Sethian JA: A fast level set method for propagating interfaces.
*J. Comput. Phys.*1995, 118(2):269-277. [http://www.sciencedirect.com/science/article/pii/S0021999185710984] 10.1006/jcph.1995.1098MathSciNetView ArticleGoogle Scholar - Peng D, Merriman B, Osher S, Zhao H, Kang M: A PDE-based fast local level set method.
*J. Comput. Phys.*1999, 155(2):410-438. [http://www.sciencedirect.com/science/article/pii/S0021999199963453] 10.1006/jcph.1999.6345MathSciNetView ArticleGoogle Scholar - Sapiro G:
*Geometric Partial Differential Equations and Image Analysis*. New York: Cambridge University; 2001.View ArticleGoogle Scholar - Chan T, Vese L: Active contours without edges.
*Image Process. IEEE Trans*2001, 10(2):266-277. 10.1109/83.902291View ArticleGoogle Scholar - Paragios N, Deriche R: Geodesic active regions: a new framework to deal with frame partition problems in computer vision.
*J. Vis. Commun. Imag. Rep.*2002, 13(1–2):249-268. [http://www.sciencedirect.com/science/article/pii/S1047320301904754]View ArticleGoogle Scholar - Joshi N, Brady M: Non-parametric mixture model based evolution of level sets and application to medical images.
*Int. J. Comput. Vis.*2010, 88: 52-68. [http://dx.doi.org/10.1007/s11263-009-0290-5] 10.1007/s11263-009-0290-5View ArticleGoogle Scholar - Bertelli L, Chandrasekaran S, Gibou F, Manjunath BS: On the length and area regularization for multiphase level set segmentation.
*Int. J. Comput. Vis.*2010, 90(3):267-282. [http://www.springerlink.com/index/10.1007/s11263-010-0348-4] 10.1007/s11263-010-0348-4View ArticleGoogle Scholar - Sundaramoorthi G, Yezzi A, Mennucci A, Sapiro G: New possibilities with Sobolev active contours.
*Int. J. Comput. Vis.*2009, 84: 113-129. [http://dx.doi.org/10.1007/s11263-008-0133-9] 10.1007/s11263-008-0133-9View ArticleGoogle Scholar - Lefohn AE, Cates JE, Whitaker RT:
*Interactive, GPU-based level sets for 3D segmentation*. ed. by Ellis RE, Peters TM, Medical Image Computing and Computer-Assisted Intervention - MICCAI 2003, Lecture Notes in Computer Science. vol. 2878 (Springer, Berlin Heidelberg, 2003) pp. 564–572, [http://dx.doi.org/10.1007/978-3-540-39899-8_70]Google Scholar - Roberts M, Packer J, Sousa MC, Mitchell JR: A work-efficient GPU algorithm for level set segmentation. In
*Proceedings of the Conference on High Performance Graphics*. ACM, New York; 2010:123-132.Google Scholar - Sharma O, Zhang Q, Anton Q, Bajaj C: Multi-domain, higher order level set scheme for 3D image segmentation on the GPU. In
*2010 IEEE Conference on, Computer Vision and Pattern Recognition (CVPR)*. San Francisco, 13–18 June 2010); 2211-2216.View ArticleGoogle Scholar - Chua LO, Yang L: Cellular neural networks: applications.
*Circuits Syst. IEEE Trans.*1988, 35(10):1273-1290. 10.1109/31.7601MathSciNetView ArticleGoogle Scholar - Cserey G, Rekeczky C, Földesy P: PDE based histogram modification with embedded morphological processing of the level-sets.
*J. Circuits, Syst Comput*2003, 12(04):519-538. 10.1142/S021812660300101XView ArticleGoogle Scholar - Rekeczky C, Roska T: Calculating local and global PDEs by analogic diffusion and wave algorithms. In
*Proceedings of the European Conference on Circuit Theory and Design, Volume 2*. (Helsinky University of Technology, Espoo; 2001:17-20.Google Scholar - Hillier D, Czeilinger Z, Vobornik A, Rekeczky C: Online 3-D reconstruction of the right atrium from echocardiography data via a topographic cellular contour extraction algorithm.
*Biomed. Eng. IEEE Trans.*2010, 57(2):384-396.View ArticleGoogle Scholar - Shi Y, Karl WC: A fast level set method without solving PDEs. In
*Proceedings on IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’05)*. Philadelphia, 18–23 March 2005; 2005:97-100.Google Scholar - Shi Y: Object based dynamic imaging with level set methods.
*PhD Thesis, Boston University College of Engineering 2005*Google Scholar - Chua LO, Roska T, Venetianer PL: The CNN is universal as the Turing machine.
*Circuits Syst. I: Fundam. Theory Appl. IEEE Trans*1993, 40(4):289-291. 10.1109/81.224308MathSciNetView ArticleGoogle Scholar - NVIDIA: CUDA C Programming Guide 2011 . Accessed 23 January 2012 [https://developer.nvidia.com/cuda-toolkit-archive]. Accessed 23 January 2012
- OpenCL specification 1.1 2011. . Accessed 1 April 2012 [http://www.khronos.org/opencl/]. Accessed 1 April 2012
- Merriman B, Bene J, Osher S: Diffusion generated motion by mean curvature.
*Computational Crystal Growers Workshop*Edited by Taylor J, (Providence, RI 1992) pp.73–83.Google Scholar - Merriman B, Bence JK, Osher SJ: Motion of multiple junctions: a level set approach. J.
*Comput. Phys.*1994, 112(2):334-363. [http://www.sciencedirect.com/science/article/pii/S0021999184711053] 10.1006/jcph.1994.1105MathSciNetView ArticleGoogle Scholar - The insight toolkit 2012 . Accessed 20 November 2012 [http://www.itk.org]. Accessed 20 November 2012

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.