In this article, a low-cost system for 2D eye gaze estimation with low-resolution webcam images is presented. Two algorithms are proposed for this purpose, one for the eye-ball detection with stable approximate pupil-center and the other one for the eye movements' direction detection. Eyeball is detected using deformable angular integral search by minimum intensity (DAISMI) algorithm. Deformable template-based 2D gaze estimation (DTBGE) algorithm is employed as a noise filter for deciding the stable movement decisions. While DTBGE employs binary images, DAISMI employs gray-scale images. Right and left eye estimates are evaluated separately. DAISMI finds the stable approximate pupil-center location by calculating the mass-center of eyeball border vertices to be employed for initial deformable template alignment. DTBGE starts running with initial alignment and updates the template alignment with resulting eye movements and eyeball size frame by frame. The horizontal and vertical deviation of eye movements through eyeball size is considered as if it is directly proportional with the deviation of cursor movements in a certain screen size and resolution. The core advantage of the system is that it does not employ the real pupil-center as a reference point for gaze estimation which is more reliable against corneal reflection. Visual angle accuracy is used for the evaluation and benchmarking of the system. Effectiveness of the proposed system is presented and experimental results are shown.
Gaze information plays an important role in point of regard (PoR) which is a proxy for handling the users' attention or intention as an interaction between human and the computer where it can be used as a form of input instead of keyboard and the mouse [1–3]. Every single eye gaze tracking system has different visual angle accuracies which are important for the size of selection targets such as buttons, icons, pictures, and texts . The systems with lower visual angle accuracy are more robust in terms of selection of targets and visual angle accuracy is generally used for benchmarking of the eye gaze systems .
Various systems have been described in the literature. Deng et al.  present a local deformable template method for locating the eye and extracting eye features. A system based on a dual-state model for eye tracking is proposed in . Both of these methods require manual initialization of eye location. Witzner et al.  model the iris as an ellipse, but the ellipse is locally fitted to the image through an EM and RANSAC optimization process. A system based on feature detection using one camera is suggested by Smith et al. . Noureddin et al.  suggest a two-camera solution where a fixed wide-angled camera with rotating mirror directs the orientation of the narrow-angled camera. A practical gaze point detecting system with dual cameras is also proposed by Park et al. . They employ IR-LED illuminators for wide and narrow view camera to overcome the problem of specula reflection on glasses. They attain the 2.89 cm of RMS error between the computed and the real positions. A non-intrusive eye gaze estimation system is proposed by Yoo et al. . Their system allows ample head movements and yields quite accurate results using cross-ratio under large head motion. Artificial neural network (ANN) is also applied into eye gaze estimation in the literature . This method employs cropped images of eyes for training the activation functions. It requires calibration and training process which is not applicable in everyday life.
Guestrin et al.  claim that the point of gaze (POG) can be estimated only if the head is completely stationary when the system employs one camera and one light source. Therefore, head-movements are the major problem of low-cost 2D eye gaze trackers. This article neglects the head-movements and estimates the approximate POG using shape- and intensity-based methods.
The remaining part of the article is organized as follows. In Section 2, insights of the face and eyes socket detection are presented. In Section 3, deformable angular integral search by minimum intensity (DAISMI) algorithm is introduced. In Section 4, deformable template-based 2D gaze estimation (DTBGE) algorithm with mathematical explanation is described. In Section 5, a dwell-time-based virtual keyboard is employed to measure the visual angle accuracy of the system. Experimental evaluation and benchmarking of the system are given at the end. Section 6 gives some discussion and concludes the paper.
2. Face and eyes socket detection
The fundamental requirement of a simultaneous eye tracking and gaze detection system is to accurately detect the eye sockets, which can easily be achieved by Haar-like object detectors which is formerly proposed by Paul Viola  and latterly extended by Rainer Lienhart . It allows a classifier trained with sample views of a particular object to be detected in a whole image. The advantages of this method are that when implemented correctly it is fast, efficient, and accurate. It is also effective in detecting objects which are either partially occluded or if the video frame is noisy. By applying it in a multistep detection approach, it is possible to detect precisely the eye region by first recognizing the face and then by setting a region of interest on it. It needs the correct Haar-cascade descriptors for front face and two-eye region. Santana et al.  present fully trained Haar-cascade descriptors which are employed in our system and classified by means of Intel OpenCV library. Figure 1 shows a successful screenshot for face and eye socket detection.
The detected eye sockets contain eyeball, eye brow, and eyelashes together. However; in most of the intensity-based trackers, eye brows and eyelashes are described as an obstacle for robust eyeball detection. It is physically known that pupil is the darkest contour in an eye-socket; however, eye brows and eyelashes are also dark and have similar intensity values with pupil. This makes everything difficult and prevents developing robust eye tracking systems.
The mass-center of an eye-socket is considered as the most probable pupil location in terms of the less weighted intensities . If the mass-center of an eye-socket is consecutively updated, then the resulting center of mass will be much closer to the pupil-center in each loop-step. From this point of view, the mass-center of an eye-socket is derived in x number of steps in terms of size and position. Gaussian blurring is employed before the processing. Size of the socket is reduced by p percentage rate in each step. Our system employs x as 2 and p as 15%. Finally, eye-socket is reduced in size which mostly excludes the eye-brows. Equation 1 describes the center of mass estimation as follows:
where c is the position of mass-center, P is the position, and I is the intensity of each pixel within the eye-socket which contains n number of pixels. Figure 2 shows screenshots from the socket reduction process.
The sockets derived from the original ones show more precise location of the eyeball. However, center of mass mostly overlaps on the eyelashes which are still very close to the eyeball. Finally, the derived sockets are employed as eye estimates in which DAISMI is employed to detect the stable approximate pupil-center position by averaging the positions of eyeball border vertices which is necessary for initial template orientation of DTBGE and the final movement displacement vector.
3.Deformable angular integral search algorithm by minimum intensity
The shape of eyeball is the most commonly used feature because it is a perfect circle by its nature. Shape-based methods are considered good at handling shape, scale, and rotation changes . Another concern is the intensity features of the eyeball in an eye model. Color feature of an eyeball is known as it is the densest dark contour in an eye estimate. This article proposes DAISMI algorithm which searches the densest dark contour in an eye estimate. Linear-search algorithms are used conventionally in color-trackers. However, they are weak in eyeball detection because of the corneal reflection and varying illumination. Since eyeball is circular, the noise distribution due to varying illumination avoids success when linear-search is applied to the color-tracking system. DAISMI models the shape information of the eyeball and tracks the possible eyeball contours in an angular way by means of deformable template windows. Figure 3 shows the basic configuration of the eyeball template. Eyeball shape is modeled as the eyeball borders are the locations where there is the maximum amount of variance in the intensity and the intensity distribution in the entire template is homogenous. Model also assumes that border pixels have lower intensity values than the average intensity of the derived eye-socket. First, template is oriented on each derived eye-socket. Literally, the center of the eyeball template should locate over the eyeball contour. Therefore, darkest pixel in the derived eye-socket is searched by simple linear-search algorithm. The resulting pixel location is considered as initial pupil-center location in derived eye-socket. However, the position of the darkest pixel in derived eye-socket is noisy which means it varies by position due to varying illumination. This is known as the basic problem of all the low-cost eye trackers. DAISMI does not aim to find the real pupil-center, but rather; it finds the mass-center of border vertices by constructing a deformable template which originates from the initial pupil-center position that is estimated by finding the darkest pixel in derived eye-socket with simple linear-search algorithm. Considering the particular conditions of the model, proposed algorithm finds the locations of border vertices which have the maximum integral of average low-intensities through a template window. However, DAISMI does not start searching from the template origin, but rather it starts searching from the last index of the template window to the origin's index. In other words; it searches from outer to inner in each window. After all, the index of the pixel which makes the maximum variation with the previous index through each window in terms of intensity is calculated using a cut-off parameter. Cut-off parameter determines the place of initial possible index which makes the maximum variation with the previous index. This parameter is used for getting rid of eyebrows. In each step of this process, the maximum integral of the average low intensities of the pixels from the resulting index to the origin's index is found. The index number with the maximum integral of average low intensities is considered as a vertex point of the eyeball border. The same process is employed for all other windows in the template. It is expected that model fits to the real eyeball in the derived eye-socket depending on the number of windows. Usage of more windows yields more matching probabilities between the model and real eyeball. Finally, mass-center of resulting border vertices is calculated in terms of low-intensities. The resulting mass-center position is considered as stable approximate pupil-center position and four border vertices which locate at the farthest distance from the mass-center, horizontally, and vertically are considered as the end-points of minor and major axis of the ellipse. Table 1 shows the pseudo-code of the algorithm.
Figure 4 shows successful snapshots of eyeball detection by DAISMI in which number of windows is employed as 80 and window size is employed as the 25% of the derived eye-socket's height while the tolerance value is set to 0.15f. The cut-off ratio is set to 1.25f.
The success rate of fitting depends on four parameters: number of windows, window size, tolerance, and cut-off ratio for the arbitrary model. The larger window size causes larger elliptical fittings with higher accuracy in terms of pupil-center detection while the center of the ellipse is considered as the pupil-center. In addition, it is more probable to attain more fittings for the model when the number of windows is increased. Figure 5 shows successful snapshots of DAISMI after ellipse completion.
4. DTBGE algorithm
In this section, a binary deformable eyeball template is modeled and 2D gaze estimation is performed depending on the displacement in eye movements. At any horizontal and vertical eye movement, the displacement vector is estimated with respect to the previous frame.
The algorithm consists of four steps as follows:
4.1 The basic configuration
The algorithm aims to decide on the direction of the eyeball movement using a deformable template as shown in Figure 3. In other words; DTBGE is a kind of noise filter which avoids incorrect decisions caused by noisy pupil-center coordinates. DAISMI finds the approximate window size which is basically considered as the Euclidean distance between template origin and eyeball border vertices. DTBGE creates another template for gaze direction detection using the window size estimated by DAISMI and converts the source image into binary using p-tile thresholding algorithm which automates the thresholding ratio frame-by-frame depending on the varying density ratio between black and white pixels over the detected eyeball. Figure 6 shows the orientation of windows on binary source image.
Template is created with eight windows though it may vary depending on the desired accuracy rate. Considering the memory workload and speed of the system, the best results are taken with eight windows though higher number of windows is more precise for decision making.
In this regard; the internal energy of a single window is defined as follows:
According to Equation 2, black pixels with '0' intensity has a negative impact on the internal energy of a single window while the white pixels with '255' intensity affect the result positively.
4.2 Adjustment of template size
According to the definition of internal energy of a single window in Equation 2, the size of the template varies depending on the angular integral of window energies with angle θ.
According to the Equation 3, the size of the template increases when the angular integral of window energies is less than '0' and decreases when it is greater than '0'.
Figure 7 shows a deforming eyeball model in two cases: shrinking and expanding. Decision is made according to Equation 3. In both cases; eyeball size is updated and a new template is created in each frame using the updated window size.
4.3 Horizontal movement of the object
Decision on the horizontal movement of deformable template is made by cosine functions (Figure 8). Cosine functions produce negative values when the angle θ is between 90° and 270°. Using Equation 4, the direction of the horizontal movement is determined.
According to Equation 4, the decision on the horizontal movement is made as to the right-hand side when the angular integral of cosine functions of each window's internal energy is less than '0', conversely (Figure 9); it is to the left-hand side when it is greater than '0'.
Emove (x ) > 0, the decision is made to (←) direction.
Emove (x ) < 0, the decision is made to (→) direction.
Finally, horizontal movement is pumped.
4.4 Vertical movement of the object
Decision on the vertical movement of deformable template is made by sine functions (Figure 10). Sine functions produce positive values when the angle θ is between 0° and 180°. Using Equation 5, the direction of the vertical movement is determined.
According to Equation 5, the decision on the vertical movement (Figure 11) is made as to up-direction when the angular integral of sine functions of each window's internal energy is less than '0'; conversely, it is to down-direction when it is greater than '0'.
Emove (y) > 0, the decision is made to (↓) direction.
Emove (y) < 0, the decision is made to (↑) direction.
Finally, vertical movement is pumped.
The calibration procedure is performed as a promise between users and system. All the users are asked to look at the center of screen before the tracking starts. After the tracking starts, displacement vector in terms of stable approximate pupil-center position is created by DAISMI. DTBGE decides on the reliable movements in case of noisy displacement vectors. When the users move their eyes, the amount of pixel-wise pupil displacement filtered by means of DTBGE is oriented to the screen relative to the initial gaze point and screen size. The system assumes that the initial position of gaze is the center of screen.
5. Experimental results
The system is primarily developed and tested on a Windows XP PC with an Intel Pentium Dual CPU (T2390) with 1.86 GHz processor and 1.75 GB RAM. The monitor size is 12.1'' (307 mm) with 9.72'' (247 mm) width and 7.28'' (185 mm) height with 4:3 aspect ratios. Video is captured with a Logitech Quickcam Pro 4000 webcam at 30 frames per second. Video is processed as binary images of 160 × 120 pixels using various utilities from the Intel OpenCV library.
5.1 Eye tracking experiment
The experiment is conducted for testing the eye tracking system whether it can successfully locate the pupils of the users under different conditions. A dataset consisting of 300 images taken under varying lighting conditions, head positions, and with complex backgrounds is collected. The experimental results show that the performance rate can reach 100% at frame per second (FPS) = 15 while the camera input image size was 160 × 120. Figure 12 illustrates some of the successful examples.
5.2. 2D gaze estimation and mouse control experiment
The second experiment is performed by asking six users to move cursor on virtual keyboard buttons (Figure 13) which are generated relative to the screen size. Each user is given a brief introduction about how the system works and starts the experiment without any practice. Each user conducts the experiment five times (Table 2).
When the users are asked to gaze expected buttons, they might look at any pixel over the button. Therefore, it is impossible to measure the gaze accuracy in pixel-wise paradigm. Besides, the accuracy rate depends on the size of the selection targets which are buttons in this case. If the size of buttons is increased, then it is easier to hit the button, and then the accuracy of the system increases. In order to overcome this challenge, the relative unit error distance values which are calculated with respect to the true hits on buttons are used. If the unit error distance is '0', then it means that user hits the expected button successfully. If it is '1', then it means that user hits the neighbour button of the expected button. Then, the gaze error is assumed as 1 button. And if the unit error distance is '', then it means that user hits the button which is at the corner of expected button (horizontally: 1 and vertically: 1 button far from the expected button). Lastly, if the unit error distance is '', then it means that user hits the button which is also at the corner of expected button but in this case it is horizontally: 1 and vertically: 2 (or the reverse) button far from the expected button. According to the experimental results in Table 2 users can hit the expected button in different durations. Durations do not show the elapsed time between initial gaze button and expected gaze button, but rather they show the total elapsed time starting from the tracking system, move the cursor to initial gaze, and move to expected button. In fact, the elapsed time between initial gaze button and expected button is in millisecond which is ignored. Users are observed that they are in difficulty to bring the cursor to initial gaze button. However, even they cannot succeed in the true hit, their resulting gaze is so close to the expected gaze in terms of unit error distance. Therefore, unit error distance values are not employed in terms of gaze accuracy evaluation. That's why most of the gaze trackers are evaluated with respect to their visual angle accuracy. The visual angle is the angle a viewed object subtends at the eye, usually stated in degrees of arc . Figure 14 shows how to measure the visual angle of a subject's eye looking at an object has a certain distance from the eye.
The visual angle V is measured directly using a theodolite placed at point O and is calculated using the formula :
In order to measure the visual angle accuracy for the system, the minimum true saccadic displacement depending on users' eye movements is needed. For our system, according to the experimental results in Table 2, minimum saccadic displacement is 1 unit button.
In our virtual keyboard application, 1 unit button's size is 75 (px) × 75 (px). However, this size is relative to the screen size and the resolution. In order to find the real size in millimetre, the pixel unit is converted to millimetre unit, firstly.
According to the system settings in Table 3 and the formulas in Equations 6-8, visual angle is estimated. It cannot be calculated precisely because it is assumed that users' distance to the monitor is 400 mm and the minimum saccadic displacement is 1 unit button. These inputs are not precise but very close to the reality.
Finally, horizontal and vertical visual angle are estimated. According to Table 4, the horizontal visual angle is 2.07° while the vertical visual angle is 2.48°.
Finally, the visual angle accuracy of the system is considered as around 2°.
6. Conclusions and discussions
In this article, a shape- and intensity-based deformable eye pupil-center detection and movement decision algorithms are introduced in terms of developing a low-cost 2D gaze estimation and cursor control system regardless of the real position of pupil-center. In many other algorithms which are based on PoR information, real pupil-center position is required to get rid of noisy eye movements. However, unlikely, DTBGE employs a new deformable template model which decides on the eye movements' directions in binary source images regardless of knowing the real pupil-center. DAISMI finds the stable approximate pupil-center location by searching deformable eyeball models which includes low-intensity pixels within the template windows. It finds that the position of the eyeball with border vertices and the mass-center of vertices are considered as the stable approximate pupil-center position. Even the origin of the template is vibrating and noisy, the border vertices are almost stable and less vibrating. The stable approximate position of pupil is enough for getting robust eye movement detection and corresponding pupil-center displacement. This system is very useful for those who want to build 2D gaze estimation and mouse control systems with low-resolution source images such as simple webcams without infra-red filter. Experimental results show very encouraging results such that the visual angle accuracy is around 2° while the speed is 15 fps.
The comparison of the system with other systems in the same category is as follows:
As seen in Table 5, our system has no calibration, no training process, no infra-red filtering with head-mounted camera, and it is the only system which does not need real pupil-center position for 2D gaze estimation. Even the system employs low-resolution webcam images (160 × 120), the success rates are so close to each other. For this reason, this system is an ideal option for those who cannot afford expensive trackers, does not like calibration, and wants to get high accuracy even the system is with a simple webcam. Finally, this system can be used with more performance if the auto-thresholding can be provided more robustly. The ratio between white and black pixels in template windows is the only decider parameter which determines the performance of the system. However, this ratio depends on the thresholding algorithm only. Better thresholding algorithms can yield better results.
artificial neural network
deformable angular integral search by minimum intensity
deformable template-based 2D gaze estimation
frame per second
point of gaze
point of regard.
Jacob RJK, Karn KS: Eye Tracking in Human-Computer Interaction and Usability Research: Ready to Deliver the Promises, in the Mind's Eye: Cognitive and Applied Aspects of Eye Movement Research. Elsevier Science, Amsterdam; 2003:573-605.
Santana MC, Déniz-Suárez O, Antón-Canalís L, Lorenzo-Navarro J: Face and facial feature detection evaluation-performance evaluation of public domain HAAR detectors for face and facial feature detection. In Proceedings of the Third International Conference on Computer Vision Theory and Applications (VISAPP 2008). Volume 2. Madeira, Portugal; 2008:167-172.
Long X, Kiderman A, Tonguz OK: A High speed eye tracking system with robust pupil center estimation algorithm. In Proceedings of the 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBS 2007). Lyon, France; 2007.
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.