Skip to main content

Tracking of moving human in different overlapping cameras using Kalman filter optimized


Tracking objects is a crucial problem in image processing and machine vision, involving the representation of position changes of an object and following it in a sequence of video images. Though it has a history in military applications, tracking has become increasingly important since the 1980s due to its wide-ranging applications in different areas. This study focuses on tracking moving objects with human identity and identifying individuals through their appearance, using an Artificial Neural Network (ANN) classification algorithm. The Kalman filter is an important tool in this process, as it can predict the movement trajectory and estimate the position of moving objects. The tracking error is reduced by weighting the filter using a fuzzy logic algorithm for each moving human. After tracking people, they are identified using the features extracted from the histogram of images by ANN. However, there are various challenges in implementing this method, which can be addressed by using Genetic Algorithm (GA) for feature selection. The simulations in this study aim to evaluate the convergence rate and estimation error of the filter. The results show that the proposed method achieves better results than other similar methods in tracking position in three different datasets. Moreover, the proposed method performs 8% better on average than other similar algorithms in night vision, cloud vision, and daylight vision situations.

1 Introduction

Detecting and tracking moving objects is a basic step in video analysis; thus, it is widely used in machine vision systems like monitoring systems, traffic control, automatic navigation, human interaction with computer, and robotics because the mentioned systems require receiving and processing videos received from the surrounding environment, analyze the behavior and events of these videos [1]. Since accuracy and speed are important factors in desired performance of the mentioned systems, various methods are presented for tracking moving objects with low time-consumption and high accuracy, increasing quality level and performance. According to the definition presented in by Sadkhan et al. [2], tracking moving objects is to follow the movement trajectory of an object or moving objects in a sequence of input images. The tracked moving objects might be any object like a fish in water, a boat on the sea, a pedestrian in the pavement, a vehicle on the highway, which should be located according to the application [3]. Tracking individuals is the first and most important step in such systems. Due to extent of the cameras’ sight range, it is not possible to examine the whole environment being monitored [4, 5]. Detecting individuals in a network of camera is very complicated due to appearance features of individuals and factors like illumination, position of the individuals regarding the camera, sight angle [6, 7]. In such systems, usually, features like movement mode, geometric modes, and individuals’ appearance are used. Since the movement feature is highly dependent on the appearance time and position in the previous frames, the movement filters cannot detect individuals during occlusion and while entering the camera’s sight angle [8, 9]. Tracking techniques should be flexible against challenges like illumination variations, sudden changes in movement direction of the objects, presence of various objects in the sight range of the camera, overlapping, etc. In recent years, appearance-based systems have shown better performance for detecting and tracking when individual enter the sight angle of the camera [10]. Color is the most important feature extracted from individuals’ appearance, which is usually employed as histogram for individual detection due to simplicity and speed of calculations. But accuracy of this method in a system with a network of cameras is low [11, 12]. This is due to two main reasons: likely changes in individuals’ appearance and illumination of the environment in different cameras. In this study, to increase accuracy of the histogram method and improve this technique for redetection of the individuals, it is suggested to select the body histogram using GA. The features extracted from histogram are the color of different body parts considering their position with respect to the cameras. In this paper, GA is used for feature reduction, and increasing detection accuracy and speed.

Segmentation of moving objects in a sequence of video images is one of the main parts in many machine vision applications [13, 14]. In fact, the output of the detection and tracking system, which is the objects being tracked, is used as the input for higher order processing like movement interpretation, counting number of objects, detecting type of behavior and so on [15]. The object detection methods are divided into two feature-based and movement-based. In the feature-based methods, the objects are specified in terms of shape, color, and other features [16,17,18]. Another important issue in all machine vision applications is the adjustment of the cameras because it its implementation accuracy directly affects the system performance [19]. In this study, considering the requirement for the monitoring systems to be real-time, the background removal method is used, and an adaptive model of the background is created to reduce the sensitivity of the detection algorithm to issues like shadow in images, stationarity of the object in the image, unwanted movements of the camera, and illumination changes such that proper performance is ensured. Such problems are out of scope of the research topics that the object tracking researchers are interested in [20, 21].

In this study, the tracking of a moving human in different cameras with overlapping is optimized using the Kalman filter. The Kalman filter is a system of equations that offers an effective (retrospective) computational tool for estimating the state of a process in a number of ways: it supports the estimation of past, present, and even future states and can even perform the same. Kalman filters always deliver the best results. When it is unclear what the represented system’s precise nature is [22]. The Kalman filter uses a type of feedback control to estimate a process. The filter gives feedback in the form of noise measures after estimating the process’s state over time. For this purpose, a hierarchical approach is presented in which, first, the objectives of each camera are determined and their different characteristics are identified according to the data sent in Fig. 1a. In the next step, an individual and target are tracked in different cameras using an accurate approach considering the system trained by previous data Fig. 1b. Thus, the camera search problem is solved. Since the appearance and movement signs of a target in a short camera tracking are compatible in a short time window, solving the tracking problem using the hierarchical method is common: first, tracks in short time windows are generated, then they are integrated to form the complete rhythms. Usually, tracking in the whole camera is more challenging than solving inside the camera, because the individuals’ appearance might demonstrate significant differences due to illumination changes, but the training the machine system and accurate extraction of the target features, this problem can be solved.

Fig. 1
figure 1

General idea of the proposed framework. a first, the targets in each camera are determined, b the targets and features of a similar individual from different overlapping cameras are related and solve the individual tracking problem in different cameras

Therefore, this work presents an integrated three-layer framework for solving the tracking problem inside and outside camera for each camera. In the first two layers, each camera is synchronized and the primary characteristic and appearance of the individuals (including color, being large or small) are detected. In the third layer, all rhythms of an individual in all camera are associated using artificial intelligence.

In summary, the innovations of this research focus on optimizing the tracking of moving humans using a combination of techniques and algorithms, providing a hierarchical approach, feature selection with GA, and optimal Kalman filtering. A comprehensive evaluation and comparison of the proposed method with other algorithms shows its effectiveness and practicality in different scenarios. The details of the contribution of this research are as follows:

  • Purpose: The purpose of the research is to track moving objects with a focus on identifying human identity through appearance analysis. To achieve this goal, it uses a combination of techniques including Kalman filter, Artificial Neural Networks (ANN), Fuzzy Logic and Genetic Algorithms (GA).

  • Methodology: The research uses a three-layer hierarchical approach. It starts by identifying people in each camera feed using color and size features. Then, it tracks people in different cameras using Kalman filter optimized with fuzzy logic. In addition, a CNN-based algorithm is used for object detection and tracking.

  • Feature selection: In this study, a feature selection process using GA is introduced to optimize the feature set for neural network training, reducing complexity and increasing classification accuracy.

  • Evaluation: This research evaluates the performance of the proposed method in terms of convergence rate, estimation error and tracking accuracy. It performs simulations on three different datasets, including night visibility, cloud visibility, and daytime visibility scenarios.

  • Comparison: This research compares our proposed method with the Mean Shift algorithm and other advanced algorithms such as YO-LO and Fast RCNN in terms of tracking accuracy, precision, recall and F1 score. It also considers the time complexity for performance evaluation.

The rest of the Paper is organized as follows. In the second part, an overview of the previous works is given and the summary of this part is compared with this study in the form of a table. In the third part, the general structure and components of the people tracking system are introduced. The fourth section describes the proposed method for detecting and identifying people under tracking. In the fifth section, the results of the performances and the considered considerations are given. Finally, in the sixth section, the conclusion of the implemented system is presented.

2 Background

Tracking in a video means continuous detection of position of an object or individual and updating it movement while the target or the camera is moving. Although various studies have been conducted in this context, but tracking is complicated due to low resolution of the cameras, high compression of the film, complicated nature of the movement and geometry of the object, changes in illumination, real-time process, and different appearance of the individuals in environments with different cameras. Thus, more studies and novel approaches are required. The systems that have been introduced recently mainly use three features, including movement mode, geometric modes, and individuals’ appearance. Each of these features might have a different performance in tracking and matching the individuals while entering the sight angle of the cameras. To this end, this paper introduces studies conducted for tracking using one camera and redetecting in a network of cameras with overlapping sight.

In the movement-based tracking, the current frame that its movement parameters like position, speed, and acceleration are similar to the movement parameters of the previous frame is categorized. In this type of trackers, the geometric center of mass of individuals, which is called center of mass, is used. In the tracking systems based on movement feature, some assumptions are applied to the system and the environment being monitored to reduce its complexity. For example, Ciaparrone et al. [23] have assumed that the frame rate of the video sequence is sufficiently high. Therefore, the individual’s position does not change significantly from one frame to another. Similarly, Nallasivam et al. [24] have assumed that the individual’s appearance, walking speed, stopping or starting walking change gradually Kang et al. [25], or the individual’s speed is constant. These methods are not valid in unconstrained environments because the trajectory might change or the individuals might stop suddenly and unexpectedly Gong and Shu [26]. Thus, this method is not proper for tracking multiple individuals in real-time applications. In geometric models, shape features are used for tracking and detection. Edge detection, matching object boundary, torque, area, and size are the object features that are used for tracking. Geometric models employ individuals and their extracted features for detecting individuals in a wide range of applications. The algorithms that are not variant against size, transmission, rotation can be used for tracking and individual matching. Torques are used as quantitative measures of shapes in a set of points [27]. The torque type 0 is (area), type 1 is (mean or center of mass), type 2 is (data scattering), which are used for statistical analysis. To match the objects shape, n torques are used [28]. In another case, the constant torque, which is the linear combination of central torque, has been introduced. Sreekala et al. [29] have performed object tracking and detection using a vector including three constant vectors and have used Euclidean distance to match them. The appearance information has been used frequently in object tracking and detection. Among the important methods used to track individuals using their appearance, probability distribution function, 1D and 2D appearance models, multiple constant points in individuals and the appearance model from multiple sight angles can be mentioned. They have used color histogram and Bhattacharyya distance to find the best match of the individuals’ positions. Among studies in which the tracking methods are not automatic, Chen et al. [30] can be mentioned in which training is offline. They have selected a number of histogram bins in the RGB space and obtained the optimal trajectory using dynamic programming. Many researchers have used multiple overlapping cameras to increase tracking performance. In Ghaznavi et al. [31], about 400 papers have been studied in the context of tracking, and the problem of tracking individuals and the methods used to resolve them have been reviewed, including the primary human appearance model, tracking, position estimation and behavior identification. In Halkarnikar et al. [32], details of many existing techniques used for individual tracking have been introduced and a brief report of these techniques has been given. In another study, some of the tracking methods have been reviewed and their performance has been compared [33, 34]. Redetection is the task of observing one individual from the sight angle of one camera and detecting it again in the sight angle of the same camera or another camera. This is an essential step in environments with multiple cameras of independent sight angles. An individual might enter the sight angle of a camera several times. If redetection and the information regarding time and number of times that the individual enters and exits the sight angle of a camera is obtained, it would be very helpful for detecting and analyzing individual’s behavior. For instance, in studies Mirbakhsh et al. [35] and Duan et al. [36], position and speed information of the moving individuals have been used to determine the relationship between time and position of the individuals. In Trik et al. [4], it has been assumed that there is a specific relationship between individuals in these cameras, and a Markov model has been used to model the individuals’ dynamic. In many other papers, histogram matching like color histogram has been used as a signature [37, 38]. A hidden Markov model is used in a recent study Lind et al. [39], to simulate the edges of the tracked region. This model looks at several picture clues and uses a joint probability data association filter to figure out how likely it is that the picture will change. Some of these techniques focus on pixel-accurate object tracking in off-line mode, while others focus on real-time tracking of rough object boundaries. None of them, on the other hand, have a way to automatically change the weights for each section of the contour, nor can they be scaled to meet both goals within a single generic framework. In Table 1, a few object-segmentation and tracking methods from the literature are listed along with some of their characteristics.

Table 1 Efficiency of existing methods for object segmentation and object tracking compared to the proposed method

3 The proposed algorithm for detecting moving object

The proposed algorithm can process online and offline video as shown in the flowchart of Fig. 2. (Threshold is the boundary that determines the real movement of the object).

Fig. 2
figure 2

Block diagram of the implemented algorithm

A video file in the.avi format is read at the input after the best standard threshold values have been chosen. This process extracts the red, green, and blue intensities of each frame of the input video and creates a histogram to identify the background. The frames are then turned into visuals in grayscale. In order to identify the foreground, the background is now split from the next frames. After finding moving objects, the shadow is taken out to determine the moving object’s region. The moving objects with rectangular boxes are then displayed at the output, labeled, and morphologic processes are then applied.

3.1 Threshold value

It is important to choose the right threshold values for the backdrop, standard deviation, and moving object area. A statistical metric called standard deviation is employed during processing to get rid of the shadow cast by moving objects. At this threshold, the backdrop is chosen at 250 pixels, STD is chosen at 0.25, and the moving object’s area is chosen at 8 pixels. In this technique, a block is defined as 8*8 pixels.

3.2 Input film

The input movie is in avi format. Audio visual interleave is referred to by the abbreviation avi. In reality, an AVI file contains RIFF-formatted audio and video data. The audio and video data are saved together in AVI files, making it possible to cast both audio and video at the same time. AVI files typically contain audio data in an uncompressed PCM format with a variety of characteristics. AVI documents with a compressed format and many codes and settings are typically where video data is kept. The functions aviread and aviinfo are mentioned for reading the input avi video. This technique is evaluated using a video file input with various frame counts.

3.3 Extraction

After reading the input video file, the RGB intensity features are independently extracted to make it simple to locate the histogram. The red, green, and blue intensities of the input video frames are read using the (1,:,:), (2,:,:), and (3,:,:) image functions.

3.4 Histogram

The graphical representation of an image’s pixels as a function of their intensity is called an image histogram. A histogram is made up of a number of cylinders or bins, each of which has a region with a particular intensity. When calculating a histogram, all of the image’s pixels are examined and given a closed set based on the intensity of each pixel’s color. The number of pixels that are assigned to a row is its final value. The square of the number of pixels is typically used to calculate the number of cylinders between levels. The image histogram is a crucial tool for analyzing a picture’s properties. They enable simultaneous detection of background and grayscale values. The backdrop is extracted using a histogram. Additionally, we employ feature extraction from object images for recognition in several overlapping cameras. As the discrete function, the histogram of a digital image with total potential intensity L in [G, 0] is defined as follows:

$$h\left( {r_{k} } \right) \, = \, n_{k}$$

where nk is the number of pixels in the image that have the intensity of rk is the kth is the kth intensity level in the range [G, 0]. Unit8, Unit16, and Double class pictures have values of 255, 65,535, and 1, respectively. Since MATLAB indices cannot begin at 0, r1 represents an intensity level of 0, r2 an intensity level of 1, and r3 an intensity level of G. G = L-1 for photographs of classes. When working with normalized histograms, it is typically beneficial to divide all h(rk) elements by the total number of pixels in the image, denoted by n.

$$P\left( {r_{k} } \right) \, = \, h\left( {r_{k} } \right)/n \, = \, n_{k} / \, n$$

For k = 1,2,…,L, P(rk) is an estimation of the intensity level occurrence. The functions histc, imhist of MATLAB are used in this section.

3.5 Grey scale image

Images in the grey scale are either constructed or devoid of color. Grey scale values range from 0 (black) to 1 (white). (white). Images are converted into grey scale images after histogram calculation in order to simplify and make use of morphology.

3.6 Subtraction

The suggested approach dynamically subtracts the background from every input video frame, subtracts it from the following frame, and compares the result to the background’s threshold value. It is regarded as the foreground if it exceeds the background threshold and the background if it does not. Every frame updates the foreground.

3.7 Removing shadow

Each frame an 8 × 8 block performs the operations using a function, and the output is compared to the variance threshold. Results are interpreted as shadow if they fall below the variance threshold and are given logic 0; otherwise, they are given logic 1.

3.8 Morphologic operations

Morphology is a broad category of image processing techniques that manipulates images according to their shapes. The morphologic operations take an input image and apply a structural element to it, producing an output image of the same size. Erosion and dilation are the two primary morphologic processes. In a morphologic operation, the neighbors of each pixel in the input image are compared to determine the value of that pixel in the output image. By choosing the neighborhood’s size and shape, which is sensitive to certain shapes in the input image, morphological processes can be produced.

Dilation Dilation is the process that makes the objects in a binary image larger or thicker. A shape known as the structural components regulates the way and quantity of thickening. To put it another way, dilation frequently makes use of the structural element to discover or enhance the shapes present in the input image. Dilation is modifiable. A + B = B + A. It is standard practice in image processing for the first operator to represent the structural element, which is typically smaller than the image, and the second operator to represent an image. For illustration, consider the employment of the 3*3 square structural element in the simple binary image A, which contains a rectangular object.

$$\left[ {\begin{array}{*{20}l} 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill \\ 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill \\ 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill \\ 0 \hfill & 0 \hfill & 0 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 0 \hfill & 0 \hfill & 0 \hfill \\ 0 \hfill & 0 \hfill & 0 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 0 \hfill & 0 \hfill & 0 \hfill \\ 0 \hfill & 0 \hfill & 0 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 0 \hfill & 0 \hfill & 0 \hfill \\ 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill \\ 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill \\ 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill \\ \end{array} } \right]$$

After dilation, the binary image A is as follows:

$$\left[ {\begin{array}{*{20}l} 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill \\ 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill \\ 0 \hfill & 0 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 0 \hfill & 0 \hfill \\ 0 \hfill & 0 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 0 \hfill & 0 \hfill \\ 0 \hfill & 0 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 0 \hfill & 0 \hfill \\ 0 \hfill & 0 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 0 \hfill & 0 \hfill \\ 0 \hfill & 0 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 0 \hfill & 0 \hfill \\ 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill \\ 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill \\ \end{array} } \right]$$

Erosion Miniaturization and tiny objects in a binary image are the definitions of erosion. Similar to dilation, a structural component governs miniaturization. Dilation is entirely different from erosion. A 3*3 square matrix is employed as the target structural element by the image processing after applying primary processes to the binary image A. Dilation and erosion are typically used in various combinations. A series of erosions are passed through an image employing the same structural elements, or occasionally distinct ones. The MATLAB functions imdilate and imerode are used in this section.

4 Methodology for target detection

The object detection algorithm based on CNN includes two steps: detecting and tracking the object. The general block diagram of the proposed system is shown in Fig. 3.

Fig. 3
figure 3

Block diagram of the proposed system

In this system, film is given as input to the system. The frames are extracted for more processing. Two main algorithms of object detection and tracking are carried out using deep learning methods and Kalman filter optimized by fuzzy logic. Embossment of a moving object using background subtraction and object tracking using Kalman filter were discussed in previous sections. Object detection is tested using the computer vision algorithm under different conditions, including light changes, illumination changes, and occlusion. Thus, in this study, the object detection algorithm based on background removal and noise removal is used. After object detection, their positions are important to start tracking. The cameras perform tracking via Kalman filter, which is optimized fuzzy PI controller for accurate tracking. In order to track moving objects and individuals in overlapping and non-overlapping cameras, a detection system is required to train the target detected and tracked in the first camera and label the camera for next cameras. In this method, instead of using the common algorithm based on computer vision, the CNN-based tracking algorithm is used. The object tracking algorithm based on CNN is shown in Fig. 4.

Fig. 4
figure 4

Flowchart for object detection in overlapping and non-overlapping cameras

Object tracking is an essential step in computer vision algorithms. For tracking to be robust, object understanding and knowledge like movement and its changes during time are required. The tracker should be able to use its model and make decisions about new observations.

In this method, the pretrained weight parameters of the model are loaded. This model can insert time information. Instead of focusing on objects during experiments, the pretrained model is trained on various objects in real-time. This model can track objects with a speed of 150 frames/second. It is also able to remove occlusion.

In this method, the locations obtained from object detection algorithms based on background subtraction are transformed to object tracking algorithm based on Kalman filter optimized using fuzzy PI controller. The initial positions are learnt by the model, and the same points are explored in specific form using CNN model experiment.

4.1 CNN with features selected using GA

The basic visual system of the brain’s simple and complex cells served as the basis for the CNN model. Unlike classic NNs, CNN employs the spatial information between image pixels. For graphic processing, CNN is comprised of two sections (including data extraction and classification) constituting three layers, including convolution, sharing, and fully-connected layers. Two sections of the CNN and the layers are shown in Fig. 5.

Fig. 5
figure 5

Flowchart of the algorithms used in this study

In this study, the histogram features are extracted using MATLAB and trained for moving under MATLAB; then, it is trained for different values of the image in the first camera for the neural network. The information received in each frame for the image processed by the moving object detection algorithm is then used by other cameras to determine whether an object or person is moving. In this design, the tracking algorithm tracks the individual of interest in each frame and marks its label for different cameras.

Genetic Algorithm (GA): GA is an adaptive heuristic search algorithm, which is a part of evolutionary computations. Darwin theorem discusses evolution—survival of most appropriate. This method provides solutions by emulating processes like selection, cross-over, mutation, and acceptance [42].

4.1.1 Optimizing feature selection

In this study, to improve optimal selection of features for training, each feature is used as a fourth order function (Eq. 5) for selecting disease type. The coefficients of this function are obtained according to the following equation using GA. The structure given in this study is defined to reduce the error between output of the polynomial function for specific input features, and it is calculated after several periods, which is the same and equal to 300. Accordingly, the feature with minimum error, using calculation of the polynomial coefficients through GA, is selected as the best feature. In this study, the neural network is trained based on these features. The related instructions are presented in the second appendix. The value of the function matrix for 16 features is given in the following.

$${\text{Final value}} = \begin{array}{*{20}l} {[14.0160658410323} \hfill & {13.5847355178953} \hfill & {14.9161632555649} \hfill & {13.9472064981342} \hfill \\ {150.608628948777} \hfill & {810.302856290688} \hfill & {14.6897366154422} \hfill & {14.8593188522127} \hfill \\ {17141.2435259272} \hfill & {15.0005639024906} \hfill & {10080.5231924715} \hfill & {27.2403298186408} \hfill \\ {15.0000000000056} \hfill & {169.285241880704} \hfill & {27015087247208.2} \hfill & {11.3403758520762]} \hfill \\ \end{array}$$
$$z = {\text{sum}}({\text{abs}}(x(1)*y(:,1).^{ \wedge } 4 + x(2)*y(:,1).^{ \wedge } 3 + x(3)*y(:,1).^{ \wedge } 2 + x(4)*y(:,1) + x(5) - y(:,2)));$$

In general, the project is carried out for individual tracking in overlapping cameras according to Fig. 5. First, the moving object or individual is identified through background subtraction. Next, the Kalman filter optimized using fuzzy PI logic is used for tracking. After identifying the moving individual and numbering it, its histogram is extracted to train the neural network. Finally, the trained network is used for other cameras.

5 Simulation results

In the following, the results of the proposed detection and tracking technique for human samples are considered. Figure 6 shows the result of all processing, segmentation, and separation steps for the selected film sample. Finally, the histogram of the selected individual is extracted and given to the neural network. Additionally, the average video tracking sequence for the chosen images in the PETS dataset is 7s, which is longer than the sequences for the other datasets. Long films present difficulties since there are more object samples and longer occlusions, and there is also a greater memory requirement because there are more frames.

Fig. 6
figure 6

Representing the tracking results and extracting color histogram features of moving individuals

Figure 7 shows the results of training the neural network of the selected individual for detection. This network is comprised of 5 layers that perform individual detection in different time periods.

Fig. 7
figure 7

Training CNN after 6 periods

In the following, the performance of the Kalman filter tracking optimized using fuzzy logic is examined in Fig. 8 and Fig. 9 for movement of a ball with a speed higher than human; its movement coordinate in 2D space on the image vs. time is plotted. Figure 8 shows the conventional filter, and Fig. 9 shows the results for the filter optimized with fuzzy logic, and the results for tracking the ball can be compared.

Fig. 8
figure 8

Tracking with conventional Kalman filter in X, Y coordinate of the 2D image vs. time

Fig. 9
figure 9

Tracking with conventional Kalman filter optimized using fuzzy PI controller in X, Y coordinate of the 2D image vs. time

5.1 Results in PETS 2009

The method presented in this section as OSM, i.e., matching the target image under simulation performed on PETS 2009 samples, is given below. The dataset was obtained from This site is a database that contains a series of photographs and short clips of overlapping cameras, and this database is used to test written programs that require overlapping cameras. Many similar researches have used it. According to the solution presented in this research, we also used it in this paper.

Figures 10, 11, and 12 show that the proposed algorithm has been successfully tested in challenging scenarios. Person recognition that affects recognition results is done with overlapping cameras. After training the neural network with person-specific labels, we can analyze different parts of the system and obtain results for other cameras based on the color residual characteristics. Finally, detection outside the AOI is easily removed using initial positions. However, the proposed algorithm removes false detections assigned to itself within the AOI. Figures 10 and 11 show the individual detection results for other overlapping cameras in the target area. Figure 12. The results of tracking and identifying the moving person based on the images of different cameras are tested for an example with a higher human density.

Fig. 10
figure 10

Showing the stages of segmentation and extraction of characteristics and identification of the target person based on the trained neural network

Fig. 11
figure 11

Recognition and identification of the target person trained under the neural network for other number 2, 3 and 4 cameras under one frame

Fig. 12
figure 12

Detecting and tracking the target individual under a, b, c, and d cameras and tracking the individual in the films using the Kalman Filter

5.2 Comparison of simulation results of the proposed method with the mean shift algorithm

In this section, we used an algorithm called Mean Shift [43] to look at the results and judge the proposed method. A mean shift algorithm for continuous video frames is called Mean Shift. Depending on the values obtained by applying the average shift algorithm to the succeeding frames, the initial values of the location and size of the current frame are set. The moving target and shadow color appear to be comparable in the test video. In this case, targets are chosen to be moving pedestrians at a constant speed. There is background noise in this video. The positioning error of the center is represented by the Euclidean distance between the actual position and the results of the tracking. The optimum outcome is attained when the Euclidean distance value rises. The scale control performance and the control performance for the Mean Shift algorithm are consistent and low when the difference between the actual target coordinates and the tracking coordinates is minimal. The center positioning error (CPE) and distance accuracy (DA)-based algorithms that are currently in use and those that have been proposed are quantitatively analyzed. CPE is the Euclidean distance between the target’s center of detection and its actual position, and DA is an evaluation metric for distance accuracy. In Table 2, the mean shift algorithm and the suggested approach are compared for three characteristics, including center positioning error, distance accuracy, and average target tracking error. The results, which are displayed in the table above, demonstrate a significant difference in the values of the three evaluated parameters and demonstrate that the suggested strategy produces superior outcomes. From the analysis, it can be inferred that using grayscale information makes it difficult to track the target and that using the mean shift technique when the background color matches the target color makes it extremely challenging.

Table 2 Evaluating the outcomes of the suggested approach and the mean shift algorithm

Table 2 shows that the suggested approach outperforms the Mean Shift algorithm in terms of tracking accuracy and mean target tracking error. The computation times for the suggested approach and the Mean Shift algorithm are shown in Fig. 13. The specifications of the system for evaluation are given in Table 3.

Fig. 13
figure 13

Comparison of the suggested algorithm’s and the Mean Shift algorithm’s time complexity

Table 3 The system requirements

A crucial factor in live video computing is time complexity. The amount of time needed to process video frames is shown in Fig. 13. The graph compares the Mean Shift algorithm to the proposed technique in terms of total frames and milliseconds. This graph demonstrates how little time is needed in the case of the suggested way to process video frames.

5.3 Comparison of results for efficiency measures

The findings of the advanced YO-LO [44] and Fast-RCNN [45] algorithms are contrasted with the results of the suggested approach. Three separate environmental conditions—clear daytime vision, hazy daytime vision, and nighttime vision—are used to calculate the performance characteristics true positive, true negative, false positive, false negative, accuracy, precision, and recall. Precision and recall are balanced by the F1 score parameter. The F1 score is an appropriate criterion for assessing an experiment’s accuracy. Precision and Recall are both taken into account in this measurement. The F1 criterion has a range of one to zero.

This criterion is determined by a classifier’s level of accuracy and recovery, where accuracy is the ratio of samples that are actually positive to samples that are predicted to be positive, and recovery is the ratio of samples that are actually positive to samples that are actually positive.

The performance criteria for clear day vision are listed in Table 4, where the F1 score of the suggested algorithm exhibits insignificant variances in the third decimal place. The cloud visibility performance metrics are shown in Table 5 and correspond to significant changes in the F1 score. The performance of the suggested method is superior to that of the YO-LO and Fast RCNN algorithms for cloud vision.

Table 4 Performance indicators for clear daytime vision
Table 5 Performance indicators for hazy daytime vision

The night vision performance criteria are shown in Table 6, where the F1 score exhibits considerable variations and is taken into consideration. While the accuracy values for the Fast RCNN method and the suggested approach are identical and equal to 0.915, the accuracy parameter values for YO-LO and the Fast RCNN algorithms are quite near. The results indicate that the performance of the suggested strategy for cloud vision is better than the Fast RCNN algorithm and worse than the YO-LO algorithm when the value of the F1 score is taken into account.

Table 6 Performance indicators for nighttime vision

In clear daytime, cloudy daytime, and night-time settings, Fig. 14 compares the F1 score of the proposed approach to more advanced YO-LO and Fast RCNN algorithms. The graph demonstrates that the three algorithms’ values exhibit very minor variations, with the suggested method outperforming the others in vision that is clouded. The YO-LO architecture and Fast RCNN architectures must be trained and tested on a strong GPU. High time complexity and greater computing expenses result from this.

Fig. 14
figure 14

F1 score performance of the proposed method compared to cutting-edge algorithms

6 Conclusion

In this section, a technique is presented for a system with multiple cameras to detect individuals using color hysteresis knowledge. We started by creating congestion maps through integrating all predicted views on the background and parallel plates. The moving objects create significant congestion in the map and generate a specific shape around them. In this study, a solution is generated to detect individuals and track them using the cameras with different overlapping cameras, and a common solution is presented, which includes the neural network technique. This method has a main drawback. The idea of detecting individuals using color hysteresis of individuals’ clothes is used. In the following, in order to track the individual in different points using overlapping cameras, the Kalman filter optimized using fuzzy logic algorithm is used. The results are obtained for PETS 2009 samples, indicating good performance. In this study, a complete framework for detecting individuals in images and videos obtained from multiple visual sensors is discussed and geometry concepts of the multi-view cameras, computer vision, and pattern recognition are used for two different approaches. The first approach is based on analysis of congestion maps with a different spatial core using Kalman filter tracking. In the second approach, a key point extraction process is presented for histogram of individuals’ images based on neural network. These key points are verified by the presence of a specific shape. These methods are tested in three challenging datasets. Individual tracking with multiple cameras results in significant values, which are related to positions that should be identified. As observed, the significant reduction in false detection rate (FDR) is achieved. Therefore, the general performance of the multi-view tracking is improved. The height of the cameras has a considerable impact on identification utilizing multiple camera congestion maps and homography congestion limits. The homographic congestion constraint defines the relationship between image uncertainties and the real world. A presumptive camera or a set of cameras providing top view, have a higher reliability in locating objects, but they face unreliability in determining the object’s height. On the other hand, if a set of presumptive cameras are located at a lower altitude, height is defined easier, but an inaccurate estimation is obtained. The effect of height on congestion maps is presented using empirical analysis. Features like appearance, texture-based, individual detection might be useful in this case. To better understand the contribution of each component of our method, we develop a basic tracker that uses 3D coupling binary correlation as the basic method Hofmann et al. [46] and apply subgraph condense search Yuan et al. [43] as the follow-up solution. However, color histogram is used for individual to reduce the dependency on distance and height of the camera. This basic algorithm is compared with our implementation of the neural network optimization for tracking two views of the camera in Table 7. This basic algorithm improves values of MOTA, and MT and reduces ML and IDS compared to the method based on network flow optimization Hofmann et al. [46], indicating that the neural network tracking performs better for searching the subgraphs or subscripts.

Table 7 Effect of different components on the proposed tracker

The results of the evaluations showed that the proposed approach works 8% better than other similar methods in the conditions of night vision, cloudy vision and daytime vision. As mentioned in Sect. 4.1 of the discussion, the reason for this superiority selection of features is with the help of genetic algorithm, which has provided a significant improvement for this part in order to select the best features and classification with less complexity for the neural network system.

The following are some suggestions to increase performance for future work and further research improvement: More advanced neural network architectures, such as Convolutional Neural Networks (CNN) with attention mechanisms or Recurrent Neural Networks (RNN) can be explored to increase the system’s mining and tracking capabilities. find It is possible to perform a full meta-parameter optimization to fine-tune the parameters of the neural network and the Kalman filter, which can also lead to improved tracking accuracy. Also, the algorithm can be considered optimized for real-time processing to handle high frame rates and provide timely tracking results in practical applications. Further, the evaluation criteria can be expanded to include other criteria that evaluate tracking quality under different conditions, such as blocking management and tracking consistency.

Availability of data and materials

The data used to support the findings of this study are available from the corresponding author upon request.


  1. S. Huang, G. Zong, H. Wang, X. Zhao, K.H. Alharbi, Command filter-based adaptive fuzzy self-triggered control for MIMO nonlinear systems with time-varying full-state constraints. Int. J. Fuzzy Syst. (2023).

    Article  Google Scholar 

  2. A.S. Sadkhan, S.R. Talebiyan, N. Farzaneh, An investigate on moving object tracking and detection in images, in 2021 1st Babylon International Conference on Information Technology and Science (BICITS) (pp. 69–75). IEEE. (2021)

  3. W. Luo, J. Xing, A. Milan, X. Zhang, W. Liu, T.K. Kim, Multiple object tracking: a literature review. Artif. Intell. 293, 103448 (2021)

    Article  MathSciNet  Google Scholar 

  4. M. Trik, A.M.N.G. Molk, F. Ghasemi, P. Pouryeganeh, A hybrid selection strategy based on traffic analysis for improving performance in networks on chip. J. Sens. (2022).

    Article  Google Scholar 

  5. F. Cheng, B. Niu, N. Xu, X. Zhao, A. M. Ahmad, Fault detection and performance recovery design with deferred actuator replacement via a low-computation method, in IEEE Transactions on Automation Science and Engineering. (2023)

  6. M. Kristan, J. Matas, A. Leonardis, M. Felsberg, L. Cehovin, G. Fernandez, R. Pflugfelder, The visual object tracking vot2015 challenge results, in Proceedings of the IEEE international conference on computer vision workshops (pp. 1–23). (2015)

  7. M. Samiei, A. Hassani, S. Sarspy, I.E. Komari, M. Trik, F. Hassanpour, Classification of skin cancer stages using a AHP fuzzy technique within the context of big data healthcare. J. Cancer Res. Clin. Oncol. (2023).

    Article  Google Scholar 

  8. Yue, S., Niu, B., Wang, H., Zhang, L., & Ahmad, A. M. (2023). Hierarchical sliding mode-based adaptive fuzzy control for uncertain switched under-actuated nonlinear systems with input saturation and dead-zone. Robotic Intelligence and Automation, 43(5), 523-536.

  9. S. Yıldırım, M. Khalafi, T. Güzel, H. Satık, M. Yılmaz, Supply curves in electricity markets: a framework for dynamic modeling and monte carlo forecasting, in IEEE Transactions on Power Systems. (2022)

  10. A. Mirbakhsh, How hospitals response to disasters; a conceptual deep reinforcement learning approach. Adv. Eng. Days (AED) 6, 114–116 (2023)

    Google Scholar 

  11. S.V. Kothiya, K.B. Mistree, A review on real time object tracking in video sequences, in Electrical, Electronics, Signals, Communication and Optimization (EESCO), 2015 International Conference on (pp. 1–4). (2015)

  12. F. Tang, H. Wang, L. Zhang, N. Xu, A.M. Ahmad, Adaptive optimized consensus control for a class of nonlinear multi-agent systems with asymmetric input saturation constraints and hybrid faults. Commun. Nonlinear Sci. Numer. Simul. 126, 107446 (2023)

    Article  MathSciNet  MATH  Google Scholar 

  13. Wang, T., Zhang, L., Xu, N., & Alharbi, K. H. (2023). Adaptive critic learning for approximate optimal event-triggered tracking control of nonlinear systems with prescribed performances. International Journal of Control, 1-15.

  14. E. Jafari, A. Dolati, K. Layeghi, Object tracking using fuzzy-based improved graph, interesting patches and multi-label MRF optimization. Multimed. Syst. (2023).

    Article  Google Scholar 

  15. R. Wakkary, D. Oogjes, A. Behzad, Two years or more of co-speculation: polylogues of philosophers, designers, and a tilting bowl. ACM Trans. Comput. Human Interact. 29(5), 1–44 (2022)

    Article  Google Scholar 

  16. Zhang, K., Jiang, S., Zhao, R., Wang, P., Jia, C., & Song, Y. (2022). Connectivity of organic matter pores in the Lower Silurian Longmaxi Formation shale, Sichuan Basin, Southern China: Analyses from helium ion microscope and focused ion beam scanning electron microscope. Geological Journal, 57(5), 1912-1924.

  17. W. Wu, N. Xu, B. Niu, X. Zhao, A.M. Ahmad, Low-computation adaptive saturated self-triggered tracking control of uncertain networked systems. Electronics 12(13), 2771 (2023)

    Article  Google Scholar 

  18. S. Koohfar, L. Dietz. An adaptive temporal attention mechanism to address distribution shifts, in NeurIPS 2022 Workshop on Robustness in Sequence Modeling. (2022)

  19. S. Guo, X. Zhao, H. Wang, N. Xu, Distributed consensus of heterogeneous switched nonlinear multiagent systems with input quantization and DoS attacks. Appl. Math. Comput. 456, 128127 (2023)

    Article  MathSciNet  MATH  Google Scholar 

  20. R. Yao, G. Lin, S. Xia, J. Zhao, Y. Zhou, Video object segmentation and tracking: a survey. ACM Trans. Intell. Syst. Technol. (TIST) 11(4), 1–47 (2020)

    Article  Google Scholar 

  21. Y. Zhao, B. Niu, G. Zong, X. Zhao, K.H. Alharbi, Neural network-based adaptive optimal containment control for non-affine nonlinear multi-agent systems within an identifier-actor-critic framework. J. Franklin Inst. 360(12), 8118–8143 (2023)

    Article  MathSciNet  MATH  Google Scholar 

  22. F. Farahi, H.S. Yazdi, Probabilistic Kalman filter for moving object tracking. Signal Process. Image Commun. 82, 115751 (2020)

    Article  Google Scholar 

  23. G. Ciaparrone, F.L. Sánchez, S. Tabik, L. Troiano, R. Tagliaferri, F. Herrera, Deep learning in video multi-object tracking: a survey. Neurocomputing 381, 61–88 (2020)

    Article  Google Scholar 

  24. M. Nallasivam, V. Senniappan, Moving human target detection and tracking in video frames. Stud. Inf. Control 30(1), 119–129 (2021)

    Article  Google Scholar 

  25. Yan, S., Gu, Z., Park, J. H., & Xie, X. (2023). A delay-kernel-dependent approach to saturated control of linear systems with mixed delays. Automatica, 152, 110984.

  26. M. Gong, Y. Shu, Real-time detection and motion recognition of human moving objects based on deep learning and multi-scale feature fusion in video. IEEE Access 8, 25811–25822 (2020)

    Article  Google Scholar 

  27. H. Arefanjazi, M. Ataei, M. Ekramian, A. Montazeri, A robust distributed observer design for Lipschitz nonlinear systems with time-varying switching topology. J. Franklin Inst. 360(14), 10728–10744 (2023)

    Article  MathSciNet  MATH  Google Scholar 

  28. E. Real, J. Shlens, S. Mazzocchi, X. Pan, V. Vanhoucke, Youtube-boundingboxes: a large high-precision human-annotated data set for object detection in video, in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5296–5305). (2017)

  29. K. Sreekala, N.N. Raj, S. Gupta, G. Anitha, A.K. Nanda, A. Chaturvedi, Deep convolutional neural network with Kalman filter based objected tracking and detection in underwater communications. Wire. Netw. (2023).

    Article  Google Scholar 

  30. C. Chen, K. Liu, N. Kehtarnavaz, Real-time human action recognition based on depth motion maps. J. Real-Time Image Proc. 12, 155–163 (2016)

    Article  Google Scholar 

  31. A. Ghaznavi, Y. Lin, M. Douvidzon, A. Szmelter, A. Rodrigues, M. Blackman, J. Xu, A monolithic 3D printed axisymmetric co-flow single and compound emulsion generator. Micromachines 13(2), 188 (2022)

    Article  Google Scholar 

  32. P.P. Halkarnikar, A. Dhakne, P. Shevatekar, P.A. Meshram, S.R. Vasekar, J. Garud, Multiple objects tracking using the Kalman filter method in outdoor video. Math. Stat. Eng. Appl. 71(4), 7711–7719 (2022)

    Google Scholar 

  33. S. Koohfar, L. Dietz, Adjustable context-aware transformer, in International Workshop on Advanced Analytics and Learning on Temporal Data (pp. 3–17). Cham: Springer International Publishing. (2022)

  34. A. Jalal, Y.H. Kim, Y.J. Kim, S. Kamal, D. Kim, Robust human activity recognition from depth video using spatiotemporal multi-fused features. Pattern Recogn. 61, 295–308 (2017)

    Article  Google Scholar 

  35. A. Mirbakhsh, J. Lee, D. Besenski, Spring–mass–damper-based platooning logic for automated vehicles. Transp. Res. Rec. 2677(5), 1264–1274 (2023)

    Article  Google Scholar 

  36. Y. Duan, X. Zhang, Z. Li, A new quaternion-based Kalman filter for human body motion tracking using the second estimator of the optimal quaternion algorithm and the joint angle constraint method with inertial and magnetic sensors. Sensors 20(21), 6018 (2020)

    Article  Google Scholar 

  37. J. Sun, Y. Zhang, M. Trik, PBPHS: a profile-based predictive handover strategy for 5G networks. Cybern. Syst. 53(6), 1–22 (2022)

    Article  Google Scholar 

  38. B. HassanVandi, R. Kurdi, M. Trik, Applying a modified triple modular redundancy mechanism to enhance the reliability in software-defined network. Int. J. Electr. Comput. Sci. (IJECS) 3(1), 10–16 (2021)

    Google Scholar 

  39. A. Lind, S. Wu, A. Hadachi, Application of Gaussian mixtures in a multimodal Kalman filter to estimate the state of a nonlinearly moving system using sparse inaccurate measurements in a cellular radio network. Sensors 23(7), 3603 (2023)

    Article  Google Scholar 

  40. S. Barawkar, M. Kumar, Force-torque (FT) based multi-drone cooperative transport using fuzzy logic and low-cost and imprecise FT sensor, in Proceedings of the Institution of Mechanical Engineers, Part G: Journal of Aerospace Engineering, 09544100231153686. (2023)

  41. M. Trik, H. Akhavan, A.M. Bidgoli, A.M.N.G. Molk, H. Vashani, S.P. Mozaffari, A new adaptive selection strategy for reducing latency in networks on chip. Integration 89, 9–24 (2023)

    Article  Google Scholar 

  42. Z. Wang, Z. Jin, Z. Yang, W. Zhao, M. Trik, Increasing efficiency for routing in internet of things using binary gray wolf optimization and fuzzy logic. J. King Saud Univer. Comput. Inf. Sci. 35(9), 101732 (2023)

    Google Scholar 

  43. Q. Yuan, J. Chang, Y. Luo, T. Ma, D. Wang, Automatic cables segmentation from a substation device based on 3D point cloud. Mach. Vis. Appl. 34(1), 9 (2023)

    Article  Google Scholar 

  44. V. Kshirsagar, R.H. Bhalerao, M. Chaturvedi, Modified YO-LO module for efficient object tracking in a video. IEEE Latin America Transactions, 100. (2023)

  45. U. Mittal, P. Chawla, R. Tiwari, EnsembleNet: a hybrid approach for vehicle detection and estimation of traffic density based on faster R-CNN and YO-LO models. Neural Comput. Appl. 35(6), 4755–4774 (2023)

    Article  Google Scholar 

  46. M. Hofmann, D. Wolf, G. Rigoll, Hypergraphs for joint multi-view reconstruction and multi-object tracking, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2013)

Download references


Not applicable


The authors received no financial support for the research authorship and or publication of this article.

Author information

Authors and Affiliations



MMY developed the original idea, analyzed the data, and wrote the manuscript. Related works and the system model were done by SM. The implementation and simulation of the idea were done by HD. The re-analysis of the training and test datasets as well as the critical review of this version in terms of important intellectual content were done by RG.

Corresponding author

Correspondence to Seyed Saleh Mohseni.

Ethics declarations

Ethics approval and consent to participate

This research does not involve human participants, human data, or human tissue.

Consent for publication

The manuscript does not contain individual data.

Competing interests

The authors declare that they have no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yousefi, S.M.M., Mohseni, S.S., Dehbovid, H. et al. Tracking of moving human in different overlapping cameras using Kalman filter optimized. EURASIP J. Adv. Signal Process. 2023, 114 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: