2.1 Multisensor data fusion
Human beings are a complex multisensor information fusion system, which is performing information fusion all the time. Multisensor information fusion is the use of multiple sensors to obtain relevant information, and perform data preprocessing, correlation, filtering, integration, and other operations to form a framework that can be used to make decisions, so as to achieve identification, tracking, and situation assessment. [4]. In summary, the multisensor data fusion system includes the following three parts:

1.
Sensor. Sensors are the cornerstone of a sensor data fusion system. Without sensors, data cannot be obtained. Multiple sensors can obtain more comprehensive and reliable data.

2.
Data. Data are the processing object in the multisensor data fusion system and the carrier of fusion. The quality of the data determines the upper limit of the fusion system performance, and the fusion algorithm only approaches this upper limit.

3.
Fusion. Fusion is the core of a multisensor data fusion system. When the quality of the information cannot be changed, fusion is to mine the information to the greatest extent and make decisions based on the data.
Data fusion performs multilevel processing on multisource data. Each level of processing abstracts the original data to a certain extent. It mainly includes data detection, calibration, correlation, and estimation [5]. Data fusion can be divided into three levels according to the degree of abstraction of data processing in the fusion system:

1.
Pixellevel fusion. This method is currently the most widely used fusion method. Directly use the original image data, design algorithms to process and integrate the pixels one by one to achieve the purpose of image fusion. The data processed at the pixel level are all raw data, without any conversion, the information expressed is more accurate. However, because each pixel of the original data is processed, the amount of data becomes larger, and the fusion efficiency is low.

2.
Feature level fusion. This method is based on the feature information of the image itself for fusion. The algorithm is used to extract feature information such as the edge and contour of the image, and then the information is fused. This method reduces the amount of data in the fusion process and improves the fusion efficiency, but it has higher requirements for feature extraction, and the quality of feature extraction directly affects the fusion effect [6, 7].

3.
Decisionlevel integration. This type of method is the most complicated. It is necessary to prepare expert knowledge related to the image content before image processing, and then perform targeted adjustment and processing of the image. This fusion method is relatively abstract and requires higher expert knowledge, but due to its strong pertinence, the fusion effect is also more ideal [8, 9].
Shannon entropy is defined as "the averaged amount of information after excluding information redundancy", which is defined by Shannon as "something that can remove uncertainty". Most modern scholars have proposed the opposite view to Shannon's data dfinition, that is, information is the increase of certainty. In accordance with information theory, the multidimensional information created by the fusion of multiple onedimensional pieces of information is more informative than any onedimensional piece of information, which is the theoretical ground for the fusion of multisensor data. Below we give the proof from the perspective of Shannon entropy [10].
Suppose the Shannon entropy H(X) of the random variable X is a function of the probability distribution P_{1}, P_{2}, …, P_{n}. According to the definition of Shannon entropy:
$$H(X) =  \sum\limits_{j = 1}^{n} {P_{j} \log P_{j} }$$
(1)
Among them, \(0 \le P_{j} \le 1\), easy to get:
$$H(P1,P2 \ldots Pn) \ge 0$$
(2)
If and only if each item on the right side of the equal sign in formula (1) is 0, the equal sign in formula (2) is true:
$$\sum\limits_{j = 1}^{n} {P_{j} } = 1$$
(3)
Combining formula (2) and formula (3), we can see that when \(P_{j} = 1\) and \(P_{k} = 0\), formula (2) takes the equal sign.
Assuming that the Shannon entropy of random variables X and Y are H(X) and H(Y), respectively, their joint Shannon entropy is H(XY). According to the additivity of Shannon entropy, we can know:
$$H(XY) = H(X) + H(Y\left X \right.)$$
(4)
Suppose the Shannon entropy H(Y) of the random variable Y is a function of P_{1}, P_{2} … P_{m}, and the conditional transition probability of X and Y is \(Pij\). Combining formulas (1) and (4), the Shannon of the twodimensional random variable can be obtained. Entropy expression:
$$H(P_{1} P_{11} ,P_{1} P_{12} , \ldots P_{1} P_{1n} ,P_{2} P_{21} ,P_{2} P_{22} , \ldots P_{2} P_{2n} ,P_{m} P_{m1} ,P_{m} P_{m2} , \ldots P_{m} P_{mn} ) = H(P_{1} ,P_{2} , \ldots P_{n} ) + \sum\limits_{j = 1}^{m} {P_{j} H(P_{j1} ,P_{j2} , \ldots P_{jn} )}$$
(5)
From the nonnegativity of Shannon entropy and \(0 \le P_{j} \le 1\), the formula (6):
$$H(P_{1} P_{11} ,P_{1} P_{12} , \ldots P_{1} P_{1n} ,P_{2} P_{21} ,P_{2} P_{22} ,P_{2} P_{2n} , \ldots P_{m} P_{m1} ,P_{m} P_{m2} \ldots P_{m} P_{mn} ) \ge H(P_{1} ,P_{2} , \ldots P_{n} )$$
(6)
Generalizing to the scenario of n random variables X_{1}, X_{2}, X_{n}, from the additivity of Shannon entropy, formula (7) can be obtained:
$$H(X_{1} X_{2} \ldots X_{n} ) = H(X_{1} ) + H(X_{1} \left {X_{2} } \right.) + \cdots + H(Xn\left {X_{1} X_{2} + \cdots Xn  1} \right.)$$
(7)
Shannon entropy is a quantity that describes the uncertainty of a system or variable, not the amount of information in the system, but when a random variable takes a specific value, the value with information is equal to the Shannon entropy [11]. The larger the Shannon entropy, the larger the amount of information that the random variable has when it takes a specific value. Combining formulas (6) and (7), it can be inferred that the multidimensional information fused by multiple singledimensional information contains more information for a specific target than any singledimensional information [12].
The functional model diagram of multisensor data fusion is shown in Fig. 1.
As a matter of fact, multisensor information fusion is a functional simulation of the human mind's integrated processing of complex problems. When compared with single sensor, the use of multisensor information fusion technology can enhance the system survivability, reliability and robustness of the whole process, credibility of data, accuracy, duration and spatial coverage, realize realtime and information utilization, etc., in addressing the problems of exploration, tracking, and object identification.
Translated with http://www.DeepL.com/Translator (free version).
2.2 Image processing
Image processing refers to the use of a computer to process the image to be recognized to meet the subsequent needs of the recognition process. It is mainly divided into two steps: image preprocessing and image segmentation [13]. Image preprocessing mainly includes image restoration and image transformation. Its main purpose is to remove interference and noise in the image, enhance the useful information in the image, and improve the detectability of the target object. At the same time, due to the realtime requirements of image processing, it is necessary that the image is reencoded and compressed to reduce the complexity and computational efficiency of subsequent algorithms. The existing image segmentation methods mainly include edgebased segmentation, thresholdbased segmentation, and regionbased segmentation [14].
2.3 Significant area detection
The salient area detection aims to find the most salient target area in the picture. When observing a picture, a specific target object often attracts our attention immediately. When processing image scene information, it is possible to obtain priority processing target areas through saliency area detection, so as to rationally allocate computing resources and reduce the amount of calculation. Therefore, detecting the saliency area of the image has higher application value. Generally speaking, saliency area detection is divided into two types: topdown and bottomup methods. The topdown approach is taskdriven and requires the use of highlevel information for supervised training and learning. This method has complex crossdisciplinary issues, because it probably requires a combination of neurology, physiology, and other related subject areas. The bottomup method is datadriven, which mainly uses lowlevel information such as color contrast and spatial layout characteristics to obtain the saliency target area. This method is simple and fast to operate. Relevant studies in recent years have shown that this type of salient detection method has good results and has been widely used in image segmentation, target recognition, visual tracking, and other fields [15].

1.
Bottomup significant area detection
Bottomup datadriven salient area detection has nothing to do with human cognition. The salient value is calculated by extracting the underlying features of the image. These features can be color, direction, brightness, or texture. Bottomup salient area detection methods can be divided into salient area detection methods based on local contrast and salient area detection methods based on global contrast.

2.
Topdown saliency area detection
The topdown salient area detection is a taskdriven computing model. For example, if you want to find a person, you will pay attention to humanshaped objects when you look at the image; when you want to find a dog, you will ignore the person in the image and pay attention to the dog. Therefore, the topdown salient area detection is generally to find a certain kind of things [16]. Most of the topdown salient area detection requires training on a large amount of image data, which is computationally intensive and will get different results due to different tasks, which is not universal.