Water surface object detection using panoramic vision based on improved single-shot multibox detector

In view of the deficiencies in traditional visual water surface object detection, such as the existence of non-detection zones, failure to acquire global information, and deficiencies in a single-shot multibox detector (SSD) object detection algorithm such as remote detection and low detection precision of small objects, this study proposes a water surface object detection algorithm from panoramic vision based on an improved SSD. We reconstruct the backbone network for the SSD algorithm, replace VVG16 with a ResNet-50 network, and add five layers of feature extraction. More abundant semantic information of the shallow feature graph is obtained through a feature pyramid network structure with deconvolution. An experiment is conducted by building a water surface object dataset. Results showed the mean Average Precision (mAP) of the improved algorithm are increased by 4.03%, compared with the existing SSD detecting Algorithm. Improved algorithm can effectively improve the overall detection precision of water surface objects and enhance the detection effect of remote objects.

successful applications in image classification, object detection, object segmentation and many image and video domains.Water surface object detection technology is starting to adopt this method with high accuracy, fast speed and strong generalization ability. Surface object detection requires high real-time performance and recognition accuracy, and the monitored object needs to be detected at a long distance. Although the actual length of the long distance water surface object can reach tens of meters or even hundreds of meters, it only occupies dozens or even fewer pixels on the imaging plane. Among the current various object detection algorithms, the Single-shot Multibox Detector (SSD) algorithm has a good performance in detection speed and detection accuracy. However, its detection accuracy for small objects is not as satisfactory as expected, when the SSD algorithm is applied to the detection of water surface objects, it is difficult to detect small objects in the long distance water surface. Water surface object detection technology is starting to adopt this method with high accuracy, fast speed and strong generalization ability. Surface object detection requires high real-time performance and recognition accuracy, and the monitored object needs to be detected at a long distance. Although the actual length of the long distance water surface object can reach tens of meters or even hundreds of meters, it only occupies dozens or even fewer pixels on the imaging plane. Among the current various object detection algorithms, the Single-shot Multibox Detector(SSD) algorithm has a good performance in detection speed and detection accuracy. However, its detection accuracy for small objects is not as satisfactory as expected, when the SSD algorithm is applied to the detection of water surface objects, it is difficult to detect small objects in the long distance water surface.
Currently, object detection is mostly based on traditional vision. Traditional vision detection system have many deficiencies, such as limited vision, existence of nondetection zones, and the failure to acquire global information. A panoramic visual image is a composite image with high resolution and a wide viewing angle obtained by processing a few overlapped images. It has advantages such as a fast generation rate, high resolution, high fidelity of scene restoration, and low hardware requirements [2]. The unique advantages of the panoramic vision system can solve the viewing angle limitations of the traditional vision system, and more effectively meet the needs of large field of view, large range, and long distance. It is widely used in the field of intelligent navigation of ships [3].
In this work, a panoramic camera that generates a panoramic image will be used as a detection tool, use an improved SSD algorithm to detect water surface objects.
The main contributions of the work are as follows: (1) Preprocess the panoramic image to obtain a rectangular panoramic image.
(2) Design a new SSD algorithm backbone network to obtain rich small object semantic information. (3) Adopt Feature Pyramid Network (FPN) structure and add deconvolution operation to solve the problem of the lack of semantic information in the shallow feature map and the lack of detailed information in the deep feature map of the SSD model. (4) Construct a water surface object dataset to verify the effectiveness of the method.
The remainder of this paper is organized as follows. Section 2 describes previous work in object detection. Section 3 introduced the improvement method we proposed. Section 4 describes dataset, training environment and evaluation indicators of the system. Section 5 reports and discusses the experimental results. Section 6 presents the conclusions of this paper.

Related work
Surface object detection methods are mainly divided into traditional detection methods and deep learning detection methods. Traditional object detection methods are based on background modeling, image segmentation, feature extraction and learning, color features, and color space transformation, or saliency detection [4]. For instance, Wijnhoven et al. [5] used the histogram of oriented gradients (HOG) to extract ship object features, and achieve object recognition through online learning and design classifiers. Mirghasemi et al. [6]. used the particle swarm optimization algorithm to transform the color space of the object to improve the robustness of the object recognition algorithm. Albrecht et al. [7] achieved saliency detection by constructing regional complexity features, regional differences features, surrounding differences features, and water and sky classification methods to improve the detection effect of the saliency analysis algorithm. These methods conduct object classification and detection for specific tasks; hence they are characterized by poor generalization ability and long detection times.
Object detection methods based on deep learning mainly include one-stage and two-stage methods. The one-stage object detection method uses the idea of regression analysis, omits the stage of candidate region generation, and directly obtains object classification and location information. The one-stage methods include the You Only Look Once (YOLO) [8][9][10][11] series and single-shot multibox detector (SSD) [12] series, which have fast but less accurate. The two-stage object detection method mainly generates candidate regions through selective search or region proposal network (RPN), and then perform classification and regression on the candidate regions to obtain the detection results. The two-stage methods include Fast-RCNN, Faster-RCNN, R-FCN, and Mask-RCNN [13][14][15][16], which have high accuracy but slow speed. To solve the ship identification problem in a video image, Cao et al. [17] proposed to extract object features based on image segmentation and convolutional neural networks and realized automatic identification of ships. Qi et al. [18] improved Faster R-CNN, and implemented ship object detection with image downscaling and scene narrowing methods, which shortened the detection time and improved the detection precision of Faster R-CNN. Lin et al. [19] proposed a new network architecture based on the faster R-CNN is proposed to further improve the detection performance by using squeeze and excitation mechanism. However, using squeeze and excitation mechanism will cause a decrease in the real-time performance of the network. According to the characteristics of ship target in the SAR images, Xiao et al. [20] make several improvements such as enlarging the input, proposal optimization, database target categorization, and weight balance on the basis of the standard Faster R-CNN. Zhang et al. [21] used two deep learning algorithms, the Mask R-CNN algorithm and the Faster R-CNN algorithm to build ship target feature extraction and recognition models based on deep convolutional neural networks.The above is based on the two-stage detection method, the detection accuracy can be high and satisfactory, but the real-time performance is low. Li et al. [22] proposed a ship object detection algorithm based on the improved YOLOV3-Tiny, intensified and reconstructed the shallow information based on characteristics of ship objects, and introduced a residual network that greatly improved the accuracy of ship object detection on the sea surface. Xu et al. [23] proposed a YOLO-based multiscale object detection algorithm and designed an adaptive feature fusion module and new loss function. This algorithm has obvious advantages in terms of multiscale detection, for it improves the object detection precision without increasing the detection time. Li et al. [24] proposed a novel target detection method by fusing DenseNet in YOLOV3 to improve the stability of detection to decrease the feature loss, while the target feature is transmitted in the layers of a deep neural network. Jie et al. [25] presented ship detection and tracking of ships using the improved YOLOv3 detection algorithm and Deep Simple Online and Real-time Tracking (Deep SORT) tracking algorithm.
Compared with Faster R-CNN and YOLO, the SSD algorithm has integrated grid thinking from the YOLO algorithm and an anchor mechanism from Faster R-CNN. As a result, the SSD algorithm can quickly detect objects without decreasing the detection accuracy. Li et al. [26] applied the SSD algorithm to a railway scene with UAV surveillance, designed a three-step, multi-block SSD mechanism, and conducted sample detection with the shift learning method, which overall accuracy increased by 9.2% over traditional SSD. Yin et al. [27] formulated an object detection algorithm classifying and extracting the multiscale feature graph, dividing feature graphs with different scales in the SSD algorithm into low-and high-level feature graphs. The low-level feature graph extracts features of a shallow feature enhancement (SFE) module, and the high-level feature graph adopts two-stage deconvolution, which increases mAP by 2.4% compared to the SSD algorithm. Zhang et al. [28] proposed a lightweight feature optimizing network Table 1 Comparison of various water surface object detection algorithms

Object detection algorithm Algorithm characteristics Advantage Limitation
Improve Faster-RCNN [18][19][20] Image downscaling and scene narrowing [18] Squeeze and excitation echanism [19] Enlarging the inputproposal optimization, database target categorization, and weight balance [20] High accuracy Low real-time Mask-RCNN [20] Instance segmentation [21] High accuracy Low real-time Improve YOLO [22][23][24][25] Reconstructed the shallow information and introduced a residual network [22] Designed an adaptive feature fusion module and new loss function [23] Fused DenseNet in YOLOV3 [24] Improved YOLOv3 and Real-time tracking [25] High real-time Low accuracy Improve SSD [28] Lightweight feature optimizing network and feature fusion module [28] High accuracy and real-time (LFO-Net) based on popular single shot detector (SSD) model for single polarization SAR image ship detection.For ship detection, this method designed a simple lightweight network, proposing a bidirectional feature fusion module including semantic aggregation and feature reuse blocks, and used an attention mechanism to optimize features. Compared with the above, the water surface target dataset constructed in this work contains five different targets and can be used for object classification. In addtion,this work uses a panoramic camera as a tool for object detection with a wider range compared to ordinary visual sensors. A comparison is shown in Table 1

Panorama image preprocessing
The generation of a panoramic image relies on the processing of multiple photographs and mapping them on the geometric surface, forming an image by analyzing the picture seams and a seamless mosaic with image mosaic technology. Based on different projection modes, panoramic images can be classified as plane, cylindrical, spherical, and cubic. According to practical task requirements, this study adopts spherical projection.
A spherical panoramic image has a 360-degree horizontal view angle and 180-degree vertical view angle, and is shot by a fisheye lens. When identifying an object, three-dimensional spherical coordinates must be converted to two-dimensional plane coordinates. One common method is longitudinal and latitudinal mapping [29]. It is expanded based on the projection of spherical longitude and latitude on the surface of a circumscribed cylinder. The closer it gets to the two poles, the more apparent is the distortion. Set any point on the sphere as M(φ, θ) and convert it to plane coordinates N(x, y) through formula (1).
(1) x = W * (φ/2π) y = L * (θ + π/2)/π  Fig. 3 The backbone structure diagram of the improved SSD network where φ and θ are the longitude and latitude, respectively, and W and L are the width and length, respectively, of the plane image. A rectangular plane graph with a lengthwidth ratio of 2:1 is obtained, shown as Fig. 1.

Improve SSD object detection algorithm
The SSD algorithm is based on a feedforward neural network. The network structure shown as Fig. 2, takes a VGG16 network as the backbone, changing the last two fully connected layers to convolutional layers and adding four convolutional layers. There are six feature extraction layers with different scales. The SSD object detection algorithm has the following steps: (1) Add convolutional layers with decreasing scales to the basic network, generate a multi-scale feature graph, and extract object features; (2) Set prior boxes with different scales and length-width ratios for every pixel at the feature layer; (3) Connect feature graphs of different sizes to the ultimate detection layer to position and classify objects; (4) Calculate and output the result with the non-maximum suppression algorithm.
The SSD algorithm has advantages in terms of detection speed and precision, but suffers from poor detection of small objects, mainly because of insufficient extraction of deep features and limited information about small objects. The SSD algorithm detects objects of different sizes by utilizing feature graphs with different scales. Shallow and deep feature graphs are used to detect small and large objects, respectively. A shallow feature graph is characterized by a large mapping size and abundant details and features of objects, but it has poor semantic information. A deep feature graph has a small mapping size and abundant semantic and abstract information, but insufficient details and features of objects. It generally detects small objects through the conv4_3 layer at the lowest layer, but the precision is poor in this case. Therefore, the detection effect for small remote objects is unsatisfactory.

Backbone network improvement
To obtain abundant semantic information of small objects, ResNet-50 [30] with residual learning units is used instead of VGG16 as the backbone network. ResNet-50 can solve the "vanishing gradient" problem caused by a deep network in the neural network. It has a deeper network than VGG16, can better extract feature graphs with more abundant semantic information, and has fewer parameters and a more prominent effect.
The improved backbone network is shown as Fig. 3. We set the conv4_x layer in the ResNet-50 network structure as the first feature extraction layer of SSD, remove the conv5_x and fully connected layers, and add five convolutional layers. Parameters for the backbone network after modification are shown in Table 2.

Feature pyramid structure
A top-down feature pyramid network (FPN) [31] structure can be adopted to solve the problem of insufficient semantic information in the shallow feature graph of the SSD model and insufficient detailed information in the deep feature graph. It integrates information of feature graphs in different sizes to offer sufficient semantic information of the shallow feature graph and detailed information of the deep feature graph and improve the overall detection precision.
Deconvolution can be adopted during integration of feature graphs in different sizes. This is the reverse of convolution. Deconvolution is similar to upsampling methods such as bilinear, nearest neighbor, and area interpolation, and can obtain a feature graph of high dimension. We enlarge the size of the input image by zero fill between neighboring elements, and then carry out a convolution operation. The deconvolution equation is where S is the stride, K is the size of the convolution kernel, P is the padding size, I is the size of the input feature graph, and D is the size of the output feature graph. As shown in Fig. 4, a 5 × 5 feature graph is obtained from deconvolution of a 3 × 3 feature graph.
Hence the adoption of deconvolution for feature graphs at the intermediate and upper layers of the SSD feature pyramid can obtain feature graphs with higher dimensions. This is integrated with corresponding original feature graphs with consistent dimensions to integrate the shallow feature graph in the feature pyramid with deep semantic information. Figure 5 shows the network structure of the feature pyramid adopting the deconvolution operation. We carry out deconvolution for Conv9_x, Conv8_x, Conv7_x, Conv6_x, and Conv5_x, and integrate these with the next neighboring layer. The parameters are shown in Table 3. Six feature graphs of different sizes are generated after integration.

Default boxes adjustment
The SSD algorithm locates objects of different sizes by setting a series of a default boxes of different scales on different layer feature maps. Suppose we want to use m feature maps for prediction. The scale of the default boxes for each feature map is computed as: where S min is 0.2 and S max is 0.9, meaning the lowest layer has a scale of 0.2 and the highest layer has a scale of 0.9. Through the analysis of data samples, in order to meet the requirements of small object size, we will adjust the values of S min and S max to 0.1 and 0.8.

Experimental data set and platform
To realize the quick detection of a water surface object, a dataset of network learning must be built. We built five common ship image datasets through connection, shooting, crawler search, and other methods, as shown in Fig. 6. We obtained 1757 images by expanding them through data enhancements such as flip and color adjustment. The distribution of target numbers for each type is shown in Fig. 7. All images were manually annotated with the LabelImg tool, and they adopted the VOC format. The hardware environment was an AMD Ryzen 7 4800H CPU and Nvidia GeForce RTX 2060 graphics card, and the software environment was a 64-bit Windows 10 operating system and Ten-sorFlow deep learning framework.

Evaluation and training
Advantages and disadvantages of the performance of object detection models can be evaluated from the aspects of detection precision and speed. Detection speed has units of frames per second (FPS) as the evaluation index, i.e., the number of images the model can detect per second. For single-category detection, we take precision, recall, and average precision as evaluation indexes for detection precision. These are defined as where TP, FP, and FN are the numbers of accurately detected, wrongly detected, and undetected objects, respectively. The area of the graph encircled by the precision rate and recall rate is considered the average detection precision (AP) of this kind of object.
In the case of multiple categories of detection, detection precision generally takes the mean average precision, as an evaluation index, where N is the number of object categories in the dataset.
The entire model basically adopts training strategies of the original SSD, including data enhancement, dimension, and scale setting of the prior frame, loss function, and non-maximum suppression. The dataset is divided into training and testing sets in an 8:2 ratio, and 10% of the training set data is taken as a certification set. The training speed can be accelerated by using the trained ResNet50 pre-training weight and freezing the universal part of the backbone network based on transfer learning. We input the resolution ratio image of 300 × 300, and set the batch size as 16. The initial learning rate of training is 0.0005. We adopt the early stopping function to prevent the training from overfitting, and end the training when the loss value does not decrease after 500 times. The changing curve of the loss function during the training process is shown as Fig. 8. There are 2,162 iterations in the entire training, each iteration took 62 s, and the loss value remains stable at 1.26.

Result analysis and discussion
To verify the performance of the proposed SSD algorithm, an experiment was conducted on the constructed water surface object dataset. The experimental results are shown as Table 4. It can be seen from the results that the proposed algorithm showed good performance in detection of all kinds of objects. The average detection precision was over 80%, and it surpassed 90% for fishing boat, cargo ship, and warship, which demonstrates the effectiveness of this algorithm. The low accuracy of sailboats is mainly due to the small number of samples in the data set and the relatively small size of sailboats.
To further verify the performance of the algorithm, an experimental comparison was carried out with other object detection algorithms such as SSD, Faster-RCNN andY-OLOv3, with results as shown in Table 5.
It can be seen that the network structure is complicated, with the added FPN network and deconvolution operation, and the real-time performance of the algorithm was less than that of algorithms 2, 3 and 4, but the mAP increased by 3.8%, 2.3% and 1.6%, respectively. Compared with algorithm 1, the mAP increased by 1.2% and 1.6%, and the detection real-time performance of the proposed algorithm was clearly superior.
To show the advantages of the algorithm in this study, a picture with multiple objects for detection was selected. The detection result is shown in Fig. 9. It can be seen from Fig. 9 that algorithm 2 and 3 had missed detection for the picture on the left. All five algorithms had various degrees of missed detection for the middle picture, with the algorithm of this study and algorithm 4 having the lowest number of missed detection, and algorithm 1 also had erroneous detection. For the picture on the right, only the algorithm of this study could detect all ships. Through the experiments above, the improved SSD algorithm proposed in this study is seen to be superior to the other algorithms in terms of detection precision, especially for remote object detection.
To reflect the advantages of panoramic visual detection, we used the improved algorithm in this work for panoramic vision detection. The detection results are shown in Fig. 10. It can be seen from Fig. 10 that compared to ordinary visual pictures, panoramic visual pictures have a wider field of view. The target on the left side of the picture is a cruise ship, and the right side is three freight boats.

Conclusion
We carried out detection of five common ships on the sea surface based on the improved SSD algorithm and panoramic vision. More abundant object environmental information was obtained through the panoramic view. The reconstruction of the backbone network with the advantages of Resnet50 improved the network depth, reduced the calculated number of parameters, and increased the object detection speed. We integrated feature graphs of different sizes and took full advantage of the semantic information of shallow feature graphs by integrating a feature pyramid network with deconvolution to improve the detection precision of remote objects. Experimental results revealed that the mean Average Precision (mAP) of the improved algorithm are increased by 4.03%, compared with the existing SSD detecting Algorithm, effectively reduce erroneous detection and missed detection of remote objects, and realize real-time detection. In addition, panoramic visual detection has a broader field of vision than ordinary visual detection, which can conduct global detection. We will next study methods to simplify the network structure so as to improve the detection speed, and expand the dataset to realize the detection of more objects.