PSENet-based efficient scene text detection

Text detection is a key technique and plays an important role in computer vision applications, but efficient and precise text detection is still challenging. In this paper, an efficient scene text detection scheme is proposed based on the Progressive Scale Expansion Network (PSENet). A Mixed Pooling Module (MPM) is designed to effectively capture the dependence of text information at different distances, where different pooling operations are employed to better extract information of text shape. The backbone network is optimized by combining two extensions of the Residual Network (ResNet), i.e., ResNeXt and Res2Net, to enhance feature extraction effectiveness. Experimental results show that the precision of our scheme is improved more than by 5% compared with the original PSENet.

the corner detection-based algorithms, Tychsen-Smith et al. adopted corner detection and directed sparse sampling in DENet, to replace the region proposal network portion of the RCNN model [13]. Pengyuan et al. detected the scene text by locating the corner of the text box and dividing the text area to the relative position [14]. Wang et al. combined the corner points, center points and regression points of the boundary of the text box into a network through a fully convolutional network [15]. The algorithm can solve the problem of documents inclination to a great extent, but the precision was reduced because the corner points were difficult to determine. For segmentation-based algorithms, Deng et al. proposed the PixelLink algorithm to solve the segmentation problem of adjacent text areas, which can predict the pixel links between different text blocks by adopting text secondary prediction and link secondary prediction algorithm [16]. Zhang et al. adopted the maximum stable extremum regions method to detect candidate characters from the extracted text regions, and divided the characters into words or text lines according to prior rules [17]. Xie et al. proposed a supervised pyramid context network, which introduced the power segmentation framework of Mask RCNN, and used context information to detect arbitrary shape text [18]. The Text Context Module and Re-Score Module proposed by the algorithm can effectively restrain false sample detection.
Recently, Long et al. proposed the TextSnake algorithm based on semantic segmentation to solve the problem of curved text for the first time, by introducing a disc and the text centerline [19]. Although above-mentioned segmentation-based algorithms solve the problem of curved text, it is difficult to distinguish adjacent or overlapping texts accurately. Most of the previous scene text detection algorithms use multi-feature prediction methods such as feature pyramid networks or spatial pooling to solve complex scene problems. Among them, the traditional spatial aggregation is not well suited to the task of pixel-level prediction based on semantic segmentation due to the limitation that they all probe the input features map within square windows. Therefore, this paper proposes an efficient text detection method based on Mixed Pooling Module (MPM), which considers not only the regular shape of N × N, but also the long but narrow kernel, i.e., 1 × N or N × 1. The MPM with different pooling operations is designed to locate the scene text regions precisely.
Compared with above-mentioned segmentation-based algorithms, Wenhai et al. adopted Feature Pyramid Network (FPN) to convert images into feature maps of different scales, and then fuse features from different scales to achieve multi-scale prediction. In addition, the progressive scale expansion algorithm is adopted to solve the segmentation problem of adjacent text [20]. However, the single maximum pooling operation of PSENet resulted in the loss of some neighborhood feature information, which made it impossible to adapt to the text length of different scenes and capture the long and short features of the scene text object. To prevent the loss of adjacent feature information, various methods such as pyramid pooling and dilation convolution have been proposed in recent years [20,21]. Enough experiments have proved that spatial pooling is an effective means to capture remote context information and perform pixel-level prediction tasks well. However, the above methods all use square windows to obtain input feature information. Because the square window can't adapt to the characteristics of text, it can't capture the anisotropic context information widely existing in real scenes flexibly. Scene text have no obvious boundary, and the characters in the text image have their own characteristics, such as English and Chinese, which are generally long, with horizontally dense and vertically sparse representations. To sum up, the accurate location of text region relies not only on the large-scale saliency information, but also on the small-scale boundary feature. Hence, this paper presents an efficient scene text detection scheme based on the PSENet from two aspects as follows. Firstly, a Mixed Pooling Module (MPM) is designed to locate the scene text regions precisely. Then, the backbone network is optimized to enhance the effectiveness of multi-scale feature extraction. Specifically, the main contributions of this paper can be summarized as follows: 1. The MPM is designed with different pooling operations. Generally, the single pooling operation is adopted in the original PSENet. Its complexity is low, but for adjacent texts, it is easy to cause the problems of missed detection and false detection.
Considering the diversity of text length in different scenes, the MPM with different pooling operations is designed to locate the scene text regions precisely. Through the context information extraction at different distances, we can capture the relevance of text information more effectively, and then extract the text shape features and locate the scene text regions. 2. The backbone network is optimized by the combination of ResNeXt and Res2Net.
In original PSENet, the Residual Network (ResNet) is adopted as the backbone network, to solve the problem of gradient vanishing and further increase the network depth easily. But the ResNet cannot enhance the extraction capability of multi-scale features, which are very important for scene text detection. Therefore, the backbone network is optimized here by combining ResNeXt and Res2Net, to fuse different types of deep feature information and characterize the text features at a more finegrained level, and further to improve the text detection precision.
The remainder of the paper is organized as follows: Sect. 2 describes the proposed scheme in detail. Experimental results are shown and analyzed in Sect. 3. Section 4 discusses the paper and Sect. 5 concludes the paper.

Proposed scheme
In order to achieve efficient and precise text detection, a scene text detection scheme based on PSENet is proposed. By designing the Mixed Pooling Module (MPM) and optimizing the backbone network, a detection model for scene text is trained, which can detect more accurately than the original PSENet model in the scene text. The diagram of the proposed scheme is shown in Fig. 1.
As shown in Fig. 2a, ResNet is adapted to the original PSENet backbone network to realize basic image feature extraction. The short connections are introduced to ResNet, which alleviated the problem of gradient vanishing disappearance and obtained deeper network structures at the same time. However, during the process of feature extraction, only four equivalent feature scales can be obtained in ResNet through different combinations of convolution operations. Multi-scale feature representations of backbone network are very important for vision tasks, because an effective backbone network needs to locate objects of different scales in the scene. Therefore, the backbone network is optimized by combining ResNeXt and Res2Net to enhance the feature extraction effectiveness shown in Fig. 2b. In addition, the MPM is embedded into the backbone network to capture the correlation between long distance and short distances between different locations. At the same time, the backbone network can extract the shape of the scene text better to improve the precision of text detection in the model.

Mixed pooling module
As shown in Fig. 3, with different pooling kernel sizes of MPM is designed to adapt to the characteristics of image text and avoid the noise impact caused by rectangular filtering. The MPM can be better adapted to the scene text features by different pooling kernel sizes and different dimensional pooling operations. As shown in Fig. 3a, the one-dimensional pooling operations in the MPM can effectively enhance the perceptual wildness of the backbone network in horizontal or vertical directions and further improve its longrange dependencies at the high-level semantic level. The pooling operations with square pooling kernels can enable the model to capture a large range of contextual information. In general, more local contextual information can be obtained in complex text scenarios by different pooling kernel shape operations. Figure 3a shows a simplified MPM diagram, and Fig. 3b provides detailed information about the design process of MPM. As shown in Fig. 3b, the MPM has two one-dimensional pooling layers, which can better adapt to the text features of the text image. Then, there are two spatial pooling layers and one original spatial information preserving layer, which can effectively capture the context information of dense text areas. Notice that the bin sizes of feature maps after each spatial pooling are 18 × 18 and 10 × 10, and the bin sizes after each one-dimensional pooling are N × 1 and 1 × N. Finally, all five sub-paths are combined by summing.  These MPMs are then embedded into the PSENet backbone network. It is noticeable that since the output of the backbone has 2,048 channels, the 1 × 1 convolutional layer is connected to the backbone first, to reduce the output channels from 2,048 to 1,024, and then two MPMs are embedded.

Optimizing the backbone network
As shown in Fig. 4, ResNet is optimized by combining two extensions of the residual network (ResNet), i.e., ResNeXt [22] and Res2Net [23]. Among them, the essence of ResNeXt is group convolution, which is composed of ResNet [24] and Inception [25], effectively reducing the number of parameters. The Res2Net increases the range of receptive fields of each network layer and improves the ability of multi-scale feature extraction by constructing hierarchical residual-like connections within one single residual block.
The number of channel of each feature subset x i is equal to 1/s of the input feature map, and their spatial size are the same. Except for x 1 , every x i has a corresponding 3 × 3 convolution, which is denoted by K i (). Y i is used to express the output of K i () and the input of K i () is the sum of the outputs of feature subset x i and K i−1 () minus 1. And omitting a 3 × 3 convolution at x 1 is to increase s and reduce the number of parameters. Therefore, Y i can be written as follows: Note that the Res2Net performs multi-scale processing, and fuses different scale information through a 1 × 1 convolution, thus effectively processing the feature information. The optimized backbone network in this paper is conducive to extracting global and local information, and effectively improves the network's feature extraction ability, improving the text detection accuracy of the model.

Experimental results comparison and analysis
To evaluate the performance of the proposed scheme, experiments are conducted on ICDAR2015 dataset and ICDAR2017-MLT dataset. The precision, recall and F-measure are used for evaluation. The experimental development environment is as follows: CPU: i7-8700 3.20 GHz, RAM: 16 GB, GPU: NVIDIA GeForce GTX1060Ti 6 GB, and deep learning network framework: PyTorch.
When the proposed scheme is trained on ICDAR2015 dataset, 1,000 training ICDAR2015 images and 500 ICDAR2015 verification images are used to train the model. The batch is set to 2, and 600 epochs are performed on a single GPU. The initial learning rate is 1 × 10 −4 and the final learning rate is 1 × 10 −7 . Between the 200 epochs and 400 epochs, the attenuation rate is set to 5 × 10 −4 . In the above training process, the loss balance is set to 0.7, the online hard example mining to 3, the kernel size to 0.5, the number of kernels to 6, and the aspect ratio of the input image to the output image to 1 [20]. The affine transform is adopted to process the training data. The details are given below.

Performance comparison of mixed pooling
To evaluate the MPM performance, experiments are conducted, where the ResNet and ResNet-MPM are used for comparison on the ICDAR2015 dataset. As shown in Table 1, when the MPM is embedded into the PSENet backbone network, the network performance has been greatly improved. The precision of model detection has increased by more than 5%, and the recall and F-measure have also increased, but the scale of the network model has not increased. The experimental results demonstrate that the MPM can not only significantly improve the performance of the model, but also hardly need additional model parameters.

Performance comparison of the backbone network
To further verify the performance of optimized PSENet backbone network, relevant experiments are conducted on the basis of the previous experimental steps. Compared with the original PSENet model, the proposed scheme training model has made some progress in precision, recall, F-measure, frames per second (FPS) and model scale shown in Table 2.
In addition, experiments are conducted to compare the performance between optimized ResNet-MPM-50 and ResNet-152 shown in Fig. 5. As shown in Fig. 5a, when the number of epochs reaches 540, the ResNet-152 based model has an 85.51% on precision, an 80.69% on recall, an 83.03% on F-measure, and 0.3720 on loss. Figure 5b shows the optimized ResNet-MPM-50 based model reaches the optimal value when the number of epochs reaches 350, with the precision of 87.26%, the recall of 80.79%, the F-measure of 84.00%, and the loss of 0.3275. Experimental results demonstrate that the performance of the proposed scheme training model is even better than that of the ResNet-152 based model. Moreover, the scale of the proposed scheme training model is about half of that based on ResNet-152. It is worth noting that the proposed scheme can effectively improve the performance of model without deepening the network.

Comparison with classical scene text detection algorithms
To evaluate the performance of the proposed scheme, experiments are conducted on ICDAR2015 dataset, in which the original PSENet is compared with several other classical scene text detection algorithms. As shown in Table 3, the detection precision of the proposed scheme has a 5.76% improvement on precision, 0.24 on FPS compared with the original PSENet. It is worth noting that the proposed scheme has the highest detection precision among the following scene text detection algorithms. In order to evaluate the performance of the proposed scheme in multi-directional text, experiments are conducted on ICDAR2017-MLT dataset. As shown in Table 4, the precision, recall and F-measure of the proposed scheme are 77.67%, 69.98% and 73.83%, respectively, and the precision is higher than that of the original PSENet by more than 3%.    To further intuitively show that the performance of the proposed scheme is better than the original PSENet, two sets of experimental results are shown in Fig. 6, and the effectiveness of the proposed scheme is analyzed. As shown in Fig. 6 (a1, a2), the original PSENet has the phenomenon of missed detection and false detection. The missed detection or false detection of text objects often leads to failure in text recognition tasks. Because the text cannot be accurately detected and recognized, the final semantic information will not be understood. Compared with the original PSENet, the proposed scheme can precisely identify the text without missed detection or false detection and precisely locate the scene text regions and object boundaries in an image shown in Fig. 6  (b1, b2). In conclusion, the proposed scheme can precisely locate the scene text regions and object boundaries in an image and it has higher precision and a lower false detection rate than the original PSENet.

Discussion
In this paper, the scene text detection is optimized by both the Mixed Pooling Module (MPM) and the fusion networks (i.e., ResNeXt, and Res2Net). The idea of this scheme is to collect more context information with the different pooling operations. Furthermore, the fusion mechanism is used to enhance the multi-scale feature extraction ability of the backbone network and we discuss the following aspects in detail. designed MPM can improve the precision and reduce the missed detection and false detection of our scheme. The real reason is that, more comprehensive information can be extracted efficiently with the targeted pooling module for different text scenes, inspired by the assumption of the pooling strategies in [29,30]. 2. The effect of the fusion networks. Related studies have shown that the ResNet is mainly used to solve the problem of gradient vanishing in deep neural network, but its performance is still needed to be further improved for the diversity of scene text feature. Hence, the backbone network is optimized by the fusion networks to enhance the multi-scale feature extraction capability. The experimental results illustrate the effectiveness of this fusion networks on the ICDAR2015 dataset.
However, the complex post-processing steps and heavy network lead to the scene text detection algorithm still does not meet the requirement of real-time detection. At the same time, text detection is only the first key step for scene text recognition and the endto-end text spotting framework is the further hot topic and study trend [31][32][33]. Therefore, our future research focuses on how to simplify the post-processing steps and build lightweight networks, aiming to improve the model speed and further build an end-toend detection and recognition framework.

Conclusion
An efficient scene text detection scheme based on PSENet is proposed to solve the problem of missed detection and false detection existing in most scene text detection algorithms. The Mixed Pooling Module can capture the dependency between different text positions and collect context information, and precisely locate the scene text regions and object boundaries in an image. Additionally, the backbone network is optimized by combining ResNeXt and Res2Net to further improve its multi-scale feature extraction ability. Experimental results have demonstrated that, compared with common scene text detection algorithms, the proposed scheme has lower missing detection rate and higher detection precision. Specifically, the precision of the proposed scheme is improved by more than 5% compared with the original PSENet.