Hand-drawn sketch recognition with a double-channel convolutional neural network

In hand-drawn sketch recognition, the traditional deep learning method has the problems of insufficient feature extraction and low recognition rate. To solve this problem, a new algorithm based on a dual-channel convolutional neural network is proposed. Firstly, the sketch is preprocessed to get a smooth sketch. The contour of the sketch is obtained by the contour extraction algorithm. Then, the sketch and contour are used as the input image of CNN. Finally, feature fusion is carried out in the full connection layer, and the classification results are obtained by using a softmax classifier. Experimental results show that this method can effectively improve the recognition rate of a hand-drawn sketch.

and sent to the classifier. General hand features include shape context feature [8], scale-invariant feature transform [9], and directional gradient histogram feature [10], but these hand features designed for natural images are not suitable for abstract and sparse hand-drawn sketches. The fusion of different local features by multi-core learning helps to improve the recognition performance, which was proved by Li et al. [11]. Fisher vector (FV) is applied to the recognition of a hand-drawn sketch, and a high recognition rate is obtained [12].
In recent years, deep learning in the field of machine learning has developed rapidly. The essence of general deep learning is a nonlinear network model with multiple hidden layers. Through the training of large-scale original data, we can extract the characteristics of the original data from the network model and predict or classify the samples. In the field of image recognition and computer vision, CNN has achieved the most remarkable results [13]. In addition, deep learning has been widely used in pedestrian detection [14], gesture recognition [15], natural language processing [16], data mining, and speech recognition. Compared with other deep neural networks such as deep belief network [17] and S-layer automatic coding [18], CNN can directly process two-dimensional images. When a two-dimensional image is converted into a single image, the spatial structure of the input data will be lost. With the development of deep learning, some deep learning models for sketch recognition have been proposed, such as VGg [19], RESNET [20], and Alex net [21]. However, these deep learning models are mainly designed for color texture natural images. Due to the lack of color and texture information in a hand-painted sketch, they are not suitable for hand-painted sketch recognition. In reference [22], the user is required to draw semantic symbols by using an explicit prompt and then click the button. In reference [23], the use of a time threshold requires users to have a clear pause after drawing semantic symbols. In addition, special graphical symbols (such as arrows) are used for grouping [24]. The constraints of these algorithms weaken the natural rendering features of the handdrawn interface and limit the ability of rapid expression and modeling.
CNN's multi-channel mechanism is used to access different data views (such as red, green, blue channels of color images, and stereo audio tracking) [25]. By adding the input information, CNN can learn more features and improve the classification effect of the model. Therefore, in order to improve the recognition rate of sketch recognition, this paper also uses the sketch contour as the input data of CNN and proposes a dualchannel CNN.

Introduction
In this section, firstly, we will introduce the CNN. Secondly, we present the contour extracting method of the hand-drawn sketch. The proposed algorithm will be introduced at last.

Convolution neural network
CNN is an algorithm with less human intervention. The traditional BP neural network is used for reference in the process of weight updating. Error backpropagation is used to update parameters automatically. Due to the lack of human intervention, CNN can directly take the image as input and automatically extract image features for recognition. The weight sharing and local sensing characteristics of CNN not only reduce the number of parameters in the network, but also work in a way similar to that of animal visual nerve cells. The recognition accuracy and efficiency of the network are greatly improved.
CNN has two typical characteristics. The first is the local connection between the two layers of neurons through the convolution nucleus, rather than the complete connection. Therefore, the convolution layer connected to the input image is a local link constructed for pixel blocks, rather than the traditional pixel-based full connection. Secondly, the weight parameters of the convolution kernel are shared in the same layer. These two features greatly reduce the number of parameters of the deep web, reduce the complexity of the model, and accelerate the training speed. This makes CNN have a great advantage in the pixel value of the processing unit. The main components of CNN include convolution layer, pool layer, activation function, full connection layer, and classifier, as shown in Fig. 1.

Convolution layer
The convolution layer is the most important network layer in CNN feature extraction. The convolution operation is the process of obtaining a new feature map by convolution kernel and input sample image or upper output feature map under the action of the activation function. At each level, there are multiple feature maps, which represent a feature of the image. Convolution operation can be expressed as follows.
Each level in the CNN has multiple feature maps. Suppose the jth feature map of the layer l is x l j , where f(⋅) represents the activation function which will be described in more detail in the following chapters. M j represents the input sample image or the set of all the input feature graphs, and k ij l represents the convolution kernel in the layer l, and the convolution is expressed as * . After the convolution operation, we need to add the bias b after the result, then the new feature graph is formed by the activation function.
The inverse error propagation algorithm is used to update CNN weight. And the first step in the update process is to calculate the gradient at each level.

Downsampling and pooling layer
The lower sampling layer is the process of feature extraction. By reducing the dimension of the image feature graph, it is usually called pooling. In the process of downsampling, the dimension of the upper layer of the feature graph is reduced to obtain the feature graph satisfying the one-to-one correspondence. Therefore, n output characteristic maps of the upper convolution layer are used as the input of the lower sampling layer.
After dimension reduction, n output feature graphs are obtained. The following formula represents the process of downsampling and merging.
where down(⋅) represents the downsampling operation. The pixel value in n × n region of the input feature graph is selected to obtain a value in the output feature graph. The dimensions of the input feature graph are reduced by n times in both horizontal and vertical directions. The final value of the pixels in the output feature graph is also related to the multiplier offset β and the additional offset b. After the activation function, the final pixel value is obtained.

Fully connection layer and softmax classifier
The whole join layer is usually connected to the pool layer and the last layer of the classifier to fuse different features represented by multiple feature graphs. Each neuron in the full connectivity layer is connected with all the neurons in the bottom layer and has output characteristics.
The full join layer combines all the features of the previous features and then inputs them into the softmax classifier.
The input sample image is convoluted and downsampled layer by layer to get a relatively complete feature set. These features need to be classified by a classifier to get the predictive value of the sample image category. Then, the difference between the predicted value and the actual value is obtained. The input sample image is sent back by gradient-based algorithm to train the whole neural network. Generally, the last layer of downsampling cannot be directly connected to the classifier, and the dimension transformation can only be used as the input of the classifier after one or two layers are completely connected. Softmax classifier is usually used in CNN.
Softmax classifier is suitable for multi-classification, and its prototype is a logistic regression model for binary classification. In logistic regression, assuming the sample category label is y(y = 0 or y = 1). There are $ m $ data samples {(x 1 , y 1 ), (x 1 , y 1 ), …, (x m , y m )} and the input characteristic x (i) ∈ R n + 1 of these samples. The category label of the sample is 0 or 1, that is, y (i) ∈ {0, 1}, then its hypothetical function is shown as follows:.
where θ is an important parameter and θ can constitute the cost function. By adjusting parameter θ to minimize the cost function, the predicted category of the input sample can be obtained.

The training process
There are three main ways of CNN training, namely, full supervision, full nonsupervision, and the combination of supervision and nonsupervision. This paper adopts the supervised learning method. Supervised learning is trained on neural networks in the form of supervised signals. The supervised signal is the true value of the classification in each sample. In the learning process, CNN learns and extracts the features of the input image, and gives the predictive value of the sample classification at the output end. CNN backpropagates the difference between the predicted value and the actual value to continuously adjust the network parameters. Finally, it enables all input classes on the network to make correct image samples.

Proposed algorithm
In this part, we first introduce the method of drawing sketch outlines. Then, a dualchannel CNN is proposed. The extraction process of sketch contour features is shown in Fig. 2. First, the input sketch is preprocessed to obtain a smooth sketch. Then, the outline of the sketch is extracted.
Due to the characteristics of hand drawing, there will inevitably be two overlapping areas, in which there are redundant unclosed curve segments. In order to get a smooth sketch outline, it is necessary to preprocess the input sketch.
The algorithm of eliminating the unclosed curve segment can be described as follows:.
(a) Scan the picture according to the direction of the line. If a point is found to belong to the curve endpoint, then turn to (b). If the whole picture still does not have the curve endpoint, then exit. The curve endpoint can be judged when one point in a 3 × 3 area. If there is no other point present in the eight directions at this point, which is an isolated point, then it belongs to the curve endpoint. If there is only one direction that has a point in eight directions, which is a little bit, then it is a curve endpoint. If there are three directions and more than three directions have a point in eight directions, then it is a curve endpoint. If there are two directions that have a point in the eight directions, and the two directions are adjacent, then it is a curve endpoint; otherwise, it is not.
(b) Find the endpoint of the curve and eliminate this endpoint. Then, determine whether the point adjacent to this endpoint is the curve endpoint. If it is the curve endpoint, then continue to eliminate this point and determine the next adjacent point. If it is not, then go to (a).
For the hand-drawn sketch, an adaptive tracking algorithm based on the direction of the eight-connected domain is used to extract the contour of the sketch. The original image is represented by I(x, y),C(x, y) represents a 2-value image of the contour. The current direction is D i , starting from the right side and starting from the counterclockwise direction of 0,1,... 7. Select a point point c at the left of the top line of the image as the first point. In the 3 × 3 area at this point, there are no other points in the upper, left, and right directions. Then, d i = 2 is selected and look for the contour of the sketch in a counterclockwise direction. The specific algorithm is described as follows: Step 1. Initialization. Set C to zero and d i = 2, then the direction array DI is set to DI= {0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7}.
Step 2. Scan the original image I line by line from top to bottom, left to right, then the starting point of the contour that is point c can be obtained. The current point point now is initialized to point c .
Step Step 4. If the point point now coincides with the starting point c , then exit. Otherwise, it should be returned to step 3.
The effect of the algorithm is shown in Fig. 2, which has good robustness. The smooth contour of the input sketch can be obtained by preprocessing the sketch. CNN's multi-channel mechanism is used to access different data views, such as red, green, blue channels, and stereo audio tracks of color images [26]. By adding the input information, CNN can learn more features and improve the classification effect of the model. Therefore, in order to optimize the training process of a hand-drawn bone recognition, this paper proposes a dual-channel CNN. Figure 3 shows the structure of the network. The network consists of two relatively independent convolution networks. The first input of the network is the hand-drawn image, and the second input is the outline of the hand-drawn sketch [27][28][29][30][31][32][33][34][35].
In a dual-channel CNN, each channel contains the same number of convolution layers and parameters, but has independent weights. After the pool layer, the two channels are connected to the full connection layer and perform the full connection mapping. The two channels are connected to a fully connected hidden layer, which generates the output of the logistic regression classifier. Each channel's weight has its own update. But the final error is obtained through two output layers. So, the two output layers are like a layer that deviates from each other. 3 Results and discussion

Experimental preparation
In this experiment, the configuration of the computer is as follows. Windows 7, 3.60GHz, i7 processor, 32GB ddr, and 1024GB hard disk. The software of the experiment is Matlab 2017a. In 2012, Eitz et al. [1] organized and collected a collection of the largest handsketched sketch, it contains 250 hand-drawn sketches, and each containing 80 different hand-drawn sketches. The original pixel size of the sketch is 1111×1111, as shown in Fig. 4. In the experiment, 4 fold cross-validation was used, three for training and one for testing. The evaluation index of this experiment is the recognition rate of all test samples.

Experimental results and analysis
Deep learning requires a large amount of training data, and the lack of training data tends to create an over-fit problem. In order to reduce the influence of overfitting, this paper makes a manual expansion of the hand-drawn sketch data set used in the experiment and obtains a new amplified data set. Specific steps are as follows:  SIFT-Fisher [2] MKL-SVM [10] FV-SP [2] Alex-Net [11] Proposed / Zhang EURASIP Journal on Advances in Signal Processing (2021) 2021:73 Step 1. Dimension reduction. Reduce all the hand-painted sketch images from the original size of 1111×1111 to 256×256.
Step 2. Extract the slices. From the 256×256 diagram, select five slices of the center, upper left corner, lower left corner, upper right corner, and lower right corner, which size is set to 225×225. In the resulting five slices, the original dataset is made up of all 225×225 slices of pixel size in the center.
Step 3. Flip horizontally. Take the five slices obtained by step 2 and flip them horizontally, and five new slices are taken again. The 10 slices of each sample obtained by step 2 and step 3 constitute the amplified data set, so the data volume of the amplified data set is 10 times of the original data set.
The proposed algorithm in this paper is compared with some other popular sketch recognition methods, such as HOG-SVM [1], SIFT-Fisher [2], MKL-SVM [10], FV-SP [2], and Alex Net\cite [11]. The experimental results are shown in Table 1. Compared with traditional non-deep learning methods, HOG-SVM, SIFT-Fisher, MKL-SVM, and FV-SP, the recognition rates of the proposed algorithm are 16.1, 9.98, 5.9, and 3.2, respectively. The results show that the depth learning method has a stronger feature and nonlinear expression than the non-depth learning method. Compared with the depth  of the classical learning method Alex-Net, the accuracy rate is improved by 5.1. The results show that the proposed algorithm, namely a double-channel CNN, can help improve the recognition rate of hand-drawn sketches ( Table 2). The results are shown in Fig. 1 to Fig. 5, and the comparison between our method and other methods on the COAD dataset (Figs. 6 and 7).
Our method is superior to the other methods in training time, recognition accuracy, and energy consumption  In order to improve the recognition rate of a hand-drawn sketch recognition, a handdrawn sketch recognition algorithm based on a dual-channel convolution neural network is proposed. Firstly, the sketch is preprocessed to extract the contour information. Secondly, sketch and contour are used as two input channels of a convolutional neural network. Finally, a softmax classifier is used for feature fusion in the full connection layer to get the classification results. The experimental results show that the proposed method achieves a higher recognition rate than the existing mainstream sketch recognition methods. Our future work is to improve the recognition accuracy of handwriting input.