In this section, firstly, we will introduce the CNN. Secondly, we present the contour extracting method of the hand-drawn sketch. The proposed algorithm will be introduced at last.
2.1 Convolution neural network
CNN is an algorithm with less human intervention. The traditional BP neural network is used for reference in the process of weight updating. Error backpropagation is used to update parameters automatically. Due to the lack of human intervention, CNN can directly take the image as input and automatically extract image features for recognition. The weight sharing and local sensing characteristics of CNN not only reduce the number of parameters in the network, but also work in a way similar to that of animal visual nerve cells. The recognition accuracy and efficiency of the network are greatly improved.
CNN has two typical characteristics. The first is the local connection between the two layers of neurons through the convolution nucleus, rather than the complete connection. Therefore, the convolution layer connected to the input image is a local link constructed for pixel blocks, rather than the traditional pixel-based full connection. Secondly, the weight parameters of the convolution kernel are shared in the same layer. These two features greatly reduce the number of parameters of the deep web, reduce the complexity of the model, and accelerate the training speed. This makes CNN have a great advantage in the pixel value of the processing unit. The main components of CNN include convolution layer, pool layer, activation function, full connection layer, and classifier, as shown in Fig. 1.
2.2 Convolution layer
The convolution layer is the most important network layer in CNN feature extraction. The convolution operation is the process of obtaining a new feature map by convolution kernel and input sample image or upper output feature map under the action of the activation function. At each level, there are multiple feature maps, which represent a feature of the image. Convolution operation can be expressed as follows.
$$ {x}_j^l=f\left(\sum \limits_{i\in {M}_j}{x}_j^{l-1}\ast {k}_{kj}^l+{b}_i^l\right) $$
(1)
Each level in the CNN has multiple feature maps. Suppose the jth feature map of the layer l is \( {x}_j^l \), where f(⋅) represents the activation function which will be described in more detail in the following chapters. Mj represents the input sample image or the set of all the input feature graphs, and \( {k}_l^{ij} \) represents the convolution kernel in the layer l, and the convolution is expressed as ∗. After the convolution operation, we need to add the bias b after the result, then the new feature graph is formed by the activation function.
The inverse error propagation algorithm is used to update CNN weight. And the first step in the update process is to calculate the gradient at each level.
2.2.1 Downsampling and pooling layer
The lower sampling layer is the process of feature extraction. By reducing the dimension of the image feature graph, it is usually called pooling. In the process of downsampling, the dimension of the upper layer of the feature graph is reduced to obtain the feature graph satisfying the one-to-one correspondence. Therefore, n output characteristic maps of the upper convolution layer are used as the input of the lower sampling layer.
After dimension reduction, n output feature graphs are obtained. The following formula represents the process of downsampling and merging.
$$ {x}_j^l=f\left({\beta}_j^l down\left({x}_j^{l-1}\right)+{b}_j^l\right) $$
(2)
where down(⋅) represents the downsampling operation. The pixel value in n × n region of the input feature graph is selected to obtain a value in the output feature graph. The dimensions of the input feature graph are reduced by n times in both horizontal and vertical directions. The final value of the pixels in the output feature graph is also related to the multiplier offset β and the additional offset b. After the activation function, the final pixel value is obtained.
2.2.2 Fully connection layer and softmax classifier
The whole join layer is usually connected to the pool layer and the last layer of the classifier to fuse different features represented by multiple feature graphs.
Each neuron in the full connectivity layer is connected with all the neurons in the bottom layer and has output characteristics.
The full join layer combines all the features of the previous features and then inputs them into the softmax classifier.
The input sample image is convoluted and downsampled layer by layer to get a relatively complete feature set. These features need to be classified by a classifier to get the predictive value of the sample image category. Then, the difference between the predicted value and the actual value is obtained. The input sample image is sent back by gradient-based algorithm to train the whole neural network. Generally, the last layer of downsampling cannot be directly connected to the classifier, and the dimension transformation can only be used as the input of the classifier after one or two layers are completely connected. Softmax classifier is usually used in CNN.
Softmax classifier is suitable for multi-classification, and its prototype is a logistic regression model for binary classification. In logistic regression, assuming the sample category label is y(y = 0 or y = 1). There are $ m $ data samples {(x1, y1), (x1, y1), …, (xm, ym)} and the input characteristic x(i) ∈ Rn + 1 of these samples. The category label of the sample is 0 or 1, that is, y(i) ∈ {0, 1}, then its hypothetical function is shown as follows:.
$$ {h}_g(x)=\frac{1}{1+\exp \left(-{\theta}^Tx\right)} $$
(3)
where θ is an important parameter and θ can constitute the cost function. By adjusting parameter θ to minimize the cost function, the predicted category of the input sample can be obtained.
2.2.3 The training process
There are three main ways of CNN training, namely, full supervision, full nonsupervision, and the combination of supervision and nonsupervision. This paper adopts the supervised learning method. Supervised learning is trained on neural networks in the form of supervised signals. The supervised signal is the true value of the classification in each sample. In the learning process, CNN learns and extracts the features of the input image, and gives the predictive value of the sample classification at the output end. CNN backpropagates the difference between the predicted value and the actual value to continuously adjust the network parameters. Finally, it enables all input classes on the network to make correct image samples.
2.3 Proposed algorithm
In this part, we first introduce the method of drawing sketch outlines. Then, a dual-channel CNN is proposed.
The extraction process of sketch contour features is shown in Fig. 2. First, the input sketch is preprocessed to obtain a smooth sketch. Then, the outline of the sketch is extracted.
Due to the characteristics of hand drawing, there will inevitably be two overlapping areas, in which there are redundant unclosed curve segments. In order to get a smooth sketch outline, it is necessary to preprocess the input sketch.
The algorithm of eliminating the unclosed curve segment can be described as follows:.
(a) Scan the picture according to the direction of the line. If a point is found to belong to the curve endpoint, then turn to (b). If the whole picture still does not have the curve endpoint, then exit. The curve endpoint can be judged when one point in a 3 × 3 area. If there is no other point present in the eight directions at this point, which is an isolated point, then it belongs to the curve endpoint. If there is only one direction that has a point in eight directions, which is a little bit, then it is a curve endpoint. If there are three directions and more than three directions have a point in eight directions, then it is a curve endpoint. If there are two directions that have a point in the eight directions, and the two directions are adjacent, then it is a curve endpoint; otherwise, it is not.
(b) Find the endpoint of the curve and eliminate this endpoint. Then, determine whether the point adjacent to this endpoint is the curve endpoint. If it is the curve endpoint, then continue to eliminate this point and determine the next adjacent point. If it is not, then go to (a).
For the hand-drawn sketch, an adaptive tracking algorithm based on the direction of the eight-connected domain is used to extract the contour of the sketch. The original image is represented by I(x, y),C(x, y) represents a 2-value image of the contour. The current direction is Di, starting from the right side and starting from the counterclockwise direction of 0,1,... 7. Select a point pointc at the left of the top line of the image as the first point. In the 3 × 3 area at this point, there are no other points in the upper, left, and right directions. Then, di = 2 is selected and look for the contour of the sketch in a counterclockwise direction. The specific algorithm is described as follows:
Step 1. Initialization. Set C to zero and di = 2, then the direction array DI is set to DI= {0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7}.
Step 2. Scan the original image I line by line from top to bottom, left to right, then the starting point of the contour that is pointc can be obtained. The current point pointnow is initialized to pointc.
Step 3. Add the current point pointnow to the binary image C(x, y) of the contour. Search pointnow in the order of DI[di], DI[di + 1], …, DI[di + 7], to find the next adjacent boundary point pointnext. If a point in one direction is found to belong to the original image I and is not equal to the initial point pointc, then this point is pointnext. If the direction DI[i] of the next point is found, then the new search direction di is the next direction in the opposite direction of DI[i], that is di = (DI[i] + 4 + 1) mod8. Then, assign the value of pointnext to pointnow.
Step 4. If the point pointnow coincides with the starting pointc, then exit. Otherwise, it should be returned to step 3.
The effect of the algorithm is shown in Fig. 2, which has good robustness. The smooth contour of the input sketch can be obtained by preprocessing the sketch.
CNN’s multi-channel mechanism is used to access different data views, such as red, green, blue channels, and stereo audio tracks of color images [26]. By adding the input information, CNN can learn more features and improve the classification effect of the model. Therefore, in order to optimize the training process of a hand-drawn bone recognition, this paper proposes a dual-channel CNN. Figure 3 shows the structure of the network. The network consists of two relatively independent convolution networks. The first input of the network is the hand-drawn image, and the second input is the outline of the hand-drawn sketch [27,28,29,30,31,32,33,34,35].
In a dual-channel CNN, each channel contains the same number of convolution layers and parameters, but has independent weights. After the pool layer, the two channels are connected to the full connection layer and perform the full connection mapping. The two channels are connected to a fully connected hidden layer, which generates the output of the logistic regression classifier. Each channel’s weight has its own update. But the final error is obtained through two output layers. So, the two output layers are like a layer that deviates from each other.