Hand pose estimation based on improved NSRM network

Hand pose estimation is the basis of dynamic gesture recognition. In vision-based hand pose estimation, the performance of hand pose estimation is affected due to the high flexibility of hand joints, local similarity and severe occlusion among hand joints. In this paper, the structural relations between hand joints are established, and the improved nonparametric structure regularization machine (NSRM) is used to achieve more accurate estimation of hand pose. Based on the NSRM network, the backbone network is replaced by the new high-resolution net proposed in this paper to improve the network performance, and then the number of parameters is decreased by reducing the input and output channels of some convolutional layers. The experiment of hand pose estimation is carried out by using public dataset, the experimental results show that the improved NSRM network has higher accuracy and faster inference speed for hand pose estimation.


Background and significance
Hand pose estimation can infer 2D or 3D positions of hand keypoints in the input image, which has a wide range of application potential in virtual reality [1], human-computer interaction [2,3] and other fields [4,5]. Due to the high flexibility of hand joints, local similarity and severe occlusion, hand pose estimation remain to be further studied.
In recent years, more and more methods have emerged in the field of hand pose estimation, including multi-view RGB systems [6,7], depth-based methods [8][9][10] and monocular RGB methods [11,12], etc. Therefore, the accuracy, speed and other performance of hand pose estimation have been continuously improved. Although the 3D hand pose estimation [3,5,13,14] has attracted more and more attention, the 2D hand pose estimation [2,15,16] is still an essential research direction. A large number of 3D hand pose estimation algorithms rely on the corresponding 2D algorithms [3,17], which obtain the result of estimation by mapping the features of 2D space to 3D space. With the emergence of Deep Convolutional Neural Network (DCNN), human pose estimation has also made significant progress, and more and more excellent networks are emerging in this field, such as Convolution Pose Machine (CPM) [18], Residual Network [19] and Stacked Hourglass Network (SHG) [20]. These methods implicitly access information of the body part and performed 2D pose estimation sub-module [7,11,12]. However, although DCNN has good representation ability, it cannot capture the complex structural relationship between the keypoints of the human hand, so it is difficult to deal with the serious hand occlusion problem.
Nonparametric Structure Regularization Machine (NSRM) [21] adopts cascading multi-task architecture to jointly learn keypoints and hand structure representation, and uses synthetic hand mask to guide keypoints structure learning. As an effective method in pose estimation field, High-Resolution Net (HRNet) [22,23] maintains high-resolution transmission in the whole network. It consists of multiple branches with different resolutions, the low-resolution branch captures the context information, and the highresolution branch stores spatial information. HRNet can generate high-resolution feature maps with rich semantics by using the multi-scale fusion between branches.
For better solving the problem of inaccurate location of keypoints caused by occlusion, interference and complex structural relationship of the human hand in 2D hand pose estimation, the main contributions of this paper are threefold: 1. We introduce an attention mechanism to the HRNet framework to obtain a more accurate and efficient model NHRNet. 2. We decrease the input-output channel of the convolutional layer in stages of NSRM, which aims to reduce the number of parameters of the network. 3. We improve the backbone network of NSRM by replacing VGG-19 with NHRNet to decrease the positioning error of keypoints.
The rest of this paper is organized as follows. Section 1.2 discusses related work; Sect. 2 describes our proposed network; Sect. 3 shows our experimental details; Sect. 4 discusses the experimental results; finally, Sect. 5 concludes the paper.

Related work
Research on hand pose estimation has developed rapidly in recent years. Hand pose estimation is a crucial task in computer vision, and its goal is to detect important skeletal points of the hand. Existing hand pose estimation methods can be roughly divided into three classes: generative method, discriminative method and hybrid method.
Generative method usually builds a deformable hand model and defines an objective function to compare the similarity between the collected image and the hand model. In order to optimize the objective function, the parameters of the hand model are adjusted iteratively to fit the input image, so as to obtain the optimal solution. Sridhar et al. [24] proposed a Gaussian mixture model of hands and established a new objective function to estimate hand posture. The model requires only relatively few computational resources which makes hand pose estimation very fast. Romero et al. [25]  Discriminative method does not need to know the size of the hand, and does not need to perform motion constraints. It learns a mapping relationship between the observed features and the predicted output through the training data, so as to realize the estimation of the hand pose in the current image. Kong et al. [26] proposed a network architecture called Rotation-invariant Mixed Graphical Model Network (R-MGMN) to address the task of 2D hand posture estimation in monocular vision. Fang et al. [27] proposed a joint graph inference module depended on Graph Convolutional Network (GCN) to build complex dependencies among joints while enhancing the representation capability of each pixel. The offset of each pixel to the joint is estimated in the domain, and the position of the human hand joint is estimated based on the weighted average of all pixel predictions. Goodfellow et al. [28] proposed a Generative Adversarial Network (GAN) based on the core idea of synthesizing data in the skeleton space for outputting hand poses. Kourbane et al. [29] proposed a new 2D hand pose estimation method, which has multi-scale heatmap regression performance, and adopts hand skeleton as additional information to constrain the regression problem.
Hybrid method is a combination of generative method and discriminative method. Zhang et al. [30] proposed a unified optimization framework to jointly track hand pose and object motion. The model first segments the hand and object by a Deep Neural Network (DNN), and predicts the current hand pose based on the previous pose with a pretrained Long Short-Term Memory (LSTM) network, then reconstructs the target model with a nonrigid fusion technique. Chen et al. [31] proposed the Spherical Part Model (SPM) to represent the hand pose. The model can more precisely estimate the hand posture depending on the prior knowledge of the hand.

Principle and framework of NSRM
NSRM for 2D hand pose estimation is applied to learn keypoints representation by synthesizing hand masks. A new representation method for hand joint probability is emerged; the synthetic mask can be obtained from the keypoints without additional data annotation. The hand model contains 21 keypoints and 20 limbs connected by keypoints, which as shown in Fig. 1. The NSRM network structure (G1_6) is shown in Fig. 2. Firstly, feature extraction is carried out on the backbone network to obtain the feature map of the hand. Secondly, input the feature map into the model to learn the limb structure and obtain the hand structure representation. Then, the feature map and the structure characterization are fused. Finally, learn hand posture and output keypoint coordinates.

Representation and composition of limb mask
For limb L between any two keypoints i and j, two masks are defined: Limb Deterministic Mask (LDM) and Limb Probabilistic Mask (LPM), as shown in Fig. 3a The pixels of L are those in the fixed width rectangle centered on the line segment p i p j between keypoints i and j, i.e., where u ⊥ is a vector perpendicular to p i p j , and σ LDM is a hyperparameter that controls limb width.
LDM represents that if a point p is in the rectangular area where a limb L is located, the pixel value of p is defined as 1, otherwise, it is defined as 0, i.e., where p ∈ L is any pixel in the image. LDM works poor in the actual scene due to its rough processing method, therefore LPM is proposed to solve this problem. Instead of representing a point with absolute 0 and 1 pixels, LPM represents it as the ratio of the distance from pixel p to line p i p j to a Gaussian threshold σ 2 LPM , i.e.,

Fig. 3 Two mask representations
where D p, p i p j is the distance between the pixel p and the line segment p i p j , and σ LPM is the hyperparameter that controls the diffusion of Gaussian distribution. The mask representation can be divided into four types, named LDM_G1 (as shown in Fig. 3c), LPM_G1 (as shown in Fig. 3d), LDM_G1_6 and LPM_G1_6. G1 represents the whole hand with a single mask, while G6 divided limbs into six parts, it takes the form of a palm and five fingers, as shown in Fig. 4. Furthermore, G1 captures the entire hand structure, while G6 pays more attention to detail in local areas of the hand. G1_6 represents the integration of G1 and G6 versions.

Loss function
The loss function of structural modules adopts cross-entropy loss, i.e., where T S is the number of stages of structural learning, ∧ S t p g is the predicted value of structural module, S * p g is the mask after the combination of limbs, and G is the number of groups (usually G1_6 is 7, G1 is 1). The confidence position KCM of keypoint k is defined as a 2D Gaussian distribution with marked keypoints centered on standard deviation σ KCM , i.e., The loss function of the position prediction module adopts the sum-of-squared-error loss, i.e., where T K is the number of stages of pose learning, and ∧ C t p k is the predicted value of position module at pixel p and keypoint k of stage t.
The total loss function is: where 1 , 2 are hyperparameters for controlling relative weight.

Improved NSRM network
By observing the NSRM network based on LPM_G1_6 mask, it can be found that the backbone network is the classic VGG-19. The quality of backbone network architecture will directly affect the strength of feature extraction capability, so its importance is selfevident. Although VGG-19 network can effectively extract the feature information of human hand keypoints, the resolution of feature maps obtained by VGG-19 network is very low and the spatial structure is lost, which leads to incomplete or invalid image features extraction. Moreover, the function map of 128 channels is generated through VGG-19 network, which leads to the increase in the number of parameters in the NSRM network and consumes more computing resources. Excellent HRNet pose estimation network can connect feature maps of different resolutions in parallel instead of simply concatenating them, which makes the whole network structure maintain high-resolution representation. The different resolution representations of the network at the same stage are fused repeatedly to enhance the capture of image information and improve the prediction accuracy. Therefore, the original backbone network VGG-19 is replaced by the HRNet network.

NHRNet model
The HRNet network mainly consists of four stages, and its structure is shown in Fig. 5. At the end of each stage, the network will be connected with the next stage through an exchange unit. After the end of the last stage, the output of the feature map is obtained. The HRNet network has four parallel sub-networks, and the resolution of each subnet is 1/4, 1/8, 1/16 and 1/32 of the input image, respectively. The network first reduces the image resolution to 1/4 of the input image through 3 × 3 convolution with stride 2, then enters the first stage which belongs to the first sub-network. The first stage includes four Bottleneck residual units with 64 output channels (represented by one unit in Fig. 5). The input feature map is generated by convolution module which contains convolution layer, Batch Normalization (BN) layer and Rectified Linear Unit (ReLU) layer, and the newly generated feature map is fused with the input feature map as the output feature map. There are 1, 4 and 3 resolution blocks in the second, third and fourth stages, respectively, each resolution block contains 4 Basic residual units (represented by one unit in Fig. 5).
The resulting feature map is generated by two convolution modules (including convolution layer, BN layer and ReLU layer), and the new feature map is fused with the resulting feature map as the output feature map. HRNet network performs cross-resolution feature fusion among sub-networks by exchange units. The resolution of the new sub-network is reduced to half of the previous one, but the number of channels is doubled. Cross-resolution feature fusion can be achieved by 2 (4 or 8) times down-sampling, convolution unit and 2 (4 or 8) times upsampling. Multiple multi-resolution scale fusion is realized by exchange units, so that each high to low resolution representation receives the information of other parallel features repeatedly, finally obtains the high-resolution feature with rich information.
By introducing the attention mechanism, the network focuses its attention on the input information that is more critical to the current task, and even filters out the information that is irrelevant, so as to improve the efficiency and accuracy of the network. The idea of SA module proposed by SANet (Shuffle Attention Networks) [32] is convenient to implement and easy to load into the existing model framework. The architecture of SA module is shown in Fig. 6. SA module combines spatial attention and channel attention, which aims to capture the pixel-level pairwise relationship and channel dependency together. SA module first groups the channel dimensions and then performs parallel operations for each group. For each group, SA performs global average pooling ( F gp ) and group normalization (GN) operations, respectively, to obtain channelwise and spatial-wise information. Then all the groups are aggregated and the channel shuffle operator is used to realize the information communication between each group.
On the basis of the HRNet network, an attention channel is increased by introducing the SA module after the second convolution module of the Basic residual unit. The attention mechanism of HRNet is introduced to make the network pay more attention to the skeleton point of human hand, then the NHRNet (NewHRNet) network model is constructed. The framework of NHRNet is shown in Fig. 7. The orange squares represent improved Basic modules. Take one of the squares as an example, the details of which are shown in Fig. 7. The input form of the Basic module is x ∈ R c×w×h , and the Basic module adopts the residual mapping, representing the required underlying mapping as H (x) , then H (x) = F (x) + x , and F is the two 3 × 3 convolution function used in the Basic module. After adding SA, the formula becomes the following form:

Improved backbone network
For improving the accuracy of neural network for human hand keypoint detection, this paper improves the NSRM network. This paper replaces the backbone network VGG-19 with NHRNet. The number of generated channels is decreased from 128 to 32, and the size of the obtained feature map is also changed due to the replacement of the backbone network. Therefore, the output channel of the first convolution layer in the first stage is changed to 128, and from the second stage to the sixth stage, the output channel of the first convolution layer and the input channel of the seventh convolution layer are changed to 32, so that the network performance can be improved and the number of network parameters can be reduced. The improved network structure is shown in Fig. 8.
In front of the green arrow are the modules adopted in the NSRM network, and behind the green arrow are the improved modules. The backbone network used in the improved network is NHRNet, and a 32-channel feature map is generated through NHRNet. The entire network architecture has six stages, each stage contains five convolution layers of 7 × 7 kernel size and two convolution layers of 1 × 1 kernel size (except the first stage). The first three stages are used for learning synthetic mask representation, and the last three stages are used for learning posture representation. The improved model's process is as follows: Firstly, all the hand images are resized to 256 × 256 and fed into the model, so as to generate 64 × 64 feature representation map for the mask method. As shown in Fig. 8, each stage consists of a range of convolution layers with k × k kernel size, and the specific input-output feature map channel number and convolution kernel size are shown in the rectangular box below. The output of structure stage 3 is fed to each keypoint stage. The output of the two kinds of network stages is a tensor with the size of (batchsize, 3, 7, 64, 64) and a tensor with the size of (batchsize, 3, 21, 64, 64), respectively, which is expressed as the position score of each six sets of masks and the confidence score of each point in the image.

Experiments
CMU Panoptic Hand dataset [4] was selected for model training and testing, and the effectiveness of the proposed model was verified by comparing with the existing excellent models.

Dataset and evaluation index
CMU Panoptic Hand dataset contains 14,817 human images, each of which has 21 annotated keypoints of the right hand. Since this paper focuses on the hand pose estimation rather than the hand target detection, the images are clipped according to the 2.2 times of the largest size of the boundary box containing every hand keypoint. The clipped dataset is arbitrarily split into three subsets for training (80%), validation (10%) and testing (10%), with 11,853, 1482 and 1482 hand targets, respectively. The quality of detector d 0 is defined as the Probability of Correct Keypoint (PCK): PCK is defined as the proportion of the keypoints correctly estimated, that is, the proportion of the normalized distance between the detected keypoints and the corresponding groundtruth location less than the set threshold σ. For an especial keypoint p, this paper uses PCK P σ (d 0 ) [7] to represent it approximately on a test set T as follows: where x f p represents the predicted p-th joint location of the detector, y f p represents its true joint location, δ(·) is an indicator function, and σ is the set threshold. Because the dataset used in this paper does not explicitly provide the size of the hand, we select to normalization with respect to the dimension of the tightest hand bounding box, and mean PCK (mPCK) with threshold σ = {0.04, 0.06, 0.08, 0.10, 0.12}.

Train and test settings
The model was trained on the training set, and tested on the validation set and test set. The experiment adopted Windows10 system and a P106-100 graphics card with 6 GB memory. The CPU version is Intel (R) Core (TM) i5-4460 CPU @ 3.20 GHz. The software environment is python3.7 + pytorch1.2.0 + cuda10.0 + cudnn7.4.0.
In this study, the Adam optimizer was used to train the model better, we set λ 1 to 0.1, λ 2 to 0.02 and decay by a ratio of 0.1 every 20 epochs. The batchsize is set to 8, the total number of training epochs is set to 80, the initial learning rate is set to 1e−4, and other parameters are set to the default value.

Comparison with NSRM
In order to compare the effectiveness of the original NSRM network and the improved NSRM network in detecting hand keypoints, eight hand pictures in different states in the test set were selected for testing, respectively, and the results are shown in Fig. 9.
Where Fig. 9a, c, e, g, i, k, m, o shows the effect pictures of the NSRM network detection, and Fig. 9b, d, f, h, j, l, n, p shows the improved NSRM network detection. There are 4 keypoints for each finger and 1 keypoint at the wrist, for a total of 21 keypoints. In Fig. 9a, b, the right hand in a clenched fist is detected. By comparison, it is found that some of the keypoints of the thumb and middle finger are not accurately detected in Fig. 9a, while in Fig. 9b, each keypoint of the hand can be detected well by using the improved NSRM network. In Fig. 9c, d, a right hand with one finger exposed was detected. Both networks are able to roughly infer the locations of the 21 keypoints on the hand, but the predicted locations of the occluded keypoints are still slightly different from the true locations. In Fig. 9e, f, a right hand with two fingers exposed was detected, and in both images of the hand, two networks are able to roughly infer the 21 keypoints of the hand. In Fig. 9g, h, the right hand is obscured by the head, and only three fingers exposed can be detected. In Fig. 9i, j a right hand with four fingers exposed is detected, and the greatest ambiguity in these two images arises The improved NSRM model was evaluated on validation set and test set. The performance results of the improved NSRM model and other most advanced models on the test set are listed in Table 1. The PCK curves of different networks are shown in Fig. 10.
The PCK of the improved NSRM network in this paper is higher than that of the original network and other networks under different thresholds. When the threshold is 0.04, the PCK of the improved NSRM network is 6.95% higher than that of the NSRM network, the average PCK of the improved NSRM network is 2.71% higher than that of the NSRM network, the parameters of the improved NSRM network is 25M less than those of the NSRM network, indicating that the improved NSRM network is slightly better than other models.

Picture test in real scene
In order to detect the generalization ability of the improved NSRM network, eight images in real scenes were selected to detect hand keypoints using the improved NSRM network. The detection effect is shown in Fig. 11. Figure 11a shows the back of the hand with the five fingers in the open state. Figure 11b shows the front side of the hand for a state with five fingers open. Figure 11c shows a state that the index finger expanded and the other fingers bent. Figure 11d shows a state that the thumb is extended and the other four fingers are half closed. Figure 11e shows the back of the hand with the thumb and index finger straight and the other three fingers half closed. Figure 11f shows the state of holding the cup, in which the thumb is blocked. Figure 11g shows the back of the hand with the thumb half occluded, the index finger straight, and the other three fingers half closed. Figure 11h shows the side of the hand with a state of the fingers are slightly open, and there is a slight occlusion. It can be seen that the improved NSRM network can identify approximate positions of the keypoints of the hand in the eight images.

Conclusion
Human hand joints have high flexibility, local similarity and serious occlusion, which are supposed to have big impacts on hand posture estimation. In order to adopt to the complex hand posture and establish the structural relationship between the hand joints, this study replaced its backbone network with NHRNet based on the NSRM network, and reduced the input and output channels of some convolution layers to achieve more accurate and faster hand posture estimation. On the CMU Panoptic Hand dataset, the PCK of the improved NSRM model under different thresholds is higher than that of other networks. Compared with the NSRM network, the PCK increases by 6.95% and the average PCK increases by 2.71% when the threshold is 0.04, and the improved NSRM network reduces the number of parameters. In the comparison experiment of the test set and the experiment in the real scene, it can also be seen that the improved NSRM network can identify the hand keypoints in different states. Therefore, the improved NSRM is an excellent hand pose estimation model.