2.1 Principle and framework of NSRM
NSRM for 2D hand pose estimation is applied to learn keypoints representation by synthesizing hand masks. A new representation method for hand joint probability is emerged; the synthetic mask can be obtained from the keypoints without additional data annotation. The hand model contains 21 keypoints and 20 limbs connected by keypoints, which as shown in Fig. 1.
The NSRM network structure (G1_6) is shown in Fig. 2. Firstly, feature extraction is carried out on the backbone network to obtain the feature map of the hand. Secondly, input the feature map into the model to learn the limb structure and obtain the hand structure representation. Then, the feature map and the structure characterization are fused. Finally, learn hand posture and output keypoint coordinates.
2.1.1 Representation and composition of limb mask
For limb L between any two keypoints i and j, two masks are defined: Limb Deterministic Mask (LDM) and Limb Probabilistic Mask (LPM), as shown in Fig. 3a, b.
The pixels of L are those in the fixed width rectangle centered on the line segment \(\overline{{p_{{_{i} }} p_{j} }}\) between keypoints i and j, i.e.,
$$\left\{\begin{array}{ll} {{0 \le \left( {p - p_{j} } \right)^{T} \left( {p_{i} - p_{j} } \right) \le \left\| {p_{i} - p_{j} } \right\|_{2}^{2} }}\\ {{{\left| {\left( {p - p_{j} } \right)^{T} u^{ \bot } } \right| \le \sigma_{{{\text{LDM}}}} }} } \end{array}\right.$$
(1)
where \(u^{ \bot }\) is a vector perpendicular to \(\overline{{p_{{{i} }} p_{j} }}\), and \(\sigma_{{{\text{LDM}}}}\) is a hyperparameter that controls limb width.
LDM represents that if a point p is in the rectangular area where a limb L is located, the pixel value of p is defined as 1, otherwise, it is defined as 0, i.e.,
$$S_{{{\text{LDM}}}} \left( {p\left| L \right.} \right) = \left\{ {\begin{array}{*{20}l} 1 \hfill & {{\text{if }}\;p \in L} \hfill \\ 0 \hfill & {{\text{otherwise}}} \hfill \\ \end{array} } \right.$$
(2)
where \(p \in L\) is any pixel in the image.
LDM works poor in the actual scene due to its rough processing method, therefore LPM is proposed to solve this problem. Instead of representing a point with absolute 0 and 1 pixels, LPM represents it as the ratio of the distance from pixel p to line \(\overline{{p_{{{i} }} p_{j} }}\) to a Gaussian threshold \(\sigma_{{{\text{LPM}}}}^{2}\), i.e.,
$$S_{{{\text{LPM}}}} \left( {p\left| L \right.} \right) = \exp \left( { - \frac{{D\left( {p,\overline{{p_{i} p_{j} }} } \right)}}{{2\sigma_{{{\text{LPM}}}}^{2} }}} \right)$$
(3)
where \(D\left( {p,\overline{{p_{i} p_{j} }} } \right)\) is the distance between the pixel p and the line segment \(\overline{{p_{{{i} }} p_{j} }}\), and \(\sigma_{LPM}\) is the hyperparameter that controls the diffusion of Gaussian distribution.
The mask representation can be divided into four types, named LDM_G1 (as shown in Fig. 3c), LPM_G1 (as shown in Fig. 3d), LDM_G1_6 and LPM_G1_6. G1 represents the whole hand with a single mask, while G6 divided limbs into six parts, it takes the form of a palm and five fingers, as shown in Fig. 4. Furthermore, G1 captures the entire hand structure, while G6 pays more attention to detail in local areas of the hand. G1_6 represents the integration of G1 and G6 versions.
2.1.2 Loss function
The loss function of structural modules adopts cross-entropy loss, i.e.,
$$\begin{aligned} L_{{\text{S}}} & = \sum\limits_{t = 1}^{{T_{{\text{S}}} }} {\sum\limits_{g \in G} {\sum\limits_{p \in I} {S^{*} \left( {p\left| g \right.} \right)\log \mathop {\mathop S\limits^{ \wedge }_{t} \left( {p\left| g \right.} \right)} } } } \\ & \quad + \left( {1 - S^{*} \left( {p\left| g \right.} \right)} \right)\log \left( {1 - \mathop S\limits^{ \wedge }_{t} \left( {p\left| g \right.} \right)} \right) \\ \end{aligned}$$
(4)
where \(T_{{\text{S}}}\) is the number of stages of structural learning, \(\mathop S\limits^{ \wedge }_{t} \left( {p\left| g \right.} \right)\) is the predicted value of structural module, \(S^{*} \left( {p\left| g \right.} \right)\) is the mask after the combination of limbs, and G is the number of groups (usually G1_6 is 7, G1 is 1).
The confidence position KCM of keypoint k is defined as a 2D Gaussian distribution with marked keypoints centered on standard deviation \(\sigma_{{{\text{KCM}}}}\), i.e.,
$$C^{*} \left( {p\left| k \right.} \right) = \exp \left\{ { - \frac{{\left\| {p - p_{k}^{*} } \right\|_{2}^{2} }}{{2\sigma_{{{\text{KCM}}}}^{2} }}} \right\}$$
(5)
The loss function of the position prediction module adopts the sum-of-squared-error loss, i.e.,
$$L_{K} = \sum\limits_{t = 1}^{{T_{K} }} {\sum\limits_{k = 1}^{K} {\sum\limits_{p \in I} {\left\| {C^{*} \left( {p\left| k \right.} \right) - \mathop C\limits^{ \wedge }_{t} \left( {p\left| k \right.} \right)} \right\|_{2}^{2} } } }$$
(6)
where \(T_{K}\) is the number of stages of pose learning, and \(\mathop C\limits^{ \wedge }_{t} \left( {p\left| k \right.} \right)\) is the predicted value of position module at pixel p and keypoint k of stage t.
The total loss function is:
$$L = \left\{ {_{{L_{K} + \lambda_{1} L_{S}^{G1} + \lambda_{2} L_{S}^{G6} {\text{ G6}}}}^{{L_{K} + \lambda_{1} L_{S}^{G1} {\text{ G1}}}} } \right.$$
(7)
where \(\lambda_{1}\), \(\lambda_{2}\) are hyperparameters for controlling relative weight.
2.2 Improved NSRM network
By observing the NSRM network based on LPM_G1_6 mask, it can be found that the backbone network is the classic VGG-19. The quality of backbone network architecture will directly affect the strength of feature extraction capability, so its importance is self-evident. Although VGG-19 network can effectively extract the feature information of human hand keypoints, the resolution of feature maps obtained by VGG-19 network is very low and the spatial structure is lost, which leads to incomplete or invalid image features extraction. Moreover, the function map of 128 channels is generated through VGG-19 network, which leads to the increase in the number of parameters in the NSRM network and consumes more computing resources. Excellent HRNet pose estimation network can connect feature maps of different resolutions in parallel instead of simply concatenating them, which makes the whole network structure maintain high-resolution representation. The different resolution representations of the network at the same stage are fused repeatedly to enhance the capture of image information and improve the prediction accuracy. Therefore, the original backbone network VGG-19 is replaced by the HRNet network.
2.2.1 NHRNet model
The HRNet network mainly consists of four stages, and its structure is shown in Fig. 5. At the end of each stage, the network will be connected with the next stage through an exchange unit. After the end of the last stage, the output of the feature map is obtained. The HRNet network has four parallel sub-networks, and the resolution of each subnet is 1/4, 1/8, 1/16 and 1/32 of the input image, respectively. The network first reduces the image resolution to 1/4 of the input image through 3 × 3 convolution with stride 2, then enters the first stage which belongs to the first sub-network. The first stage includes four Bottleneck residual units with 64 output channels (represented by one unit in Fig. 5). The input feature map is generated by convolution module which contains convolution layer, Batch Normalization (BN) layer and Rectified Linear Unit (ReLU) layer, and the newly generated feature map is fused with the input feature map as the output feature map. There are 1, 4 and 3 resolution blocks in the second, third and fourth stages, respectively, each resolution block contains 4 Basic residual units (represented by one unit in Fig. 5). The resulting feature map is generated by two convolution modules (including convolution layer, BN layer and ReLU layer), and the new feature map is fused with the resulting feature map as the output feature map.
HRNet network performs cross-resolution feature fusion among sub-networks by exchange units. The resolution of the new sub-network is reduced to half of the previous one, but the number of channels is doubled. Cross-resolution feature fusion can be achieved by 2 (4 or 8) times down-sampling, convolution unit and 2 (4 or 8) times up-sampling. Multiple multi-resolution scale fusion is realized by exchange units, so that each high to low resolution representation receives the information of other parallel features repeatedly, finally obtains the high-resolution feature with rich information.
By introducing the attention mechanism, the network focuses its attention on the input information that is more critical to the current task, and even filters out the information that is irrelevant, so as to improve the efficiency and accuracy of the network. The idea of SA module proposed by SANet (Shuffle Attention Networks) [32] is convenient to implement and easy to load into the existing model framework. The architecture of SA module is shown in Fig. 6. SA module combines spatial attention and channel attention, which aims to capture the pixel-level pairwise relationship and channel dependency together. SA module first groups the channel dimensions and then performs parallel operations for each group. For each group, SA performs global average pooling (\(F_{{{\text{gp}}}}\)) and group normalization (GN) operations, respectively, to obtain channel-wise and spatial-wise information. Then all the groups are aggregated and the channel shuffle operator is used to realize the information communication between each group.
On the basis of the HRNet network, an attention channel is increased by introducing the SA module after the second convolution module of the Basic residual unit. The attention mechanism of HRNet is introduced to make the network pay more attention to the skeleton point of human hand, then the NHRNet (NewHRNet) network model is constructed. The framework of NHRNet is shown in Fig. 7. The orange squares represent improved Basic modules. Take one of the squares as an example, the details of which are shown in Fig. 7. The input form of the Basic module is \(x \in {\mathbb{R}}^{c \times w \times h}\), and the Basic module adopts the residual mapping, representing the required underlying mapping as \(H(x)\), then \(H(x) = F(x) + x\), and F is the two 3 × 3 convolution function used in the Basic module. After adding SA, the formula becomes the following form: \(H(x) = SA(F(x)) + x\).
2.2.2 Improved backbone network
For improving the accuracy of neural network for human hand keypoint detection, this paper improves the NSRM network. This paper replaces the backbone network VGG-19 with NHRNet. The number of generated channels is decreased from 128 to 32, and the size of the obtained feature map is also changed due to the replacement of the backbone network. Therefore, the output channel of the first convolution layer in the first stage is changed to 128, and from the second stage to the sixth stage, the output channel of the first convolution layer and the input channel of the seventh convolution layer are changed to 32, so that the network performance can be improved and the number of network parameters can be reduced. The improved network structure is shown in Fig. 8. In front of the green arrow are the modules adopted in the NSRM network, and behind the green arrow are the improved modules.
The backbone network used in the improved network is NHRNet, and a 32-channel feature map is generated through NHRNet. The entire network architecture has six stages, each stage contains five convolution layers of 7 × 7 kernel size and two convolution layers of 1 × 1 kernel size (except the first stage). The first three stages are used for learning synthetic mask representation, and the last three stages are used for learning posture representation. The improved model’s process is as follows: Firstly, all the hand images are resized to 256 × 256 and fed into the model, so as to generate 64 × 64 feature representation map for the mask method. As shown in Fig. 8, each stage consists of a range of convolution layers with k × k kernel size, and the specific input–output feature map channel number and convolution kernel size are shown in the rectangular box below. The output of structure stage 3 is fed to each keypoint stage. The output of the two kinds of network stages is a tensor with the size of (batchsize, 3, 7, 64, 64) and a tensor with the size of (batchsize, 3, 21, 64, 64), respectively, which is expressed as the position score of each six sets of masks and the confidence score of each point in the image.