For the principle of different structured-light profilometries, Fourier transform profilometry has the lowest computational complexity, and the deformed structured-light pattern I(u, v) captured by camera can be written as:
$$\begin{aligned} I(u,v)=a(u,v)+b(u,v)\cos [\varphi (u,v)+2\pi f_{0}u] \end{aligned}$$
(1)
where a(u, v) is the background light intensity of pixels (u, v), b(u, v) is the amplitude of structured-light pattern, f0 is the fundamental frequency of the striped structured-light, and \(\varphi (u, v)\) is the phase amplitude modulated by the surface height h(u, v). Then, h(u, v) is expressed as:
$$\begin{aligned} h(u,v)=\frac{ l_{0}\varphi (u,v)}{2\pi f_{0}d} \end{aligned}$$
(2)
where d represents the central distance between the camera and projector, and \(l_0\) represents the distance between reference plane and the camera, and both of them are the geometric parameters of the structured-light device.
Convert the trigonometric function in Eq. 1 into an exponential form, let \(c(u, v)=\frac{1}{2}b(u,v)\exp(i\varphi (u,v))\), the phase amplitude \(\varphi (u,v)\) modulated by the measured surface can be expressed as:
$$\begin{aligned} \varphi (u,v)=\frac{Im[c(u,v)\exp (i2\pi f_{0}u)]}{Re[c(u,v)\exp (i2\pi f_{0}u)]} \end{aligned}$$
(3)
The phase amplitude \(\varphi (u,v)\) calculated by Eq. 3 is wrapped in the range of \((-\pi , \pi )\); after phase unwrapping, the final height map h(u, v) can be obtained by Eq. 2 [32].
3.1 Global feature extraction of structured-light pattern
At present, for structured-light measurement algorithms based on deep neural network, most of them are encoder–decoder frameworks. Feature maps are extracted from the input structured-light pattern by a pre-trained network and then put them into the decoder to generate height information.
In the encoder, for feature map extraction of structured-light pattern, it is necessary to collect global features, especially when there are discontinuous sections of the measured surface. The current approaches are: (1) By reducing the resolution (reducing the scale) of the convolution layer feature map, such as down-sampling operations (e.g., pooling layer), the network can get the feature information between long-distance positions of the original pattern. However, the output of the convolution layer represents the feature information at different spatial positions, and the pattern is segmented into grids to obtain the local features of each part. In the process of encoding and decoding, the scaling of pattern size leads to the loss of information, resulting in the reduction of the accuracy of the 3D reconstruction, or relying on deeper convolution and pooling operations [33]. (2) Another method is dilated convolution, compared with the problem that the pooling layer increases the receptive field but losses information, and dilated convolution network can avoid the down-sampling operation [34]. By adding a dilation rate, dilated convolution inserts blanks between the elements of the convolution kernel, which expands the kernel for a larger receptive field. However, the sampling process of dilated convolution is sparse, while multiple dilated convolutions are superimposed in the network, and some lost pixels will lose the continuity of information and the correlation between the feature maps, for object edge and small scale object, which will result in the decrease of 3D reconstruction accuracy [35].
Nowadays, most of existing studies are based on deep convolution layers to extract global feature maps of the structured-light pattern, which leads a large number of learnable parameters of the network, long training time and difficult deployment. Therefore, for efficient and accurate 3D reconstruction, a key step is to get more global information based on the network with limited neural layers [36, 37].
For global feature maps extraction, self-attention makes great improvement in acquiring large-scale interactivity, which main operation is to obtain the weighted average of the calculated values of hidden cells. More than that, self-attention mechanism can get a wide range of interactive without increasing parameters, which helps to reduce the number of learnable parameters of the network model. This is significant for large-scale modeling of high-resolution structured-light profilometry [38, 39].
At present, transformer uses self-attention to acquire long-range interactive information. Compared with CNN, the transformer requires fewer computing resources, has achieved excellent performance in NLP, image classification, etc. [40, 41], and has become a study hotspot in deep learning. The underlying structure of the transformer is similar to ResNet, which divides the image into multiple patches of a specified size, and this leads to two disadvantages: First, the boundary pixels cannot use the adjacent pixels outside the patch for image restoration; second, the restored image may be mixed with boundary artifacts around each patch [42].
As an improved visual transformer, swin transformer utilizes a novel general architecture based on shifted-window and hierarchical expression. Compared with the previous vision transformer, swin transformer introduces the idea of locality and uses the shifted window to calculate the self-attention of the non-coincident patches, which also greatly reduces the computing consumption [43, 44].
3.2 Dual-path hybrid submodule
Convolution has good local perception ability, but it lacks the interaction of long-range information, which will lose the global feature of the structured-light pattern. If the network only relies on deeper convolution layers and pooling layers to expand the receptive field, it will lead to a huge number of learnable parameters and over-fitting of the network. A pure transformer or swin transformer network has an obvious advantage in the global perception of the pattern, but the pattern detail information is lost in the division of patches [45]. In [46], a hybrid network structure is proposed to take advantage of convolutional operations and self-attention mechanisms for enhanced representation learning, which can significantly improve the representation ability of the base network under comparable parameter complexity. Inspired by this, we present a dual-path hybrid submodule for feature learning, in which there are two parallel subpaths, the local and global features are represented by convolution path and swin transformer path, respectively, and each convolution block has its corresponding parallel swin transformer block for feature interaction [47]. The diagram of the dual-path hybrid submodule is shown in Fig. 1.
In the convolution path of the dual-path hybrid submodule, the feature map \(f_i\) output by the previous submodule is directly transmitted to the convolution path for local feature extraction, and this feature is also serialized by a FC (feature coupling) Down block and indirectly sent to the swin transformer path for global feature extraction. The output global feature \(p_s\) of the swin transformer is converted into 3D form \(f_u\) (H\(_j\)W\(_j\)C\(_j\)) by a FC Up block, and it is coupled with the output feature \(f_c\) from the convolution layer by the Average layer. There are a UpSampling2D layer and a Dense layer in the behind of the Average layer, and the purpose is to keep the feature dimension consistent with the residual information from the encoder. After the feature information and residual information are concatenated, they are used as the input \(f_j\) for the next submodule [48, 49].
In the swin transformer path, the tensor \(p_i\) output by the previous submodule and the 2D feature map passed from the convolutional layer are also coupled by the Average layer and then passed to the current swin transformer block for global feature extraction. The tensor \(p_s\) gets from the swin transformer has two branches: One is coupled to the convolution path for providing global feature information, and the other is upsampled by patch expanding layer and passed to the next submodule for further global feature representation.
The FC Down block is composed of a patch extracting layer, a patch embedding layer and a LayerNormalization layer, and the 3D feature map \(f_i\) is serialized by the patch extracting layer into 2D patches by the patch extracting layer. These patches are tokenized by the patch embedding layer and maintain a similar dimension to the previous \(p_i\); after the LayerNormalization layer, the disappearance of gradient can be avoided. In the FC Up block, after a patch expanding 2D layer, the serialized global feature \(p_s\) is reshaped into 3D form; then, its dimension is supplemented by the 11 convolution layer and then outputs through the BatchNormalization layer.
With this dual-path hybrid submodule, for feature maps with different scales, the convolutional path and the swin transformer path can extract local and global features, respectively, and those two different features are strongly fused by coupling blocks. Through the hybrid submodule, the number of layers of the neural network can be effectively reduced, and a high-precision 3D reconstruction can be obtained.
3.3 The proposed dual-path hybrid decoder network
Based on the dual-path hybrid submodule mentioned above, we proposed a novel dual-path decoder network for single-shot structured-light profilometry, which is improved from the classic UNet [50], and the final network architecture is shown in Fig. 2.
There are three convolution blocks in the encoder, which consist of two 33 convolution layers, two BatchNormalization layers and a MaxPooling layer. Between the convolution blocks, 2 down-sampling is performed by the MaxPooling layer. Compared with 4 downsampling convolution blocks and a bottom convolution block in UNet, the proposed network eliminates the deepest convolution blocks in order to reduce the overall size of the model, and the global feature information is extracted and represented by the hybrid submodules in the decoder. Meanwhile, each convolution block also outputs residual information for skipping to the decoder, which can avoid the gradient disappearance in the back-propagation process [51].
The decoder is composed of 4 dual-path hybrid submodules in series, which mainly represent the local and global features of the structured-light pattern, and scale the feature maps in the two paths by UpSampling layer and Patch expanding layer, respectively. It should be noted that in the decoder, each convolution block consists of one 33 convolution layer and one BatchNormalization layer, while each swin transformer block is composed of two swing transformer layers.
The output layer in the model is a 11 convolution, and the final 3D height map is output in the form of linear regression.