### 2.1 Visual fingerprint displacement detection method

As stated before, the crowdsourced image sends back to the server. For a Region of Interest (RoI) rectangle of size \(w\times h\) in the image, \(k \times k\) bins form with each size approximate to \(\frac{w\times h}{k^2}\). In the last convolutional layer, the \(k^2\) score map for each category produces, a pooling scheme defined as

$$\begin{aligned} r_c(i,j|\Theta )=\sum _{(x,y)\in bin(i,j)}z_{i,j,c}(x+x_0,y+y_0|\Theta )/n, \end{aligned}$$

(1)

where \(r_c(i,j)\) is the response in the (*i*, *j*)th bin for the *c*th category, \(z_{i,j,c}\) is one score map out of the \(k^2(c+1)\) score map, \((x_0,y_0)\) is the top-left corner of an RoI, *n* is the number of pixels in the bin, \(\Theta\) is all parameters of the R-FCN. Then, \(k^2\) scores are voted by averaging on the RoI. A \(c+1\) dimensional vector generates as \(r_c(\Theta )=\sum _{i,j}r_c(i,j|\Theta )\). Finally, the softmax responses computed by \(s_c(\Theta )=e^{r_c(\Theta )}/\sum ^{c}_{c_{i,i=0}}e^{r_{c_i}(\Theta )}\). Note that the output of R-FCN for each RoI is the particular category \(c_i\) and its bounding box \({\mathbf{S}}_i=(s_x,s_y,s_w,s_h)\). In general, the output from the pretrained R-FCN can express as

$$\begin{aligned} {\mathbf{S}}=\{{{\mathbf{S}}}_1,{\mathbf{S}}_2,{\mathbf{S}}_j,\dots ,{\mathbf{S}}_m\}, \end{aligned}$$

(2)

where \({\mathbf{S}}_j\) is the *j*th semantic region, *m* is the total semantic region labeled by R-FCN.

Then, the SURF descriptors extracts, which express as

$$\begin{aligned} \mathbf {F}=\{F_1,F_2,F_i,\dots ,F_n\}, \end{aligned}$$

(3)

where *n* is the number of SURF descriptor, \(\mathbf {F}_i\) is a vector. With the help of the descriptor, 2D features and 3D points can be mapped uniquely by [34]. Typically, the 2D image coordinates and the 3D point world coordinates will be the coefficients of our proposed method after semantic segmentation. The pixel coordinates of each SURF descriptor can be used to judge easily which \({\mathbf{S}}_i\) it belongs to. It describes as

$$\begin{aligned} {F}_{s_j}\in {\mathbf{S}}_j, \end{aligned}$$

(4)

where \({F}_{s_j} \subset \mathbf {F}\). Then, RANSAC is used for refining the 2D-2D matches between the crowdsourced image and the reference image. The remaining SURF descriptor with its semantic label can be used for checking the displacement of the object. We define the matched SURF number of each semantic segmentation before RANSAC as \(n^{{\mathbf{s}}_{j}}_{\mathrm{match}}\), and the remaining SURF number after RANSAC as \(n^{{\mathbf{s}}_{j}}_{\mathrm{filter}}\). Typically, when the ratio \(\phi _{{\mathbf{s}}_j}=n^{{\mathbf{s}}_{j}}_{\mathrm{filter}}/n^{{\mathbf{s}}_{j}}_{\mathrm{match}}\) is lower than a threshold \(\phi _{\mathrm{thr}}\), it could assume that the corresponding semantic object is displaced comparing to the reference. Furthermore, considering the randomness of RANSAC, *m* groups of PnP equations will establish, respectively. By EPnP [38] algorithm, the localization results of the crowdsource user can be recalculated, where \(\{{\mathbf {R}}_{\mathrm{filter}}, {\mathbf {t}}_{\mathrm{filter}}\}_{\mathrm{RANSAC}}\) is the solution of all the RANSAC filtered 2D-3D correspondences. When the number is too small to solve the accurate solution, the calculation of the reprojection error will substitute for solving the EPnP equations, which is a golden standard for judging the correctness of the solution. It will correct the misjudgment of the first step as far as possible.

Finally, the displaced object can be judged by

$$\begin{aligned} \!\! {\mathbf{S}}^{\mathrm{disp}} \!\!= & {} \!\! \{{\mathbf{S}}_i | \phi _{{\mathbf{S}}_i}\!<\!\phi _{\mathrm{thr}} \} \!\cup \! \{{\mathbf{S}}_i | \epsilon _{\mathrm{reproj}}^{{\mathbf{S}}_i}\!<\!\epsilon _{\mathrm{reproj}}^{\mathrm{thr}} \}, n \!\le \! 5 \end{aligned}$$

(5a)

$$\begin{aligned} \!\! {\mathbf{S}}^{\mathrm{disp}} \!\!= & {} \!\! \{{\mathbf{S}}_i | \phi _{{\mathbf{S}}_i}\!<\!\phi _{\mathrm{thr}} \} \!\cup \! \{{\mathbf{S}}_i | \epsilon _{\mathrm{diff}}^{{\mathbf{S}}_i}\!<\!\epsilon _{\mathrm{diff}}^{\mathrm{thr}} \}, n \!>\! 5, \end{aligned}$$

(5b)

where \({{\mathbf{S}}}^{\mathrm{disp}}\) is the displaced semantic region in the image plane, \(\phi _{{\mathbf{S}}_i}\) is the ratio labeled by the semantic region \({\mathbf{S}}_i\), \(\phi _{\mathrm{thr}}\) is a threshold, \(\epsilon _{\mathrm{reproj}}^{{\mathbf{S}}_i}\) is the reprojection error of correspondences labeled by \({\mathbf{S}}_i\), \(\epsilon _{\mathrm{reproj}}^{thr}\) is the reprojection error threshold. \(\epsilon _{\mathrm{diff}}^{{\mathbf{S}}_i}\) is the difference between \(\{{\mathbf {R}}_{\mathrm{filter}}, {\mathbf {t}}_{\mathrm{filter}}\}_{{\mathbf{s}}_i}\) and \(\{{\mathbf {R}}_{\mathrm{filter}}, {\mathbf {t}}_{\mathrm{filter}}\}_{\mathrm{RANSAC}}\), \(\epsilon _{\mathrm{diff}}^{\mathrm{thr}}\) is the difference threshold. When the semantic object is unshifted, the corresponding region is defined as \({\mathbf{S}}^{\mathrm{unsh}}\). Consequently, the displaced object will be ranked according to the semantic region in the image plane. The proposed method is summarized in Algorithm 1.

### 2.2 Visual fingerprint relocation method

In the previous subsection, the displaced object has been locked as well as the unshifted objects. The translation vector of the displaced object will calculate for refreshing the visual fingerprint database in this subsection. First of all, a benchmark method will be illustrated. Although it can be derived very simply from the *PnP* equation, we need to give a brief deduction in this subsection to compare with our proposed method. Another reason is that it is the first proposal for solving the visual fingerprint refreshing problem. The semantic 2D–3D correspondences can be clustered into two categories, which describes as two sets \({\mathbf{T}}^{\mathrm{disp}}=\{u_d,v_d,x_d,y_d,z_d\}\) and \({\mathbf{T}}^{\mathrm{fix}}=\{u_f,v_f,x_f,y_f,z_f\}\). \({\mathbf{T}}^{\mathrm{disp}}\) represents the set of displaced 2D-3D correspondences, while \({\mathbf{T}}^{\mathrm{fix}}\) is the set of fixed ones. The rotation matrix and translation vector of the crowdsourcing query image can be calculated by EPnP [38] with the coefficients from \({\mathbf{T}}^{\mathrm{fix}}\), which donates as \(\mathbf {R}\) and \({\mathbf {t}}=[t_1, t_2, t_3]^{\mathbf{T}}\). The relative translation vector of the displaced object defines as \(c_x,c_y,c_z\), respectively. Then, according to the PnP equation, without considering the rotation of the displaced object, we have

$$\begin{aligned} {\lambda }_i\begin{bmatrix} u_i \\ v_i \\ 1 \end{bmatrix}= \mathbf {K} ({\mathbf {R}}\begin{bmatrix} x_i+d_x \\ y_i+d_y \\ z_i+d_z \end{bmatrix}+\begin{bmatrix} t_1 \\ t_2 \\ t_3 \end{bmatrix}), \end{aligned}$$

(6)

where \(u_i\), \(v_i\) is the *i*th 2D feature coordinate, \(\lambda _i\) is its depth factor, \(x_i\), \(y_i\), \(z_i\) is its corresponding 3D point coordinate, and \({\mathbf {K}}\) is the camera internal matrix of the crowdsourced user. The internal matrix can be roughly recovered from the crowdsourced image, or sent by the user as a part of the feedback data accurately, \({\mathbf {K}}\) can further represent as

$$\begin{aligned} \mathbf {K}=\begin{bmatrix} f &{}0 &{} u_0 \\ 0 &{}f &{} v_0 \\ 0 &{}0 &{}1 \end{bmatrix}, \end{aligned}$$

(7)

where *f* is the focal length, \(u_0,v_0\) is the principal point coordinate of the image. Noted that the coefficients of (6) are from the set \({\mathbf{T}}^{disp}\). For a tuple of \({\mathbf{T}}^{\mathrm{disp}}\), (6) could transform to a simplified form of three linear associated equations with four unknowns. A very natural idea is to do an elimination before solving the equations. When \(\lambda _i\) is eliminated, we have

$$\begin{aligned} \ \underbrace{\mathbf {A}^\mathbf {i}}_{\mathbf {2} \times \mathbf {3}}\begin{bmatrix} d_x \\ d_y \\ d_z \end{bmatrix}=\underbrace{\mathbf {B}^\mathbf {i}}_{\mathbf {2} \times \mathbf {1}}, \end{aligned}$$

(8)

where \(\mathbf {A}^i_{11}=fr_{11}+(u_0-u_i)r_{31}\), \(\mathbf {A}^i_{12}=fr_{12}+(u_0-u_i)r_{32}\), \(\mathbf {A}^i_{13}=fr_{13}+(u_0-u_i)r_{33}\), \(\mathbf {A}^i_{21}=fr_{21}+(v_0-v_i)r_{31}\), \(\mathbf {A}^i_{22}=fr_{22}+(v_0-v_i)r_{32}\), \(\mathbf {A}^i_{23}=fr_{23}+(v_0-v_i)r_{33}\), \(\mathbf {B}^i_1=(u_i-u_0)({\mathbf {R}}^3\mathbf {X}_i+t_3)-f({\mathbf {R}}^1\mathbf {X}_i+t_1)\), \(\mathbf {B}^i_2=(v_0-v_i)({\mathbf {R}}^3\mathbf {X}_i+t_3)-f({\mathbf {R}}^2\mathbf {X}_i+t_2)\). Note that \(r_{ij}\) and \({\mathbf {R}}^i\) are the element and the *i*th row vector of rotation matrix \({\mathbf {R}}\), respectively.

It is clear that for a tuple of coefficients from \({\mathbf{T}}^{\mathrm{disp}}\), two equations can be obtained. Therefore, when there are *n* tuples in the set \({\mathbf{T}}^{\mathrm{disp}}\), \(2\times n\) linear equations will generate. According to linear algebra, a simplified form expresses as

$$\begin{aligned} \mathbf {Ad}=B, \end{aligned}$$

(9)

where \(\mathbf {c}=[c_x, c_y, c_z]^{\mathbf{T}}\) is the unknown vector. A direct least-square solution can solve as

$$\begin{aligned} \mathbf {d}=(A^{T}A)^{-1}A^{T}B. \end{aligned}$$

(10)

Finally, we have

$$\begin{aligned} \mathbf {x}_s^r=\mathbf {x}_s+\mathbf {d}, \end{aligned}$$

(11)

where \(\mathbf {x}_s^r\) is the refreshed location of the displaced visual fingerprint, \(\mathbf {x}_s\) is the primitive one in the database. It defines as our benchmark, which could be called the DLS method.

Our proposed method will deduce from the PnP equation, which is

$$\begin{aligned} {\lambda }_i\begin{bmatrix} u_i \\ v_i \\ 1 \end{bmatrix}= \mathbf {K} ({\mathbf {R}}\begin{bmatrix} x^i\\ y^i \\ z^i \end{bmatrix}+\mathbf {t}). \end{aligned}$$

(12)

Noted that the point coordinates \([x^i, y^i, z^i]^{\mathbf{T}}\) are the projection of the image coordinates \([u_i,v_i]\) in the crowdsourced query image by the depth \(\lambda _i\), rotation matrix \(\mathbf {R}\), and translation vector \({\mathbf {t}}\). Be different from Eq. (6), the goal is to calculate the point coordinates of each semantic feature in the world coordinate system. From Eq. (12), we have 3 equations and 4 unknowns, which represent infinite solutions. Thus, we formulate a minimization problem as our goal of a solution, which is

$$\begin{aligned} \mathop {\min } \,\,\,\sum _{j,k,j\ne k}^{3}{( \lambda _i^j-\lambda _i^k)}^2, \end{aligned}$$

(13)

where \(\lambda _i^j\) donates the *j*th expression of \(\lambda _i\) from Eq. (12). Typically, (13) is a quadratic programming problem. The optimal solution can obtain by solving the first derivative of (13). The final simplified form is 3 equations of the first degree, which is quite easy for computation. With *n* tuples of projected point coordinates, the center can obtain as \(\mathbf {c}^p\), while the primitive center of the displaced object is \(\mathbf {c}^r\). The translation vector of the center coordinate \(\mathbf {c}\) represents as

$$\begin{aligned} \mathbf {c}=\mathbf {c}^p-\mathbf {c}^r. \end{aligned}$$

(14)

Generally, this value is used as an index to measure the accuracy of the method, since \(\mathbf {c}\) is predefined in the simulation. The proposed method summarizes in Algorithm 2.