A visual fingerprint update algorithm based on crowdsourced localization and deep learning for smart IoV

Recently, deep learning and vision-based technologies have shown their great significance for the prospective development of smart Internet of Vehicle (IoV). When the smart vehicle enters the indoor parking of a shopping mall, the vision-based localization technology can provide reliable parking service. As known, the vision-based technique relies on a visual map without a change in the position of the reference object. Although, some researchers have proposed a few automatic visual fingerprinting (AVF) methods, which are aiming at reducing the cost of building the visual map database. However, the AVF method still costs too much under such a situation, since it is impossible to determine the specific location of the displaced object. Given the smart IoV and the development of deep learning approach, we propose an algorithm for solving the problem based on crowdsourcing and deep learning in this paper. Firstly, we propose a Region-based Fully Convolutional Network (R-FCN) based method with the feedback of crowdsourced images to locate the specific displaced object in the visual map database. Secondly, we propose a method based on quadratic programming (QP) for solving the translation vector of the displaced objects, which finally solves the problem of updating the visual map database. The simulation results show that our method can provide a higher detection sensitivity and correction accuracy as well as the relocation results. It means that our proposed algorithm outperforms the compared one, which is verified by both synthetic and real data simulation.

greater delay. Therefore, vision-based technology is a promising solution for the positioning and navigation of smart vehicles after entering the indoor environment.
Vision-based localization technology has its unique advantages. The evidence is more obvious in the composite positioning technologies. For instance, WiFi and vision [6], inertia and vision [7], Lidar and vision [8], hybrid localization technology [9]. Visual localization is a key role in their system architecture, respectively. The premise of accurate visual localization is to establish a visual map database on the offline stage in the interested areas. Typically, with the help of customized equipment [10][11][12], the visual fingerprint can be collected very accurately in a given area. However, these devices need to be customized specially, which is also expensive. To solve this problem, Farhang et al. proposed an automatic visual fingerprinting method based on the consumer-grade device in [13], which was further improved in our previous work [14]. These methods have well solved the visual fingerprinting problem in different ways. However, some objects in an interested area will be moved randomly, which will lead to localization deviation. Unfortunately, none of the existing visual fingerprinting methods could solve the visual fingerprint updating problem directly, which is caused by displaced objects in the location area.
In contrast, the problem of updating the WiFi radio map is well studied in its corresponding research community. The method of updating a WiFi radio map can be summarized into three categories. One is predicting the Received Signal Strength Indication (RSSI) fingerprint by the particular radio propagation model, [15][16][17] can be classified as examples of this type. The other one is leveraging the deep learning framework for generating the renewed radio fingerprint by training the large amount of time-varying RSSI, Signal Noise Ratio (SNR), or Channel State Information (CSI), such as [18][19][20]. These two kinds of methods are based on the fingerprint change with some fixed patterns. Unlike the characters of radio fingerprints, the visual fingerprint has no model for learning algorithms. Another method is crowdsourcing. When the user is cooperative, the fingerprint used in his/her localization can be updated from the available radio map. Several methods are proposed in this category for finding the best updated RSSI fingerprint [21][22][23][24][25]. Although these methods could not be directly used for updating the visual fingerprint, they provide an inspired idea.
An interesting technology that should mention in this section is the Region-based Fully Convolutional Network (R-FCN), which was proposed by Dai et al. in [26]. R-FCN provides an effective solution for image recognition, which is well proved by [27][28][29]. It should be noted that we also use this framework for semantic segmentation in crowdsourced query images. By leveraging these regions generated by R-FCN, semantic Speed-Up Robust Feature (SURF) forms. Fortunately, there are many existing frameworks for deep learning implementation, among which Caffe is a representative one. A list of works has shown its superior performance [30][31][32]. More details about Caffe are referred to [33].
Therefore, the purpose of this paper is to propose a visual fingerprint update algorithm based on crowdsourced visual localization. More specifically, a displaced visual fingerprint detection method and quadratic programming (QP) based visual fingerprint 1. A visual fingerprint update algorithm is proposed in this paper, which is based on crowdsourced visual localization. When the users are cooperative, compared to the location in the visual map database, the algorithm could detect automatically whether the objects in the localization area are displaced or not. 2. A novel method is proposed in this paper for detecting the positional change of the visual fingerprints. A region-based fully convolutional network is used for labeling the SURF descriptor in the query image, which is defined as Semantic SURF. Compared with the traditional Perspective-n-Point (PnP) solver with all 2D-3D correspondences, our method is calculated with semantic 2D-3D correspondences. Our proposed method has a higher detection ratio, which is proved by synthetic and real data simulation results. 3. A Quadratic Programming (QP) based visual fingerprint update method is also proposed. In this way, the fingerprint can update automatically without the real crowdsourced localization results. The accuracy is higher than that of the compared method under different configurations.
The rest of this paper organizes as follows. In Sect. 1.2, some related works will be discussed. Section 1.3 describes the visual fingerprint update problem in visual localization and presents our proposed system model. In Sect. 2, we propose our semantic SURFbased detection method and QP-based update method, respectively. Section 3 explains the experimental setup. Section 4 provides the simulation results, and the conclusion draws in Sect. 5.

Ralated works
As far as we know, this paper is the first proposal for solving the visual fingerprint update problem. Therefore, in this section, we introduce the relevant works in our proposed algorithm. Farhang et al. proposed an automatic visual fingerprinting (AVF) method for the first time in [13], which improves the efficiency of collecting the fingerprint and reduces the cost of acquisition equipment. Once the object is displaced, which is also shot in the query image by the crowdsourced user, the result of visual localization will deviate. At present, it seems that the only solution is the periodic scanning by the AVF method to correct the error in the database due to object displacement. However, this will lead to two problems, one is how to set the optimal acquisition cycle, the other is the consumption of labor and time cost caused by overall rescanning. Visual localization is more accurate than the other methods, which is mainly due to its 6 degree of freedom mapping equation. The coefficients of these equations are defined by 2D-3D correspondences. [34] proposed an effective and efficient method for finding 2D-3D correspondences. The method expresses the 3D point by 2D image feature descriptor when it is generated by the Structure from Motion (SfM) technique. The mean value of the two matched 2D feature descriptors is saved for representing the  [35], SURF [36], or any other features. When a 2D image feature and 3D point match, a Fast Library for Approximate Nearest Neighbor (FLANN) [37] search or bag of visual words approach could accomplish this task. In our proposed algorithm, we also use this method to find reliable 2D-3D correspondences. More specifically, SURF and FLANN are chosen in the framework. Efficient PnP (EPnP) is the most widely used solver for PnP problems [38], which is classical for its O(n) complexity with known camera internal parameters. The algorithm utilizes the linearization and re-linearization method for solving the weight of a linear combination of a matrix eigenvector, which derives from 3D-2D correspondences. With these weights, the camera coordinates of the 3D point can calculate. Then, the rotation matrix and translation vector decompose with the help of Singular Value Decomposition (SVD) for solving the matrix maximum trace problem. Furthermore, a more accurate result will be reached by setting the closed-form solution as the initial input of the Gauss-Newton scheme. Displaced objects in the indoor interested environment may lead to localization error due to the mismatch between the 2D feature image coordinates and 3D point world coordinates. A natural idea is setting a threshold, which acts as a criterion for the 3-dimensional localization results. When the feedback of crowdsourced users is beyond the threshold, it can assume that some reference objects in the scene may move. This method could treat as the traditional strategy and the benchmark of our proposal. Specifically, EPnP with all 2D-3D correspondences with a judging threshold is chosen as our benchmark. Since in the large-scale environment the self-verifiable localization mark is the vertical result, there is only one predefined threshold for judging the reliability of the localization result.
RANdom SAmple Consensus (RANSAC) is well known for filtering outliers in the dataset by calculating the mathematical model parameters of the samples [39]. It is generally applicable to refine the matched pairs during the offline stage due to its time-consuming characteristics in the computer vision field. For the algorithm proposed in this paper, the calculation cost of RANSAC is not so restrictive. Therefore, it can be used for filtering the mismatch, so that a corrected localization result will be generated, which could also locate the displaced semantic object.
In recent years, the Deep Learning (DL) based method has been embedded into the indoor localization framework. One role of DL in localization is to generate new fingerprints compared to traditional ones. Hernández et al. [40] proposed a deep learning feature for localization, which was trained by multisensor fingerprints. Ma et al. constructed hierarchical convolutional features for visual tracking in [41]. However, the improvement is limited in terms of localization accuracy, considering the cost of the training. The other application is to map the image information and its location directly through the DL network. This method entails an obvious cost, which needs a substantial set of training data. Although it shows some practical value in the latest research, such as [42,43], it still has no advantage from the point of theoretical view. Another one is leveraging DL for semantic segmentation, which could provide a pixel-wise classification. Undoubtedly, it is embedded into the DL-based visual localization framework to enhance performance. A representative one based on DL is proposed in [44], which is called VLocNET++. Unfortunately, it can not be used to solve the problem mentioned in this paper. DL could use as a fundamental tool for filtering features, which could provide more accurate coefficients for solving the PnP equations. We define this as Semantic SURF, and it is a basic tool for challenging the visual fingerprint update problem. Specifically, in this paper, we propose a visual fingerprint update algorithm under the crowdsourced framework, which contains a semantic SURF-based visual fingerprint, displacement detection method with the help of R-FCN, and a visual fingerprint update method. In this way, the fingerprints can update automatically and reliably.

System model
Once the layout of the located region is known, the visual fingerprint will be collected from the visual sensors mounted on a smart vehicle, which is labeled in green. When it is fingerprinting over the extinguisher, its position will record as the position framed by the grey box in the database. Thereafter, the extinguisher moves to the position of the red dotted box. The visual localization result will deviate when the camera labeled with red tends to locate itself by the visual sensor and the surrounding visual fingerprint, due to the displaced extinguisher in the query image. Meanwhile, the camera labeled in blue will obtain an accurate result since its reference fingerprint is fixed, which may be the unmoved ashbin. The description is shown in Fig. 1.
The cameras labeled both red and blue can be regarded as volunteers, which are located at different times after generating the visual fingerprint database. Whenever the vehicle has completed the visual localization or navigation in the scene, as long as it is pleased to send the data back to the server cooperatively, our proposed algorithm will work automatically without any human aids. Noted that in our proposed algorithm, there is no need to give the ground truth location of the query image or any other additional information. When the query image from the volunteer sends back to the server, the visual fingerprint update algorithm is initiated. A block diagram of our proposed algorithm is shown in Fig. 2.
The location of the query image will recalculate on the server to validate the availability of the feedback from the crowdsourced user. As shown in Fig. 2, the visual fingerprint displacement detection method will be applied in the algorithm. When the results obtained by this method are beyond the predetermined threshold, the feedback from crowdsourced users is invalid, so that the overall process will terminate. On the contrary, when the method locks the displaced fingerprint, the subsequent fingerprint relocation method will initiate. Thereafter, the new location of the displaced visual fingerprint will be calculated, and the database refreshes. Our proposed algorithm will automatically detect and update the displaced visual fingerprint to substitute for the periodic manual fingerprint scanning. Although the overall algorithm typically operates on the server, the calculation should complete as quickly as possible in consideration of the amount of crowdsourcing data. As shown in Fig. 2, the visual fingerprint displacement detection method mainly consists of R-FCN, RANSAC, and EPnP. The online computation complexity of R-FCN is O(n), as well as EPnP, while RANSAC is mainly affected by the number of iterations and the value of maximum tolerable error.
A brief illustration of the role of the region-based fully convolutional network (R-FCN) in this paper is provided in Fig. 3. The crowdsourced image is an input of R-FCN, whose output is the corresponding semantic segmentation and the label with its score.
In the next step, semantic SURF will extract, respectively, which shows in Fig. 4. It could be further filtered by RANSAC with its corresponding one in the reference image. According to its matching results with the point cloud simultaneously, several bundles of semantic PnP equations will be established. Then, different localization results calculate. Otherwise, the semantic point cloud can also be generated by leveraging the semantic SURF. Besides matching, the images above could also describe the traditional way of generating point clouds by two frames in a sequential image sequence, while the images below could represent a novel way for creating semantic point clouds by R-FCN. When all groups of semantic SURF are beyond the predefined threshold, the crowdsourced data will be determined as invalid by the proposed method, the overall process will terminate as a consequence. When a group of semantic SURF is within the predefined threshold, the displaced object will be locked by the method automatically. The result will be input into the visual fingerprint relocation method, which is aimed at providing a reliable new location of the displaced visual fingerprint locked by the previous method. Then the translation can be solved by the equations, whose coefficients are from the semantic 2D-3D correspondences. The optimal value of the translation is the solution of a constructed quadratic programming problem. Finally, the refreshed location will be recorded into the visual fingerprint database. We will explain the visual fingerprint displacement detection method and show how it defines in Sect. 2.1. Then, the visual fingerprint relocation method details in Sect. 2.2.

Visual fingerprint displacement detection method
As stated before, the crowdsourced image sends back to the server. For a Region of Interest (RoI) rectangle of size w × h in the image, k × k bins form with each size approximate to w×h k 2 . In the last convolutional layer, the k 2 score map for each category produces, a pooling scheme defined as where r c (i, j) is the response in the (i, j)th bin for the cth category, z i,j,c is one score map out of the k 2 (c + 1) score map, (x 0 , y 0 ) is the top-left corner of an RoI, n is the number of pixels in the bin, is all parameters of the R-FCN. Then, k 2 scores are voted by averaging on the RoI. A c + 1 dimensional vector generates as r c (�) = i,j r c (i, j|�) . Finally, the softmax responses computed by s c (�) = e r c (�) / c c i,i=0 e r c i (�) . Note that the output of R-FCN for each RoI is the particular category c i and its bounding box S i = (s x , s y , s w , s h ) . In general, the output from the pretrained R-FCN can express as where S j is the jth semantic region, m is the total semantic region labeled by R-FCN.
Then, the SURF descriptors extracts, which express as where n is the number of SURF descriptor, F i is a vector. With the help of the descriptor, 2D features and 3D points can be mapped uniquely by [34]. Typically, the 2D image coordinates and the 3D point world coordinates will be the coefficients of our proposed method after semantic segmentation. The pixel coordinates of each SURF descriptor can be used to judge easily which S i it belongs to. It describes as where F s j ⊂ F . Then, RANSAC is used for refining the 2D-2D matches between the crowdsourced image and the reference image. The remaining SURF descriptor with its semantic label can be used for checking the displacement of the object. We define the matched SURF number of each semantic segmentation before RANSAC as n s j match , and the remaining SURF number after RANSAC as n s j filter . Typically, when the ratio φ s j = n s j filter /n s j match is lower than a threshold φ thr , it could assume that the corresponding semantic object is displaced comparing to the reference. Furthermore, considering the randomness of RANSAC, m groups of PnP equations will establish, respectively. By EPnP [38] algorithm, the localization results of the crowdsource user can be recalculated, where {R filter , t filter } RANSAC is the solution of all the RANSAC filtered 2D-3D correspondences. When the number is too small to solve the accurate solution, the calculation of the reprojection error will substitute for solving the EPnP equations, which is a golden standard for judging the correctness of the solution. It will correct the misjudgment of the first step as far as possible.
Finally, the displaced object can be judged by where S disp is the displaced semantic region in the image plane, φ S i is the ratio labeled by the semantic region S i , φ thr is a threshold, ǫ S i reproj is the reprojection error of correspondences labeled by S i , ǫ thr reproj is the reprojection error threshold. ǫ S i diff is the difference between {R filter , t filter } s i and {R filter , t filter } RANSAC , ǫ thr diff is the difference threshold. When the semantic object is unshifted, the corresponding region is defined as S unsh .
(2) S = {S 1 , S 2 , S j , . . . , S m }, Consequently, the displaced object will be ranked according to the semantic region in the image plane. The proposed method is summarized in Algorithm 1.

Visual fingerprint relocation method
In the previous subsection, the displaced object has been locked as well as the unshifted objects. The translation vector of the displaced object will calculate for refreshing the visual fingerprint database in this subsection. First of all, a benchmark method will be illustrated. Although it can be derived very simply from the PnP equation, we need to give a brief deduction in this subsection to compare with our proposed method. Another reason is that it is the first proposal for solving the visual fingerprint refreshing problem. The semantic 2D-3D correspondences can be clustered into two categories, which describes as two sets T disp represents the set of displaced 2D-3D correspondences, while T fix is the set of fixed ones. The rotation matrix and translation vector of the crowdsourcing query image can be calculated by EPnP [38] with the coefficients from T fix , which donates as R and t = [t 1 , t 2 , t 3 ] T . The relative translation vector of the displaced object defines as c x , c y , c z , respectively. Then, according to the PnP equation, without considering the rotation of the displaced object, we have where u i , v i is the ith 2D feature coordinate, i is its depth factor, x i , y i , z i is its corresponding 3D point coordinate, and K is the camera internal matrix of the crowdsourced user. The internal matrix can be roughly recovered from the crowdsourced image, or sent by the user as a part of the feedback data accurately, K can further represent as where f is the focal length, u 0 , v 0 is the principal point coordinate of the image. Noted that the coefficients of (6) are from the set T disp . For a tuple of T disp , (6) could transform to a simplified form of three linear associated equations with four unknowns. A very natural idea is to do an elimination before solving the equations. When i is eliminated, we have . Note that r ij and R i are the element and the ith row vector of rotation matrix R , respectively.
It is clear that for a tuple of coefficients from T disp , two equations can be obtained. Therefore, when there are n tuples in the set T disp , 2 × n linear equations will generate. According to linear algebra, a simplified form expresses as where c = [c x , c y , c z ] T is the unknown vector. A direct least-square solution can solve as Finally, we have where x r s is the refreshed location of the displaced visual fingerprint, x s is the primitive one in the database. It defines as our benchmark, which could be called the DLS method.
Our proposed method will deduce from the PnP equation, which is Noted that the point coordinates [x i , y i , z i ] T are the projection of the image coordinates [u i , v i ] in the crowdsourced query image by the depth i , rotation matrix R , and translation vector t . Be different from Eq. (6), the goal is to calculate the point coordinates of each semantic feature in the world coordinate system. From Eq. (12), we have 3 equations and 4 unknowns, which represent infinite solutions. Thus, we formulate a minimization problem as our goal of a solution, which is where j i donates the jth expression of i from Eq. (12). Typically, (13) is a quadratic programming problem. The optimal solution can obtain by solving the first derivative of (13). The final simplified form is 3 equations of the first degree, which is quite easy for computation. With n tuples of projected point coordinates, the center can obtain as c p , while the primitive center of the displaced object is c r . The translation vector of the center coordinate c represents as Generally, this value is used as an index to measure the accuracy of the method, since c is predefined in the simulation. The proposed method summarizes in Algorithm 2.

Comparison algorithm
Some researchers have proposed visual localization methods by the leverage of known prerequisites, such as man height [45] or vertical direction [46]. A traditional idea is borrowed from this kind of setting. On the contrary, we can judge whether the localization result is right or wrong by the known prerequisites. However, in the application of pedestrian visual localization, these prerequisites will not be satisfied all time. Fortunately, the localization error still could be utilized for ranking the displaced object. Typically, the localization result is with an error in every direction. Since the user is unaware of his real location in the horizontal or vertical direction, a traditional method can only leverage the distance perception of the human in the vertical direction for judging the accuracy of the localization result. Thus, the conventional trick is to set an error threshold in the vertical direction. Once the result is beyond the threshold, it will determine that some mismatch exists in the 2D-3D correspondences. The threshold method referred to in the comparison is a concrete realization of the traditional one.

Synthetic data
To minimize the impact of related parts on our proposed method, the experiment is conducted on the synthetic data for convenience. Moreover, it could assume that the 2D-3D corresponding match and semantic segmentation have 100% accuracy in the synthetic data set, respectively. Thus, the focus will concentrate on the visual fingerprint displacement detection and relocation method. For simplicity, we assume that the first camera is set as the origin of the world coordinate system, while the second camera assumes to be translated several units from the first camera along the x-axis. The reason is that the two cameras need to keep a certain distance to keep the overlap between the images. The camera coordinates of 3D points are generated randomly within a box of [−2, 2] × [−2, 2] × [6,8] , then the pixel coordinate of the two cameras is projected by a pinhole camera model with an initial point (960, 540) and focal length around 1000. The y-axis angle of the second camera chooses randomly from −10 • − 0 • , while the angle of the first camera varies from 0 • − 10 • . In each trial, 500 points generate. Then, the K-means algorithm applies to clustering points. For a cluster, they all belong to the same object as a common assumption. Consequently, any object can be used to simulate a displaced one. The shifted distance is set to 0.5−1 units far from its original position. Suppose that the object displaces too far or a new one places, it will not exist in the crowdsourced image at that particular location. These two kinds of situations are beyond the scope of this paper. The setting of the crowdsourced camera is the same as the second camera in the synthetic data simulation. Both the displaced points and fixed ones project on the image plane. With these image coordinates and their corresponding points, the location can be calculated by a PnP algorithm like EPnP [38].
We divided the synthetic data simulation into two parts according to the number of displaced objects. There is only one displaced object in the first part, which is more likely in the indoor environment. There are two displaced objects at the same time in the other part, which is more complex. Both results were achieved under the condition of φ thr = 5% , ǫ thr reproj = 100 , ǫ thr diff = 1 , which are the optimal setting in our simulations. We find that when φ thr is bigger, the RANSAC threshold method will report false alarms in simulations using synthetic data, especially when the number of 2D-3D correspondences is originally small. Meanwhile, ǫ thr reproj = 100 and ǫ thr diff = 1 are to tolerate the errors introduced by the 2D-3D matching procedure for the real dataset. The threshold could be set lower in the simulations with synthetic data, e.g. ǫ thr reproj = 20 in the primitive implementation of EPnP [38].

Real dataset
The real dataset chooses from the image shot at the communication research center in Harbin Institute of Technology, whose floor plan and layout show in Fig. 1. The experiment site is mainly the area marked yellow on the plan. The main aisle is about 50m long and 3m wide. To simulate the smart vehicles for convenience, we use wheeled equipment instead, which mounted the camera on its roof. The training set is a total of 800 images, whose image resolution is 1920 × 1080 . Some samples of the reference image have been shown in Figs. 3 and 4 previously. By leveraging semantic segmentation, we  Table 1.
A typical example of the displaced objects in the above scene is shown in Fig. 5. The extinguishers in the red box of Fig. 5a are in their original positions while scanning the fingerprint with Canon EOS 1300D digital camera. Then, in the red box of Fig. 5b, their positions moved, which was shot by the crowdsourced user with iPhone 6S camera. The crowdsourced images were also collected by Samsung Galaxy S9plus and iPhone XR cameras, which have different internal parameters. It should be noted that images with low quality are not considered in this paper.
The learning parameters in the R-FCN have selected default values. The Caffe framework is trained on a server with Inter(R) Xeon(R) Gold 5118 CPU @2.3GHz, memory 64GB, NVIDIA SMI 418.56 GPU, and 64 bits Ubuntu OS. More details about the R-FCN configurations for our simulation environments can be referred to in our previous conference paper [47].    Figure 6 shows the detection probability curve w.r.t. the threshold value between the traditional method and our proposed method. As mentioned before, the traditional one uses all 2D-3D correspondences as the coefficients of the PnP equations. Furthermore, the solution is provided by the PnP solver. In our simulation, EPnP is chosen due to its accuracy and efficiency. For each threshold sampling point, the results are obtained from 5000 repeated random trials. The traditional method is more sensitive to the human-perceivable threshold, which means the performance is disturbed by the threshold value. The threshold value is also contradictory. The bigger is the value, the more sensitive is the crowdsourced user. However, the detection probability decreased dramatically with the increase of the threshold value. The threshold is set from 0 to 0.9 m, the reason is that with the help of an infrared or laser distance sensor equipped with the device, the localization error could be perceived. Moreover, a human will sense at least a 20 cm error in the vertical direction.

Results from synthetic dataset
In Fig. 7, the logarithmic localization error of different 2D-3D correspondences shows, respectively. The mean and maximum errors are used for comparison. It concludes that the localization error is small when the 2D-3D correspondences are from the fixed objects, whose point label represents FOP1s and FOP2s. The logarithmic error is positive when the 2D-3D correspondences are from a displaced object, which labels as MOPs. The difference is obvious when the 2D-3D correspondences are from fixed objects and displaced ones. The localization result is hard to distinguish whether the positioning results are correct when the correspondences are mixed, which labels as APs. From the results, it is easy to tell which object displaces with its semantic label.
Once the displaced object is locked, the next step in our proposed algorithm is to relocate the point cloud fingerprint. In the previous step, the ground truth rotation matrix and translation vector of the crowdsourced image can calculate correspondences from the fixed objects. The process is followed by RANSAC for all correspondences. It makes our method of semantic 2D-3D correspondences more intuitive and efficient. Since the translation vector of the displaced object is predefined, the error could be calculated between the ground truth and the solved one. The predefined translation vector varies by a random value between 0.5 and 0.8 in the x-axis and y-axis. The CDF curve shows in Fig. 8. The error CDF curve of our proposed method is much better than the benchmark. It illustrates that the position of the fingerprint is refreshed by our proposed method more accurately.
The second part of the synthetic dataset results shows in Figs. 9, 10, 11. There are two unmoved objects and two displaced objects, which means more mismatches are in 2D-3D correspondences. The number of all 2D-3D correspondences is 500. Noted that to ensure the normal solution of EPnP, at least 10 points are generated on each object. From Fig. 9, the detection probability of both displaced objects is higher than the one of the traditional method. With more mismatched correspondences, the detection ratio of the traditional method promotes indeed from the comparison of Figs. 6 and 9. However, the trend is the same. Meanwhile, the performance of our proposed method remains unchanged. Figure 10 shows the logarithmic error of the localization results from different bundles of 2D-3D correspondences. Comparing with Fig. 7, the error between fixed and displaced objects still exists, which also will be enabled to lock the displacement object accurately. In Fig. 11, the CDF curves of relocation error from different displaced object draw. The relocation error of the two displaced objects by our proposed method diverges slightly, whose computation results are both better than that of the traditional one. Advantageously, the performance preserves well when in comparison with Fig. 8.

Results from real dataset
The detection result from the real dataset shows in Fig. 12. To simplify the process of the simulation, we select a random location in the scene, which contains four semantic objects. One object is moved deliberately to a specific location. It should be noted that the displacement of the object will also happen in normal time, and the chosen location is only for the convenience of measuring the ground truth.
When the semantic object shifts, 50 test images are collected near the reference location, whose precise locations are unknown. It shows clearly that the successful detection ratio of the displaced object is independent of the human-perceivable threshold by our proposed method, while the ratio of the comparing one decreases indeed with the increase of the threshold. However, unlike synthetic data simulation, the fixed objects are misjudged. There are 55% and 30% misjudgments from fixed objects labeled as FO2 and FO3, respectively. The main reason for this difference is that the number of effective semantic features on FO2 and FO3 is smaller in real data simulation than in synthetic one. The average percentage number of features on FO2 is 5.95% , while that on FO3 is 17.64% , which are both much smaller than FO1. The reprojection errors of FO2 and FO3 can obtain by using the location results of FO1, which can eliminate such misjudgments. Furthermore, it could reduce to 5% and 20%. Figure 13 describes the localization result of the displaced object. Our proposed method outperforms the compared method when the error is smaller than 0.94 m. Compared with the simulation results of synthetic data, the convergence of the Cumulative Distribution Function (CDF) curve is slower than that of the compared method in real False detection ratio Fig. 12 The detection ratio comparison between our proposed and the threshold method, in addition a false ratio is also shown in this figure when the simulation is running in the real data set data simulation. In summary, the performance of our proposed method on the real data is worse than that of the synthetic data. The reason is that the feature matching algorithm fails to provide 100% accuracy, which will lead to misjudgments and distortion of equation coefficients. As a result, the solution of our proposed method will have a few deviations.

Operation efficiency
As known, the computation complexity of EPnP is O(n). It is supposed that the type of semantic segmentation is m. Typically, it regards as m ≤ 10 in an indoor environment, which is similar to our experiment place. Thus, according to Algorithm 1, the computation complexity of our proposed visual fingerprint displacement detection method is O(m × n) . Meanwhile, the computation complexity of the proposed visual fingerprint relocation method is O(n) by Algorithm 2, where n is the number of visual fingerprints belonging to the displaced object. The average running time trend w.r.t. the corresponding 2D-3D point number n shows in Fig. 14.
The running time of each sampling point was obtained from the average of 1000 trials. With the increase of the feature number that extracts from the images, the time required by both methods increases slowly. Meanwhile, the number of displaced objects does not significantly affect the performance of the method. The experiment results show that the proposed method is efficient. However, compared with other methods in the proposed algorithm in this paper, the average running time of semantic segmentation by R-FCN is still long, which is 1.63 s. It makes the algorithm be only able to run on the server-side temporarily.

Comparison with AVF reconstruction
In this subsection, we list the disadvantages and advantages of our proposed visual fingerprint updating algorithm and reconstruction by the AVF method, which can be seen in Table 2. The result of average processing time and relocation error is from 500 trials. In the experiment, we set one moving object, whose translation varies from 50 to 100 cm. Under this condition, it is convenient to calculate the relocation error of the fingerprint. It should be noted that the AVF method is easily affected by the flow of people in the scene. By contrast, our proposed algorithm is more flexible with the aim of crowdsourced localization.

Conclusion
In this paper, we propose an algorithm based on crowdsourcing and deep learning for solving the challenging visual fingerprint update problem of the smart vehicle, which aims to detect whether the reference object in the crowdsourced image is displaced and provide a refreshed location to facilitate subsequent vehicles. The simulation results are achieved thoroughly from synthetic data with various configurations. Besides, a real indoor dataset is applied to test the performance of our proposed algorithm compared with synthetic data. In summary, our proposed algorithm can promote nearly 100% detection probability, while the average probability by threshold method is 60% . The accuracy of relocated fingerprints by our proposed algorithm is 42% higher than the DLS method. Although the accuracy of our proposed algorithm is 10% lower than the