Efficient Feature Matching and Pose-graph Initialization for SfM
Wentian Gan; Zongqian Zhan; Xin Wang
In the last few decades, Structure-from-Motion (SfM) has been intensively studied in the field of computer vision, photogrammetry.
However, it is still challenging to deal with large-scale datasets (such as crowdsourced or UAV images), wherein the feature matching and pose-graph generation are the key limitations in terms of time efficiency.
Similar the goal of Barath et al. (2021), we propose a method to improve the feature matching and initial pose graph generation algorithms for SfM (pose graph consists of node as images and edge between two images that survive from two-view geometric verification).
On the contrary to Barath et al. (2021), we improve the pose graph via using several orthogonal minimum spanning trees (MSTs) to build a more complete pose graph, then the pose graph generation begins two-view epipolar geometry based on the image pairs that are on orthogonal MSTs.
The overall workflow of our method is shown as Fig.1, the green parts denote our main works.
The goal of this work is to propose an solution that can significantly improve the time efficiency for generating pose graph for
image dataset while maintaining the performance of subsequential SfM processing, i.e., our method can be expected to more
time efficiently yield input for SfM.
First, to generate a more complete pose graph faster, based on image similarity degree, several orthogonal MSTs are built as
an pre-built initialization of pose graph.
We propose to use multiple orthogonal MSTs to generate a robust pose graph and further improve
the time efficiency (“orthogonal” means that any two MSTs do not share any single same edge).
To obtain a MST, firstly, each image is considered as a root node of different trees in the graph G,
and one edge (image pair) has two vertices of root node. Secondly, retrieve the edge (image pair)
with the highest similarity degree from the estimated similarity matrix, if two vertices of the edge
belong to different trees, then we merge the corresponding root nodes into one tree and add the edge
into the MST that is to be generated, otherwise it continues to process the next image pair with second
highest similarity degree. Then, repeat the second step until only one tree left in the graph G, which is
the final MST. Finally, a set of edges T = {(i, j)|i, j ∈ G} that includes all images and has the lowest
cost is found. After the first MST tree is built, all the edges in the MST are added to the excluded set.
Then we can get multiple orthogonal MSTs by repeating the above steps several times.
Second, to verify unknow two-view geometry more time efficient, a heuristic A* algorithm is employed to find visible path
between these two views, based on which the corresponding two-view geometry is estimated via propagation along the path.
Third, to speed up the time efficiency of feature matching, a hashing strategy based on epipolar lines (calculated from the
estimated relative pose) is utilized to narrow down the potential candidate matchable points.
We use a method called epipolar hash based on the essential matrix estimated in the previous step (two-view geometry propagation) to find the possible points which lies nearby the epipolar lines projected
on the source image. Due to the characteristic of epipolar geometry, we divide the polar plane into different bins
according to angles. Each bin shares the same epipole and the span of bin’s angle is [0,π)/#bins number
or [a,b]/#bins number. The one that hits the epipolar line is the candidate bin to be explored.
Finally, the feature matching are performed via considering the features on the candidate bin.
To validate the performance of the proposed method, we test several different image datasets.
To evaluate the time efficiency, two popular package (Colmap and openMVG) and one state-of-the-art method Barath
et al., (2021) are compared (section 4.2). Finally, the results of SfM are reported using the
evaluation metric of mean track length, mean reprojection error and number of registered images.
Our method is tested on three image datasets of Knapitsch et al.,
(2017), named as Family, Lighthouse and Playground, the table below
lists the information of image size and number of images and sample image of these three datasets.
Datasets Name |
Image Size |
Image Numbers |
Image Sample |
Family |
1920*1080 |
152 |
 |
Lighthouse |
2048*1080 |
200 |
 |
Playground |
1920*1080 |
307 |
 |
Before testing our method, the first step is to compute the similarity degree
matrix among images, which is normalized into (0, 1) and with
dimensions of n × n, higher value indicates that the two corresponding images are more likely to overlap with each other.
To do this, we use ResNet50 as backbone and GeM (Radenovićet al., 2018) as aggregation layer that is pre-trained on GLD-v1
(Noh et al., 2017) to extract global feature. After that, similarity
degree is computed via cosine distance, and in our experiment
image pairs with similarity over 0.5 are eligible for subsequential processing and 3 orthogonal
MSTs are used to generate the starting pose graph.
Performance of time efficiency
Datasets
|
Method
|
Cost time(s)
|
extract feature
|
pose graph
|
Family
|
our method
|
128.25
|
106.14
|
Barath et al., (2021)
|
91.83
|
305.82
|
openMVG
|
273.99
|
139.13
|
Colmap
|
13.98(GPU)
|
381.36
|
Lighthouse
|
our method
|
121.18
|
104.12
|
Barath et al., (2021)
|
120.00
|
149.47
|
openMVG
|
172.13
|
115.40
|
Colmap
|
15.96(GPU)
|
439.68
|
Playground
|
our method
|
189.06
|
277.31
|
Barath et al., (2021)
|
180.56
|
391.10
|
openMVG
|
331.21
|
308.24
|
Colmap
|
31.86(GPU)
|
1465.32
|
We can find that our method is always the fast
solution when generating pose graph and exhibits significantly
reduction in processing time compared to Colmap. Colmap
computes pose graph time slowest, primarily due to its utilization of BF-search (brute-force) matching strategy. The second
fastest pipeline is the method of OpenMVG, which utilizes the
Fast-Cascade-Hashing-L2 matching algorithm. While this approach results in a significant speed improvement, but it typically need a greater amount of memory (Cheng et al., 2014). As
the feature extraction step in Colmap benefits from GPU acceleration, the total time for the initialization part was not included
in the comparative analysis. Comparing to Barath et al., (2021),
we are in general faster, this can be explained by the benefit of
the initialization stage using orthogonal MSTs.
The performance of SfM relies on the quality of feature matching and two-view geometric results. Therefore, in this section,
the popular framework - openMVG is employed as a SfM engine which take the output of our method and Barath et al.
(2021) as input to estimate image poses and 3D sparse point
cloud. Three evaluation metrics are computed:
- Mean Track length (MTL).Track length denotes the number
of images that a 3D point can be viewed on. Mean Track length
is the averaging track length for all triangulated 3D points. A
MTL value can typically imply the quality of feature matching,
i.e., higher MTL means better feature matching.
- Mean Reprojection error (MRE).The reprojection error
represents a geometric discrepancy measured as the distance
between a reprojected 2D point and its corresponding feature
point on the image. MRE can be used to indicate how good the
whole SfM result is consistent to the projection model, such as
pin-hole.
- Registered images number (RIN).The number of successfully
registered images in the photogrammetric block. A higher number of successfully registered images indicates a more complete
pose graph.
Dataset
|
Method
|
MTL
|
MRE(px)
|
RIN
|
Family
|
our method
|
7.53
|
0.50
|
152/152
|
Barath et al., (2021)
|
3.80
|
0.83
|
141/152
|
openMVG
|
7.02
|
0.67
|
152/152
|
Colmap
|
8.06
|
0.87
|
152/152
|
Lighthouse
|
our method
|
5.41
|
0.61
|
198/200
|
Barath et al., (2021)
|
5.35
|
0.63
|
186/200
|
openMVG
|
5.21
|
0.90
|
196/200
|
Colmap
|
7.74
|
0.66
|
200/200
|
Playground
|
our method
|
5.73
|
0.49
|
307/307
|
Barath et al., (2021)
|
5.67
|
0.68
|
307/307
|
openMVG
|
4.20
|
0.99
|
302/307
|
Colmap
|
6.03
|
0.65
|
307/307
|
It can be observed that the method proposed in this paper exhibits the highest reconstruction
accuracy regarding MRE, while the pipeline based on
Colmap always exhibit the longest MTL. In terms of the successfully registered number of images,
the method proposed is
just slightly inferior to Colmap. For colmap, the inherent feature
extraction module is used which generate approximately 3000
more feature points per image, as a result, the corresponding
MTL and RIN is always the best as more correspondences are
found and more edges are likely to pass the two-view geometric verification.
Compared to the original pose graph method
Barath et al., (2021), our method shows superior results on all
three metrics, which demonstrate the efficacy of the proposed
integration of multiple orthogonal MSTs. The visual SfM reconstruction results are shown as below:
If you have any questions or advice, you can contact us through following address:
- gwt2019@whu.edu.cn, Wentian Gan, WuHan University
- xwang@sgg.whu.edu.cn, Xin Wang, WuHan University
- zqzhan@sgg.whu.edu.cn, Zongqian Zhan, WuHan University
Thanks for your support!