Efficient Feature Matching and Pose-graph Initialization for SfM

In the last few decades, Structure-from-Motion (SfM) has been intensively studied in the field of computer vision, photogrammetry. However, it is still challenging to deal with large-scale datasets (such as crowdsourced or UAV images), wherein the feature matching and pose-graph generation are the key limitations in terms of time efficiency.

Similar the goal of Barath et al. (2021), we propose a method to improve the feature matching and initial pose graph generation algorithms for SfM (pose graph consists of node as images and edge between two images that survive from two-view geometric verification). On the contrary to Barath et al. (2021), we improve the pose graph via using several orthogonal minimum spanning trees (MSTs) to build a more complete pose graph, then the pose graph generation begins two-view epipolar geometry based on the image pairs that are on orthogonal MSTs.

The overall workflow of our method is shown as Fig.1, the green parts denote our main works.

The goal of this work is to propose an solution that can significantly improve the time efficiency for generating pose graph for image dataset while maintaining the performance of subsequential SfM processing, i.e., our method can be expected to more time efficiently yield input for SfM.

First, to generate a more complete pose graph faster, based on image similarity degree, several orthogonal MSTs are built as an pre-built initialization of pose graph.

We propose to use multiple orthogonal MSTs to generate a robust pose graph and further improve the time efficiency (“orthogonal” means that any two MSTs do not share any single same edge). To obtain a MST, firstly, each image is considered as a root node of different trees in the graph G, and one edge (image pair) has two vertices of root node. Secondly, retrieve the edge (image pair) with the highest similarity degree from the estimated similarity matrix, if two vertices of the edge belong to different trees, then we merge the corresponding root nodes into one tree and add the edge into the MST that is to be generated, otherwise it continues to process the next image pair with second highest similarity degree. Then, repeat the second step until only one tree left in the graph G, which is the final MST. Finally, a set of edges T = {(i, j)|i, j ∈ G} that includes all images and has the lowest cost is found. After the first MST tree is built, all the edges in the MST are added to the excluded set. Then we can get multiple orthogonal MSTs by repeating the above steps several times.

Second, to verify unknow two-view geometry more time efficient, a heuristic A* algorithm is employed to find visible path between these two views, based on which the corresponding two-view geometry is estimated via propagation along the path.

We use a method called epipolar hash based on the essential matrix estimated in the previous step (two-view geometry propagation) to find the possible points which lies nearby the epipolar lines projected on the source image. Due to the characteristic of epipolar geometry, we divide the polar plane into different bins according to angles. Each bin shares the same epipole and the span of bin’s angle is [0,π)/#bins number or [a,b]/#bins number. The one that hits the epipolar line is the candidate bin to be explored. Finally, the feature matching are performed via considering the features on the candidate bin.

To validate the performance of the proposed method, we test several different image datasets. To evaluate the time efficiency, two popular package (Colmap and openMVG) and one state-of-the-art method Barath et al., (2021) are compared (section 4.2). Finally, the results of SfM are reported using the evaluation metric of mean track length, mean reprojection error and number of registered images.

Our method is tested on three image datasets of Knapitsch et al., (2017), named as Family, Lighthouse and Playground, the table below lists the information of image size and number of images and sample image of these three datasets.

Datasets Name	Image Size	Image Numbers	Image Sample
Family	1920*1080	152
Lighthouse	2048*1080	200
Playground	1920*1080	307

Before testing our method, the first step is to compute the similarity degree matrix among images, which is normalized into (0, 1) and with dimensions of n × n, higher value indicates that the two corresponding images are more likely to overlap with each other. To do this, we use ResNet50 as backbone and GeM (Radenovićet al., 2018) as aggregation layer that is pre-trained on GLD-v1 (Noh et al., 2017) to extract global feature. After that, similarity degree is computed via cosine distance, and in our experiment image pairs with similarity over 0.5 are eligible for subsequential processing and 3 orthogonal MSTs are used to generate the starting pose graph.

Performance of time efficiency

Datasets	Method	Cost time(s)
Datasets	Method	extract feature	pose graph
Family	our method	128.25	106.14
	Barath et al., (2021)	91.83	305.82
	openMVG	273.99	139.13
	Colmap	13.98(GPU)	381.36
Lighthouse	our method	121.18	104.12
	Barath et al., (2021)	120.00	149.47
	openMVG	172.13	115.40
	Colmap	15.96(GPU)	439.68
Playground	our method	189.06	277.31
	Barath et al., (2021)	180.56	391.10
	openMVG	331.21	308.24
	Colmap	31.86(GPU)	1465.32

We can find that our method is always the fast solution when generating pose graph and exhibits significantly reduction in processing time compared to Colmap. Colmap computes pose graph time slowest, primarily due to its utilization of BF-search (brute-force) matching strategy. The second fastest pipeline is the method of OpenMVG, which utilizes the Fast-Cascade-Hashing-L2 matching algorithm. While this approach results in a significant speed improvement, but it typically need a greater amount of memory (Cheng et al., 2014). As the feature extraction step in Colmap benefits from GPU acceleration, the total time for the initialization part was not included in the comparative analysis. Comparing to Barath et al., (2021), we are in general faster, this can be explained by the benefit of the initialization stage using orthogonal MSTs.

Performance of SfM

The performance of SfM relies on the quality of feature matching and two-view geometric results. Therefore, in this section, the popular framework - openMVG is employed as a SfM engine which take the output of our method and Barath et al. (2021) as input to estimate image poses and 3D sparse point cloud. Three evaluation metrics are computed:

Mean Track length (MTL).Track length denotes the number of images that a 3D point can be viewed on. Mean Track length is the averaging track length for all triangulated 3D points. A MTL value can typically imply the quality of feature matching, i.e., higher MTL means better feature matching.
Mean Reprojection error (MRE).The reprojection error represents a geometric discrepancy measured as the distance between a reprojected 2D point and its corresponding feature point on the image. MRE can be used to indicate how good the whole SfM result is consistent to the projection model, such as pin-hole.
Registered images number (RIN).The number of successfully registered images in the photogrammetric block. A higher number of successfully registered images indicates a more complete pose graph.

Dataset	Method	MTL	MRE(px)	RIN
Family	our method	7.53	0.50	152/152
	Barath et al., (2021)	3.80	0.83	141/152
	openMVG	7.02	0.67	152/152
	Colmap	8.06	0.87	152/152
Lighthouse	our method	5.41	0.61	198/200
	Barath et al., (2021)	5.35	0.63	186/200
	openMVG	5.21	0.90	196/200
	Colmap	7.74	0.66	200/200
Playground	our method	5.73	0.49	307/307
	Barath et al., (2021)	5.67	0.68	307/307
	openMVG	4.20	0.99	302/307
	Colmap	6.03	0.65	307/307

It can be observed that the method proposed in this paper exhibits the highest reconstruction accuracy regarding MRE, while the pipeline based on Colmap always exhibit the longest MTL. In terms of the successfully registered number of images, the method proposed is just slightly inferior to Colmap. For colmap, the inherent feature extraction module is used which generate approximately 3000 more feature points per image, as a result, the corresponding MTL and RIN is always the best as more correspondences are found and more edges are likely to pass the two-view geometric verification. Compared to the original pose graph method Barath et al., (2021), our method shows superior results on all three metrics, which demonstrate the efficacy of the proposed integration of multiple orthogonal MSTs. The visual SfM reconstruction results are shown as below:

If you have any questions or advice, you can contact us through following address:

gwt2019@whu.edu.cn, Wentian Gan, WuHan University
xwang@sgg.whu.edu.cn, Xin Wang, WuHan University
zqzhan@sgg.whu.edu.cn, Zongqian Zhan, WuHan University

Thanks for your support!