The author has followed this competition for a long time, and I have written two articles last year and the year before 2021 , 2020 , homepage , competition address
What is the biggest difference this year? That’s right, it’s Kaggle
!
In fact, last year’s competition was supposed to be held on Kaggle
, but due to the time-consuming epidemic and data preparation, the decision was postponed to this year.
This year, a total of 3846 people participated in the competition, a total of 642 teams, of which 128 users participated in the competition for the first time (18 new users in the Top20), these participants came from 60 countries, and a total of 14170 submission records were recorded.
It can be seen from the following figure: more than 25 times the number of participants and 150 times the submission record:
Fun fact: The championship plan was completed 48 hours before the deadline.
the difference
Compared with previous years’ competitions, this year’s competition is different in the following aspects:
- Contestants need to submit
notebooks
for offline processing of competition data - Contestants can’t view the test set: it’s hard to cheat
- Allows algorithm to iterate quickly
Apart from that, there are a few differences:
- Cut off the
multiview track
(multi-view matching track) and focus only on thestereo track
. There are a number of reasons for this, the main one being the “technical problem” that it is difficult to run and evaluate matching performance in a limited and reasonable amount of time. - New datasets and evaluation criteria. The translation amount of the true value pose in previous years has no scale, and can only evaluate the accuracy of rotation; while this year’s translation amount has scale information, which makes it possible to evaluate the amount of pose rotation and translation at the same time. In addition, this year, a non-public dataset from Google (which is not available online) was used.
- time limit. The total computing time is limited to 9 hours (computing platform: Kaggle GPU virtual instance), no timeout! This makes the contestants think about what algorithms can and cannot be used. A simple example: using a semantic segmentation mask may be helpful to improve the index, but the computing power required is too large, so it cannot be used!
Top solutions
Unlike in the past, almost all of the top-ranked teams used a similar process this year, and although the implementation details were different, the core idea was similar…
The final second-placed solution (hkchkc) adopted a different solution, which did not use any pre/post processing, and the algorithm “occupied” the top 1 for a long time during the competition.
A general feature matching process (used by many contestants) is shown below:
- The first step uses an off-the-shelf model to obtain initial matches. These models include
LoFTR
5 ,SuperGlue
7 ,QuadTree Attention LoFTR
6 . Some teams also used scale augmentation algorithms (multi-scale feature matching). - The next step is to estimate the common viewing area of the two images. A common method is to use
K-means
orDBSCAN
(top1 scheme) to cluster the matches, and then find the bbox (bounding box) of the common viewing area on the image. In this bbox Include the largest number of matching pairs, which may result in multiple potential co-viewing regions. Next, the basis matrix is calculated using the matching of each co-viewing region. Another idea that does not require clustering is to useMAGSAC++
to iterate several rounds under the condition of low threshold to eliminate some false matches, and then also get bbox. - Crop out each bbox area and then resize it to a preset image size. Next, these “zoomed-in” images are matched again using a matching algorithm. After new matches are obtained, these matching pairs are projected onto the original image, and these projected matches are organized in series with the original image matches. Note that the match after “zoomed-in” is to add to the match , not to replace the original match.
- (Optional) This step uses non-maximum suppression to filter overly concentrated matching pairs. Non-maximum suppression includes
ANMS
9 or radius-based NMS algorithms. - Use OpenCV
MAGSAC++
to solve the fundamental matrix.
More specifically, Top1/Top2 solutions are listed here.
Top1 ideas
Completely based on open source matchers ( LoFTR
5 , SuperGlue
7 , DKM
8 ) to achieve this ranking is incredible! Another key reason for this ranking is that the scheme uses a “2-stage” strategy of multi-scale image cropping and matching.
The specific steps are as follows:
- Image initial matching:
LoFTR
5 extracts image features with a resolution of 840; Superpoint+SuperGlue
7 extracts multi-scale image features with a resolution of 840, 1024, 1280; then the matching obtained by the two matchers is connected in series; Phase 1 matching; - Using
DBSCAN
to cluster the matches, the best 80-90% matching pairs can be obtained; then a bbox of the corresponding region is obtained, and the region is cropped. The author calls this method mkpt_crop. mkpt_crop can effectively filter outliers and extract common viewing areas; - Use
LoFTR
5 ,SuperGlue
7 ,DKM
8 to re-match the common vision area again, which is the second stage of matching in this scheme; - Concatenate the matched pairs of the two stages above and solve the fundamental matrix using
MAGSAC
.
Top2 ideas
A few key points:
- The accuracy of 0.838 (public)/0.833 (private) can be obtained by using only a single model (Baseline); this model is similar in structure to
LoFTR
5 , and is also based on the transformer for direct image matching without feature points. However, the author mentioned that the paper is still under blind review at this stage, and not much information can be disclosed at present; - Use other matchers (Baseline +
QuadTree
6 +LoFTR
5 +SuperGlue
7 ) to further enhance the accuracy; - Use normalized position coding;
- No pre/post treatment is used;
useful tricks
- Swapping image matching order improves accuracy of LoFTR-like matchers
- Valid for positional encoding normalization of LoFTR-like matchers (top2 scheme)
- There is little difference in the methods of using different resize images
- Using
ECO-TR
to optimize coordinates is effective (not open source) - Using local descriptors + non-learning matchers to increase the number of matches does not work, such as
DISK
11 ,ALIKE
12 , etc.; - Semantic segmentation masks (sky/people) didn’t work either;
Summarize
- The “2-stage” approach is quite effective for image matching tasks: first find the common view area, and then zoom to match;
- It’s better to solve the “recall” problem first, i.e. find as many matches as possible, this process can use different matchers; it is believed that modern
RANSACs
can recover poses with fewer interior points; -
LoFTR
5 is very sensitive to the input image size, which deserves further study.
refer to
1. Image Matching: Local Features & Beyond, homepage: https://image-matching-workshop.github.io ↩
2. Image Matching Challenge 2022, homepage: https://www.kaggle.com/competitions/image-matching-challenge-2022 ↩
3. Image Matching Challenge 2022 Recap, Dmytro Mishkin, https://ducha-aiki.github.io/wide-baseline-stereo-blog/2022/07/05/IMC2022-Recap.html , homepage: http://dmytro .ai ↩
4. Competition is Finalized : Congrats to our Winners, Recap, https://www.kaggle.com/competitions/image-matching-challenge-2022/discussion/329650 ↩
5. LoFTR: Detector-Free Local Feature Matching with Transformers, CVPR 2021, code: https://github.com/zju3dv/LoFTR , pdf: https://arxiv.org/abs/2104.00680 ↩
6. QuadTree Attention for Vision Transformers, ICLR 2022, code: https://github.com/Tangshitao/QuadTreeAttention , pdf: https://arxiv.org/abs/2201.02767 ↩
7. SuperGlue: Learning Feature Matching with Graph Neural Networks, CVPR 2020, code: https://github.com/magicleap/SuperGluePretrainedNetwork , pdf: https://arxiv.org/abs/1911.11763 ↩
8. DKM, Deep Kernelized Dense Geometric Matching, arxiv 2022, code: https://github.com/Parskatt/DKM , pdf: https://arxiv.org/abs/2202.00667 ↩
9. ANMS, Efficient adaptive non-maximal suppression algorithms for homogeneous spatial keypoint distribution, code: https://github.com/BAILOOL/ANMS-Codes , pdf: https://www.researchgate.net/publication/323388062_Efficient_adaptive_non-maximal_suppression_algorithms_for_homogeneous_spatial_keypoint_distribution ↩
10. OANet, Learning Two-View Correspondences and Geometry Using Order-Aware Network, code: https://github.com/zjhthu/OANet , pdf: https://arxiv.org/abs/1908.04964 ↩
11. DISK: Learning local features with policy gradient, NeurIPS 2020, code: https://github.com/cvlab-epfl/disk , pdf: https://arxiv.org/abs/2006.13566 ↩
12. ALIKE: Accurate and Lightweight Keypoint Detection and Descriptor Extraction, Transactions on Multimedia 2022, code: https://github.com/Shiaoming/ALIKE , pdf: https://arxiv.org/abs/2112.02906 ↩
13. ASLFeat: Learning Local Features of Accurate Shape and Localization, CVPR 2020, code: https://github.com/lzx551402/ASLFeat , pdf: https://arxiv.org/abs/2003.10071 ↩
This article is reprinted from https://www.vincentqin.tech/posts/imc2022/
This site is for inclusion only, and the copyright belongs to the original author.