-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with the evaluation #1
Comments
I think what you're trying to convey with Figure 1 is that you can achieve a AUC@5 of 56 at a speed of 40ms, but that's not actually the case. This could lead to misunderstandings and confusion. |
Thank you for your reminder and suggestions. We have changed Figure 1 in the camera-ready version to the right image mentioned above to dispel the potaintial misunderstandings. |
Thank you very much for the extensive reply. 1) Figure 1: Thank you for updating the figure. As @Master-cai mentions, reporting accuracy and speed from two different datasets was confusing. After some investigation, it seems that there might be a problem with your evaluation of LightGlue on ScanNet. We actually have a fully-reproducible evaluation available at cvg/glue-factory#25. Running it with 2k keypoints on full-resolution images (1296x968) with the basic RANSAC, we obtain an AUC of 17.7 / 34.6 / 51.2 while you report 14.8 / 30.8 / 47.5. Running this evaluation at 1k or 640x480 only does not make much sense given that dense approaches can leverage a lot more correspondences. As for the speed, running LightGlue on 2k keypoints takes less than 20ms an RTX 3090 (see our benchmark) while you report >30ms. Your paper does not mention whether this time includes the feature extraction by SuperPoint (~10ms).
Indeed, the speed reported in the LightGlue paper does not include the time of feature extraction. In most applications, such as SfM, features are extracted once per image and subsequently cached. Since each image is matched to multiple ones, the matching time dominates by far. For sparse matching, I think that including the extraction time is thus not fair and does not reflect realistic operating conditions. For dense matching, it should probably be included because caching the dense features is unfeasible due to their high memory footprint. To summarize, I suggest that a more appropriate Figure 1 would look like the following: Alternatively, you could use the MegaDepth evaluation and report the average matching time per image. The image resolution is a great way to control the speed vs accuracy trade-off for dense approaches, so you could even report multiple values per approach. 3) RANSAC threshold and variant: Thank you providing these numbers, this is very insightful. Nice to see that better tuning the inlier threshold and switching to LO-RANSAC can both increase the accuracy of your approach. I think that using LO-RANSAC should become the standard in future evaluation because it is much less sensitive to the choice of inlier threshold, making the evaluation a lot more reliable and closer to downstream applications like SfM. |
Thank you very much for the detailed and insightful reply. 1) Whether to Include SP Time:
Indeed, in the SfM application, the sparse features can be extracted once and stored. However, we believe that applications such as visual localization which is a more latency-sensitive application than offline SfM, only show the matching time without sparse detection and description time (i.e., SP time) cannot reveal practical operating conditions: 1) For visual localization based on SfM models (and pre-stored features), the running time of feature detection and description for each query image cannot be eliminated. Show only matching time cannot show this scenario. 2) As argued in MeshLoc, storing the sparse features of entire database maps is large, especially for large-scale scenes (”E.g., storing SuperPoint features for the Aachen Day-Night v1.1 dataset requires more than 25 GB…”). Moreover, loading pre-stored database features instead of re-extracting them at the localization phase will also lead to extra data loading time. Therefore, MeshLoc proposes to store mesh instead of SfM features. Using matching methods in this practical setting, we think it would be more appropriate to show end-to-end matching time (SP + LG). Therefore, we think it would be better to show both times of (SP+LG), and (LG only) to cover both SfM and visual localization settings for a more comprehensive understanding. 2) The experimental setting of SP+LG:
We first clarify that we calculate the average running time of SuperPoint+LightGlue across 1500 pairs from ScanNet. This differs from the LightGlue benchmark in two main respects. First, since the results in the LightGlue paper did not utilize flash attention to compare with other methods (as stated in issue), we opted for a fair comparison by not using flash attention for all methods. Second, Lightglue only repeats on 1 pair while ours average over 1500 pairs.
Regarding the accuracy we reported, we first clarify that the AUC in our paper corresponds to SP[640/2048/0.0005/4]+LG prune[0.95/0.95], where 640 follows the setting of SuperGlue (and 9/10 following papers) on ScanNet, and (2048/0.0005/4) uses the default parameters for LightGlue. Using a 1296x968 resolution image as the input for SuperPoint not only greatly increases the time of SuperPoint (even exceeding the total time of our full model), but the extraction of higher quality features may also makes the matching process of LightGlue pruning easier (running faster) and better (+2~3AUC@5). In this context, especially considering the real-time feature extraction needs of downstream localization tasks, comparing only the time taken by LightGlue to other end-to-end methods may not be entirely fair. What’s more, the AUC of 17.7/34.6/51.2 at cvg/glue-factory#25 correspond to LG no_prune, which requires flash attention to decrease the time from 36.6ms to 19.4ms as shown in the table above. The accuracy of all settings is reproduced by forked gluefactory, using gluefactory’s environment (torch2.2.1) on RTX3090:
If we don't follow the ScanNet setting of SuperGlue, as shown in the table below, we can achieve similar accuracy at a lower resolution with faster speeds. We also provide reproduce scripts.
In summary, if we don't follow SuperGlue’s ScanNet 640x480 setting, maybe a more correct Figure 1 would look like this: Note that these hardware-specific accelerations are not available on all hardware. Flash requires Turing (sm_75) or newer architectures, while FP16 requires Volta (sm_70) or newer. That is, V100 GPUs can't use Flash, and GPUs older than V100 can't use either Flash or FP16. 3) RANSAC threshold and variant:
We agree that LO-RANSAC is a better method than OpenCV RANSAC for practical use, as it is more robust and appears to be less sensitive to the choice of inlier threshold. We also conducted more experiments with LO-RANSAC on ScanNet and found that the accuracy gap still exists.
|
I am afraid the "image pairs per second" cannot be simply obtained by inverse of "processing time per pair", unless one can make sure the GPU is 100% utilized (ie: with batch inference). I don't think this is guaranteed from discussion above and in LightGlue repo. Correct me if I am wrong @sarlinpe @Phil26AT I guess the community should report both latency and throughput as shown in the "THE EFFICIENCY MISNOMER" paper. One should measure them carefully and rigorously. |
We agree that “fps cannot be simply obtained by the inverse of latency” and thanks for the comment. Perhaps our Figure 1 could have been more accurately described using "forward pass per second (1/latency)" to avoid confusion with throughput. As one of LightGlue’s core novelties and designs, adaptive pruning significantly reduces LightGlue's latency with negligible effects on accuracy. Therefore we follow LightGlue paper’s setting that uses 1/latency as the running time comparisons in the teaser figure. Furthermore, as emphasized in the "THE EFFICIENCY MISNOMER" paper you mentioned: "we argue that cost indicators that one cares about strongly depend on the applications and setup in which the models are supposed to be used", we believe that for the image matching task, latency is more crucial for online applications like SLAM and visual localization that is more sensitive to speed than offline SfM. |
I want to know why this result is not consistent with the paper, is it the result of training on scannet ?
|
It is the result using the same ckpt as paper report (trained on MegaDepth), the only difference is the confidence threshold(0.2 v.s. 0.1).
Yes, with the same ckpt but different confidence threshold.
If you use RTX3090, both the MegaDepth results and ScanNet results in paper and issue should be numerical reproducible with scripts in reproduce_test and varied_size. To reproduce the numerical results on an RTX 3090, please make sure to use a new environment created by following the README instructions with miniconda. Additionally, ensure that you do not share the same GPU with other Python programs when running the scripts. You can also share your Python and Torch versions to help with verification. |
Thanks for your reply.
|
And pay attention to not using multi-GPU parallel testing; you need to use a single GPU with a batch size of 1. Another quick method for reproduction is to use an RTX 3090 and cpu from autodl, using the provided image (python=3.8 and torch=2.0.0, cuda=11.8) to prevent issues due to using specific CPU models. Although we have not observed any impact of CPU models on numerical results, there is still a small probability that numerical results may be affected by the CPU model. |
I ran the megadepth test using the official environment, single gpu, rtx3090, and the configuration is as follows
and get the result:
But for scannet, using the same environment, was able to achieve the accuracy of the paper. |
Besides utilizing cloud servers like autodl to eliminate potential hardware issues as a last resort, I also suggest using a freshly cloned repository to prevent modifications to the code from impacting the results. Additionally, confirm whether the PyTorch version is 2.0.0+cu118 by checking the output of |
I verify that my version of python and pytorch version are consistent with the source project. I also tried running with fp32, but didn't get much improvement. I want to know how the Ours and Optimized results in this paper correspond? Ours=Full+fp32+noflash? optimize=Opt.+mp+noflash? I wonder if it could be a parameter setting problem, eg boarder and so on. |
Yes.
No, directly running scripts in reproduce_test from a freshly cloned codebase will use the exactly same parameter setting as paper. |
Thanks for your reply. I'll try it again. |
Hi,
Before everyone gets too excited, I need to point out some obvious issues in the evaluation described in the paper.
In Figure 1, the inference time of the semi-dense approaches is largely under-estimated because it is computed at a much lower resolution than the pose accuracy (on MegaDepth). This is evidenced by Table 8: 56.4 AUC@5° and 40.1 ms (Table 1) actually correspond to resolutions 1184×1184 and 640×640, respectively. In reality, the proposed approach is much slower: the inference time at this resolution is 139ms (compare this to LightGlue's 30ms). For the reported inference time, the proposed approach is actually not more accurate than LightGlue (and most likely less).
The same story goes for other semi-dense matchers - for LoFTR it should be much higher than 66ms, closer to 180ms (LightGlue paper, Table 2). Even at this resolution, the accuracy gap might completely vanish when using a modern implementation of RANSAC, as found in PoseLib. Evidence of this can also be found in LightGlue, Table 2 (LO-RANSAC). This can be easily evaluated in glue-factory so this omission is surprising.
We'd appreciate having the authors comment on this - @wyf2020 @hxy-123 @Cuistiano - thank you!
cc @Phil26AT
The text was updated successfully, but these errors were encountered: