Issues with the evaluation #1

sarlinpe · 2024-03-08T16:36:02Z

Hi,

Before everyone gets too excited, I need to point out some obvious issues in the evaluation described in the paper.

In Figure 1, the inference time of the semi-dense approaches is largely under-estimated because it is computed at a much lower resolution than the pose accuracy (on MegaDepth). This is evidenced by Table 8: 56.4 AUC@5° and 40.1 ms (Table 1) actually correspond to resolutions 1184×1184 and 640×640, respectively. In reality, the proposed approach is much slower: the inference time at this resolution is 139ms (compare this to LightGlue's 30ms). For the reported inference time, the proposed approach is actually not more accurate than LightGlue (and most likely less).

The same story goes for other semi-dense matchers - for LoFTR it should be much higher than 66ms, closer to 180ms (LightGlue paper, Table 2). Even at this resolution, the accuracy gap might completely vanish when using a modern implementation of RANSAC, as found in PoseLib. Evidence of this can also be found in LightGlue, Table 2 (LO-RANSAC). This can be easily evaluated in glue-factory so this omission is surprising.

We'd appreciate having the authors comment on this - @wyf2020 @hxy-123 @Cuistiano - thank you!
cc @Phil26AT

wyf2020 · 2024-03-09T23:24:11Z

Hi Sarlin,

Thank you for your interest in our work. We address your questions as follows.

1) Firstly, we clarify that all statistics in Fig. 1 are from Tab. 1, where we evaluate the running time of all methods using unified 640×480 image resolution on ScanNet, as pointed out in Tab. 1 caption and Sec. 4.2 Evaluation protocol. The reason for not using MegaDepth for time evaluation is due to the significantly varied image resolutions used in the baselines’ original papers, as summarized in the following table, where the running times of our method on different MegaDepth resolutions are shown in Tab. 8.

Summary of 13 papers from top conferences; the data in '()' is from the official code, while the rest is from the paper and appendix.

method	MegaDepth	ScanNet	Ransac/thr	conference
SP+SG	1600 max_keypoints=2048 nms=3 (keypoint_threshold=0.005)	640×480 max_keypoints=1024 nms=4 (keypoint_threshold=0.005)	OpenCV RANSAC/0.5pix	CVPR2020
LoFTR	840	640×480	OpenCV RANSAC/0.5pix	CVPR2021
QuadTree	(832)	640×480	OpenCV RANSAC/0.5pix	ICLR2022
ASpanFormer	1152	640×480	OpenCV RANSAC/0.5pix	ECCV2022
MatchFormer	840	640×480	OpenCV RANSAC/0.5pix	ACCV2022
TopicFM	1200	640×480	OpenCV RANSAC/0.5pix	AAAI2023
DKM	(880×660)	(640×480)	OpenCV RANSAC/0.5pix	CVPR2023
ASTR	1216	640×480	OpenCV RANSAC/0.5pix	CVPR2023
PATS	None	(640×480)	OpenCV RANSAC/0.5pix	CVPR2023
CasMTR	1152	640×480	OpenCV RANSAC/0.5pix	ICCV2023
SP+LG	1600 max_keypoints=2048 (nms=4 keypoint_threshold=0.0005)[LG] & (nms=3 keypoint_threshold=0)[gluefactory]	Not eval	OpenCV RANSAC or poselib(LO-RANSAC)/self tune	ICCV2023
RoMa	672	(560×560)	OpenCV RANSAC/0.5pix	CVPR2024
Ours	1152	640×480	OpenCV RANSAC/0.5pix	CVPR2024

Notably, we can also use ScanNet AUC in Fig. 1’s accuracy comparisons, where the gaps between ours and LG still exist: Compared to the SP+LG AUC (49.9, 67.0, 80.1) on MegaDepth, our AUC (56.4, 72.2, 83.5) increased by (13%, 8%, 4%). Meanwhile, on ScanNet’s generalization results, compared to SP+LG AUC (14.8, 30.8, 47.5), our AUC (19.2, 37.0, 53.6) increased by (30%, 20%, 13%). We show both figures for a comprehensive understanding:

However, we didn't use ScanNet's AUC in Fig. 1 mainly because our experiments found that the quality of MegaDepth is better than ScanNet. (Perhaps this is the reason why LightGlue didn't perform experiments on ScanNet. )

2) As for the accuracy and efficiency comparisons with LightGlue on MegaDepth, given that the strongest model from the LightGlue paper (SP + LG, 1600 input image size with 2048 keypoints, carefully tuned RANSAC thr), the AUC is 49.9, 67.0, 80.1 for AUC@5, 10, 20 respectively and the total running time is SP (46.1ms) + LG (26.8ms) = 72.9ms. Our method with a 640×640 image resolution without changing the RANSAC method and thr (kept same with LoFTR and many other baselines) can already achieve generally better accuracy (51.0, 67.4, 79.8) and faster end-to-end inference speed 41.7ms), as shown in Tab. 8. And our optimized model can even achieve AUC (50.5, 67.1, 79.6)(fixed RANSAC thr) and (51.9, 68.0, 80.0)(tuned RANSAC thr) in just 34.1 ms.

By the way, we kindly remind that the running time of feature extraction & keypoint detection (SuperPoint or DISK) is missing in LightGlue’s Tab. 2, where only matching time (from sparse features input to matches output) is used to compare with dense methods which are end-to-end (from image input to matches output).

3) As for RANSAC setting, we follow the setting used by most of (12/13) the previous methods that use the same fixed RANSAC method (OpenCV) and threshold (0.5px), as summarized above in Table. We also evaluate using LG’s setting that changes RANSAC method and tunes the inlier thresholds. As shown below, the accuracy gaps between ours and LG still exist with more advanced LO-RANSAC.

	RANSAC(follow 12/13 papers)(0.5px)	RANSAC(tuned by LG)	RANSAC(tuned by us)(0.3px)	LO-RANSAC(follow LG in gluefactory)(2.0px)	LO-RANSAC(tuned by us)(1.5px)
Ours	56.4 / 72.2 / 83.5	None	58.4 / 73.4 / 84.2	69.3 / 80.7 / 88.5	69.5 / 80.9 / 88.8
	RANSAC(follow 12/13 papers)(0.5px)	RANSAC(tuned by LG)(unknown px)	RANSAC(tuned by us)	LO-RANSAC(tuned by LG in gluefactory)(2.0px)	LO-RANSAC(tuned by us)
SP+LG	None	49.9 / 67.0 / 80.1	None	66.8 / 79.3 / 87.9	None
	RANSAC(follow 12/13 papers)(0.5px)	RANSAC(tuned by LG)(unknown px)	RANSAC(tuned by us)(0.3px)	LO-RANSAC(tuned by LG)(unknown px)	LO-RANSAC(tuned by us)
ASpanFormer	55.3 / 71.5 / 83.1	55.3 / 71.5 / 83.1	58.3 / 73.3 / 84.2	69.4 / 81.1 / 88.9	None

Moreover, we observe the AUCs of ASpanFormer and other dense methods in the LightGlue’s Tab.2, RANSAC column, are identical to their original papers (without changing RANSAC thresholds), whereas all methods seem carefully tuned as stated in the LightGlue paper (for example, “tuning the RANSAC threshold yields +7% AUC@5 on SuperGlue” in its sup).

Therefore, we also try to tune the RANSAC threshold (finally set to 0.3 pix same as ours) for AspanFormer, where AUCs go from 55.3 / 71.5 / 83.1 (reported in LightGlue) to 58.3 / 73.3 / 84.2. We think this may reflect that finding the best parameter for each method may require a large tuning range and dense sample steps and may potentially overfit to a specific dataset.

I hope these responses can answer your questions and any discussion is welcome :).

Master-cai · 2024-03-10T09:54:36Z

I think what you're trying to convey with Figure 1 is that you can achieve a AUC@5 of 56 at a speed of 40ms, but that's not actually the case. This could lead to misunderstandings and confusion.

wyf2020 · 2024-03-10T13:14:36Z

I think what you're trying to convey with Figure 1 is that you can achieve a AUC@5 of 56 at a speed of 40ms, but that's not actually the case. This could lead to misunderstandings and confusion.

Thank you for your reminder and suggestions. We have changed Figure 1 in the camera-ready version to the right image mentioned above to dispel the potaintial misunderstandings.

sarlinpe · 2024-03-17T13:06:41Z

Thank you very much for the extensive reply.

1) Figure 1: Thank you for updating the figure. As @Master-cai mentions, reporting accuracy and speed from two different datasets was confusing.

After some investigation, it seems that there might be a problem with your evaluation of LightGlue on ScanNet. We actually have a fully-reproducible evaluation available at cvg/glue-factory#25. Running it with 2k keypoints on full-resolution images (1296x968) with the basic RANSAC, we obtain an AUC of 17.7 / 34.6 / 51.2 while you report 14.8 / 30.8 / 47.5. Running this evaluation at 1k or 640x480 only does not make much sense given that dense approaches can leverage a lot more correspondences.

As for the speed, running LightGlue on 2k keypoints takes less than 20ms an RTX 3090 (see our benchmark) while you report >30ms. Your paper does not mention whether this time includes the feature extraction by SuperPoint (~10ms).

By the way, we kindly remind that the running time of feature extraction & keypoint detection (SuperPoint or DISK) is missing in LightGlue’s Tab. 2, where only matching time (from sparse features input to matches output) is used to compare with dense methods which are end-to-end (from image input to matches output).

Indeed, the speed reported in the LightGlue paper does not include the time of feature extraction. In most applications, such as SfM, features are extracted once per image and subsequently cached. Since each image is matched to multiple ones, the matching time dominates by far. For sparse matching, I think that including the extraction time is thus not fair and does not reflect realistic operating conditions. For dense matching, it should probably be included because caching the dense features is unfeasible due to their high memory footprint.

To summarize, I suggest that a more appropriate Figure 1 would look like the following:

Alternatively, you could use the MegaDepth evaluation and report the average matching time per image. The image resolution is a great way to control the speed vs accuracy trade-off for dense approaches, so you could even report multiple values per approach.

3) RANSAC threshold and variant: Thank you providing these numbers, this is very insightful. Nice to see that better tuning the inlier threshold and switching to LO-RANSAC can both increase the accuracy of your approach. I think that using LO-RANSAC should become the standard in future evaluation because it is much less sensitive to the choice of inlier threshold, making the evaluation a lot more reliable and closer to downstream applications like SfM.

wyf2020 · 2024-03-31T14:05:01Z

Thank you very much for the detailed and insightful reply.

1) Whether to Include SP Time:

Indeed, the speed reported in the LightGlue paper does not include the time of feature extraction. In most applications, such as SfM, features are extracted once per image and subsequently cached. Since each image is matched to multiple ones, the matching time dominates by far. For sparse matching, I think that including the extraction time is thus not fair and does not reflect realistic operating conditions. For dense matching, it should probably be included because caching the dense features is unfeasible due to their high memory footprint.

Indeed, in the SfM application, the sparse features can be extracted once and stored. However, we believe that applications such as visual localization which is a more latency-sensitive application than offline SfM, only show the matching time without sparse detection and description time (i.e., SP time) cannot reveal practical operating conditions:

1) For visual localization based on SfM models (and pre-stored features), the running time of feature detection and description for each query image cannot be eliminated. Show only matching time cannot show this scenario.

2) As argued in MeshLoc, storing the sparse features of entire database maps is large, especially for large-scale scenes (”E.g., storing SuperPoint features for the Aachen Day-Night v1.1 dataset requires more than 25 GB…”). Moreover, loading pre-stored database features instead of re-extracting them at the localization phase will also lead to extra data loading time. Therefore, MeshLoc proposes to store mesh instead of SfM features. Using matching methods in this practical setting, we think it would be more appropriate to show end-to-end matching time (SP + LG).

Therefore, we think it would be better to show both times of (SP+LG), and (LG only) to cover both SfM and visual localization settings for a more comprehensive understanding.

2) The experimental setting of SP+LG:

As for the speed, running LightGlue on 2k keypoints takes less than 20ms an RTX 3090 (see our benchmark) while you report >30ms. Your paper does not mention whether this time includes the feature extraction by SuperPoint (~10ms).

We first clarify that we calculate the average running time of SuperPoint+LightGlue across 1500 pairs from ScanNet. This differs from the LightGlue benchmark in two main respects. First, since the results in the LightGlue paper did not utilize flash attention to compare with other methods (as stated in issue), we opted for a fair comparison by not using flash attention for all methods. Second, Lightglue only repeats on 1 pair while ours average over 1500 pairs.
We use a forked gluefactory and LightGlue benchmark to elaborate the difference of timing measurement as following table. All times are measured using gluefactory’s environment (torch2.2.1) on RTX3090:

time measure method(SP time) + (LG time) = Total time	LG no_prune[-1/-1]	LG prune [0.95/0.95]	LG prune [0.95/0.99]
↓without flash attention
repeat 1 (easy) pair [2048kpts] [noflash]	SP+36.9	SP+15.5	SP+16.9
repeat 1 (hard) pair [2048kpts] [noflash]	SP+36.9	SP+20.2	SP+23.4
average 1500 pairs [1296/2048/0/3] [noflash]	32.8+36.6=69.4	32.2+26.6=58.8 (corresponding to the following figure)	32.2+28.8=61.0
average 1500 pairs [640/2048/0/3] [noflash]	10.8+37.0=47.8	10.8+30.8=41.6	10.9+33.4=44.3
average 1500 pairs [640/2048/0.0005/4] [noflash]	11.1+20.2=31.3	11.2+20.4=31.6 (corresponding to our paper )	11.1+20.7=31.8
↓with flash attention
repeat 1 (easy) pair [2048kpts] [flash]	SP+18.5	SP+10.8	SP+10.9
repeat 1 (hard) pair [2048kpts] [flash]	SP+18.5	SP+16.6	SP+16.8
average 1500 pairs [1296/2048/0/3] [flash]	32.2+19.4=51.6	32.2+18.5=50.7	32.2+19.2=51.4
average 1500 pairs [640/2048/0/3] [flash]	10.9+20.8=31.7	11.0+20.7=31.7	10.9+21.4=32.3
average 1500 pairs [640/2048/0.0005/4] [flash]	10.9+19.9=30.8	10.9+19.2=30.1	11.0+19.6=30.6
↓with fp16 + flash attention
average 1500 pairs [1296/2048/0/3] [fp16+flash]	32.6+17.7=50.3	32.6+18.0=50.6 (corresponding to the following figure)	32.7+18.2=50.9
average 1500 pairs [640/2048/0/3] [fp16+flash]	11.0+18.0=29	11.2+20.9=32.1	11.1+21.1=32.2
average 1500 pairs [640/2048/0.0005/4] [fp16+flash]	11.0+19.0=30.0	11.0+18.7=29.7	11.1+19.0=30.1

Detailed explanation of the ScanNet settings
- SuperPoint [image_resize/max_num_keypoints/detection_threshold/nms_radius]
  - [1296/2048/0/3]: gluefactory setting
  - [640/2048/0/3]: 640 follow SuperGlue (and 9/10 following papers) ScanNet setting; [2048/0/3] is gluefactory setting
  - [640/2048/0.0005/4]: 640 follow SuperGlue (and 9/10 following papers) ScanNet setting; [2048/0.0005/4] is LightGlue default setting
- LightGlue [depth_confidence/width_confidence]
  - [-1/-1]: no prune used in both LightGlue and gluefactory
  - [0.95/0.95]: gluefactory setting
  - [0.95/0.99]: LightGlue benchmark and default setting

After some investigation, it seems that there might be a problem with your evaluation of LightGlue on ScanNet. We actually have a fully-reproducible evaluation available at cvg/glue-factory#25. Running it with 2k keypoints on full-resolution images (1296x968) with the basic RANSAC, we obtain an AUC of 17.7 / 34.6 / 51.2 while you report 14.8 / 30.8 / 47.5. Running this evaluation at 1k or 640x480 only does not make much sense given that dense approaches can leverage a lot more correspondences.

Regarding the accuracy we reported, we first clarify that the AUC in our paper corresponds to SP[640/2048/0.0005/4]+LG prune[0.95/0.95], where 640 follows the setting of SuperGlue (and 9/10 following papers) on ScanNet, and (2048/0.0005/4) uses the default parameters for LightGlue. Using a 1296x968 resolution image as the input for SuperPoint not only greatly increases the time of SuperPoint (even exceeding the total time of our full model), but the extraction of higher quality features may also makes the matching process of LightGlue pruning easier (running faster) and better (+2~3AUC@5). In this context, especially considering the real-time feature extraction needs of downstream localization tasks, comparing only the time taken by LightGlue to other end-to-end methods may not be entirely fair.

What’s more, the AUC of 17.7/34.6/51.2 at cvg/glue-factory#25 correspond to LG no_prune, which requires flash attention to decrease the time from 36.6ms to 19.4ms as shown in the table above. The accuracy of all settings is reproduced by forked gluefactory, using gluefactory’s environment (torch2.2.1) on RTX3090:

SP[img_size/max_kpts/thr/nms] V.S. LG[depth/width_confidence]	LG no_prune [-1/-1]	LG prune [0.95/0.95]	LG prune [0.95/0.99]
↓withou flash attention
SP[1296/2048/0/3]	18.13/34.9/51.55	17.84/34.45/50.3 (corresponding to the following figure)	17.53/34.51/50.35
SP[640/2048/0/3]	15.77/33.09/50.85	14.94/32.03/49.57	16.03/33.34/50.49
SP[640/2048/0.0005/4]	15.6/31.77/48.01	14.65/30.36/47.42 (corresponding to our paper)	14.9/30.42/47.16
↓with flash attention
SP[1296/2048/0/3]	17.89/34.72/51.3	17.47/33.99/49.72	17.99/34.84/50.84
SP[640/2048/0/3]	16.33/33.57/50.92	15.2/32.54/50.21	16.47/33.57/50.4
SP[640/2048/0.0005/4]	15.33/31.31/47.57	14.2/29.79/46.84	14.21/29.8/46.82
↓with fp16 + flash attention
SP[1296/2048/0/3]	18/34.89/51.01	17.26/34.2/49.92 (corresponding to the following figure)	18.02/34.8/50.42
SP[640/2048/0/3]	15.42/32.43/50.21	14.87/32.13/49.98	15.69/32.86/50.56
SP[640/2048/0.0005/4]	15.74/31.78/47.91	14.58/30.4/47.15	14.61/30.37/47.11

If we don't follow the ScanNet setting of SuperGlue, as shown in the table below, we can achieve similar accuracy at a lower resolution with faster speeds. We also provide reproduce scripts.

AUC[5/10/20]	640×480	512×384	384×288
Ours (Full+fp32+noflash)	19.8/37.8/54.3 (40.1ms)	19.5/37.3/53.9 (30.1ms)	17.7/34.9/51.5 (24.2ms)
Ours (Opt.+mp+noflash)	18.5/35.8/52.2 (27.0ms)	18.0/35.3/51.9 (25.5ms)	16.7/33.1/49.1 (23.5ms)
Ours (Opt.+fp16+flash)	18.4/35.6/52.0 (24.2ms)	18.1/35.6/52.3 (20.8ms)	16.4/32.6/48.6 (19.4ms)

In summary, if we don't follow SuperGlue’s ScanNet 640x480 setting, maybe a more correct Figure 1 would look like this:

Note that these hardware-specific accelerations are not available on all hardware. Flash requires Turing (sm_75) or newer architectures, while FP16 requires Volta (sm_70) or newer. That is, V100 GPUs can't use Flash, and GPUs older than V100 can't use either Flash or FP16.

3) RANSAC threshold and variant:

I think that using LO-RANSAC should become the standard in future evaluation because it is much less sensitive to the choice of inlier threshold, making the evaluation a lot more reliable and closer to downstream applications like SfM.

We agree that LO-RANSAC is a better method than OpenCV RANSAC for practical use, as it is more robust and appears to be less sensitive to the choice of inlier threshold. We also conducted more experiments with LO-RANSAC on ScanNet and found that the accuracy gap still exists.

Method	ScanNet LO-RANSAC (2.0px)
SP[1296/2048/0/3]+SG	21.4/38.1/53.1
SP[1296/2048/0/3]+LG(adaptive)	21.2/38.4/54.1
SP[1296/2048/0/3]+LG	21.3/39.1/55.2
LoFTR[640]	23.1/40.8/56.6
Ours (Optimized)[640]	24.8/43.7/59.9
Ours[640]	25.5/44.3/60.1
DKM[640]	28.4/49.0/65.9
RoMa[560]	31.4/53.0/70.4

hit2sjtu · 2024-04-01T01:31:09Z

I am afraid the "image pairs per second" cannot be simply obtained by inverse of "processing time per pair", unless one can make sure the GPU is 100% utilized (ie: with batch inference). I don't think this is guaranteed from discussion above and in LightGlue repo. Correct me if I am wrong @sarlinpe @Phil26AT

I guess the community should report both latency and throughput as shown in the "THE EFFICIENCY MISNOMER" paper. One should measure them carefully and rigorously.

wyf2020 · 2024-04-01T12:43:41Z

We agree that “fps cannot be simply obtained by the inverse of latency” and thanks for the comment. Perhaps our Figure 1 could have been more accurately described using "forward pass per second (1/latency)" to avoid confusion with throughput.

As one of LightGlue’s core novelties and designs, adaptive pruning significantly reduces LightGlue's latency with negligible effects on accuracy. Therefore we follow LightGlue paper’s setting that uses 1/latency as the running time comparisons in the teaser figure.

Furthermore, as emphasized in the "THE EFFICIENCY MISNOMER" paper you mentioned: "we argue that cost indicators that one cares about strongly depend on the applications and setup in which the models are supposed to be used", we believe that for the image matching task, latency is more crucial for online applications like SLAM and visual localization that is more sensitive to speed than offline SfM.

boyagesmile · 2024-12-29T17:12:13Z

I want to know why this result is not consistent with the paper, is it the result of training on scannet ?
So is the result marked "full" in this paper the configuration (Full+fp32+noflash), and the result "optimized" the configuration (Opt.+mp+noflash)?
I did not reproduce the megadepth results on the rtx3090, so I would like to know its configuration.

AUC[5/10/20] 640×480 512×384 384×288

Ours (Full+fp32+noflash) 19.8/37.8/54.3 (40.1ms) 19.5/37.3/53.9 (30.1ms) 17.7/34.9/51.5 (24.2ms)

Ours (Opt.+mp+noflash) 18.5/35.8/52.2 (27.0ms) 18.0/35.3/51.9 (25.5ms) 16.7/33.1/49.1 (23.5ms)

Ours (Opt.+fp16+flash) 18.4/35.6/52.0 (24.2ms) 18.1/35.6/52.3 (20.8ms) 16.4/32.6/48.6 (19.4ms)

wyf2020 · 2024-12-30T03:24:58Z

I want to know why this result is not consistent with the paper, is it the result of training on scannet ?

It is the result using the same ckpt as paper report (trained on MegaDepth), the only difference is the confidence threshold(0.2 v.s. 0.1).

So is the result marked "full" in this paper the configuration (Full+fp32+noflash), and the result "optimized" the configuration (Opt.+mp+noflash)?

Yes, with the same ckpt but different confidence threshold.

I did not reproduce the megadepth results on the rtx3090, so I would like to know its configuration.

If you use RTX3090, both the MegaDepth results and ScanNet results in paper and issue should be numerical reproducible with scripts in reproduce_test and varied_size. To reproduce the numerical results on an RTX 3090, please make sure to use a new environment created by following the README instructions with miniconda. Additionally, ensure that you do not share the same GPU with other Python programs when running the scripts. You can also share your Python and Torch versions to help with verification.

boyagesmile · 2024-12-30T03:30:14Z

Thanks for your reply.
I was able to reproduce the results on both scanent and hpatches on RTX3090, but only the megadepth was problematic.

So is the result marked "full" in this paper the configuration (Full+fp32+noflash), and the result "optimized" the configuration (Opt.+mp+noflash)?
Yes, with the same ckpt but different confidence threshold.
I want to know how the Ours and Optimized results in this paper correspond? Ours=Full+fp32+noflash? optimize=Opt.+mp+noflash？

wyf2020 · 2024-12-30T04:03:11Z

I was able to reproduce the results on both scanent and hpatches on RTX3090, but only the megadepth was problematic.

And pay attention to not using multi-GPU parallel testing; you need to use a single GPU with a batch size of 1. Another quick method for reproduction is to use an RTX 3090 and cpu from autodl, using the provided image (python=3.8 and torch=2.0.0, cuda=11.8) to prevent issues due to using specific CPU models. Although we have not observed any impact of CPU models on numerical results, there is still a small probability that numerical results may be affected by the CPU model.

boyagesmile · 2024-12-30T06:12:39Z

I ran the megadepth test using the official environment, single gpu, rtx3090, and the configuration is as follows

configs/data/megadepth_test_1500.py
configs/loftr/eloftr_full.py
--ckpt_path=weights/eloftr_outdoor.ckpt
--dump_dir=dump/eloftr_full_megadepth
--gpus=-1
--num_nodes=1
--accelerator="ddp"
--batch_size=1
--num_workers=4
--profiler_name="inference"
--benchmark
--megasize=1152
--npe
--thr
0.1
--deter
--ransac_times=5

and get the result：

2024-12-30 13:43:19.501 | INFO     | src.lightning.lightning_loftr:test_epoch_end:329 - 
{'auc@10': 0.7157417235297286,
 'auc@20': 0.8312183935049863,
 'auc@5': 0.5538459742394248,
 'num_matches': 3288.0693333333334,
 'prec@5e-04': 0.9687058492772844}

But for scannet, using the same environment, was able to achieve the accuracy of the paper.
but for optimized test，get the result：
“56.0 71.7 83.1”

wyf2020 · 2024-12-30T08:53:10Z

Besides utilizing cloud servers like autodl to eliminate potential hardware issues as a last resort, I also suggest using a freshly cloned repository to prevent modifications to the code from impacting the results. Additionally, confirm whether the PyTorch version is 2.0.0+cu118 by checking the output of pip list and the version of Python is 3.8.

boyagesmile · 2024-12-30T09:12:12Z

I verify that my version of python and pytorch version are consistent with the source project. I also tried running with fp32, but didn't get much improvement.
I conducted experiments on the RTX 3090, using python=3.8, pytorch=2.00+cu118.

I want to know how the Ours and Optimized results in this paper correspond? Ours=Full+fp32+noflash? optimize=Opt.+mp+noflash？

I wonder if it could be a parameter setting problem, eg boarder and so on.

wyf2020 · 2024-12-30T10:37:14Z

I want to know how the Ours and Optimized results in this paper correspond? Ours=Full+fp32+noflash? optimize=Opt.+mp+noflash？

Yes.

I wonder if it could be a parameter setting problem, eg boarder and so on.

No, directly running scripts in reproduce_test from a freshly cloned codebase will use the exactly same parameter setting as paper.

boyagesmile · 2024-12-30T12:22:42Z

Thanks for your reply. I'll try it again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with the evaluation #1

Issues with the evaluation #1

sarlinpe commented Mar 8, 2024

wyf2020 commented Mar 9, 2024 •

edited

Loading

Master-cai commented Mar 10, 2024

wyf2020 commented Mar 10, 2024

sarlinpe commented Mar 17, 2024

wyf2020 commented Mar 31, 2024 •

edited

Loading

hit2sjtu commented Apr 1, 2024

wyf2020 commented Apr 1, 2024

boyagesmile commented Dec 29, 2024

wyf2020 commented Dec 30, 2024

boyagesmile commented Dec 30, 2024

wyf2020 commented Dec 30, 2024

boyagesmile commented Dec 30, 2024

wyf2020 commented Dec 30, 2024

boyagesmile commented Dec 30, 2024

wyf2020 commented Dec 30, 2024

boyagesmile commented Dec 30, 2024

Issues with the evaluation #1

Issues with the evaluation #1

Comments

sarlinpe commented Mar 8, 2024

wyf2020 commented Mar 9, 2024 • edited Loading

Master-cai commented Mar 10, 2024

wyf2020 commented Mar 10, 2024

sarlinpe commented Mar 17, 2024

wyf2020 commented Mar 31, 2024 • edited Loading

hit2sjtu commented Apr 1, 2024

wyf2020 commented Apr 1, 2024

boyagesmile commented Dec 29, 2024

wyf2020 commented Dec 30, 2024

boyagesmile commented Dec 30, 2024

wyf2020 commented Dec 30, 2024

boyagesmile commented Dec 30, 2024

wyf2020 commented Dec 30, 2024

boyagesmile commented Dec 30, 2024

wyf2020 commented Dec 30, 2024

boyagesmile commented Dec 30, 2024

wyf2020 commented Mar 9, 2024 •

edited

Loading

wyf2020 commented Mar 31, 2024 •

edited

Loading