Support Multi-threaded Benchmarks for Production-Realistic Performance Metrics #568

mesibo · 2025-01-17T10:06:57Z

Currently, ANN-Benchmarks enforces single-CPU execution during experimentation, disabling multi-threading capabilities at the hardware level (AWS single-cpu mode) so that even libraries supporting multi-core CPU/threading can't use it. While batch mode exists, it doesn't address the potential performance benefits that multi-threading could offer.

There are ANN libraries supporting multi-core/multi-threading processing to improve performance. However, the single-threaded benchmark is unlikely to project their true potential in the production environment. This is especially true since most end users are likely to run ANN on multi-core CPUs, and there is no reason they should choose the wrong implementation based on single CPU benchmarks. Running and providing both single- and multi-threaded benchmarks will allow users to view more realistic benchmarks, matching production setup.

We request to consider adding multi-threading benchmarks alongside existing single-threaded tests.

Optionally, Add a new flag (e.g., --threading) to enable multi-threading evaluation
Create separate performance plots for single-threaded vs multi-threaded execution

We're happy to contribute to the implementation if needed and have already made some changes for local benchmarking to test ANN in multi-threaded mode.

Looking forward to your thoughts on this.

The text was updated successfully, but these errors were encountered:

maumueller · 2025-01-17T11:41:40Z

@mesibo thanks for the initiative, I agree that reporting on non-single-thread results is interesting.

Could you detail what you are missing with the present --batch mode? The default implementation is of course not very interesting, but each implementation can specify its own behaviour.

mesibo · 2025-01-17T15:48:35Z

Actually, the default ann-benchmarks implementation is very well thought out and already supports multi-threading on local systems (e.g., HNSWLib module uses multiple threads; as a side note, the set_num_threads() line in fit() has possibly no effect). So, I don’t think major changes are required. The main point is to publish benchmarks for multi-threaded scenarios rather than restricting them to single-CPU. Given the changing landscape, this would be more meaningful and aligned with production use cases.

Could you detail what you are missing with the present --batch mode?

You are correct that the --batch mode could work—we’re still exploring it (FAISS GPU implements batch). However, it may not be ideal in all cases, let me explain our scenario:

Unlike most synchronous algorithms, our algorithm is asynchronous and analyzes patterns in vectors before indexing them. It uses multi-core systems for this pre-processing and hence has to be asynchronous. However, this poses no issue during indexing as it’s a one-time operation, and we only need to wait in the fit() function until the indexing is complete. Hence, no changes are required on the indexing side.

However, querying is also asynchronous, and the algorithm can handle multiple query vectors in parallel. This is where integrating it into the ann-benchmarks framework becomes slightly challenging. Running single_query() benchmarks doesn’t work because the query quickly returns without processing (due to its asynchronous nature). Also, unlike the fit() case, single_query() can’t wait for the query to finish, as it would negate the benefits of parallel processing.

While --batch mode could work, it introduces unnecessary overhead by maintaining a large queue, even though the algorithm typically processes 8–10 vectors at a time, leading to higher memory usage and additional overhead.

We are still investigating potential solutions, but any suggestions or guidance you can provide would be greatly appreciated.

maumueller · 2025-01-20T14:11:16Z

Thanks for making such a strong case, @mesibo. I'd be very interested to see the work that you are doing and how the current framework stands in your way to achieve the best performance for your implementation.

I'm not entirely sure I understand what you mean with

While --batch mode could work, it introduces unnecessary overhead by maintaining a large queue, even though the algorithm typically processes 8–10 vectors at a time, leading to higher memory usage and additional overhead.

Batch-mode will bypass the single_query method and instead call https://github.com/erikbern/ann-benchmarks/blob/main/ann_benchmarks/runner.py#L86-L121. This will pass the whole set of query vectors at once to the batch_query method that you are supposed to support with your implementation.

The only part where we are using a multiprocessing queue is our implementation of --parallelism, which allows to carry out independent single-threaded experiments in parallel.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Multi-threaded Benchmarks for Production-Realistic Performance Metrics #568

Support Multi-threaded Benchmarks for Production-Realistic Performance Metrics #568

mesibo commented Jan 17, 2025

maumueller commented Jan 17, 2025

mesibo commented Jan 17, 2025

maumueller commented Jan 20, 2025

Support Multi-threaded Benchmarks for Production-Realistic Performance Metrics #568

Support Multi-threaded Benchmarks for Production-Realistic Performance Metrics #568

Comments

mesibo commented Jan 17, 2025

maumueller commented Jan 17, 2025

mesibo commented Jan 17, 2025

maumueller commented Jan 20, 2025