Inner Product Metric Produces Negative Distances for Similar Embeddings with Quantized Int8 Data #3188

LukasKriesch · 2025-01-23T09:43:23Z

Description
I’m using the Usearch library to perform semantic search with int8 quantized embeddings and the inner product ("ip") metric. While the results seem correctly ranked by similarity, I noticed that the distances returned by the search are negative. I would like to better understand this behavior and confirm whether it is expected.

Steps to Reproduce
Compute embeddings using the following model:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer(
'mixedbread-ai/deepset-mxbai-embed-de-large-v1',
model_kwargs={"torch_dtype": "float16"}
)

Quantize the embeddings with precomputed ranges:

from sentence_transformers.quantization import quantize_embeddings
int8_embeddings = quantize_embeddings(
embeddings,
precision="int8",
ranges=ranges
)

Add the quantized embeddings to a Usearch index:


from usearch.index import Index
index = Index(ndim=1024, metric="ip", dtype="i8")
index.add(list(ids), int8_embeddings)

Perform a query with a normalized embedding:

query = model.encode("query: Pizza", normalize_embeddings=True)
query = quantize_embeddings(query, precision="int8", ranges=ranges)
matches = index.search(query, 10, return_distances=True)

Observed Behavior
The distances returned by the index are negative, even for highly similar embeddings.
For example:
Top result distance: -426537.0
Less similar result distance: -436325.0

Expected Behavior
I expected the distances to be positive or to align more clearly with cosine similarity values in a normalized embedding space.

Questions
Is it expected for the inner product ("ip") metric to produce negative distances when used with int8 quantized embeddings?
Should I interpret less negative distances as higher similarity? Is there a recommended approach to convert these distances into similarity scores (e.g., by inverting the sign)?
Could this behavior be related to quantization ranges or normalization of embeddings before adding them to the index?
Additional Information
Embedding Model: mixedbread-ai/deepset-mxbai-embed-de-large-v1

Usearch Index Configuration:

index = Index(ndim=1024, metric="ip", dtype="i8")

Quantization Ranges:

Precomputed global min and max values across the entire dataset:

embedding_min = embeddings.min(axis=0)
embedding_max = embeddings.max(axis=0)
ranges = np.vstack([embedding_min, embedding_max])

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inner Product Metric Produces Negative Distances for Similar Embeddings with Quantized Int8 Data #3188

Inner Product Metric Produces Negative Distances for Similar Embeddings with Quantized Int8 Data #3188

LukasKriesch commented Jan 23, 2025

Inner Product Metric Produces Negative Distances for Similar Embeddings with Quantized Int8 Data #3188

Inner Product Metric Produces Negative Distances for Similar Embeddings with Quantized Int8 Data #3188

Comments

LukasKriesch commented Jan 23, 2025