Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inner Product Metric Produces Negative Distances for Similar Embeddings with Quantized Int8 Data #3188

Open
LukasKriesch opened this issue Jan 23, 2025 · 0 comments

Comments

@LukasKriesch
Copy link

Description
I’m using the Usearch library to perform semantic search with int8 quantized embeddings and the inner product ("ip") metric. While the results seem correctly ranked by similarity, I noticed that the distances returned by the search are negative. I would like to better understand this behavior and confirm whether it is expected.

Steps to Reproduce
Compute embeddings using the following model:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer(
'mixedbread-ai/deepset-mxbai-embed-de-large-v1',
model_kwargs={"torch_dtype": "float16"}
)

Quantize the embeddings with precomputed ranges:

from sentence_transformers.quantization import quantize_embeddings
int8_embeddings = quantize_embeddings(
embeddings,
precision="int8",
ranges=ranges
)

Add the quantized embeddings to a Usearch index:


from usearch.index import Index
index = Index(ndim=1024, metric="ip", dtype="i8")
index.add(list(ids), int8_embeddings)

Perform a query with a normalized embedding:

query = model.encode("query: Pizza", normalize_embeddings=True)
query = quantize_embeddings(query, precision="int8", ranges=ranges)
matches = index.search(query, 10, return_distances=True)

Observed Behavior
The distances returned by the index are negative, even for highly similar embeddings.
For example:
Top result distance: -426537.0
Less similar result distance: -436325.0

Expected Behavior
I expected the distances to be positive or to align more clearly with cosine similarity values in a normalized embedding space.

Questions
Is it expected for the inner product ("ip") metric to produce negative distances when used with int8 quantized embeddings?
Should I interpret less negative distances as higher similarity? Is there a recommended approach to convert these distances into similarity scores (e.g., by inverting the sign)?
Could this behavior be related to quantization ranges or normalization of embeddings before adding them to the index?
Additional Information
Embedding Model: mixedbread-ai/deepset-mxbai-embed-de-large-v1

Usearch Index Configuration:

index = Index(ndim=1024, metric="ip", dtype="i8")

Quantization Ranges:

Precomputed global min and max values across the entire dataset:

embedding_min = embeddings.min(axis=0)
embedding_max = embeddings.max(axis=0)
ranges = np.vstack([embedding_min, embedding_max])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant