-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update retrieval quality article #1241
base: master
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for condescending-goldwasser-91acf0 ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
[Loading a dataset from Hugging Face hub](/documentation/tutorials/huggingface-datasets/) tutorial, `Qdrant/arxiv-titles-instructorxl-embeddings` | ||
from the [Hugging Face hub](https://huggingface.co/datasets/Qdrant/arxiv-titles-instructorxl-embeddings). Let's download it in a streaming | ||
mode, as we are only going to use part of it. | ||
We’ll use a pre-embedded dataset from Hugging Face to train and test Qdrant’s search capabilities. First, load and split the dataset for training (1,000 items) and testing (100 items). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
differs from the code values
@thierrypdamiba @davidmyriel I actually liked the fact that in the previous version we said that embeddings quality is crucial (maybe we paid it a bit more attention than required) and we explained why we're comparing exact search to ann, now the tutorial has become a bit faceless |
@joein @davidmyriel I added information about the quality and ann vs exact search. Also updated the numbers on the dataset to reflect the code. |
qdrant-landing/package.json
Outdated
@@ -21,7 +21,8 @@ | |||
"anchor-js": "^5.0.0", | |||
"bootstrap": "^5.3.3", | |||
"clipboard": "^2.0.11", | |||
"qdrant-page-search": "^1.0.8" | |||
"qdrant-page-search": "^1.0.8", | |||
"react-router-dom": "^6.27.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need it. Removing now.
Update text and format to better reflect the benefit of ANN vs KNN/exact search and why a user would want to measure retrieval quality" TODO: Add screenshots of how you can do this in the webui
- **m**: This parameter determines the maximum number of connections per node in the HNSW graph. A higher value for `m` increases the connectivity of the graph, potentially improving search accuracy at the cost of increased memory usage and indexing time. The default value for `m` is 16. | ||
- **ef_construct**: This parameter controls the size of the dynamic candidate list during index construction. A higher value of `ef_construct` leads to a more exhaustive search during the indexing phase, resulting in a higher quality graph and improved search accuracy. However, this comes at the cost of longer indexing times. The default value for `ef_construct` is 100. | ||
|
||
We will use the untuned HNSW as the baseline to compare how changes affect the precision of the search. Initially, we will use the default values of `m` (16) and `ef_construct` (100) for the HNSW algorithm. Later, we will double these values to observe their impact on retrieval quality. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have already written what the default values are, so we can shorten this sentence, like
"We'll use the default m
and ef
as a baseline and then tweak the params to see how it affects the precision of the search."
- If you require higher precision, increase `m` and `ef_construct` while considering the increased memory usage and indexing time. | ||
- If memory and indexing time are critical constraints, tune the parameters incrementally to find the right balance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, these is also a third parameter : ef
(also known as efSearch
, it controls the number of neighbors evaluated during the search, a higher value may increase precision, however, it also increases latency
``` | ||
|
||
Response: | ||
This step measures the initial retrieval quality before any tuning of the HNSW parameters. The HNSW (Hierarchical Navigable Small World) algorithm has two key parameters that influence search performance and quality: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could provide a bit more details here:
There are 2 types of parameters which users can tune, index time parameters and search time parameters
index time: m
and ef_construct
, search time - ef
I think that we might want to mention it here, rather than just add a brief sentence at the end of the article
However, I don't find the code adjustments to be a necessity
Make changes to the retrieval quality article