-
Notifications
You must be signed in to change notification settings - Fork 689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MB-59102: Merging kNN hits from a distributed index with exact search hits #1936
Conversation
CascadingRadium
commented
Dec 12, 2023
- Reverts commit MB-59102: Merging KNN Results #1910, which was an earlier attempt to address this issue.
- Implements the PreSearch Construct in Bleve alias search, enabling a preliminary query to collect metadata from all alias indexes before executing the main search query in MultiSearch. PreSearch gathers KNN results from all alias indexes, selecting the top K results. This facilitates the main Bleve Query to operate within the context of documents that matched the KNN query, ensuring seamless functionality of existing Bleve constructs such as Faceting, Sorting, Pagination, SearchAfter, and SearchBefore.
- Introduces the KNN Collector construct to merge and obtain accurate Top K results from multiple Zap Segments' KNN results.
- Enhances KNN Unit Tests for greater generality.
- Addresses an issue where errors generated within the Top N Document handler were being discarded.
- Resolves an issue where document matches failing to meet the SearchAfter clause weren't being returned to the pool.
Thank you @CascadingRadium . I'll share my review by tomorrow. |
@CascadingRadium I tried this change E2E with couchbase. Things don't work on 2 accounts -
Non-vector indexes are affected as well, so this needs a bit more testing from your end. |
@abhinavdangeti Re point 1, I believe this PR is the cause(#1935) since this doesn't allow an empty string as the KNN operator. If it is, it should be reverted. |
Thanks @metonymic-smokey , here's the PR adjusting that - #1937 |
there's a separate PR in cbft that I will raise today. the marshal unmatshal in logic there needs to be updated |
I don't see this still. Are any gRPC changes required in cbft to support this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still trying to understand the details of the implementation, although i get the core logic behind it. will keep at it periodically and let you know any concerns i have
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go with this. Well done @CascadingRadium .
Keep an eye out for any optimizations we can fish out in this area for the future.
- Cleanup up unused code paths since the merging of #1936 changed the overall approach, resulting in a lot of dead code. - Optimized the KNN collector to not update the allHits map for every hit and instead just use its internal heaps when the Final() is called. - Typo Fix - Optimized the OptimizeVR's Finish() method to concurrently search across segments. Results of Optimization Dataset - SIFT 1 Million dataset Index - 1 partition index Before this commit - Query Execution time - 2.25 seconds With this commit - Query Execution time - 1.70 seconds Percentage Reduction - **24.4%** --------- Co-authored-by: Abhinav Dangeti <[email protected]>