-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Composite indexes and filtered searches #121
Comments
We published a blog post about this feature in the release announcement for pgvecto.rs 0.2: https://blog.pgvecto.rs/pgvectors-02-unifying-relational-queries-and-vector-search-in-postgresql. |
Here are some discussions: tensorchord/pgvecto.rs#188 |
Thank you for your interest! There're two ways to do filtering with vector search, pre-filtering(check filter condition inside the index) and post-filtering(check filter condition outside the index). For post-filtering inside postgres, it means the vector index is an iterator to provide candidates to postgres. Then postgres will check whether each element satisfies the filter condition. The technique to make the index an iterator is called VBASE. This is supported by VectorChord out of the box. For pre-filtering inside postgres, it's more likely the composite index you mentioned. However, the composite index needs to be claimed in the indexing process, like FreshDiskANN only works for a specific type of query (i.e. OR clause with tag selection) that doesn't fit postgres. |
@VoVAllen I think these examples still suffer from the inefficiency I described earlier where you may have to skip many results to find the For example in my My understanding from https://harsha-simhadri.org/pubs/Filtered-DiskANN23.pdf is that FilteredDiskANN addresses this by using the filter value (e.g. This composite index approach is easier to implement for a filter like Please correct me if I'm misunderstanding something though as I've never actually worked on Postgres index internals before. |
You're right about the post-filtering approach. That's why I also mentioned the prefilter approach. However prefilter means you need to store the information to be filtered along with the vector. It will have extra costs on both storage and computation. There's no simple composite index for multiple fields, that you cannot combine two indexes together. For the query like FilteredDiskANN is specific for certain type of query, as I mentioned, the tag selection scenario. You can consider it as something like create partial index on each user_id. It will make the query like
The implementation is actually different from what you mentioned.
Ref: https://www.postgresql.org/docs/current/indexes-multicolumn.html The btree index is on the leading column. Other columns are just used to check the condition. |
@VoVAllen I read that a little differently. My understanding is that a multi column BTREE index is actually ordering the data based on the composite order of all keys. What you are describing sounds a little more like "covering indexes" which are described at https://www.postgresql.org/docs/current/indexes-index-only-scans.html#INDEXES-INDEX-ONLY-SCANS . These are a way to include extra columns in the index without actually ordering by them. It allows doing the filtering without a heap fetch and I assume that's what you're referring to with "just used to check the condition".
Since "leading columns" is plural I believe it is talking about doing equality checks on all of the columns in order. Additionally they mention the point I was trying to make about combining sorting and the index to limit the amount of the index scanned.
Also that page doesn't talk about I'll try to illustrate the kind of benefits we get from a multi-column index when combining filtering and ordering with an example: Example with/without BTREE indexesCREATE TABLE posts (
id bigserial PRIMARY KEY,
user_id bigint,
created_at timestamp
);
INSERT INTO posts (
SELECT
generate_series(1,100000) as id,
floor(random()*100)+1 as user_id,
NOW() + (random() * (NOW()+'100 days' - NOW())) as created_at
);
explain analyze select id, user_id, created_at from posts where user_id = 5 order by created_at limit 5;
create index on posts (user_id);
explain analyze select id, user_id, created_at from posts where user_id = 5 order by created_at limit 5;
create index on posts (user_id) include(created_at);
explain analyze select id, user_id, created_at from posts where user_id = 5 order by created_at limit 5;
create index on posts (user_id, created_at);
explain analyze select id, user_id, created_at from posts where user_id = 5 order by created_at limit 5;
CREATE TABLE
INSERT 0 100000
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Limit (cost=1895.42..1895.43 rows=5 width=24) (actual time=3.900..3.901 rows=5 loops=1)
-> Sort (cost=1895.42..1896.67 rows=500 width=24) (actual time=3.899..3.900 rows=5 loops=1)
Sort Key: created_at
Sort Method: top-N heapsort Memory: 25kB
-> Seq Scan on posts (cost=0.00..1887.11 rows=500 width=24) (actual time=0.017..3.866 rows=1038 loops=1)
Filter: (user_id = 5)
Rows Removed by Filter: 98962
Planning Time: 0.046 ms
Execution Time: 3.907 ms
(9 rows)
CREATE INDEX
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=378.35..378.36 rows=5 width=24) (actual time=0.226..0.226 rows=5 loops=1)
-> Sort (cost=378.35..379.60 rows=500 width=24) (actual time=0.225..0.225 rows=5 loops=1)
Sort Key: created_at
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using posts_user_id_idx on posts (cost=0.29..370.04 rows=500 width=24) (actual time=0.008..0.195 rows=1038 loops=1)
Index Cond: (user_id = 5)
Planning Time: 0.113 ms
Execution Time: 0.232 ms
(8 rows)
CREATE INDEX
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=378.35..378.36 rows=5 width=24) (actual time=0.229..0.230 rows=5 loops=1)
-> Sort (cost=378.35..379.60 rows=500 width=24) (actual time=0.229..0.229 rows=5 loops=1)
Sort Key: created_at
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using posts_user_id_idx on posts (cost=0.29..370.04 rows=500 width=24) (actual time=0.005..0.196 rows=1038 loops=1)
Index Cond: (user_id = 5)
Planning Time: 0.110 ms
Execution Time: 0.236 ms
(8 rows)
CREATE INDEX
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..4.12 rows=5 width=24) (actual time=0.077..0.079 rows=5 loops=1)
-> Index Scan using posts_user_id_created_at_idx1 on posts (cost=0.42..371.17 rows=500 width=24) (actual time=0.076..0.078 rows=5 loops=1)
Index Cond: (user_id = 5)
Planning Time: 0.097 ms
Execution Time: 0.083 ms
(5 rows) For the example query
This indexing approach is not easy to map onto an inverted index or a graph index but from my reading of the paper https://harsha-simhadri.org/pubs/Filtered-DiskANN23.pdf the FilteredDiskANN approach does try to accomplish something similar by ensuring that nearby graph nodes are also "nearby" based on the filters that you want to apply. So traversing the index (graph) yields you a set of results that match your filter with far fewer steps through the graph than a regular HNSW index. I'm not sure how to relate that to IVF indexes but I'm trying to highlight the major performance opportunities we have if we can get an index that is actually organized by the filter conditions as well as the vector distances. |
I think there are 2 simple approaches to expand an IVFFlat index with the ability to filter more efficiently. For the example I described of: SELECT * FROM posts WHERE user_id = 123 ORDER BY embedding <-> '[3,1,2]' LIMIT 5 You could do:
|
Thank you. Your example makes a lot of sense to me. However, I think it only works if the leading operator is an equal constraint. If it's |
@VoVAllen yes I think you're right about that. The use case I'm mainly interested in is things like |
It's possible to make each posting list a btree organized by user_id to make it faster. But the code will be much more complex and I'm not sure how much speedup we will see. It depends a lot on how many users there are and how many posts there are for each user. |
Problem
I was interested to understand how this extension performs when combining other filters (
WHERE
clauses) with vector searches. This is a common issue with vector search indexes and there is a long discussion about it in pgvector at pgvector/pgvector#259 .I couldn't find an existing issue or mention of the topic in this repo except for this question #77 (comment) so I thought I'd open an issue.
Basically the idea would be to support queries like:
Typically in Postgres such queries are optimized with a composite index on
user_id
andembedding
. Composite indexes are simple for aBTREE
index but much more complicated for other index types.Without a composite index Postgres query planner has to choose between loading all
posts WHERE user_id = 123
and sorting those in memory and then limiting to the top 5 or it has to look at the entire orderedembedding <-> '[3,1,2]'
list and keep searching until it finds 5 matches whereuser_id = 123
. The choice usually depends on statistics which indicate which will be least wasteful. But in many applications both options will be very wasteful.The worst case scenario happens when there are many users and every user has many posts. In that case the
posts WHERE user_id = 123
needs to sort a lot of posts in memory and theembedding <-> '[3,1,2]'
option would need to skip many irrelevant posts by different users.Solution
This issue is mainly opened as a starting point to the conversation. There aren't easy answers here but usually only tradeoffs and I wanted to hear from the maintainers about their thoughts on this problem.
Related research
So far the most thorough discussion I've seen about this topic is in https://harsha-simhadri.org/pubs/Filtered-DiskANN23.pdf but this has not yielded any production grade database nor a Postgres extension to my knowledge.
The text was updated successfully, but these errors were encountered: