-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
discussion: embedding column update problem for a vector db #120
Comments
This is actually a complex general question on how to maintain the sync between data and vector. Here's my opinionated suggestion:
CREATE OR REPLACE FUNCTION ensure_columns_updated()
RETURNS TRIGGER AS $$
BEGIN
-- Check if text has been updated.
IF NEW.text IS DISTINCT FROM OLD.text THEN
-- Throw an exception if the vector has not been updated.
IF NEW.vector IS NOT DISTINCT FROM OLD.vector THEN
RAISE EXCEPTION 'vector must be updated when text is updated';
END IF;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER ensure_columns_updated_trigger
BEFORE UPDATE ON your_table_name
FOR EACH ROW
EXECUTE FUNCTION ensure_columns_updated();
DO $$
DECLARE
batch_size INT := 500; -- Adjust the batch size as needed
offset1 INT := 0;
rows_updated INT;
BEGIN
LOOP
-- Update a batch of rows
RAISE NOTICE 'We are at the % offset', offset1 ;
UPDATE cases SET description_vector = azure_openai.create_embeddings(
'text-embedding-3-small', -- example deployment name in Azure OpenAI
COALESCE(data#>>'{name}', 'default_value') || COALESCE(LEFT(data#>>'{casebody, opinions, 0}', 8000), 'default_value'),
1536, -- dimension
3600000, -- timeout_ms
false, -- throw_on_error
10, -- max_attempts
2000 -- retry_delay_ms
)::vector
WHERE id IN (
SELECT id
FROM cases
where description_vector is null
ORDER BY id ASC
LIMIT batch_size
OFFSET offset1
);
-- Get the number of rows updated
GET DIAGNOSTICS rows_updated = ROW_COUNT;
-- Exit the loop if no more rows are updated
IF rows_updated = 0 THEN
EXIT;
END IF;
-- Increment the offset for the next batch
offset1 := offset1 + batch_size;
-- Commit the transaction to avoid long-running transactions
COMMIT;
END LOOP;
END $$; |
timescale https://github.com/timescale/pgai dedicated to your scenario. But I personally don't think this is a good practice.
|
I do also think it is not a good idea to maintain these stuff in SQL. |
@aseaday What type of ORM are you using? Could a new SDK for SQLAlchemy help in your scenario? |
I'm also in favor of using https://github.com/dbos-inc/ to handle the text -> vector mapping. It supports cron job. |
We use EF Core which is a officially ORM bundled with .NET. |
@aseaday Another solution is to use generated columns. However, it may prevent data insertion if the embedding model doesn't work. If not, we have to set up a separate async job to convert text to embedding, either inside postgres or outside it using Python or other SDK. |
Immich handles this through a job scheduler. One job inserts the content, and once complete queues another job that generates an embedding and inserts it into the database. Processes that rely on the embedding to exist get queued once the embedding job is complete. |
However, jobs are at the asset-level, so there is unfortunately a lot of communication overhead and no batching. |
I am doing a content discovery system these days and I use a vector db (pgvector.rs).
Here is a demand I called it chain reaction:
A embedding should be changed when I change some columns of a record in a table.
I don't want to explicity to update columns both content and its embedding. Is there any better solutions like
DEFAULT CURRENT_TIMESTAMP ON UPDATE
. How do you think about this problem.The text was updated successfully, but these errors were encountered: