discussion: embedding column update problem for a vector db #120

aseaday · 2024-11-30T15:30:05Z

I am doing a content discovery system these days and I use a vector db (pgvector.rs).
Here is a demand I called it chain reaction:

A embedding should be changed when I change some columns of a record in a table.

I don't want to explicity to update columns both content and its embedding. Is there any better solutions like DEFAULT CURRENT_TIMESTAMP ON UPDATE. How do you think about this problem.

The text was updated successfully, but these errors were encountered:

VoVAllen · 2024-12-02T16:39:30Z

This is actually a complex general question on how to maintain the sync between data and vector. Here's my opinionated suggestion:

Enforce user to set the vector to NULL when updating the text, you can achieve this by

CREATE OR REPLACE FUNCTION ensure_columns_updated()
RETURNS TRIGGER AS $$
BEGIN
    -- Check if text has been updated.
    IF NEW.text IS DISTINCT FROM OLD.text THEN
        -- Throw an exception if the vector has not been updated.
        IF NEW.vector IS NOT DISTINCT FROM OLD.vector THEN
            RAISE EXCEPTION 'vector must be updated when text is updated';
        END IF;
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER ensure_columns_updated_trigger
BEFORE UPDATE ON your_table_name
FOR EACH ROW
EXECUTE FUNCTION ensure_columns_updated();

Generate embedding for NULL rows by external service (for example write a cron job to execute every minute), or use some extensions (pgai, https://github.com/tembo-io/pg_vectorize) to generate the embeddings inside postgres like

DO $$
DECLARE
    batch_size INT := 500;  -- Adjust the batch size as needed
    offset1 INT := 0;
    rows_updated INT;
BEGIN
    LOOP
        -- Update a batch of rows
        RAISE NOTICE 'We are at the % offset', offset1 ;
        UPDATE cases SET description_vector = azure_openai.create_embeddings(
            'text-embedding-3-small',  -- example deployment name in Azure OpenAI
            COALESCE(data#>>'{name}', 'default_value') || COALESCE(LEFT(data#>>'{casebody, opinions, 0}', 8000), 'default_value'),
            1536,  -- dimension
            3600000,  -- timeout_ms
            false,  -- throw_on_error
            10,  -- max_attempts
            2000  -- retry_delay_ms
        )::vector
        WHERE id IN (
            SELECT id
            FROM cases
			where description_vector is null
            ORDER BY id ASC
            LIMIT batch_size
            OFFSET offset1
        );
		

        -- Get the number of rows updated
        GET DIAGNOSTICS rows_updated = ROW_COUNT;

        -- Exit the loop if no more rows are updated
        IF rows_updated = 0 THEN
            EXIT;
        END IF;

        -- Increment the offset for the next batch
        offset1 := offset1 + batch_size;

        -- Commit the transaction to avoid long-running transactions
        COMMIT;
    END LOOP;
END $$;

VoVAllen · 2024-12-02T16:52:44Z

timescale https://github.com/timescale/pgai dedicated to your scenario. But I personally don't think this is a good practice.

SQL is hard to maintain. It's generally a bad pattern to write application logic with SQL functions. In the long run, nobody knows if this SQL can be changed and the extension can be updated.
Different iteration cycles. For database, you want it to be stable, so less maintenance and upgrade is preferred. But the model and methods in AI iterate very fast, the interface and function itself may need to be changed frequently.
Security issue. To maintain the security for database, we also set the minimum permission for it, including the network policy for it. But for embedding service, it need to communicate with external endpoint. And may need special network topology to connect to self-host models. It's hard to do it securely and properly.

aseaday · 2024-12-02T17:06:59Z

https://github.com/timescale/pgai

I do also think it is not a good idea to maintain these stuff in SQL.
But for now, the problem arise is that there is no standard or implementation such as a ORM could do this. In our team. some of member could only care about the ordinary logic part like CRUD. So they may develop a handler which change title or description but leave original embedding there.
In my opinion, a new spec which handle this problem in ORM is a good choice.

VoVAllen · 2024-12-03T11:27:26Z

@aseaday What type of ORM are you using? Could a new SDK for SQLAlchemy help in your scenario?

VoVAllen · 2024-12-03T12:28:40Z

I'm also in favor of using https://github.com/dbos-inc/ to handle the text -> vector mapping. It supports cron job.

aseaday · 2024-12-03T17:13:00Z

@aseaday What type of ORM are you using? Could a new SDK for SQLAlchemy help in your scenario?

We use EF Core which is a officially ORM bundled with .NET.
I simply build a binlog event pipeline to solve this problem. But I think it can be more clean and designed well.

VoVAllen · 2024-12-04T16:10:00Z

@aseaday Another solution is to use generated columns. However, it may prevent data insertion if the embedding model doesn't work. If not, we have to set up a separate async job to convert text to embedding, either inside postgres or outside it using Python or other SDK.

mertalev · 2024-12-04T16:49:21Z

Immich handles this through a job scheduler. One job inserts the content, and once complete queues another job that generates an embedding and inserts it into the database. Processes that rely on the embedding to exist get queued once the embedding job is complete.

mertalev · 2024-12-04T16:51:20Z

However, jobs are at the asset-level, so there is unfortunately a lot of communication overhead and no batching.

aseaday changed the title ~~Embedding column update problem for a vector db~~ discussion: embedding column update problem for a vector db Nov 30, 2024

gaocegege added question Further information is requested status/needs-discussion 🪧 labels Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

discussion: embedding column update problem for a vector db #120

discussion: embedding column update problem for a vector db #120

aseaday commented Nov 30, 2024

VoVAllen commented Dec 2, 2024 •

edited

Loading

VoVAllen commented Dec 2, 2024

aseaday commented Dec 2, 2024

VoVAllen commented Dec 3, 2024

VoVAllen commented Dec 3, 2024

aseaday commented Dec 3, 2024

VoVAllen commented Dec 4, 2024

mertalev commented Dec 4, 2024 •

edited

Loading

mertalev commented Dec 4, 2024

discussion: embedding column update problem for a vector db #120

discussion: embedding column update problem for a vector db #120

Comments

aseaday commented Nov 30, 2024

VoVAllen commented Dec 2, 2024 • edited Loading

VoVAllen commented Dec 2, 2024

aseaday commented Dec 2, 2024

VoVAllen commented Dec 3, 2024

VoVAllen commented Dec 3, 2024

aseaday commented Dec 3, 2024

VoVAllen commented Dec 4, 2024

mertalev commented Dec 4, 2024 • edited Loading

mertalev commented Dec 4, 2024

VoVAllen commented Dec 2, 2024 •

edited

Loading

mertalev commented Dec 4, 2024 •

edited

Loading