Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

discussion: embedding column update problem for a vector db #120

Open
aseaday opened this issue Nov 30, 2024 · 9 comments
Open

discussion: embedding column update problem for a vector db #120

aseaday opened this issue Nov 30, 2024 · 9 comments
Labels
question Further information is requested status/needs-discussion 🪧

Comments

@aseaday
Copy link
Member

aseaday commented Nov 30, 2024

I am doing a content discovery system these days and I use a vector db (pgvector.rs).
Here is a demand I called it chain reaction:

A embedding should be changed when I change some columns of a record in a table.

I don't want to explicity to update columns both content and its embedding. Is there any better solutions like DEFAULT CURRENT_TIMESTAMP ON UPDATE. How do you think about this problem.

@aseaday aseaday changed the title Embedding column update problem for a vector db discussion: embedding column update problem for a vector db Nov 30, 2024
@gaocegege gaocegege added question Further information is requested status/needs-discussion 🪧 labels Dec 2, 2024
@VoVAllen
Copy link
Member

VoVAllen commented Dec 2, 2024

This is actually a complex general question on how to maintain the sync between data and vector. Here's my opinionated suggestion:

  • Enforce user to set the vector to NULL when updating the text, you can achieve this by
CREATE OR REPLACE FUNCTION ensure_columns_updated()
RETURNS TRIGGER AS $$
BEGIN
    -- Check if text has been updated.
    IF NEW.text IS DISTINCT FROM OLD.text THEN
        -- Throw an exception if the vector has not been updated.
        IF NEW.vector IS NOT DISTINCT FROM OLD.vector THEN
            RAISE EXCEPTION 'vector must be updated when text is updated';
        END IF;
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER ensure_columns_updated_trigger
BEFORE UPDATE ON your_table_name
FOR EACH ROW
EXECUTE FUNCTION ensure_columns_updated();
  • Generate embedding for NULL rows by external service (for example write a cron job to execute every minute), or use some extensions (pgai, https://github.com/tembo-io/pg_vectorize) to generate the embeddings inside postgres like
DO $$
DECLARE
    batch_size INT := 500;  -- Adjust the batch size as needed
    offset1 INT := 0;
    rows_updated INT;
BEGIN
    LOOP
        -- Update a batch of rows
        RAISE NOTICE 'We are at the % offset', offset1 ;
        UPDATE cases SET description_vector = azure_openai.create_embeddings(
            'text-embedding-3-small',  -- example deployment name in Azure OpenAI
            COALESCE(data#>>'{name}', 'default_value') || COALESCE(LEFT(data#>>'{casebody, opinions, 0}', 8000), 'default_value'),
            1536,  -- dimension
            3600000,  -- timeout_ms
            false,  -- throw_on_error
            10,  -- max_attempts
            2000  -- retry_delay_ms
        )::vector
        WHERE id IN (
            SELECT id
            FROM cases
			where description_vector is null
            ORDER BY id ASC
            LIMIT batch_size
            OFFSET offset1
        );
		

        -- Get the number of rows updated
        GET DIAGNOSTICS rows_updated = ROW_COUNT;

        -- Exit the loop if no more rows are updated
        IF rows_updated = 0 THEN
            EXIT;
        END IF;

        -- Increment the offset for the next batch
        offset1 := offset1 + batch_size;

        -- Commit the transaction to avoid long-running transactions
        COMMIT;
    END LOOP;
END $$;

@VoVAllen
Copy link
Member

VoVAllen commented Dec 2, 2024

timescale https://github.com/timescale/pgai dedicated to your scenario. But I personally don't think this is a good practice.

  • SQL is hard to maintain. It's generally a bad pattern to write application logic with SQL functions. In the long run, nobody knows if this SQL can be changed and the extension can be updated.
  • Different iteration cycles. For database, you want it to be stable, so less maintenance and upgrade is preferred. But the model and methods in AI iterate very fast, the interface and function itself may need to be changed frequently.
  • Security issue. To maintain the security for database, we also set the minimum permission for it, including the network policy for it. But for embedding service, it need to communicate with external endpoint. And may need special network topology to connect to self-host models. It's hard to do it securely and properly.

@aseaday
Copy link
Member Author

aseaday commented Dec 2, 2024

https://github.com/timescale/pgai

I do also think it is not a good idea to maintain these stuff in SQL.
But for now, the problem arise is that there is no standard or implementation such as a ORM could do this. In our team. some of member could only care about the ordinary logic part like CRUD. So they may develop a handler which change title or description but leave original embedding there.
In my opinion, a new spec which handle this problem in ORM is a good choice.

@VoVAllen
Copy link
Member

VoVAllen commented Dec 3, 2024

@aseaday What type of ORM are you using? Could a new SDK for SQLAlchemy help in your scenario?

@VoVAllen
Copy link
Member

VoVAllen commented Dec 3, 2024

I'm also in favor of using https://github.com/dbos-inc/ to handle the text -> vector mapping. It supports cron job.

@aseaday
Copy link
Member Author

aseaday commented Dec 3, 2024

@aseaday What type of ORM are you using? Could a new SDK for SQLAlchemy help in your scenario?

We use EF Core which is a officially ORM bundled with .NET.
I simply build a binlog event pipeline to solve this problem. But I think it can be more clean and designed well.

@VoVAllen
Copy link
Member

VoVAllen commented Dec 4, 2024

@aseaday Another solution is to use generated columns. However, it may prevent data insertion if the embedding model doesn't work. If not, we have to set up a separate async job to convert text to embedding, either inside postgres or outside it using Python or other SDK.

@mertalev
Copy link

mertalev commented Dec 4, 2024

Immich handles this through a job scheduler. One job inserts the content, and once complete queues another job that generates an embedding and inserts it into the database. Processes that rely on the embedding to exist get queued once the embedding job is complete.

@mertalev
Copy link

mertalev commented Dec 4, 2024

However, jobs are at the asset-level, so there is unfortunately a lot of communication overhead and no batching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested status/needs-discussion 🪧
Projects
None yet
Development

No branches or pull requests

4 participants