Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Dataprep Ingest Data Issue. #1271

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

letonghan
Copy link
Collaborator

@letonghan letonghan commented Feb 7, 2025

Description

Fix Dataprep Ingest Data Issue.

Root Cause:
The package of langchain_huggingface updated, caused different output of HuggingFaceEndpointEmbeddings.embed_documents.

Trace:

  1. The update of langchain_huggingface.HuggingFaceEndpointEmbeddings caused the wrong size of embedding vectors.
  2. Wrong size vectors are wrongly saved into Redis database, and the indices are not created correctly.
  3. The retriever can not retrieve data from Redis using index due to the reasons above.
  4. Then the RAG seems not work, for the file uploaded can not be retrieved from database.

Solution:
Replace all of the langchain_huggingface.HuggingFaceEndpointEmbeddings to langchain_community.embeddings.HuggingFaceInferenceAPIEmbeddings, and modify related READMEs and scirpts.

Issues

opea-project/GenAIExamples#1473
opea-project/GenAIExamples#1482

Type of change

List the type of change like below. Please delete options that are not relevant.

  • Others (enhancement, documentation, validation, etc.)

Dependencies

None

Tests

Local tested

Trace:
1. The update of `langchain_huggingface.HuggingFaceEndpointEmbeddings` caused the wrong size of embedding vectors.
2. Wrong size vectors are wrongly saved into Redis database in type of
   `byte`, and the indices are not created correctly.
3. The retriever can not retrieve data from Redis using index due to the
   reasons above.
4. Then the RAG seems `not work`, for the file uploaded can not be
   retrieved from database.

Solution:
Replace all of the `langchain_huggingface.HuggingFaceEndpointEmbeddings`
to `langchain_community.embeddings.HuggingFaceInferenceAPIEmbeddings`,
and modify related READMEs and scirpts.

Related issue: opea-project/GenAIExamples#1482

Signed-off-by: letonghan <[email protected]>
comps/dataprep/src/integrations/redis.py Outdated Show resolved Hide resolved
@lianhao
Copy link
Collaborator

lianhao commented Feb 7, 2025

One more thing, I noticed that retriever is also using the HuggingFaceEndpointEmbeddings, should we change the retriever to use HuggingFaceInferenceAPIEmbeddings if the latest HuggingFaceEndpointEmbeddings is buggy?

@letonghan
Copy link
Collaborator Author

One more thing, I noticed that retriever is also using the HuggingFaceEndpointEmbeddings, should we change the retriever to use HuggingFaceInferenceAPIEmbeddings if the latest HuggingFaceEndpointEmbeddings is buggy?

The embed_document function is not used in Retriever component, so it does not affect retriever's functionality.

Signed-off-by: letonghan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants