Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add semantic chunking pipeline #812

Open
davidmezzetti opened this issue Nov 17, 2024 · 2 comments
Open

Add semantic chunking pipeline #812

davidmezzetti opened this issue Nov 17, 2024 · 2 comments
Assignees
Milestone

Comments

@davidmezzetti
Copy link
Member

davidmezzetti commented Nov 17, 2024

Evaluate using the Chonkie library to add more sophisticated chunking to txtai.

In reviewing the code, it appears that an interface wrapping a txtai vectors instance could be passed in. With this, any supported txtai vectors model (huggingface, sentence-transformers, LiteLLM, llama-cpp etc) could be used.

@davidmezzetti davidmezzetti changed the title Add Chonkie pipeline Add semantic chunking pipeline Nov 17, 2024
@davidmezzetti davidmezzetti self-assigned this Nov 20, 2024
@davidmezzetti davidmezzetti added this to the v8.1.0 milestone Nov 20, 2024
@davidmezzetti davidmezzetti modified the milestones: v8.1.0, v8.2.0 Dec 10, 2024
@davidmezzetti
Copy link
Member Author

Docling also has some chunkers that are worth considering: https://ds4sd.github.io/docling/examples/hybrid_chunking/

@bhavnicksm
Copy link

Hey @davidmezzetti! 👋

I'm glad that you found chonkie useful for txtai! I would be happy to help accelerate the integration of Chonkie here~ Please let me know if there is any support needed.

BTW, we do offer support for customizable embeddings for SemanticChunker and SDPMChunker! Unfortunately, the newly added LateChunker is limited to sentence-transformers only at the moment (but that makes a very strong chunking technique).

Thanks 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants