-
-
Notifications
You must be signed in to change notification settings - Fork 677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
documentloaders: ms office docs #1068
base: main
Are you sure you want to change the base?
Conversation
} | ||
|
||
var text strings.Builder | ||
for entry, err := doc.Next(); err == nil; entry, err = doc.Next() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can an error be returned by doc.Next() because of an other reason then there being no more entries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Umm that's a good question. I'll have to check and get back to you with that. Will submit fix next week ;)
// Process the binary content | ||
for j := 0; j < i; j++ { | ||
// Extract readable ASCII text | ||
if buf[j] >= 32 && buf[j] <= 126 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these numbers be unexported constants
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it!
I've added a document loader that will read and parse MS office file types, .doc, .docx, .xls, .xlsx, .ppt, .pptx. Turns out it's a bit more complicated than I expected so I didn't extract the texts but all the doc data and parse it in the schema.Document.PageContent.
For excel files there is metadata that will extract sheets and numerate them. The docx and pptx are just xml so I didn't extract the text just dumped the xml into PageContent, so at later date maybe somehow who understands the file formats better than me can build a decent document structure into schema.Document{}.
At the same time I think there are some advantages of llm having access to entire document structure not just the text strings.
PR Checklist
memory: add interfaces for X, Y
orutil: add whizzbang helpers
).Fixes #123
).golangci-lint
checks.