Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

documentloaders: ms office docs #1068

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

Struki84
Copy link
Contributor

@Struki84 Struki84 commented Nov 19, 2024

I've added a document loader that will read and parse MS office file types, .doc, .docx, .xls, .xlsx, .ppt, .pptx. Turns out it's a bit more complicated than I expected so I didn't extract the texts but all the doc data and parse it in the schema.Document.PageContent.

For excel files there is metadata that will extract sheets and numerate them. The docx and pptx are just xml so I didn't extract the text just dumped the xml into PageContent, so at later date maybe somehow who understands the file formats better than me can build a decent document structure into schema.Document{}.

At the same time I think there are some advantages of llm having access to entire document structure not just the text strings.

PR Checklist

  • Read the Contributing documentation.
  • Read the Code of conduct documentation.
  • Name your Pull Request title clearly, concisely, and prefixed with the name of the primarily affected package you changed according to Good commit messages (such as memory: add interfaces for X, Y or util: add whizzbang helpers).
  • Check that there isn't already a PR that solves the problem the same way to avoid creating a duplicate.
  • Provide a description in this PR that addresses what the PR is solving, or reference the issue that it solves (e.g. Fixes #123).
  • Contains test coverage for new functions.
  • Passes all golangci-lint checks.

}

var text strings.Builder
for entry, err := doc.Next(); err == nil; entry, err = doc.Next() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can an error be returned by doc.Next() because of an other reason then there being no more entries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm that's a good question. I'll have to check and get back to you with that. Will submit fix next week ;)

// Process the binary content
for j := 0; j < i; j++ {
// Extract readable ASCII text
if buf[j] >= 32 && buf[j] <= 126 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these numbers be unexported constants

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants