documentloaders: ms office docs #1068

Struki84 · 2024-11-19T11:03:26Z

I've added a document loader that will read and parse MS office file types, .doc, .docx, .xls, .xlsx, .ppt, .pptx. Turns out it's a bit more complicated than I expected so I didn't extract the texts but all the doc data and parse it in the schema.Document.PageContent.

For excel files there is metadata that will extract sheets and numerate them. The docx and pptx are just xml so I didn't extract the text just dumped the xml into PageContent, so at later date maybe somehow who understands the file formats better than me can build a decent document structure into schema.Document{}.

At the same time I think there are some advantages of llm having access to entire document structure not just the text strings.

PR Checklist

Read the Contributing documentation.
Read the Code of conduct documentation.
Name your Pull Request title clearly, concisely, and prefixed with the name of the primarily affected package you changed according to Good commit messages (such as memory: add interfaces for X, Y or util: add whizzbang helpers).
Check that there isn't already a PR that solves the problem the same way to avoid creating a duplicate.
Provide a description in this PR that addresses what the PR is solving, or reference the issue that it solves (e.g. Fixes #123).
Contains test coverage for new functions.
Passes all golangci-lint checks.

xls, xlsx

FluffyKebab · 2025-01-05T04:42:30Z

documentloaders/ms_office.go

+	}
+
+	var text strings.Builder
+	for entry, err := doc.Next(); err == nil; entry, err = doc.Next() {


Can an error be returned by doc.Next() because of an other reason then there being no more entries?

Umm that's a good question. I'll have to check and get back to you with that. Will submit fix next week ;)

FluffyKebab · 2025-01-05T04:45:07Z

documentloaders/ms_office.go

+				// Process the binary content
+				for j := 0; j < i; j++ {
+					// Extract readable ASCII text
+					if buf[j] >= 32 && buf[j] <= 126 {


I think these numbers be unexported constants

Struki84 added 7 commits November 5, 2024 00:01

Added core office docs documentloader

c92caf8

Added loader for all the common office doc types, doc, docx, ppt, pptx,

934f859

xls, xlsx

Added dependencies

e2f0d82

Added test files

13b8236

Added tests

192abd6

Added comments

6ef49eb

Fixed the linter errors

47d9761

FluffyKebab reviewed Jan 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

documentloaders: ms office docs #1068

documentloaders: ms office docs #1068

Struki84 commented Nov 19, 2024 •

edited

Loading

FluffyKebab Jan 5, 2025

Struki84 Jan 6, 2025

FluffyKebab Jan 5, 2025

Struki84 Jan 6, 2025

documentloaders: ms office docs #1068

Are you sure you want to change the base?

documentloaders: ms office docs #1068

Conversation

Struki84 commented Nov 19, 2024 • edited Loading

PR Checklist

FluffyKebab Jan 5, 2025

Choose a reason for hiding this comment

Struki84 Jan 6, 2025

Choose a reason for hiding this comment

FluffyKebab Jan 5, 2025

Choose a reason for hiding this comment

Struki84 Jan 6, 2025

Choose a reason for hiding this comment

Struki84 commented Nov 19, 2024 •

edited

Loading