[Feature Request] [PDF] Option to use PDF outlines (ToC metadata) for header extraction #107

ChiNoel-osu · 2024-08-22T08:45:28Z

ChiNoel-osu
Aug 22, 2024

Some PDFs have built-in outlines (or chapters) so instead of relying on font size for headers, we can use that instead for a much more accurate header extraction (when the outlines are available).
Header extraction based on font size can break when a title has multiple font sizes for aesthetic reasons.
For example: A numbered chapter title which says 1 Introduction but the 1 is larger than Introduction, the extracted header would look like this:

### Introduction
# 1

I tried to implement it myself but I'm not sure how I can map the outline text to the extracted text and append the appropriate hashtags, especially when the outline text is slightly different from the extracted text.
For example: A outline 4 Appendix may refer to the extracted text 4Appendix.

JorjMcKie · 2024-08-22T12:01:01Z

JorjMcKie
Aug 22, 2024
Maintainer

This is an excellent idea - which I am also considering myself since some time. And with the same issues: What if text found in the document is only almost equal to the text in the outline item ...

But I keep thinking here. Maybe both texts can be made (more) equal with some extra treatment before comparing, or using regular expressions ...
Whatever solution here, it will eat up some performance though.

0 replies

ChiNoel-osu · 2024-08-22T14:18:23Z

ChiNoel-osu
Aug 22, 2024
Author

I was also thinking about using regex but I suck at it so I ended up doing outline.split(" ",1)[-1] so 4 Appendix becomes Appendix and I would just search for this string from behind (so it doesn't replace the one from the ToC page lol) and put hashtags in front and hope there's no other "Appendix" occurrence.

Come to think of it I could add a "\n" at the back to improve accuracy, but whatever :/

It works for my small set of documents (they all have consistent formatting) but it's more like a hack instead of a solution.

1 reply

JorjMcKie Aug 22, 2024
Maintainer

Appreciating all you wrote, e.g. I hate regex too!
Quote from a friend:

I had a problem, so I started using regex.
Now I have two problems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] [PDF] Option to use PDF outlines (ToC metadata) for header extraction #107

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

[Feature Request] [PDF] Option to use PDF outlines (ToC metadata) for header extraction #107

ChiNoel-osu Aug 22, 2024

Replies: 2 comments · 1 reply

JorjMcKie Aug 22, 2024 Maintainer

ChiNoel-osu Aug 22, 2024 Author

JorjMcKie Aug 22, 2024 Maintainer

ChiNoel-osu
Aug 22, 2024

Replies: 2 comments 1 reply

JorjMcKie
Aug 22, 2024
Maintainer

ChiNoel-osu
Aug 22, 2024
Author

JorjMcKie Aug 22, 2024
Maintainer