[Feature Request] [PDF] Option to use PDF outlines (ToC metadata) for header extraction #107
Replies: 2 comments 1 reply
-
This is an excellent idea - which I am also considering myself since some time. And with the same issues: What if text found in the document is only almost equal to the text in the outline item ... But I keep thinking here. Maybe both texts can be made (more) equal with some extra treatment before comparing, or using regular expressions ... |
Beta Was this translation helpful? Give feedback.
-
I was also thinking about using regex but I suck at it so I ended up doing Come to think of it I could add a "\n" at the back to improve accuracy, but whatever :/ It works for my small set of documents (they all have consistent formatting) but it's more like a hack instead of a solution. |
Beta Was this translation helpful? Give feedback.
-
Some PDFs have built-in outlines (or chapters) so instead of relying on font size for headers, we can use that instead for a much more accurate header extraction (when the outlines are available).
Header extraction based on font size can break when a title has multiple font sizes for aesthetic reasons.
For example: A numbered chapter title which says
1 Introduction
but the1
is larger thanIntroduction
, the extracted header would look like this:I tried to implement it myself but I'm not sure how I can map the outline text to the extracted text and append the appropriate hashtags, especially when the outline text is slightly different from the extracted text.
For example: A outline
4 Appendix
may refer to the extracted text4Appendix
.Beta Was this translation helpful? Give feedback.
All reactions