Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect bounding box on TimesNewRomanPSMT #749

Open
grinay opened this issue Jan 10, 2024 · 6 comments
Open

Incorrect bounding box on TimesNewRomanPSMT #749

grinay opened this issue Jan 10, 2024 · 6 comments
Labels
document-reading Related to reading documents

Comments

@grinay
Copy link
Contributor

grinay commented Jan 10, 2024

Hi. I found an issue on the bounding boxes for font TimesNewRomanPSMT,
document11.pdf
If you try to extract bounding box for word "difficulty" on the first page
image
you will see that bounding box shifted. I've tested that case in the itextsharp, and find out that it's using font descender to calculate the bounding box.
Looking into the code I can't find any usage of descender of the font. Is that correct? May you advice how to fix this?

@BobLd
Copy link
Collaborator

BobLd commented Jan 14, 2024

Hi @grinay, would you be able to share a screenshot of the itextsharp bounding box?

The bounding box here loosk correct to me, but I might be wrong. For words, the bounding boxes start from the baseline points. Maybe compare the bounding box of this word with the bounding boxes of its letters.

@grinay
Copy link
Contributor Author

grinay commented Jan 18, 2024

Hi @BobLd , I've made a simple test with textsharp https://gist.github.com/grinay/ff5bba3c0b9b6a81f11413ca669583ff.
It outputs the positions yStart: 552,51666, yEnd: 564,51666.
Here is the example of how I originally extracted positions for letters, and based on this information later I get the max and min of Y position.

foreach (var letter in word.Letters){
                        var rectangle = new[]
                        {
                            (float)letter.StartBaseLine.X - (float)pigPage.MediaBox.Bounds.Left,
                            (float)textLine.BoundingBox.Bottom - (float)pigPage.MediaBox.Bounds.Bottom,
                            (float)letter.EndBaseLine.X - (float)pigPage.MediaBox.Bounds.Left,
                            (float)textLine.BoundingBox.Top - (float)pigPage.MediaBox.Bounds.Bottom,
                        };
....
}

I was able to fix it without modifying pdfpig.
What I did is wrote a function which using reflection and goes down to the resource store of Page , resourceStore -> loadedFonts and get font descriptor, currently I'm only handling font types Type1FontSimple and TrueTypeSimpleFont.
After I have access to font descriptor I may correct Y position:

                    //current font
                    var fontName = word.Letters.First().FontName;
                    var pointSize = word.Letters.First().PointSize;
                    var fontDescriptor = fontDescriptors.FirstOrDefault(x => x.FontName == fontName);

 foreach (var letter in word.Letters)
                    {
 //In case font is with Descent and Ascent, correct bounding box
                        if (fontDescriptor != null)
                        {
                            //correct bounding box with fontDescriptor.Descent
                            var descent = fontDescriptor.Descent;
                            var ascent = fontDescriptor.Ascent;
                            var fontBbox = fontDescriptor.BoundingBox;

   //This logic taken from itextsharp
                            var maxAscent = Math.Max((decimal)fontBbox.TopRight.Y, ascent);
                            var minDescent = Math.Min((decimal)fontBbox.BottomRight.Y, descent);
                            ascent = maxAscent * 1000 / (maxAscent - minDescent);
                            descent = minDescent * 1000 / (maxAscent - minDescent);
//

                            var bboxHeight = descent / 1000 * (decimal)pointSize;
                            var bboxHeight2 = ascent / 1000 * (decimal)pointSize;

                            rectangle = new[]
                            {
                                (float)letter.StartBaseLine.X - (float)pigPage.MediaBox.Bounds.Left,
                                (float)letter.StartBaseLine.Y + (float)bboxHeight -
                                (float)pigPage.MediaBox.Bounds.Bottom,
                                (float)letter.EndBaseLine.X - (float)pigPage.MediaBox.Bounds.Left,
                                (float)letter.StartBaseLine.Y + (float)bboxHeight2 -
                                (float)pigPage.MediaBox.Bounds.Bottom,
                            };
                        }
...
}

After this fix everything looks correct. I've tested on other documents, and it works.
image

@BobLd
Copy link
Collaborator

BobLd commented Jan 18, 2024

@grinay Thanks a lot for the great explanation of your fix. I had a look on my side and I think there's an easier way to achieve what you want to do.

Can you try to compute the word bounding box using the following:

foreach (var word in page.GetWords())
{
	var first = word.Letters[0];
	var last = word.Letters[word.Letters.Count - 1];

	double x1 = first.GlyphRectangle.TopLeft.X;
	double x2 = last.GlyphRectangle.TopRight.X;
	double y1 = word.Letters.Max(l => l.GlyphRectangle.TopLeft.Y);
	double y2 = word.Letters.Min(l => l.GlyphRectangle.BottomLeft.Y);
	var bbox = new PdfRectangle(x1, y1, x2, y2); // This is the bbox you're looking for
	DrawRectangle(bbox, canvas, redPaint, size.Height, Scale);
}

I've pushed my code in https://github.com/BobLd/PdfPig/tree/issue-749-word-bbox (NB: this branch is not dirrectly based on master and contain changes that are in #757).

Have a look in the test GenerateLetterGlyphImages, Issue749(). It will generate an image saved in the ImagesGlyphs folder:
image

You can see (above) that the "difficulty" bbox is now what you want. One difference between my code and yours might be the bbox of words that do not contains letters with ascenders and descenders, for example "The" and "in", in "The difficulty in".

@BobLd
Copy link
Collaborator

BobLd commented Jan 18, 2024

Let's leave the issue openned as I'd like to do some further tests

@grinay
Copy link
Contributor Author

grinay commented Jan 19, 2024

@BobLd yes looks good at the image. As I understand this will not work at the the current master branch right? We should wait until you merge your changes? And btw on image you shows it some words, like "program" bounding upper boundary shifter to the bottom, which was the reason we had to extend boxes up to 20%, as in other documents it was very incorrect for us. After I applied that fix with font descriptor, this problem disappears, and we removed code which extended the boxes.

@BobLd
Copy link
Collaborator

BobLd commented Jan 20, 2024

@grinay I think you can try my fix with the current PdfPig version, or with the latest pre-released version. But do let me know if that does not work.

As you point out, I do think some issue remains (especially upper boundary problem) and your approach might be the solution. This is what I want to look into.

I am refering to the upper boundary in your initial screenshot:
image
The upper boundary is too high compared to what it should be, and that might be related to ascender/descender. The lower is correct though, as we use the base line points of letters (i.e. not the bottom points, so it excludes any descender) as the bottom for word bounding box (this is different from what you are looking for).

Let's leave the issue open for now so that we keep that in mind

@EliotJones EliotJones added the document-reading Related to reading documents label Feb 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
document-reading Related to reading documents
Projects
None yet
Development

No branches or pull requests

3 participants