Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce IBlock and ILettersBlock interfaces #864

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

davebrokit
Copy link
Contributor

Part of #855

@@ -0,0 +1,306 @@
namespace UglyToad.PdfPig.DocumentLayoutAnalysis.Export
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a copy of PageXmlTextExporter with changes to make it more general. With this you pass a list of IBlocks and it generates the PageXml for that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you exclude this class from the PR? This could be added back later on in a different PR if need be. Ideally, we want to avoid creating different versions of classes that do the same thing (PageXmlGeneralExporter and PageXmlTextExporter)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will Raise a separate PR and can discuss on that

@@ -0,0 +1,306 @@
namespace UglyToad.PdfPig.DocumentLayoutAnalysis.Export
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you exclude this class from the PR? This could be added back later on in a different PR if need be. Ideally, we want to avoid creating different versions of classes that do the same thing (PageXmlGeneralExporter and PageXmlTextExporter)

/// <summary>
/// Interface for classes with a bounding box
/// </summary>
public interface IBlock
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you change IBoundingBox to IBlock? Let's revert back to IBoundingBox as the intent is clearer

/// </summary>
TextOrientation TextOrientation { get; }

/// <summary>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are you trying to achieve here by adding the letters collection? Letters collection should only be at Word level.

Having a collection of Letters at Letter level does not make sense and is counterintuitive... Similar can be said for TextBlock and TextLine classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the reading orders use TextSequence to order words. This exposes the letters so we can get that.

"Having a collection of Letters at Letter level does not make sense and is counterintuitive... ". Goal there was to make Letters behave similar to other text blocks. But your comment makes sense so I'll change the Letter class

Copy link
Collaborator

@BobLd BobLd Jun 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to be able to compute a TextSequence value at TextBlock and TextLine level, exposing IReadOnlyList<Letter> is not the solution (as such, the ILettersBlock interface does not make sense to me). You need to find another way to compute the TextSequence value...

Copy link
Contributor Author

@davebrokit davebrokit Jun 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ILettersBlock interface represents a collection of letters. Anyone can implement their own class that follows this contract as long as that class represents a collection of letters. E.g You can create a Paragraph class that implements this interface. It would make sense then to expose the Letter's that form the collection

For issue #855 there are effectively 2 types of behaviour in the reading orders. Using the TextSequence of letters or using the BoundingBox in order to determine reading order.

The implementation for when the BoundingBox is used is to create an IBoundingBox interface and use the BoundingBox property.

For the TextSequence there are 3 choices I tried:

  1. Compute the Average TextSequence in the Reading Orders. This was currently done in multiple places. To reflect the IBoundingBox changes in the next PR this would mean introducing a generic function that checks type and if type is TextBlock TextLine or Word then return calculate the average TextSequence. This option is more complex (generics can be painful) and limits the RenderingReadingOrderDetector to only work with the above classes. UnsupervisedReadingOrderDetector also has an option to use RendingOrder which would be limited to the classes above.
  2. ILettersBlock with a field AverageTextSequence. This works but you force every implementer of ILettersBlock to implement AverageTextSequence. Not necessarily useful for anything but the ReadingOrderDetectors
  3. ILettersBlock expose IReadOnlyList<Letter> This felt like the best solution as RenderingReadingOrderDetector and UnsupervisedReadingOrderDetector code was simpler (if it has letters property then we know we can calculate the average text sequence), this interface then represents a collection of Letters and there's tons of fun stuff library users can do with that :). And it's consistent with the use of a property on the interface as we do when ordering by Bounding Boxes.

With option 3 UnsupervisedReadingOrderDetector looks like this with generics.

public IEnumerable<TBlock> Get<TBlock>(IEnumerable<TBlock> blocks)
     where TBlock : IBoundingBox
{
    // Ordered by is a stable sort: if the keys of two elements are equal, the order of the elements is preserved 
    var ordered = blocks.OrderBy(b => GetAverageTextSequenceOrDefaultToZero(b));

    //..

    return ordered;
}

private double GetAverageTextSequenceOrDefaultToZero<TBlock>(TBlock block)
    where TBlock : IBoundingBox
{
    if (block is ILettersBlock textBlock)
    {
        return textBlock.Letters.Average(x => x.TextSequence);
    }

    return 0;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I think the approach you want to take is too generic. For example UnsupervisedReadingOrderDetector only makes sense at TextBlock level and should stay the same (see paper is used at the time). This does not mean we could not have ReadingOrderDetector at other levels. I do agree with the need of a IBoundingBox interface though.

Nothing prevents you to do the following:

public interface IReadingOrderDetector<T> where T: IBoundingBox
{
        IEnumerable<T> Get(IReadOnlyList<T> textBlocks);
}

[...]

public class RenderingReadingOrderDetector : IReadingOrderDetector<TextBlock> 
{
        [...]
}

[...]

public class UnsupervisedReadingOrderDetector : IReadingOrderDetector<TextBlock>
{
        [...]
}

One mistake (I believe) I did at the time was to set a reading order property at TextBlock level (block.SetReadingOrder(readingOrder++);). This property is in fact not needed and is redundant (DefaultReadingOrderDetector doesn't even sets it).

  • Also, exposing these Letter collection everywhere means you are allocating a lot of arrays, which is something we really want to avoid (see previous PRs about optimisation by @iamcarbon):
words.SelectMany(w => w.Letters).ToList().AsReadOnly();

All that being said, I feel we moved too far away from the original problem raised in #844. In line with what was described in the issue, I too believe the issue is with

I guess this issue is due to: XYLeaf.GetLines use Words.Groupy(x => x.BoundingBox.Bottom) is not robust if words baselines/bottoms have some tiny differences.

I hope you won't take my feedback too harshely as you've already contributed greatly to the repo, but I'm not willing to merge the PR as is.

Copy link
Contributor Author

@davebrokit davebrokit Jun 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues with your feedback. It's a productive discussion as several ways to implement this feature. I had some of the questions you raised when implementing so happy to discuss them with someone rather than just me :)

The issue in question here is #855. #844 will be raised on a separate PR but depends on this for some of the refactoring I performed.

Your comments:
Approach too generic
a. UnsupervisedReadingOrderDetector ->

  • I added some tests at the Word level and it doesn't make sense. But UnsupervisedReadingOrderDetector does make sense at the TextLine level, and would make sense if someone creates blocks that are a combination of TextBlocks and something else (Example is I've used it for my own custom class is a wrapper around a TextBlock or a Table from TabulaSharp which you wrote). So I would argue that it is generic enough to be used at any level apart from the word level (Any class that implements IBoundingBox)
  • Secondly the interface should be generic enough that someone can contribute to the library their own reading detector orders for Word's. For example a words in a line reading order detector. In the follow on PR I added something like that.

Nothing prevents you to do the following:
I tried that with the initial implementation. Following the argument in a) that is should be generic enough then we have to include the T parameter in the class definition otherwise we have multiple UnsupervisedReadingOrderDetector one for each type and if a user has their own custom type things get funky... . The problem I ran into here is that this breaks compatabilty with existing code using the library as you need add the T parameter when constructing:

public class RenderingReadingOrderDetector<T> : IReadingOrderDetector<T> 
{
        [...]
}

public class UnsupervisedReadingOrderDetector<T>: IReadingOrderDetector<T>
{
        [...]
}

var it = new RenderingReadingOrderDetector<TextBlock>()

Using a generic on the method keeps compatibilty with existing code.

c. SetReadingOrder -> I kept it in for compatibilty sake but happy to take it out.

Also, exposing these Letter collection everywhere means you are allocating a lot of arrays, which is something we really want to avoid (see previous PRs about optimisation by
I'm happy to change it to a lazy property that computes when fetched. That would remove the performance issue

public IReadOnlyList<Letter> Letters => words.SelectMany(w => w.Letters).AsReadOnly();

It can also be cached in a field after computed the first time to make it more efficent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BobLd Just wanted to follow up on above

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davebrokit I'm still here, I just don't have the time short term to focus on this area of PdfPig. I'll try to make time soon to have another look at how to address the points we discsussed above

/// <summary>
/// This letter as as List of Letters in order to implement ILettersBlock interface
/// </summary>
public IReadOnlyList<Letter> Letters => [this];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see previous comment

@EliotJones
Copy link
Member

@BobLd are we waiting on any further changes for this PR?

@BobLd
Copy link
Collaborator

BobLd commented Sep 29, 2024

@EliotJones I'd like to leave this PR open as the point addressed here are relevant. I won't have time short term to focus on this part of the library as i'd like to focus on bug fixes and implementing missing parts of the pdf specifications

@davebrokit
Copy link
Contributor Author

Hi @EliotJones @BobLd

It'll be great to get some progress on this PR. I've responded to your comments @BobLd on the changes you requested:

Done:

  • Split my big PR into 3 PRs
  • change to IBoundingBox
  • Exclude PageXmlGeneralExporter from PR

Todo (was waiting for response to my comment before making the change):

  • Remove IBoundingBox interface from letter Letter
  • Remove SetReadingOrder
  • Change Letters property to lazy collection (depending on response to comment)

The contentious issue is should Word, TextLine, TextBlock implement the new ILettersBlock interface and the reading orders use the IBoundingBox in a future PR. I've made my arguments why I think they should and I'm waiting on a response. I can then make the necessary changes and hopefully close this PR after that.

It'll be great to get a response as this is functionality I'm using and the PR has been hanging around for 3 months 😞

I also have 2 follow on PRs that are dependent on this: One to address the line segementer issue raised in #844 and one to change the Reading order detectors that finish off this issue #855. The code has been written for them, but I haven't raised them yet as changes may need to be made following the outcome of this discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants