Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added expected result of lexical analysis of CompactedPDFSyntaxTest.pdf as JSON #6

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

frankrem
Copy link

@frankrem frankrem commented Mar 11, 2022

File CompactedPDFSyntaxTest.pdf.json provides a JSON representation of the PDF test file and can be used to assert correctness of the PDF lexical analyzer.

CompactedSyntax/README.md Show resolved Hide resolved
@@ -0,0 +1,1658 @@
[
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This JSON appears not to document the trailer - is that correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And "Null" should really be "null"

Copy link
Author

@frankrem frankrem Mar 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And "Null" should really be "null"

Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This JSON appears not to document the trailer - is that correct?

This becomes a matter of defining the scope of the lexical analyser (e.g. does it include decryption and object stream parsing?). After the indirect objects have been parsed and identified by object number, the trailer has no meaning anymore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trailer in the test PDF does have a very specific parsing test (comment after name without whitespace) that is not duplicated elsewhere which is why I asked.

And no matter which way you look at or name things, something somewhere has to lex and parse the startxref, trailer(s), xref(s), incremental updates, etc. before you get to knowing anything about potential decrypting (which itself requires lexing/parsing of certain key data structures) which is before you eventually get to lexing/parsing the majority of main body objects in a PDF. The PDF spec is also not prescriptive about things such as lazy resolution, when and in what order you lex, parse or validate certain syntactic constructs, etc. - arguably this is just one reason "shadow attacks" are successful.

I've no problem in accepting your JSON contribution (which I think is really useful as a human-understandable version) so long as such caveats and assumptions are clearly documented as not all lexer/parsers use the same assumptions, hence why I was asking about the originating source.

Maybe the simpler solution is just to put this documentation into the README.md?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I agree. I will update the README.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And "Null" should really be "null"

To be consistent with the casing of null, I have pushed a new version that formats JSON properties and PDF object types lower-case.

"Operands": [
{
"Type": "Real",
"Value": 0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be ".0" not "0.0"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand why you want to distinguish between .0 and 0.0 after lexical analysis?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly because not all lexers and tokenizers are equal.
And there are nasty things like signed zeros that can and do occur (not for this specific line, but my other comments elsewhere).
And parsing JSON is also parser-dependent (see https://seriot.ch/projects/parsing_json.html and https://labs.bishopfox.com/tech-blog/an-exploration-of-json-interoperability-vulnerabilities).
So if the value of "Value" was a string rather than a post-processed number then it would avoid all assumptions but still keep the nicely decomposed structure from the PDF as a more friendly JSON (thank you!). (And, yes, slightly more work for each user)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly because not all lexers and tokenisers are equal (both in terms of implementation details and technology) a neutral reference is helpful. By representing a real or integer as a string, verification would still rely on the parsing of the particular lexer. In my view 0.0 is the most neutral representation of all zero real numbers such as -.0, -0.0, +.0, +0.00000, -.00000, etc.

https://json-schema.org/understanding-json-schema/reference/numeric.html explicitly discourages representing a number as a string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants