-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added expected result of lexical analysis of CompactedPDFSyntaxTest.pdf as JSON #6
base: main
Are you sure you want to change the base?
Conversation
@@ -0,0 +1,1658 @@ | |||
[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This JSON appears not to document the trailer - is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And "Null" should really be "null"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And "Null" should really be "null"
Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This JSON appears not to document the trailer - is that correct?
This becomes a matter of defining the scope of the lexical analyser (e.g. does it include decryption and object stream parsing?). After the indirect objects have been parsed and identified by object number, the trailer has no meaning anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The trailer in the test PDF does have a very specific parsing test (comment after name without whitespace) that is not duplicated elsewhere which is why I asked.
And no matter which way you look at or name things, something somewhere has to lex and parse the startxref, trailer(s), xref(s), incremental updates, etc. before you get to knowing anything about potential decrypting (which itself requires lexing/parsing of certain key data structures) which is before you eventually get to lexing/parsing the majority of main body objects in a PDF. The PDF spec is also not prescriptive about things such as lazy resolution, when and in what order you lex, parse or validate certain syntactic constructs, etc. - arguably this is just one reason "shadow attacks" are successful.
I've no problem in accepting your JSON contribution (which I think is really useful as a human-understandable version) so long as such caveats and assumptions are clearly documented as not all lexer/parsers use the same assumptions, hence why I was asking about the originating source.
Maybe the simpler solution is just to put this documentation into the README.md?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I agree. I will update the README.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And "Null" should really be "null"
To be consistent with the casing of null, I have pushed a new version that formats JSON properties and PDF object types lower-case.
"Operands": [ | ||
{ | ||
"Type": "Real", | ||
"Value": 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be ".0" not "0.0"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help me understand why you want to distinguish between .0 and 0.0 after lexical analysis?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly because not all lexers and tokenizers are equal.
And there are nasty things like signed zeros that can and do occur (not for this specific line, but my other comments elsewhere).
And parsing JSON is also parser-dependent (see https://seriot.ch/projects/parsing_json.html and https://labs.bishopfox.com/tech-blog/an-exploration-of-json-interoperability-vulnerabilities).
So if the value of "Value" was a string rather than a post-processed number then it would avoid all assumptions but still keep the nicely decomposed structure from the PDF as a more friendly JSON (thank you!). (And, yes, slightly more work for each user)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly because not all lexers and tokenisers are equal (both in terms of implementation details and technology) a neutral reference is helpful. By representing a real or integer as a string, verification would still rely on the parsing of the particular lexer. In my view 0.0 is the most neutral representation of all zero real numbers such as -.0, -0.0, +.0, +0.00000, -.00000, etc.
https://json-schema.org/understanding-json-schema/reference/numeric.html explicitly discourages representing a number as a string.
File
CompactedPDFSyntaxTest.pdf.json
provides a JSON representation of the PDF test file and can be used to assert correctness of the PDF lexical analyzer.