-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compressing XMP Metadata streams. #491
Comments
@faceless2 I suspect that the utility/benefit of having a single uncompressed document-level Metadata stream that can be easily scrapped out of a PDF file might be degraded if all object-level metadata streams were uncompressed. |
Yes, uncompressed object level metadata is a pain. |
Oh, I agree - the only way to identify which metadata applies to what is to parse the file properly. But as a first approximation of the content of the PDF (ie. not just the PDF itself, but any images etc it contains), I think scanning it is probably still useful. I also should have noted that this sort of quick scan is the reason we are required to add the "W5M0MpCehiHzreSzNTczkc9d" xpacket around the XMP. There's no reason for this string to exist unless you're byte-scanning the file. |
And what about incremental updates that rewrite the DocCatalog Metadata stream? Adobe XMP Spec, part 3 states "PDF files that have been incrementally saved can have multiple packets that all look like the “main” XMP metadata. During an incremental save, new data (including XMP packets) is written to the end of the file without AFAICT no PDFa publication clarifies this (either "PDF 2.0 Application Note 003: Use of object metadata streams" or "Technical Note 0003: Metadata in PDF/A-1") If we were to give guidance, I think updating "PDF 2.0 Application Note 003: Use of object metadata streams" would be more appropriate than ISO 32000. |
Well, if we were to give guidance I think the above comments show that the guidance is mostly for the Catalog Metadata, as that's the one we care about. An application note on Object metadata is not the first place I'd look for this. |
Proposal for a note after Table 348 modifying the input from Mike to make the distinction between "main" document-level and other Metadata streams and to avoid a normative "should": Note 3. It is best practice that the document-level Metadata stream is stored in the PDF uncompressed and does not specify a /Filter (except /Crypt if appropriate). The reason is that such Metadata can be quickly found and extracted from a PDF file by a parser without knowledge of the PDF format. Prerequisite to this is that all other Metadata streams, e.g. on object-level, are compressed. Up to the native speakers to improve wording... |
Suggested "native speaker" rewrite... Note 3. Common best practice is to store the document-level Metadata stream uncompressed in the PDF without a /Filter (except /Crypt if appropriate). This practice allows parsers to quickly find and extract the metadata from a PDF file without any knowledge of the PDF format. This approach implies that object-level Metadata streams in PDF files are always compressed. |
Consider "(unless encrypted)" in place of "(except /Crypt if appropriate)". |
Looking at it from the other direction, what are these (mythical?) XMP-scanning programs? Are they widely used? Do they pick out the first stream or the last, in the case of incrementally-updated ones? Is this a use-case which has long been intended but isn't used? Or is there a very widely-used XMP-scanner whose behaviour we should take into account? |
That's different. "without a /Filter (except /Crypt if appropriate)" would only allow the Crypt filter and refers to the case of the Identity filter. "without a /Filter (unless encrypted)" would allow arbitrary filters in encrypted contexts. |
Rewrite looks good Duff - "Common best practice for document-level Metadata" sounds good. But I think that last sentence needs to go because clearly uncompressed object-level Metadata streams are very common, so we can't say they're "always compressed". I'd be OK with not saying anything about object-level Metadata, or we can decide here an now on what best practice is for them. I suspect not saying anything is going to be an easier option. @johnwhitington that's a fair question, and while I don't have an answer this approach was described in the very first XMP specification way back in 2004:
|
If you are going to encrypt the metadata, the case for not compressing that stream basically falls apart, so why not compress before encrypting? |
We should not call software that "scans" for XMP like this "parsers", especially if these words are in ISO 32000! They are not "parsing". And there are definitely use-cases, workflows, and many extant PDFs where Metadata is encrypted (including with proprietary encryption methods) so we must not bias against those - thus I'm strongly against "common best practice" as that is an unproven assumption. "Some uses of PDF may..." sure. |
As mentioned in my message, the Crypt exception refers to the Identity filter which is one way to mark the stream as non-encrypted in an otherwise encrypted file. |
There's a long-standing practice in PDF that XMP Metadata streams should not be compressed, but there is no note to this effect. So this issue raises two questions:
The "XMP Specification Part 3 (2020)" is quiet on this, and defers to ISO32000 for the details of embedding XMP in PDF. The nearest it gets is noting that:
ISO32000 is quiet on this too. Although we do have the
EncryptMetadata
key which prevents Metadata being encrypted, no where is it explained why you might want to do this (the reason is that it's still readable even if you don't have the password, but that's not going to happen if it's compressed).Personally I think this practice is still useful - although PDF file parsers are everywhere, I imagine the archiving community amongst others would appreciate the value of a quick scan to extract XMP without having to parse the file. So I'd propose a note, something like this as a third note after table 348 on pp715/716.
The text was updated successfully, but these errors were encountered: