Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compressing XMP Metadata streams. #491

Open
faceless2 opened this issue Nov 12, 2024 · 14 comments
Open

Compressing XMP Metadata streams. #491

faceless2 opened this issue Nov 12, 2024 · 14 comments
Labels
documentation Improvements or additions to documentation

Comments

@faceless2
Copy link

faceless2 commented Nov 12, 2024

There's a long-standing practice in PDF that XMP Metadata streams should not be compressed, but there is no note to this effect. So this issue raises two questions:

  1. Is it still considered best practice to not compress XMP Metadata Streams?
  2. If so, should we have a note explaining this as a "should" in ISO32K?

The "XMP Specification Part 3 (2020)" is quiet on this, and defers to ISO32000 for the details of embedding XMP in PDF. The nearest it gets is noting that:

Although Distiller 5 will attach the XMP to the associated object in the PDF file, the XMP stream in the PDF will be Flate-compressed. This makes the object XMP packet in the PDF invisible to external packet scanners. The XMP will be visible to software processing the PDF format and decompressing the stream. Distiller 6 and later do not compress the XMP packet stream.

ISO32000 is quiet on this too. Although we do have the EncryptMetadata key which prevents Metadata being encrypted, no where is it explained why you might want to do this (the reason is that it's still readable even if you don't have the password, but that's not going to happen if it's compressed).

Personally I think this practice is still useful - although PDF file parsers are everywhere, I imagine the archiving community amongst others would appreciate the value of a quick scan to extract XMP without having to parse the file. So I'd propose a note, something like this as a third note after table 348 on pp715/716.

Note 3. Metadata streams should be stored in the PDF uncompressed, and should not specify a /Filter (except /Crypt where appropriate). Uncompressed Metadata streams can be quickly found and extracted from a PDF file by a parser without any knowledge of the PDF format.

@faceless2 faceless2 added bug Something isn't correct documentation Improvements or additions to documentation and removed bug Something isn't correct labels Nov 12, 2024
@datalogics-pgallot
Copy link

@faceless2 I suspect that the utility/benefit of having a single uncompressed document-level Metadata stream that can be easily scrapped out of a PDF file might be degraded if all object-level metadata streams were uncompressed.

@DietrichSeggern
Copy link

DietrichSeggern commented Nov 12, 2024

Yes, uncompressed object level metadata is a pain.
a) it unnecessarily increases file size (sometimes dramatically)
b) it diminshes the advantage of uncompressed document level metadata because that cannot easily be identified anymore

@faceless2
Copy link
Author

Oh, I agree - the only way to identify which metadata applies to what is to parse the file properly. But as a first approximation of the content of the PDF (ie. not just the PDF itself, but any images etc it contains), I think scanning it is probably still useful.

I also should have noted that this sort of quick scan is the reason we are required to add the "W5M0MpCehiHzreSzNTczkc9d" xpacket around the XMP. There's no reason for this string to exist unless you're byte-scanning the file.

@petervwyatt
Copy link
Member

And what about incremental updates that rewrite the DocCatalog Metadata stream?

Adobe XMP Spec, part 3 states "PDF files that have been incrementally saved can have multiple packets that all look like the “main” XMP metadata. During an incremental save, new data (including XMP packets) is written to the end of the file without
removing the old. Top-level PDF dictionaries are also rewritten, so an application that understands PDF can check the dictionary to find only the new packet.
"

AFAICT no PDFa publication clarifies this (either "PDF 2.0 Application Note 003: Use of object metadata streams" or "Technical Note 0003: Metadata in PDF/A-1")

If we were to give guidance, I think updating "PDF 2.0 Application Note 003: Use of object metadata streams" would be more appropriate than ISO 32000.

@faceless2
Copy link
Author

updating "PDF 2.0 Application Note 003: Use of object metadata streams" would be more appropriate than ISO 32000.

Well, if we were to give guidance I think the above comments show that the guidance is mostly for the Catalog Metadata, as that's the one we care about. An application note on Object metadata is not the first place I'd look for this.

@DietrichSeggern
Copy link

Proposal for a note after Table 348 modifying the input from Mike to make the distinction between "main" document-level and other Metadata streams and to avoid a normative "should":

Note 3. It is best practice that the document-level Metadata stream is stored in the PDF uncompressed and does not specify a /Filter (except /Crypt if appropriate). The reason is that such Metadata can be quickly found and extracted from a PDF file by a parser without knowledge of the PDF format. Prerequisite to this is that all other Metadata streams, e.g. on object-level, are compressed.

Up to the native speakers to improve wording...

@DuffJohnson
Copy link
Member

Suggested "native speaker" rewrite...

Note 3. Common best practice is to store the document-level Metadata stream uncompressed in the PDF without a /Filter (except /Crypt if appropriate). This practice allows parsers to quickly find and extract the metadata from a PDF file without any knowledge of the PDF format. This approach implies that object-level Metadata streams in PDF files are always compressed.

@datalogics-pgallot
Copy link

Consider "(unless encrypted)" in place of "(except /Crypt if appropriate)".

@johnwhitington
Copy link

Looking at it from the other direction, what are these (mythical?) XMP-scanning programs? Are they widely used? Do they pick out the first stream or the last, in the case of incrementally-updated ones? Is this a use-case which has long been intended but isn't used? Or is there a very widely-used XMP-scanner whose behaviour we should take into account?

@mkl-public
Copy link

Consider "(unless encrypted)" in place of "(except /Crypt if appropriate)".

That's different.

"without a /Filter (except /Crypt if appropriate)" would only allow the Crypt filter and refers to the case of the Identity filter.

"without a /Filter (unless encrypted)" would allow arbitrary filters in encrypted contexts.

@faceless2
Copy link
Author

faceless2 commented Nov 13, 2024

Rewrite looks good Duff - "Common best practice for document-level Metadata" sounds good.

But I think that last sentence needs to go because clearly uncompressed object-level Metadata streams are very common, so we can't say they're "always compressed".

I'd be OK with not saying anything about object-level Metadata, or we can decide here an now on what best practice is for them. I suspect not saying anything is going to be an easier option.

@johnwhitington that's a fair question, and while I don't have an answer this approach was described in the very first XMP specification way back in 2004:

Scanning Files for XMP Packets

This section explains how files can be scanned for XMP Packets, and why this should be done with caution.

Caveats

Knowledge of individual file formats provides the best way for an application to get access to XMP Packets. See Chapter 5, “Embedding XMP Metadata in Application Files” for detailed information on how XMP data is stored in specific file formats.

Lacking this information, applications can find XMP Packets by scanning the file. However, this should be considered a last resort, especially if it is necessary to modify the data. Without knowledge of the file format, simply locating packets may not be sufficient. The following are some possible drawbacks:

  • It may not be possible to determine which resource the XMP is associated with. If a JPEG image with XMP is placed in a page layout file of an application that is unaware of XMP, that file has one XMP Packet that refers to just the image, not the entire layout.
  • When there is more than one XMP Packet in a file, it may be impossible to determine which is the “main” XMP, and what the overall resource containment hierarchy is in a compound document.
  • Some packets could be obsolete. For example, PDF files allow incremental saves. Therefore, when changes are made to the document, there might be multiple packets, only one of which reflects the current state of the file.
    Scanning Hints

A file should be scanned byte-by-byte until a valid header is found. First, the scanner should look for a byte pattern that represents the text...

@datalogics-pgallot
Copy link

@mkl-public

"without a /Filter (unless encrypted)" would allow arbitrary filters in encrypted contexts.

If you are going to encrypt the metadata, the case for not compressing that stream basically falls apart, so why not compress before encrypting?

@petervwyatt
Copy link
Member

We should not call software that "scans" for XMP like this "parsers", especially if these words are in ISO 32000! They are not "parsing".

And there are definitely use-cases, workflows, and many extant PDFs where Metadata is encrypted (including with proprietary encryption methods) so we must not bias against those - thus I'm strongly against "common best practice" as that is an unproven assumption. "Some uses of PDF may..." sure.

@mkl-public
Copy link

@datalogics-pgallot

"without a /Filter (unless encrypted)" would allow arbitrary filters in encrypted contexts.

If you are going to encrypt the metadata, the case for not compressing that stream basically falls apart, so why not compress before encrypting?

As mentioned in my message, the Crypt exception refers to the Identity filter which is one way to mark the stream as non-encrypted in an otherwise encrypted file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

7 participants