Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Form fields missing when rendering after flatten #140

Open
bavardage opened this issue Mar 6, 2024 · 15 comments
Open

Form fields missing when rendering after flatten #140

bavardage opened this issue Mar 6, 2024 · 15 comments
Assignees

Comments

@bavardage
Copy link

I may be misunderstanding how flatten() works, but doing something like this:

examples/form_flatten.rs

use image::ImageFormat;
use pdfium_render::prelude::*;

pub fn main() -> Result<(), PdfiumError> {
    // For general comments about pdfium-render and binding to Pdfium, see export.rs.

    let pdfium = Pdfium::default();

    let document = pdfium.load_pdf_from_file("test/form-test.pdf", None)?; // Load the sample file...

    document
        .metadata()
        .iter()
        .enumerate()
        .for_each(|(index, tag)| println!("{}: {:#?} = {}", index, tag.tag_type(), tag.value()));

    match document.form() {
        Some(form) => println!(
            "PDF contains an embedded form of type {:#?}",
            form.form_type()
        ),
        None => println!("PDF does not contain an embedded form"),
    };

    let dpi = 200.0;

    let render_config = PdfRenderConfig::new()
        .scale_page_by_factor(dpi as f32 / 72.0)
        .render_form_data(true) // Rendering of form data and annotations is the default...
        .render_annotations(true) // ... but for the sake of demonstration we are explicit here.
        .highlight_text_form_fields(PdfColor::YELLOW.with_alpha(128))
        .highlight_checkbox_form_fields(PdfColor::BLUE.with_alpha(128));

    let mut page = document.pages().first().expect("to get first page");
    page.flatten()?;

    page.render_with_config(&render_config)?
        .as_image()
        .as_rgba8()
        .ok_or(PdfiumError::ImageError)?
        .save_with_format("flattened-form-page.jpg", ImageFormat::Jpeg)
        .map_err(|_| PdfiumError::ImageError)?;

    Ok(())
}

and running as
cargo run --example form_flatten

end up with:
flattened-form-page

whereas I think i would have expected the form fields to be inlined and thus rendered (without any yellow bg)

something is happening because no yellow highlighted fields are rendered!

@ajrcarey ajrcarey self-assigned this Mar 6, 2024
@ajrcarey
Copy link
Owner

ajrcarey commented Mar 9, 2024

Hi @bavardage , thank you for reporting the issue, and for your example code that demonstrates the problem. I can reproduce the problem.

The Pdfium documentation on FPDFPage_Flatten() is minimal, but there is some commentary in discussions (e.g. https://bugs.chromium.org/p/pdfium/issues/detail?id=2055) that suggests flattening in Pdfium has some limitations. At first I thought the fact this sample document contains an Xfa form might be a deal-breaker.

But it turns out all you need to do is drop and reload the page handle in order to get the flattened page:

    ...
    let mut page = document.pages().first().expect("to get first page");
    page.flatten()?;

    // Close and reopen the page
    let page = document.pages().first().expect("to get first page");
    ...

Now the output will include the filled-in form fields.

This work-around should help you in the short term. It would certainly be nice if pdfium-render did this automatically for you, though. I will leave this issue open to track making those changes.

@ajrcarey
Copy link
Owner

ajrcarey commented Mar 9, 2024

Added page drop/auto-reload to PdfPage::flatten(). Updated README. Fix ready to release as part of crate version 0.8.19.

In the meantime, you can take pdfium-render as a git dependency in your Cargo.toml to access the updated behaviour.

@bavardage
Copy link
Author

This is great - thanks so much!

@chriskyndrid
Copy link

chriskyndrid commented Jul 4, 2024

I realize this is a closed issue, however, I'm having the same problem. I have a multi-page PDF that maps HashMap values into the form:

pub fn fill_form(
        &self,
        document: &mut PdfDocument,
        data: HashMap<usize, HashMap<String, String>>,
    ) -> SystemSyncResult<()> {
        let pages = document.pages();

        for (page_index, page) in pages.iter().enumerate() {
            if let Some(page_data) = data.get(&page_index) {
                let annotations = page.annotations();

                for (annotation_index, mut annotation) in annotations.iter().enumerate() {
                    if let Some(mut field) = annotation.as_form_field_mut() {
                        if let Some(field_name) = field.name() {
                            if let Some(value) = page_data.get(&field_name) {
                                if let Some(mut text_field) = field.as_text_field_mut() {
                                    log_message!(
                                        SysLogLevel::Debug,
                                        "Page {}, text field {}: {:?} currently has value: {}",
                                        page_index,
                                        annotation_index,
                                        field_name,
                                        format!("{:?}", text_field.value()),
                                    );

                                    text_field.set_value(value)?;

                                    log_message!(
                                        SysLogLevel::Debug,
                                        "Page {}, text field {}: {:?} now has updated value: {}",
                                        page_index,
                                        annotation_index,
                                        field_name,
                                        format!("{:?}", text_field.value()),
                                    );
                                }
                            }
                        }
                    }
                }
            }
        }

        Ok(())
    }
    

I call flatten:

    pub fn flatten(&self, document: &mut PdfDocument) -> SystemSyncResult<()> {
        let pages = document.pages();

        for mut page in pages.iter() {
            page.flatten()?;
        }

        Ok(())
    }

And the PDF is blank(devoid of any content) post flatten operation.

Commenting out the the flatten operation:

pub fn safe_fill_form(
        &self,
        bytes: Vec<u8>,
        data: HashMap<usize, HashMap<String, String>>,
        copies: Option<usize>,
    ) -> SystemSyncResult<Vec<u8>> {
        match std::panic::catch_unwind(std::panic::AssertUnwindSafe(|| {
            let mut document = self.load_pdf(bytes, copies)?;
            self.fill_form(&mut document, data)?;
            // self.flatten(&mut document)?;
            self.save_to_bytes(&document)
        })) {
            Ok(result) => result,
            Err(_) => Err(SystemError::new(
                file!(),
                line!(),
                SystemErrorCode::GenericErrorCode,
                SystemErrorType::GenericErrorType,
                "PDF Production panicked with error",
            )
            .into()),
        }
    }

Results in a filled PDF, but the form fields are still fillable(what I want to avoid). In other projects I have used iText extensively to produce millions of PDF's. The form fill feature, in my opinion, for business use cases, is one of the best features available in this library. Allowing designers to design the layout and merging data in is an excellent use case, thanks for your efforts on this.

I'm using a PDF 1.3 variant form I created in Illustrator.

I'm using the latest 0.8.22 crate:

pub fn flatten(&mut self) -> Result<(), PdfiumError> {
        // TODO: AJRC - 28/5/22 - consider allowing the caller to set the FLAT_NORMALDISPLAY or FLAT_PRINT flag.
        let flag = FLAT_PRINT;

        match self
            .bindings()
            .FPDFPage_Flatten(self.page_handle, flag as c_int) as u32
        {
            FLATTEN_SUCCESS => {
                self.regenerate_content()?;

                // As noted at https://bugs.chromium.org/p/pdfium/issues/detail?id=2055,
                // FPDFPage_Flatten() updates the underlying dictionaries and content streams for
                // the page, but does not update the FPDF_Page structure. We must reload the
                // page for the effects of the flatten operation to be visible. For more information, see:
                // https://github.com/ajrcarey/pdfium-render/issues/140

                self.reload_in_place();
                Ok(())
            }
            FLATTEN_NOTHINGTODO => Ok(()),
            FLATTEN_FAIL => Err(PdfiumError::PageFlattenFailure),
            _ => Err(PdfiumError::PageFlattenFailure),
        }
    }

Let me know if I'm doing something wrong here. Thanks again for your time.

@ajrcarey ajrcarey reopened this Jul 4, 2024
@ajrcarey
Copy link
Owner

ajrcarey commented Jul 4, 2024

Hi @chriskyndrid , thank you for adding to the issue. My first thought is that it may be something specific to your document. Are you able to share a page that exhibits the problem, along with a sample call to your fill_form() function with hash map values that trigger the problem?

@ajrcarey
Copy link
Owner

ajrcarey commented Jul 4, 2024

Ok, I think I can reproduce the problem using the test/form-test.pdf sample file. I adjusted your code sample slightly to make the following test case:

use pdfium_render::prelude::*;
use std::collections::HashMap;

pub fn main() -> Result<(), PdfiumError> {
    let pdfium = Pdfium::new(Pdfium::bind_to_library(
        Pdfium::pdfium_platform_library_name_at_path("../pdfium/"),
    )?);

    let mut document = pdfium.load_pdf_from_file("../pdfium/test/form-test.pdf", None)?; // Load the sample file...

    let mut x = HashMap::new();

    x.insert(
        "form1[0].#subform[0].Pt1Line1a_FamilyName[0]".to_string(),
        "foo".to_string(),
    );

    let mut y = HashMap::new();

    y.insert(0, x);

    fill_form(&mut document, y)?;
    flatten(&mut document)?;

    document.save_to_file("./output.pdf")?;

    Ok(())
}

pub fn fill_form(
    document: &mut PdfDocument,
    data: HashMap<usize, HashMap<String, String>>,
) -> Result<(), PdfiumError> {
    let pages = document.pages();

    for (page_index, page) in pages.iter().enumerate() {
        if let Some(page_data) = data.get(&page_index) {
            let annotations = page.annotations();

            for (annotation_index, mut annotation) in annotations.iter().enumerate() {
                if let Some(field) = annotation.as_form_field_mut() {
                    if let Some(field_name) = field.name() {
                        if let Some(value) = page_data.get(&field_name) {
                            if let Some(text_field) = field.as_text_field_mut() {
                                println!(
                                    "Page {}, text field {}: {:?} currently has value: {}",
                                    page_index,
                                    annotation_index,
                                    field_name,
                                    format!("{:?}", text_field.value()),
                                );

                                println!("Applying value: {}", value);

                                text_field.set_value(value)?;

                                println!(
                                    "Page {}, text field {}: {:?} now has updated value: {}",
                                    page_index,
                                    annotation_index,
                                    field_name,
                                    format!("{:?}", text_field.value()),
                                );
                            }
                        }
                    }
                }
            }
        }
    }

    Ok(())
}

pub fn flatten(document: &mut PdfDocument) -> Result<(), PdfiumError> {
    let pages = document.pages();

    for mut page in pages.iter() {
        page.flatten()?;
    }

    Ok(())
}

Running the sample produces the following debugging output:

Page 0, text field 0: "form1[0].#subform[0].Pt1Line1a_FamilyName[0]" currently has value: None
Applying value: foo
Page 0, text field 0: "form1[0].#subform[0].Pt1Line1a_FamilyName[0]" now has updated value: Some("foo")

When the call to flatten() is uncommented, the resulting output.pdf document does not contain the updated form field value. When the call to flatten() is commented out, the resulting output.df document does contain the updated form field value, but the form field is still editable. I believe this matches your problem description.

@ajrcarey
Copy link
Owner

ajrcarey commented Jul 4, 2024

My first thought was that dropping and reloading the page handle (which is what happens inside PdfPage::flatten()) might be the source of the problem, but no, the same behaviour happens if you call FPDFPage_Flatten() directly:

pub fn flatten(document: &mut PdfDocument) -> Result<(), PdfiumError> {
    let pages = document.pages();

    for mut page in pages.iter() {
        page.bindings()
            .FPDFPage_Flatten(page.bindings().get_handle_from_page(&page), 1);
        // page.flatten()?;
    }

    Ok(())
}

If the problem is in Pdfium itself, it might be hard to mitigate from pdfium-render. I'll continue looking at this tomorrow.

@chriskyndrid
Copy link

@ajrcarey

Thanks for looking into this. I think it is intrinsic to pdfium. Regrettably I ran into other issues getting various readers to process and display the filled form data as well similar to what @liammcdermott noted elsewhere. In addition, even with the non flattened form fields every time I would send the document to print in chrome or Firefox, the data in the form fields would not render. I’ve observed this behavior in chrome before when filling forms manually too.

This morning I wrote a solution for the form filling and mapping using the rust native lopdf crate. It’s working well, with some gotchas I had to address with non-ascii chars requiring conversion to UTF-16.

I believe I can address the flatten requirement as well in lopdf and will work on that solution this weekend.

I did not experience any issues with printing or rendering post fill in lopdf. It was more work, but I think may be the solution for now…and I can run parallel generation of pdf docs.

In any case thanks again for your efforts on this and getting back to me so quickly.

@ajrcarey
Copy link
Owner

ajrcarey commented Jul 5, 2024

Yes, I agree with you that the problem is upstream; there's an open bug https://issues.chromium.org/issues/40584154 which looks related, and it's been open for five years :/

See also: https://issues.chromium.org/issues/42271909

Pdfium's flatten operation is at https://pdfium.googlesource.com/pdfium/+/main/fpdfsdk/fpdf_flatten.cpp#292. I would like to build this locally with some debugging output to see if I can figure out what's happening. The problem happens with both XFA and non-XFA (AcroForm) forms.

It looks like pdftk can also fill forms and flatten fairly trivially: https://superuser.com/a/1291383

@chriskyndrid
Copy link

@ajrcarey

Thanks again for all your effort and investigation on this. Although I have some concerns about lopdf and long term maintenance given activity level on the crate, I like the idea of Rust native solution.

Once I get flattening working in lopdf, I’ll post the Struct that does all the work here in case somebody else has this need until pdfium is ready.

Thanks again!

@chriskyndrid
Copy link

chriskyndrid commented Jul 7, 2024

I ended up implementing a solution with lopdf. I can now successfully fill forms in an edible manner or map straight to a flattened document. My implementation is about 1200 lines across 10 files or so, so I may end up creating a crate for this since there isn't really anything readily available in rust that is easy to work with for form manipulation/flattening. I forgot how much a pain working with PDF's is.

My code runs the form fill/flatten operations in parallel across templates and uses cosmic_text for sizing text into the bounding boxes of a form, and will automatically reduce font size down to a threshold before truncating the text to work to fit the text into the desired region of the form bounding box on the page. It's all fairly alpha right now, but on my development machine mapping 100 fields across 4 pages (2, 2 page templates) takes about 15ms including Actix API overhead, postgres data retrieval and all the pdf operations (flatten, compress, cleanup, and merging ) in release build with lopdf. I have some issues to still work out with char encoding with flattened text, but otherwise it's working pretty good.

I think you can close this issue.

@ajrcarey
Copy link
Owner

Thanks for the update, @chriskyndrid . Good to hear you found a solution. If you do end up making it public, I'd be keen to take a look at it because I'm kind of thinking a similar approach needs to happen here, where the best way to achieve a consistent flatten result in pdfium-render is to actually handle the flatten operation in pdfium-render rather than relying on Pdfium itself.

For that reason, I'll leave the issue open for now, although it may not make much progress in the short term.

I'm curious as to whether you had to regenerate appearance streams for your flattened elements at any point?

@chriskyndrid
Copy link

chriskyndrid commented Jul 15, 2024

@ajrcarey what I have right now is fairly alpha and somewhat kludgy/hacky, but is working for my use case, which is currently only dealing with text fields. I did not have to regenerate the appearance streams to get things to work. This is an example that is currently specific to my codebase:

For flatten operations I use the coordinates and handle all the text placement, sizing, line wrapping, etc manually. I'm specifically suggesting a font size, but my text sizing/placement routines will determine fitment down to some lower threshold (6pt in my case).

use crate::log_message;
use crate::pdf::lopdf::cleanup::Cleanup;
use crate::pdf::lopdf::font::{Font, PDFFont};
use crate::pdf::lopdf::pdf::PDF;
use crate::pdf::lopdf::placement::Placement;
use crate::pdf::lopdf::utility::Utility;
use crate::prelude::*;
use erp_contrib::lopdf::content::{Content, Operation};
use erp_contrib::lopdf::{dictionary, Document, Object, ObjectId, Stream, StringFormat};
use std::collections::HashMap;

pub trait FormFlatten: Utility + Placement + Cleanup + Font {
    fn fill_field_updates(
        document: &Document,
        fields: &Vec<Object>,
        parent_name: Option<&str>,
        parent_value: Option<&String>,
        page_data: &HashMap<String, String>,
        updates: &mut Vec<(ObjectId, ObjectId, (f32, f32, f32, f32), String)>,
    ) -> SystemSyncResult<()> {
        for field_ref in fields {
            let field_id = field_ref.as_reference()?;
            let field = document.get_object(field_id)?.as_dict()?;
            let name_string = if let Ok(name_obj) = field.get(b"T") {
                if let Ok(name) = name_obj.as_str() {
                    let mut name_string =
                        Self::clean_string(&String::from_utf8_lossy(name).to_string());
                    if let Some(parent) = parent_name {
                        name_string = format!("{}.{}", parent, name_string);
                    }
                    Some(name_string)
                } else {
                    None
                }
            } else {
                None
            };

            let value = if let Some(name_string) = &name_string {
                log_message!(
                    SysLogLevel::Debug,
                    "Value from page_data key: {:?}",
                    name_string
                );
                if let Some(val) = page_data.get(name_string) {
                    val.to_string()
                } else {
                    "".to_string()
                }
            } else if let Some(parent_val) = parent_value {
                parent_val.clone()
            } else {
                "".to_string()
            };

            log_message!(SysLogLevel::Debug, "Field ID: {:?}", field_id);
            log_message!(SysLogLevel::Debug, "Field Name: {:?}", name_string);
            log_message!(SysLogLevel::Debug, "Value: {:?}", value);
            log_message!(SysLogLevel::Debug, "Field Dict: {:?}", field);

            if let (Ok(page_id), Ok(rect)) = (
                field.get(b"P").and_then(|p| p.as_reference()),
                field.get(b"Rect").and_then(|r| r.as_array()),
            ) {
                if rect.len() == 4 {
                    let x = rect[0].as_f32().unwrap_or(0.0);
                    let y = rect[1].as_f32().unwrap_or(0.0);
                    let width = rect[2].as_f32().unwrap_or(0.0) - x;
                    let height = rect[3].as_f32().unwrap_or(0.0) - y;

                    log_message!(
                        SysLogLevel::Debug,
                        "Page ID: {:?}, Rect: ({}, {}, {}, {})",
                        page_id,
                        x,
                        y,
                        width,
                        height
                    );

                    updates.push((field_id, page_id, (x, y, width, height), value.clone()));
                }
            }

            if let Ok(kids_obj) = field.get(b"Kids") {
                if let Ok(kids) = kids_obj.as_array() {
                    Self::fill_field_updates(
                        document,
                        kids,
                        name_string.as_deref(),
                        Some(&value),
                        page_data,
                        updates,
                    )?;
                }
            }
        }
        Ok(())
    }

    fn flatten_form_fields(
        document: &mut Document,
        page_data: &HashMap<String, String>,
        font: PDFFont,
    ) -> SystemSyncResult<()> {
        let mut updates = Vec::new();

        // Collect all necessary information
        let catalog = document
            .trailer
            .get(b"Root")
            .and_then(|obj| obj.as_reference())?;
        let catalog_dict = document.get_object(catalog)?.as_dict()?;
        let acroform = catalog_dict
            .get(b"AcroForm")
            .and_then(|obj| obj.as_reference())?;
        let acroform_dict = document.get_object(acroform)?.as_dict()?;
        let fields_list = acroform_dict.get(b"Fields")?.as_array()?;

        log_message!(SysLogLevel::Debug, "Fields List: {:?}", fields_list);

        Self::fill_field_updates(document, fields_list, None, None, page_data, &mut updates)?;

        log_message!(SysLogLevel::Debug, "Updates Collected: {:?}", updates);

        // Process collected information
        for (annotation_id, page_id, (x, y, width, height), value) in updates {
            let suggested_font_size = 12.0;
            let (lines_with_coords, optimal_font_size, _) = Self::get_text_placement(
                &value,
                x,
                y + height, // Start from top of the box
                width,
                height,
                suggested_font_size,
                Some(font.clone()),
            );

            log_message!(SysLogLevel::Debug, "Lines: {:?}", lines_with_coords);

            let mut operations = Vec::new();

            operations.push(Operation::new("q", vec![]));
            operations.push(Operation::new("BT", vec![]));
            operations.push(Operation::new(
                "Tf",
                vec!["FS1".into(), optimal_font_size.into()],
            ));

            if let Some((first_line, line_x, line_y)) = lines_with_coords.first() {
                operations.push(Operation::new(
                    "Td",
                    vec![(*line_x).into(), (*line_y).into()],
                ));
                operations.push(Operation::new(
                    "Tj",
                    vec![Object::String(
                        Self::to_ascii_unicode_point(first_line.clone()),
                        StringFormat::Hexadecimal,
                    )],
                ));
            }

            for i in 1..lines_with_coords.len() {
                let (_, _, previous_y) = lines_with_coords[i - 1];
                let (line, line_x, current_y) = &lines_with_coords[i];
                let y_difference = previous_y - current_y;
                operations.push(Operation::new("Td", vec![0.into(), (-y_difference).into()]));
                operations.push(Operation::new(
                    "Tj",
                    vec![Object::String(
                        Self::to_ascii_unicode_point(line.clone()),
                        StringFormat::Hexadecimal,
                    )],
                ));
            }

            operations.push(Operation::new("ET", vec![]));
            operations.push(Operation::new("Q", vec![]));

            let content = Content { operations };

            let content_id = document.add_object(Stream::new(dictionary! {}, content.encode()?));

            // Update the page to include the new content and resources
            if let Ok(page_obj) = document.get_object_mut(page_id) {
                if let Ok(page_dict) = page_obj.as_dict_mut() {
                    // Add new content to existing contents
                    let contents = page_dict.get(b"Contents")?;
                    let mut contents_array = match contents {
                        Object::Array(arr) => arr.clone(),
                        Object::Reference(id) => vec![Object::Reference(*id)],
                        _ => vec![],
                    };
                    contents_array.push(Object::Reference(content_id));
                    page_dict.set("Contents", Object::Array(contents_array));
                }
            }

            // Remove the annotation
            document.objects.remove(&annotation_id);
        }

        Self::cleanup_pdf(document)?;

        Ok(())
    }
}

impl FormFlatten for PDF {}

For fill operations, I'm just filling the value (V) of existing annotations.

use crate::log_message;
use crate::parallel::spawn_channel_task;
use crate::pdf::lopdf::form_flatten::FormFlatten;
use crate::pdf::lopdf::pdf::PDF;
use crate::pdf::lopdf::utility::Utility;
use crate::prelude::*;
use erp_contrib::crossbeam::channel::{unbounded, Receiver, SendError, Sender};
use erp_contrib::lopdf::{Document, Object, ObjectId};
use erp_contrib::{lopdf, rayon};
use std::collections::HashMap;
use std::sync::Arc;

pub trait FormFill: Utility {
    fn fill_annotation_updates(
        document: &Document,
        fields: &Vec<Object>,
        parent_name: Option<&str>,
        data: &HashMap<String, String>,
        annotation_updates: &mut Vec<(ObjectId, String)>,
    ) -> SystemSyncResult<()> {
        for field_ref in fields {
            let field_id = field_ref.as_reference()?;
            let field = document.get_object(field_id)?.as_dict()?;

            log_message!(SysLogLevel::Debug, "Field dictionary: {:?}", field);

            let name_string = if let Ok(name_obj) = field.get(b"T") {
                log_message!(SysLogLevel::Debug, "Got Name Obj: {:?}", name_obj);
                if let Ok(name) = name_obj.as_str() {
                    log_message!(SysLogLevel::Debug, "Got Name");
                    let mut name_string =
                        Self::clean_string(&String::from_utf8_lossy(name).to_string());
                    if let Some(parent) = parent_name {
                        name_string = format!("{}.{}", parent, name_string);
                    }
                    log_message!(
                        SysLogLevel::Debug,
                        "Constructed Field Name: {}",
                        name_string
                    );
                    Some(name_string)
                } else {
                    None
                }
            } else {
                None
            };

            if let Some(name_string) = &name_string {
                if let Some(value) = data.get(name_string) {
                    log_message!(
                        SysLogLevel::Debug,
                        "Field {} currently has value: {:?}",
                        name_string,
                        format!("{:?}", field.get(b"V")),
                    );
                    annotation_updates.push((field_id, value.clone()));
                }
            }

            if let Ok(kids_obj) = field.get(b"Kids") {
                if let Ok(kids) = kids_obj.as_array() {
                    Self::fill_annotation_updates(
                        document,
                        kids,
                        name_string.as_deref(),
                        data,
                        annotation_updates,
                    )?;
                }
            }
        }
        Ok(())
    }

    fn collect_annotation_updates(
        document: &Document,
        data: &HashMap<String, String>,
    ) -> SystemSyncResult<Vec<(ObjectId, String)>> {
        let mut annotation_updates = Vec::new();
        log_message!(SysLogLevel::Debug, "Hashmap Data: {:?}", data);

        let catalog = document
            .trailer
            .get(b"Root")
            .and_then(|obj| obj.as_reference())?;

        let catalog_dict = document.get_object(catalog)?.as_dict()?;

        let acroform = catalog_dict
            .get(b"AcroForm")
            .and_then(|obj| obj.as_reference())?;

        let acroform_dict = document.get_object(acroform)?.as_dict()?;

        let fields_list = acroform_dict.get(b"Fields")?.as_array()?;

        log_message!(SysLogLevel::Debug, "Field list value: {:?}", fields_list);

        Self::fill_annotation_updates(document, fields_list, None, data, &mut annotation_updates)?;

        Ok(annotation_updates)
    }

    fn apply_annotation_updates(
        document: &mut Document,
        annotation_updates: Vec<(ObjectId, String)>,
    ) -> SystemSyncResult<()> {
        for (annotation_id, value) in annotation_updates {
            let cleaned_value = Self::clean_string(&value);
            let utf16be_value = Self::to_utf16be(&cleaned_value);
            if let Ok(mut annotation_object) = document.get_object_mut(annotation_id) {
                if let Ok(mut dict) = annotation_object.as_dict_mut() {
                    dict.set(
                        "V",
                        Object::String(utf16be_value, lopdf::StringFormat::Literal),
                    );
                }
            }

            log_message!(
                SysLogLevel::Debug,
                "Text field {:?} now has updated value: {}",
                annotation_id,
                cleaned_value,
            );
        }

        Ok(())
    }

    fn fill_form(
        &self,
        data: Arc<HashMap<usize, HashMap<String, String>>>,
        flatten: bool,
    ) -> SystemSyncResult<()>;
}

impl FormFill for PDF {
    fn fill_form(
        &self,
        data: Arc<HashMap<usize, HashMap<String, String>>>,
        flatten: bool,
    ) -> SystemSyncResult<()> {
        let (tx_data, rx_data): (Sender<SystemSyncResult<()>>, Receiver<SystemSyncResult<()>>) =
            unbounded();
        for (i, document) in self.documents.iter().enumerate() {
            let tx_clone = tx_data.clone();
            let data_clone = data.clone();
            let document_clone = document.clone();
            let font = self.default_font.clone();
            spawn_channel_task(tx_clone, move || {
                let document = document_clone.clone();
                if let Some(page_data) = data_clone.get(&i) {
                    let mut document_guard = document.write();
                    if (flatten) {
                        PDF::flatten_form_fields(&mut document_guard, page_data, font)?;
                    } else {
                        let annotation_updates =
                            PDF::collect_annotation_updates(&document_guard, page_data)?;
                        PDF::apply_annotation_updates(&mut document_guard, annotation_updates)?;
                    }
                }

                Ok(())
            });
        }
        drop(tx_data);
        while let Ok(result) = rx_data.recv() {
            result?;
        }

        Ok(())
    }
}

To put a crate together I'd need to clean things up and package it for distribution, which would take a bit. I'd want to expand it to support other form field types like checkboxes, etc. Right now I interact with it via:

pub struct PDF {
    pub documents: Vec<Arc<RwLock<Document>>>,
    pub default_font: PDFFont,
    pub copies: Option<usize>,
}

impl PDF {
    pub fn new() -> Self {
        Self {
            documents: Vec::new(),
            default_font: PDFFont::Courier,
            copies: None,
        }
    }

    pub fn set_font(mut self, font: PDFFont) -> Self {
        self.default_font = font;
        self
    }

    pub fn set_copies(mut self, copies: Option<usize>) -> Self {
        self.copies = copies;
        self
    }

    pub fn safe_fill_form(
        &mut self,
        bytes: Vec<u8>,
        data: HashMap<usize, HashMap<String, String>>,
        flatten: bool,
    ) -> SystemSyncResult<Vec<u8>> {
        match std::panic::catch_unwind(std::panic::AssertUnwindSafe(
            || -> SystemSyncResult<Vec<u8>> {
                self.load_pdf(bytes, self.copies.clone())?;
                let arc_data = Arc::new(data);
                self.fill_form(arc_data.clone(), flatten)?;
                self.save()
            },
        )) {
            Ok(result) => result,
            Err(_) => Err(SystemError::new(
                file!(),
                line!(),
                SystemErrorCode::GenericErrorCode,
                SystemErrorType::GenericErrorType,
                "PDF Production panicked with error",
            )
            .into()),
        }
    }
}

And in my case, a single source template is reused based on the number of pages/documents a caller wants. E.G. a multi-page invoice would use the same source template, the caller would request the number of pages the invoice requires(say 3), and organize the incoming Hashmap indexed by corresponding page/document (usize), where the inner map contains all the mappings for the page/document at that ordinal position. I clone the template (n) number of times, process the flatten or fill operations in parallel, and then merge the resulting document together, and pass them through optimization.

The optimization logic deduplicates content and move everything into XObject's, deduplicates fonts, etc, as lopdf doesn't handle any of this and cloning out a 1000 page document, as an example, is going to result in a very large size due to redundant information(like fonts) getting cloned per page. It would be more efficient to deal with some of that as the pages our cloned out, rather than in a single optimization pass at the end, but that doesn't matter to me right now.

Fields in the template, in my case, are delimited with a "." e.g. named as "question.1.prompt", and the PDF standard will expand that to A -> B -> C dictionaries, hence following the Kids to determine placement in the form.

unfilled
flattened
Filled

I used illustrator to design the test templates.

@ajrcarey
Copy link
Owner

ajrcarey commented Aug 4, 2024

Began work on an internal flatten implementation for pdfium-render behind feature flag flatten.

@ajrcarey
Copy link
Owner

Test whether upstream issue resolved by https://pdfium.googlesource.com/pdfium/+/a4682f9a0f699288e7b84992d2096340133b74ac.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants