Split a PDF document using OL Connect

In this tutorial, we explore how to use standard OL Connect techniques to split print-ready files, such as PDFs and PostScript files. The general steps, described in detail below, are:

  1. A Data Mapper configuration defines document boundaries.
  2. The Data Mapper configuration extracts relevant data.
  3. A Job Preset adds the extracted data to the document meta data.
  4. The meta data allows the Output Preset to name and store the individual files, and handle document separation.
  5. Deploy the resources.

Step 1: Define Document Boundaries

To split a document, we need to define the boundaries for each document within the file. This can be based on a fixed page count or triggered by specific text changes, such as a unique invoice number or page numbers. In this tutorial, we’ll work with a PDF containing multiple invoices as our example.

Create a Data Mapping configuration

  1. Start OL Connect Designer and, from the Welcome Screen, click New.
  2. Go to the Data tab and select PDF. The New Data Mapping configuration dialog appears.
  3. Browse for the PDF file and click Finish.

This creates a new Data Mapping configuration and adds the selected PDF file to the Data Samples.

Set the Document Boundaries

  1. In the Settings tab, set the Trigger to:
    • On Page: If the documents always have a fixed number of pages.
    • On Text: If document boundaries are based on a specific changing value (e.g., invoice number, customer ID).
  2. In this example, we select On Text and configure it to recognize the text PAGE 1 OF in a specific area as the trigger. This ensures that the PDF is split correctly, even when the documents have a variable number of pages.
Screenshot of the Boundaries settings in a Data Mapping Configuration, set to apply document boundaries based on text recognition, configured to detect the phrase "Page 1 of" in a specific area.

Step 2: Extract Data for the file names

Once document boundaries are defined, we extract data from the document for naming the output files.

Extract data

  1. Navigate to the Steps view.
  2. Select the text you want to extract and use for the file name. In our example, we’ve selected the Order Number.
  3. Right-click the highlighted text and choose Add Extraction.
  4. This creates a new data field that appears in the Data Model view, showing the value of the current record.
Screenshot of a portion of the input PDF document with a highlighted area indicating the extraction of the order number text.

Rename the extracted field

  1. In the Data Model view, locate the newly created field.
  2. Right-click on the field and choose Rename.
  3. Enter a descriptive name (e.g., OrderNo) that clearly represents the extracted data.
  4. Save the Data Mapping configuration to disk to ensure all changes are preserved.
Screenshot of the Data Model view displaying the OrderNo field with the extracted value for the current record/document.

Step 3: Add extracted data to meta data

A Job Preset is used to include the extracted data in the meta data of the documents, ensuring it can be used in the Output Preset for naming the output files.

Create a Job Preset

  1. Open the Welcome Screen and click on New.
  2. Navigate to the Presets tab and select Job Preset.
  3. Choose the previously created Data Mapping configuration file.
  4. Check the Include meta data option.
Screenshot of the Job Creation dialog displaying the selected Data Mapping Configuration and the enabled Include metadata option.
  1. Click Next until you reach the Meta Data Options page.
  2. In the Meta Data Options step, go to the Document Tags tab and click the Add meta data icon, then select Add field meta data.
  3. Choose the OrderNo field and click OK. The OrderNo field now appears in the Document Tags overview and will be included in the document’s meta data.
  4. Finally, click Finish to save the Job Preset.
Screenshot of the Metadata Options page in the Job Preset, with the Document Tags section selected, displaying the OrderNo tag.

Step 4: Name files and split documents

The Output Preset handles file naming and document separation.

Set the file name

  1. Open the Welcome Screen and click on New.
  2. Navigate to the Presets tab and select Output Preset.
  3. Choose the previously created Job Preset file.
  4. Enable Separation to generate individual output files per document.
  5. Click Next until you reach the Print Options page.
  6. On the Print Options page, set the Output Type to Directory.
  7. Define a Job Output Mask using the meta data variables:
${document.metadata.OrderNo}.pdf

This dynamically generates filenames using the order number.

Tip! Instead of setting the Separation option on the Separation Options page and entering the mask manually, this can be done through the Job Output Mask dialog invoked via the Pencil icon of the Job Output Mask field

Note! The Job Output Folder can be set in the preset, but can be overridden using OL Connect Workflow job infos or from within the All In One or Paginated Output nodes of OL Connect Automate.

Enable separations

  1. Click Next until you reach the Print Options page.
  2. Set the Separation Settings to Document.
  3. Click Finish to save the Output Preset.

Step 5: Deploy resources

At this stage, you have a Data Mapping configuration that defines document boundaries and extracts the order number for each document. A Job Preset stores the order number in the document’s meta data, while an Output Preset generates a separate file for each document, using the extracted order number as the file name.

To use these files, they must be deployed to OL Connect Workflow, or OL Connect Server when using OL Connect Automate (or sent directly to OL Connect Server via the REST API).

In this tutorial, we discussed the process of splitting a print-ready file. In scenarios where print-ready files are modified, such as when splitting, adding OMR marks, postal sorting, or performing similar tasks, there is typically no need to merge data with OL Connect templates. In OL Connect Workflow, this involves the Execute Data Mapping task with the Bypass content creation option enabled, as shown in the following image. This option allows the job to bypass Content Creation, which enhances overall performance.

Screenshot of the Execute Data Mapping Properties dialog with the "Bypass content creation" option checked.

The following image illustrates an OL Connect Automate flow, where the Data Mapping configuration is applied in the Document Mapping node (which applies Bypass content creation on the fly), and the Job and Output Presets are used in their respective steps.

Screenshot of a similar flow in OL Connect Automate, where the Document Mapping is used, automatically applying the "Bypass content creation" setting.

Conclusion

This method ensures the efficient and automated processing of print-ready documents, while using data driven output file names. It also allows adding extra content such as text and barcodes in the Output Preset, as well as the option to prepend the job with a banner page using the Additional Content options.

Resources