Skip to content

PDF Processing

Timothy W Belch edited this page Mar 2, 2020 · 5 revisions

MPS PDF Processing Utilities

Below are descriptions for utilities for processing PDFs by the MPS team.

The script for performing the processing resides at the following location:

ROOTDIR = tang.umdl.umich.edu:/quod-prep/prep/a/acls/mpspdfutil

PDF Optimization

This utility can be used to provide optimization to one or more PDFs. Currently, a PDF can be optimized in the following manner:

  1. Selected Images within the PDF can be resized by a specified percentage.

  2. Selected Images within the PDF can be converted to a specified format.

  3. A page index representing the cover image can be specified and extracted to a specified format.

  4. For a PDF with the file name ebookISBN.pdf,

Below is the script usage syntax:

usage: pdf_optimize [options] pdf_file [pdf_file...]
 -c   Extract cover in the format 
      [bmp|jpeg|jpeg2000|png].
 -f   Resize images in the format 
      [bmp|jpeg|jpeg2000|png].
      The default is jpeg.
 -o   Additional Java VM options.
      Use -o "-Xms8192m -Xmx8192m" for large PDFs.
 -p   Cover page index [0-9]+.
      The --cover_format option must also be specified.
      Default is 0.
 -r   Resize %. The default is 100.
 -t   Dimension threshold [0-9]+.
      The default is 0.

The resulting PDF will have the suffix _optimize_{resize_pct}pct appended to its filename. For example, if the specified PDF has the filename ebookISBN and it is resized to be 80% as the –resize_pct 80 option is provided, then the resulting PDF will have the name ebookISBN_optimize_80pct.pdf. If there exists a PDF in the same directory with the file name ebookISBN_web.pdf, then the bookmarks from ebookISBN_web.pdf will be copied into the resulting optimized PDF.

Below are a few sample invocations:

  1. To maintain the original size of all images within a PDF and use the JPEG format, invoke the following command:

    ROOTDIR/script/pdf_optimize -f jpeg /mnt/umptmm/MPS/BAR/compression/9781407336138/9781407336138.pdf

  2. To resize all images within a PDF to be 80% of their original size and use the JPEG format, invoke the following command:

    ROOTDIR/script/pdf_optimize -r 80 -f jpeg /mnt/umptmm/MPS/BAR/compression/9781407336138/9781407336138.pdf

  3. To resize all images within a PDF to be 80% of their original size, use the JPEG format, and extract the first page as the cover in the PNG format, invoke the following command:

    ROOTDIR/script/pdf_optimize -r 80 -f jpeg -c png -p 0 /mnt/umptmm/MPS/BAR/compression/9781407336138/9781407336138.pdf

  4. To resize all images within a PDF with both the width and height dimensions 1000 pixels or greater to be 80% of their original size and use the JPEG format, invoke the following command (NOTE: in this example, the nice command sets the scheduling priority to the least favorable value of 19):

    nice -19 ROOTDIR/script/pdf_optimize -r 80 -f jpeg -t 1000 /mnt/umptmm/MPS/BAR/compression/9781407336138/9781407336138.pdf

PDF Cover

This utility can be used to extract a cover page from a PDF. Below is the script usage syntax:

pdf_cover [options] pdf_file [pdf_file...]
 -c   Extract cover in the format 
      [bmp|jpeg|jpeg2000|png].
 -o   Additional Java VM options.
      Use -o "-Xms8192m -Xmx8192m" for large PDFs.
 -p   Cover page index [0-9]+.
      The --cover_format option must also be specified.
      Default is 0.

The resulting PDF will have the suffix _cover appended to its filename and extension will be extract cover format. For example, if the specified PDF has the filename ebookISBN and the cover is extracted from page 0 with the –c png option, then the resulting PDF will have the name ebookISBN_cover.png.

Below are a few sample invocations:

  1. To extract a cover from page 5 and store it as a PNG file, invoke the following command:

    ROOTDIR/script/pdf_cover -c png -p 5 /mnt/umptmm/MPS/BAR/compression/9781407336138/9781407336138.pdf

  2. To extract a cover from page 0 and store it as a JPEG file, invoke the following command:

    ROOTDIR/script/pdf_cover -c jpeg /mnt/umptmm/MPS/BAR/compression/9781407336138/9781407336138.pdf

    In this case, the –p is not provided, so page 0 is used by default.

PDF Outline

This utility can be used to generate a new PDF that contains bookmarks copied from another. Below is the script usage syntax:

pdf_outline [options] pdf_file [pdf_file...]
 -c   Extract cover in the format 
      [bmp|jpeg|jpeg2000|png].
 -o   Additional Java VM options.
      Use -o "-Xms8192m -Xmx8192m" for large PDFs.
 -p   Cover page index [0-9]+.
      The --cover_format option must also be specified.
      Default is 0.

The resulting PDF will have the suffix _outline appended to its filename. For example, if the specified PDF has the filename ebookISBN, then the resulting PDF will have the name ebookISBN_outline.pdf. If another PDF that has the file name ebookISBN_web.pdf exists within the same directory, then the bookmarks from ebookISBN_web.pdf will be copied into the resulting optimized PDF.

Below are a few sample invocations:

  1. To generate a new PDF with copied bookmarks, invoke the following command:

    ROOTDIR/script/pdf_outline /mnt/umptmm/MPS/BAR/compression/9781407336138/9781407336138.pdf

  2. To generate a new PDF with copied bookmarks and extract a cover as a PNG from page 3, invoke the following command:

    ROOTDIR/script/pdf_outline -c png -p 3 /mnt/umptmm/MPS/BAR/compression/9781407336138/9781407336138.pdf

PDF Outline Exists

This utility can be used to determine if one or more PDFs contains bookmarks. Below is the script usage syntax:

pdf_has_outline [options] pdf_file [pdf_file...]
 -v   Display bookmark titles/pages.
 -o   Additional Java VM options.
      Use -o "-Xms8192m -Xmx8192m" for large PDFs.

By default, the total number of bookmarks is displayed. If an invalid bookmark is detected, the it is acknowledged in the error count. If the -v is specified, then bookmark titles are listed.

Below are a few sample invocations:

  1. To determine the total number of bookmarks, invoke the following command:

    ROOTDIR/script/pdf_has_outline /mnt/umptmm/MPS/BAR/compression/9781407338859/9781407338859.pdf

    Below is sample resulting output:

    "9781407338859.pdf" total bookmarks: 12 errors: 0

  2. To determine the total number of bookmarks and list , invoke the following command:

    ROOTDIR/script/pdf_has_outline /mnt/umptmm/MPS/BAR/compression/9781407338859/9781407338859.pdf

    Below is sample resulting output:

    "9781407338859.pdf" total bookmarks: 12 errors: 0

    Page 1: "Cover"

    Page 4: "copyright"

    Page 5: "Contents"

    Page 7: "Foreword"

    Page 8: "Acknowledgements"

    Page 9: "List of illustrations"

    Page 11: "Introduction"

    Page 12: "Major milestones in Hungarian Early Neolithic research during the 20th century"