-
Notifications
You must be signed in to change notification settings - Fork 0
PDF Processing
Below are descriptions for utilities for processing PDFs by the MPS team.
The script for performing the processing resides at the following location:
ROOTDIR = tang.umdl.umich.edu:/quod-prep/prep/a/acls/mpspdfutil
This utility can be used to provide optimization to one or more PDFs. Currently, a PDF can be optimized in the following manner:
Selected Images within the PDF can be resized by a specified percentage.
Selected Images within the PDF can be converted to a specified format.
A page index representing the cover image can be specified and extracted to a specified format.
For a PDF with the file name ebookISBN.pdf,
Below is the script usage syntax:
usage: pdf_optimize [options] pdf_file [pdf_file...]
-c Extract cover in the format
[bmp|jpeg|jpeg2000|png].
-f Resize images in the format
[bmp|jpeg|jpeg2000|png].
The default is jpeg.
-o Additional Java VM options.
Use -o "-Xms8192m -Xmx8192m" for large PDFs.
-p Cover page index [0-9]+.
The --cover_format option must also be specified.
Default is 0.
-r Resize %. The default is 100.
-t Dimension threshold [0-9]+.
The default is 0.
The resulting PDF will
have the suffix _optimize_{resize_pct}pct appended to its filename.
For example, if the specified PDF has the filename ebookISBN and it is resized to be 80% as the –resize_pct 80
option is provided, then the resulting PDF will have the name ebookISBN_optimize_80pct.pdf. If there exists a PDF in the same
directory with the file name ebookISBN_web.pdf, then the bookmarks
from ebookISBN_web.pdf will be copied into the resulting optimized
PDF.
Below are a few sample invocations:
-
To maintain the original size of all images within a PDF and use the JPEG format, invoke the following command:
ROOTDIR/script/pdf_optimize -f jpeg /mnt/umptmm/MPS/BAR/compression/9781407336138/9781407336138.pdf
-
To resize all images within a PDF to be 80% of their original size and use the JPEG format, invoke the following command:
ROOTDIR/script/pdf_optimize -r 80 -f jpeg /mnt/umptmm/MPS/BAR/compression/9781407336138/9781407336138.pdf
-
To resize all images within a PDF to be 80% of their original size, use the JPEG format, and extract the first page as the cover in the PNG format, invoke the following command:
ROOTDIR/script/pdf_optimize -r 80 -f jpeg -c png -p 0 /mnt/umptmm/MPS/BAR/compression/9781407336138/9781407336138.pdf
-
To resize all images within a PDF with both the width and height dimensions 1000 pixels or greater to be 80% of their original size and use the JPEG format, invoke the following command (NOTE: in this example, the nice command sets the scheduling priority to the least favorable value of 19):
nice -19 ROOTDIR/script/pdf_optimize -r 80 -f jpeg -t 1000 /mnt/umptmm/MPS/BAR/compression/9781407336138/9781407336138.pdf
This utility can be used to extract a cover page from a PDF. Below is the script usage syntax:
pdf_cover [options] pdf_file [pdf_file...]
-c Extract cover in the format
[bmp|jpeg|jpeg2000|png].
-o Additional Java VM options.
Use -o "-Xms8192m -Xmx8192m" for large PDFs.
-p Cover page index [0-9]+.
The --cover_format option must also be specified.
Default is 0.
The resulting PDF will have
the suffix _cover appended to its filename and extension will
be extract cover format. For example, if the specified PDF has the
filename ebookISBN and the cover is extracted from page 0 with
the –c png
option, then the resulting PDF will
have the name ebookISBN_cover.png.
Below are a few sample invocations:
-
To extract a cover from page 5 and store it as a PNG file, invoke the following command:
ROOTDIR/script/pdf_cover -c png -p 5 /mnt/umptmm/MPS/BAR/compression/9781407336138/9781407336138.pdf
-
To extract a cover from page 0 and store it as a JPEG file, invoke the following command:
ROOTDIR/script/pdf_cover -c jpeg /mnt/umptmm/MPS/BAR/compression/9781407336138/9781407336138.pdf
In this case, the
–p
is not provided, so page 0 is used by default.
This utility can be used to generate a new PDF that contains bookmarks copied from another. Below is the script usage syntax:
pdf_outline [options] pdf_file [pdf_file...]
-c Extract cover in the format
[bmp|jpeg|jpeg2000|png].
-o Additional Java VM options.
Use -o "-Xms8192m -Xmx8192m" for large PDFs.
-p Cover page index [0-9]+.
The --cover_format option must also be specified.
Default is 0.
The resulting PDF will have the suffix _outline appended to its filename. For example, if the specified PDF has the filename ebookISBN, then the resulting PDF will have the name ebookISBN_outline.pdf. If another PDF that has the file name ebookISBN_web.pdf exists within the same directory, then the bookmarks from ebookISBN_web.pdf will be copied into the resulting optimized PDF.
Below are a few sample invocations:
-
To generate a new PDF with copied bookmarks, invoke the following command:
ROOTDIR/script/pdf_outline /mnt/umptmm/MPS/BAR/compression/9781407336138/9781407336138.pdf
-
To generate a new PDF with copied bookmarks and extract a cover as a PNG from page 3, invoke the following command:
ROOTDIR/script/pdf_outline -c png -p 3 /mnt/umptmm/MPS/BAR/compression/9781407336138/9781407336138.pdf
This utility can be used to determine if one or more PDFs contains bookmarks. Below is the script usage syntax:
pdf_has_outline [options] pdf_file [pdf_file...]
-v Display bookmark titles/pages.
-o Additional Java VM options.
Use -o "-Xms8192m -Xmx8192m" for large PDFs.
By default, the total number of bookmarks is displayed. If an
invalid bookmark is detected, the it is acknowledged in the error
count. If the -v
is specified, then bookmark titles are
listed.
Below are a few sample invocations:
-
To determine the total number of bookmarks, invoke the following command:
ROOTDIR/script/pdf_has_outline /mnt/umptmm/MPS/BAR/compression/9781407338859/9781407338859.pdf
Below is sample resulting output:
"9781407338859.pdf" total bookmarks: 12 errors: 0
-
To determine the total number of bookmarks and list , invoke the following command:
ROOTDIR/script/pdf_has_outline /mnt/umptmm/MPS/BAR/compression/9781407338859/9781407338859.pdf
Below is sample resulting output:
"9781407338859.pdf" total bookmarks: 12 errors: 0
Page 1: "Cover"
Page 4: "copyright"
Page 5: "Contents"
Page 7: "Foreword"
Page 8: "Acknowledgements"
Page 9: "List of illustrations"
Page 11: "Introduction"
Page 12: "Major milestones in Hungarian Early Neolithic research during the 20th century"