ptiff2doc
is shell script which puts tiff files from a folder, together
into a PDF and/or DJVU file. It's assumed that the tiff files are pre-processed
with a tool like ScanTailor. A hidden text layer
is added to the document PDF/DJVU, generated with tesseract
(OCR).
This allows the PDF/DJVU file to be searchable (aka sandwichpdf).
ptiff2doc
makes use of parallel
to process the tiff files in parallel
and make use of several CPU cores. ptiff2doc
is very resource hungry
(CPU and disk space). Expect about twice the size the folder with the
tiff files to be used for temporary processed files (they get removed
when the script finishes). The temporary folder is created in the current
working directory (cwd).
If you need more control over the created PDF/DJVU document, it's recommended to use gscan2pdf.
ptiff2doc
depends on many external tools (see below), for convenience the
needed packages to be installed in Fedora:
dnf install parallel libtiff-tools tesseract netpbm-progs djvulibre \
poppler-utils perl-Log-Log4perl gscan2pdf perl-File-Slurp perl-File-Temp \
perl-PDF-API2 perl-Getopt-Long perl-Encode perl-Encode-Locale perl-TimeDate
./ptiff2doc.sh [OPTIONS] [FOLDER WITH TIFF FILES]
[FOLDER WITH TIFF FILES]
a folder with .tif files, if folder is ommited
the current working directory (cwd) is used.
Options [default value]:
-h | --help This help
-b | --docname The basename of the output document [book]
-d | --dpi DPI setting for c44 [300]
-j | --djvu Create .djvu
-p | --pdf Create .pdf
-a | --author Author to be set in .pdf/.djvu
-t | --title Title to be set in .pdf/.djvu
-l | --language Language setting for tesseract [deu]
See 'tesseract --list-langs' for supported languages
deu = German
eng = English
fin = Finnish
for mixed language documents 'deu+eng' is also possible