This is a simple command-line tool written in Go to parse text from PDF files and output it to standard output, CSV, JSON, or Parquet format.
This tool relies on the pdftotext command-line utility, which is part of the poppler-utils package.
-
Install Go: Make sure you have Go installed on your system. You can download it from https://golang.org/.
-
Install
poppler-utils: You need to install thepoppler-utilspackage, which provides thepdftotextutility.-
On Debian/Ubuntu:
sudo apt-get update sudo apt-get install poppler-utils
-
On CentOS/RHEL:
sudo yum install poppler-utils
-
On macOS (using Homebrew):
brew install poppler
-
-
Build the
pdf-parser:git clone <repository_url> cd pdf-parser go build -o pdf-parser main.go
To use the pdf-parser, run the following command:
./pdf-parser -input=<path_to_your_pdf_file> -output=<text|csv|json|parquet>-input: (Required) The path to the input PDF file.-output: (Optional) The output format. Can betext,csv,json, orparquet. Defaults totext.
-
Text Output (default):
./pdf-parser -input=my_document.pdf
-
CSV Output:
./pdf-parser -input=my_document.pdf -output=csv
-
JSON Output:
./pdf-parser -input=my_document.pdf -output=json
-
Parquet Output:
./pdf-parser -input=my_document.pdf -output=parquet