A Python utility to extract embedded CSV files from PDF reports in bulk.
This project scans a folder of PDF reports, unpacks file attachments from each PDF using pdftk, and saves all extracted .csv files into an output directory with cleaned, prefixed filenames.
Many enterprise and reporting workflows generate PDF files that contain embedded CSV attachments. Manually opening each report and exporting attachments is slow and repetitive.
This script automates that process by:
- scanning a directory for PDF files
- extracting embedded attachments from each PDF
- filtering for CSV files only
- renaming the extracted CSVs using the source PDF filename
- saving everything into one output folder
- Bulk processing of PDF files
- Extracts embedded file attachments from PDFs
- Saves only CSV attachments
- Automatically renames output files to keep them organized
- Works well for report-processing and automation workflows
The script:
- Walks through a directory of PDF files
- Uses
pdftkto unpack embedded files from each PDF - Checks the extracted files for
.csvattachments - Renames each CSV using the original PDF filename as a prefix
- Moves the final files into the output directory
If the input folder contains:
report_january.pdfreport_february.pdf
and those PDFs contain embedded CSV files such as:
data.csvsummary.csv
the output may look like:
report_january_data.csvreport_february_summary.csv
.
├── extractcsv.py
├── requirements.txt
└── README.md