A high-performance CLI tool for archiving PostgreSQL partitioned table data to S3-compatible object storage.
- π Parallel Processing - Archive multiple partitions concurrently with configurable workers
- π Beautiful Progress UI - Real-time progress tracking with dual progress bars
- π Embedded Cache Viewer - Beautiful web interface with real-time updates:
- WebSocket Live Updates - Real-time data streaming without polling
- Interactive task monitoring showing current partition and operation
- Clickable partition names to jump directly to table row
- Shows archiver status (running/idle) with PID tracking
- Live statistics: total partitions, sizes, compression ratios
- Sortable table with S3 upload status indicators
- Smooth animations highlight data changes
- Error tracking with timestamps
- Auto-reconnecting WebSocket for reliability
- πΎ Intelligent Caching - Advanced caching system for maximum efficiency:
- Caches row counts for 24 hours (refreshed daily)
- Caches file metadata permanently (size, MD5, compression ratio)
- Tracks errors with timestamps
- Skip extraction/compression entirely when cached metadata matches S3
- π Data Integrity - Comprehensive file integrity verification:
- Size comparison (both compressed and uncompressed)
- MD5 hash verification for single-part uploads
- Multipart ETag verification for large files (>100MB)
- Automatic multipart upload for files >100MB
- β‘ Smart Compression - Uses Zstandard compression with multi-core support
- π Intelligent Resume - Three-level skip detection:
- Fast skip using cached metadata (no extraction needed)
- Skip if S3 file matches after local processing
- Re-upload if size or hash differs
- π― Flexible Partition Support - Handles multiple partition naming formats:
table_YYYYMMDD
(e.g.,messages_20240315
)table_pYYYYMMDD
(e.g.,messages_p20240315
)table_YYYY_MM
(e.g.,messages_2024_03
)
- Go 1.21 or higher
- PostgreSQL database with partitioned tables (format:
tablename_YYYYMMDD
) - S3-compatible object storage (Hetzner, AWS S3, MinIO, etc.)
git clone https://github.com/airframes/postgresql-archiver.git
cd postgresql-archiver
go build -o postgresql-archiver
Or install directly:
go install github.com/airframes/postgresql-archiver@latest
postgresql-archiver \
--db-user myuser \
--db-password mypass \
--db-name mydb \
--table flights \
--s3-endpoint https://fsn1.your-objectstorage.com \
--s3-bucket my-archive-bucket \
--s3-access-key YOUR_ACCESS_KEY \
--s3-secret-key YOUR_SECRET_KEY \
--start-date 2024-01-01 \
--end-date 2024-01-31
postgresql-archiver [flags]
PostgreSQL Archiver
A CLI tool to efficiently archive PostgreSQL partitioned table data to object storage.
Extracts data by day, converts to JSONL, compresses with zstd, and uploads to S3-compatible storage.
Usage:
postgresql-archiver [flags]
Flags:
--cache-viewer start embedded cache viewer web server
--config string config file (default is $HOME/.postgresql-archiver.yaml)
--db-host string PostgreSQL host (default "localhost")
--db-name string PostgreSQL database name
--db-password string PostgreSQL password
--db-port int PostgreSQL port (default 5432)
--db-user string PostgreSQL user
-d, --debug enable debug output
--dry-run perform a dry run without uploading
--end-date string end date (YYYY-MM-DD) (default "2025-08-27")
-h, --help help for postgresql-archiver
--s3-access-key string S3 access key
--s3-bucket string S3 bucket name
--s3-endpoint string S3-compatible endpoint URL
--s3-region string S3 region (default "auto")
--s3-secret-key string S3 secret key
--skip-count skip counting rows (faster startup, no progress bars)
--start-date string start date (YYYY-MM-DD)
--table string base table name (required)
--viewer-port int port for cache viewer web server (default 8080)
--workers int number of parallel workers (default 4)
--table
- Base table name (without date suffix)--db-user
- PostgreSQL username--db-name
- PostgreSQL database name--s3-endpoint
- S3-compatible endpoint URL--s3-bucket
- S3 bucket name--s3-access-key
- S3 access key--s3-secret-key
- S3 secret key
The tool supports three configuration methods (in order of precedence):
- Command-line flags (highest priority)
- Environment variables (prefix:
ARCHIVE_
) - Configuration file (lowest priority)
export ARCHIVE_DB_HOST=localhost
export ARCHIVE_DB_PORT=5432
export ARCHIVE_DB_USER=myuser
export ARCHIVE_DB_PASSWORD=mypass
export ARCHIVE_DB_NAME=mydb
export ARCHIVE_S3_ENDPOINT=https://fsn1.your-objectstorage.com
export ARCHIVE_S3_BUCKET=my-bucket
export ARCHIVE_S3_ACCESS_KEY=your_key
export ARCHIVE_S3_SECRET_KEY=your_secret
export ARCHIVE_TABLE=flights
export ARCHIVE_WORKERS=8
export ARCHIVE_CACHE_VIEWER=true
export ARCHIVE_VIEWER_PORT=8080
Create ~/.postgresql-archiver.yaml
:
db:
host: localhost
port: 5432
user: myuser
password: mypass
name: mydb
s3:
endpoint: https://fsn1.your-objectstorage.com
bucket: my-archive-bucket
access_key: your_access_key
secret_key: your_secret_key
region: auto
table: flights
workers: 8
start_date: "2024-01-01"
end_date: "2024-12-31"
cache_viewer: false # Enable embedded cache viewer
viewer_port: 8080 # Port for cache viewer web server
Files are organized in S3 with the following structure:
bucket/
βββ export/
βββ table_name/
βββ YYYY/
βββ MM/
βββ YYYY-MM-DD.jsonl.zst
Example:
my-bucket/
βββ export/
βββ flights/
βββ 2024/
βββ 01/
βββ 2024-01-01.jsonl.zst
βββ 2024-01-02.jsonl.zst
βββ 2024-01-03.jsonl.zst
The archiver includes an embedded web server for monitoring cache and progress:
# Start archiver with embedded cache viewer
postgresql-archiver --cache-viewer --viewer-port 8080 [other options]
# Or run standalone cache viewer
postgresql-archiver cache-viewer --port 8080
Features:
- WebSocket Real-time Updates: Live data streaming with automatic reconnection
- Interactive Status Panel:
- Shows current partition being processed with clickable link
- Displays specific operation (e.g., "Checking if exists", "Extracting", "Compressing", "Uploading")
- Progress bar with completion percentage and partition count
- Elapsed time tracking
- Visual Change Detection: Smooth animations highlight updated cells and stats
- S3 Upload Status: Shows which files are uploaded vs only processed locally
- Comprehensive Metrics: Shows both compressed and uncompressed sizes
- Compression Ratios: Visual display of space savings
- Error Tracking: Displays last error and timestamp for failed partitions
- Smart Rendering: No page flashing - only updates changed values
- Sortable Columns: Click any column header to sort (default: partition name)
- File Counts: Shows total partitions, processed, uploaded, and errors
- Process Monitoring: Checks if archiver is currently running via PID
- Connection Status: Visual indicator shows WebSocket connection state
Access the viewer at http://localhost:8080
(or your configured port).
The cache viewer uses modern web technologies for optimal performance:
- WebSocket Protocol: Bi-directional communication for instant updates
- Automatic Reconnection: Reconnects every 2 seconds if connection drops
- File System Monitoring: Watches cache directory for changes (500ms intervals)
- Efficient Updates: Only transmits and renders changed data
- No Polling Overhead: WebSocket eliminates the need for HTTP polling
The tool features a beautiful terminal UI with:
- Per-partition progress bar: Shows real-time progress for data extraction, compression, and upload
- Overall progress bar: Tracks completion across all partitions
- Live statistics: Displays elapsed time, estimated remaining time, and recent completions
- Row counter: Shows progress through large tables during extraction
The tool automatically discovers partitions matching these naming patterns:
-
Daily partitions (standard):
{base_table}_YYYYMMDD
- Example:
flights_20240101
,flights_20240102
- Example:
-
Daily partitions (with prefix):
{base_table}_pYYYYMMDD
- Example:
flights_p20240101
,flights_p20240102
- Example:
-
Monthly partitions:
{base_table}_YYYY_MM
- Example:
flights_2024_01
,flights_2024_02
- Note: Monthly partitions are processed as the first day of the month
- Example:
For example, if your base table is flights
, the tool will find and process all of these:
flights_20240101
(daily)flights_p20240102
(daily with prefix)flights_2024_01
(monthly)
Each row from the partition is exported as a single JSON object on its own line:
{"id":1,"flight_number":"AA123","departure":"2024-01-01T10:00:00Z"}
{"id":2,"flight_number":"UA456","departure":"2024-01-01T11:00:00Z"}
Uses Facebook's Zstandard compression with:
- Multi-core parallel compression
- "Better Compression" preset for optimal size/speed balance
- Typically achieves 5-10x compression ratios on JSON data
Files are skipped if:
- They already exist in S3 with the same path
- The file size matches (prevents re-uploading identical data)
Enable debug mode for detailed output:
postgresql-archiver --debug --table flights ...
Debug mode shows:
- Database connection details
- Discovered partitions and row counts
- Extraction progress (every 10,000 rows)
- Compression ratios
- Upload destinations
- Detailed error messages
Test your configuration without uploading:
postgresql-archiver --dry-run --table flights ...
This will:
- Connect to the database
- Discover partitions
- Extract and compress data
- Calculate file sizes and MD5 hashes
- Skip the actual upload
The archiver uses an intelligent two-tier caching system to maximize performance:
- Caches partition row counts for 24 hours
- Speeds up progress bar initialization
- Always recounts today's partition for accuracy
- Cache location:
~/.postgresql-archiver/cache/{table}_metadata.json
- Caches compressed/uncompressed sizes, MD5 hash, and S3 upload status
- Tracks whether files have been successfully uploaded to S3
- Enables fast skipping without extraction/compression on subsequent runs
- Validates against S3 metadata before skipping
- Preserves all metadata when updating row counts
- Stores error messages with timestamps for failed uploads
- File metadata is kept permanently (only row counts expire after 24 hours)
On subsequent runs with cached metadata:
- Check cached size/MD5 against S3 (milliseconds)
- Skip extraction and compression if match found
- Result: 100-1000x faster for already-processed partitions
The archiver provides real-time monitoring capabilities:
- Creates PID file at
~/.postgresql-archiver/archiver.pid
when running - Allows external tools to check if archiver is active
- Automatically cleaned up on exit
- Writes current task details to
~/.postgresql-archiver/current_task.json
- Includes:
- Current operation (connecting, counting, extracting, uploading)
- Progress percentage
- Total and completed partitions
- Start time and last update time
- Updated in real-time during processing
The cache viewer provides REST API and WebSocket endpoints:
/api/cache
- Returns all cached metadata (REST)/api/status
- Returns archiver running status and current task (REST)/ws
- WebSocket endpoint for real-time updates- Sends cache updates when files change
- Streams status updates during archiving
- Automatic reconnection support
The archiver ensures data integrity through multiple verification methods:
- Calculates MD5 hash of compressed data
- Compares with S3 ETag (which is MD5 for single-part uploads)
- Only skips if both size and MD5 match exactly
- Automatically uses multipart upload for large files
- Calculates multipart ETag using S3's algorithm
- Verifies size and multipart ETag match before skipping
- First Run: Extract β Compress β Calculate MD5 β Upload β Cache metadata
- Subsequent Runs with Cache: Check cache β Compare with S3 β Skip if match
- Subsequent Runs without Cache: Extract β Compress β Calculate MD5 β Compare with S3 β Skip or upload
postgresql-archiver \
--table events \
--start-date $(date -d '30 days ago' +%Y-%m-%d) \
--config ~/.archive-config.yaml
postgresql-archiver \
--table transactions \
--start-date 2024-06-01 \
--end-date 2024-06-30 \
--debug \
--workers 8
postgresql-archiver \
--config production.yaml \
--table orders \
--dry-run \
--debug
The tool provides detailed error messages for common issues:
- Database Connection: Checks connectivity before processing
- Partition Discovery: Reports invalid partition formats
- Data Extraction: Handles large datasets with streaming
- Compression: Reports compression failures and ratios
- S3 Upload: Retries on transient failures
- Configuration: Validates all required parameters
- Increase Workers: Use
--workers
to process more partitions in parallel - Network: Ensure good bandwidth to S3 endpoint
- Database: Add indexes on date columns for faster queries
- Memory: Tool streams data to minimize memory usage
- Compression: Multi-core zstd scales with CPU cores
The project includes a comprehensive test suite covering:
- Cache Operations: Row count and file metadata caching, TTL expiration, legacy migration
- Configuration Validation: Required fields, default values, date formats
- Process Management: PID file operations, task tracking, process status checks
Run tests with:
# Run all tests
go test ./...
# Run with verbose output
go test -v ./...
# Run with coverage
go test -cover ./...
# Run specific tests
go test -run TestPartitionCache ./cmd
Build and run with Docker:
# Build the Docker image
docker build -t postgresql-archiver .
# Run with environment variables
docker run --rm \
-e ARCHIVE_DB_HOST=host.docker.internal \
-e ARCHIVE_DB_USER=myuser \
-e ARCHIVE_DB_PASSWORD=mypass \
-e ARCHIVE_DB_NAME=mydb \
-e ARCHIVE_S3_ENDPOINT=https://s3.example.com \
-e ARCHIVE_S3_BUCKET=my-bucket \
-e ARCHIVE_S3_ACCESS_KEY=key \
-e ARCHIVE_S3_SECRET_KEY=secret \
-e ARCHIVE_TABLE=events \
postgresql-archiver
# Run with config file
docker run --rm \
-v ~/.postgresql-archiver.yaml:/root/.postgresql-archiver.yaml \
postgresql-archiver
- Go 1.21+
- PostgreSQL database for testing
- S3-compatible storage for testing
# Clone the repository
git clone https://github.com/airframesio/postgresql-archiver.git
cd postgresql-archiver
# Install dependencies
go mod download
# Build the binary
go build -o postgresql-archiver
# Run tests
go test ./...
# Build for different platforms
GOOS=linux GOARCH=amd64 go build -o postgresql-archiver-linux-amd64
GOOS=darwin GOARCH=arm64 go build -o postgresql-archiver-darwin-arm64
GOOS=windows GOARCH=amd64 go build -o postgresql-archiver.exe
The project uses GitHub Actions for continuous integration:
- Test Matrix: Tests on Go 1.21.x and 1.22.x
- Platforms: Linux, macOS, Windows
- Coverage: Runs tests with coverage reporting
- Linting: Ensures code quality with golangci-lint
- Binary Builds: Creates binaries for multiple platforms
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
MIT License - see LICENSE file for details
Built with these awesome libraries:
- Charmbracelet - Beautiful CLI components
- Cobra - CLI framework
- Viper - Configuration management
- klauspost/compress - Fast zstd compression
- AWS SDK for Go - S3 integration
- Gorilla WebSocket - WebSocket implementation