π One command to rule them all:
make dev-setup
A batteries-included Docker Compose setup for running a complete DuckLake lakehouse locally. Get PostgreSQL catalog + MinIO object storage + DuckDB with zero configuration fuss.
- π Complete Lakehouse Stack - PostgreSQL catalog + MinIO S3 storage + DuckDB compute
- β‘ Zero Configuration - Everything works out of the box with sensible defaults
- π οΈ Easy Makefile Interface - Simple commands for all operations
- π ACID Transactions - Full lakehouse capabilities with versioning
- π Demo Data Included - Pre-loaded with 44k+ gene records for testing
- π³ Docker Everything - No local dependencies except Docker
- πΎ Data Persistence - Your data survives container restarts
# Clone and start everything
git clone https://github.com/MattOates/quack.git
cd quack
make demo
# Or for development setup
make dev-setup
# Connect and start querying
make shellThat's it! You now have a fully functional lakehouse running locally.
- Architecture
- Prerequisites
- Installation
- Usage
- Makefile Commands
- Configuration
- Examples
- Troubleshooting
- References
DuckLake is an open-source lakehouse solution built on DuckDB, providing ACID transactions, versioning, and metadata management via pluggable catalogs.
flowchart TD
subgraph "ποΈ Catalog Layer"
PG[(PostgreSQL<br/>Metadata & Transactions)]
end
subgraph "πΎ Storage Layer"
S3[(MinIO<br/>S3-Compatible Object Store)]
end
subgraph "β‘ Compute Layer"
Init[ducklake-init<br/>Orchestrator]
DB[DuckDB<br/>Query Engine]
end
subgraph "π οΈ Interface"
Make[Makefile<br/>Easy Commands]
Shell[Interactive Shell]
end
Make -->|make shell| Shell
Make -->|make up| Init
Init -->|health checks| PG
Init -->|health checks| S3
Init -->|creates bucket| S3
Init -->|ATTACH lakehouse| DB
DB -->|metadata| PG
DB -->|data files| S3
Shell -->|queries| DB
- ποΈ PostgreSQL: Stores lakehouse metadata, transaction logs, and schema information
- πΎ MinIO: S3-compatible object storage for Parquet data files
- β‘ DuckDB: High-performance analytical query engine with lakehouse extensions
- π οΈ ducklake-init: Python orchestrator that configures and initializes everything
- Docker Desktop or Docker + Docker Compose v2.0+
- 4GB+ RAM recommended
- macOS, Linux, or Windows with WSL2
git clone https://github.com/MattOates/quack.git
cd quack
make demoThis will:
- Build all Docker images
- Start PostgreSQL + MinIO + DuckDB
- Load 44k+ gene records for testing
- Show you the results
git clone https://github.com/MattOates/quack.git
cd quack
make dev-setupThis will build and start everything without demo data.
git clone https://github.com/MattOates/quack.git
cd quack
make build # Build Docker images
make up # Start services
make shell # Connect to DuckDB# Start everything
make up
# Connect to DuckDB shell (with DuckLake pre-attached)
make shell
# Check service health
make health
# View logs
make logs
# Stop everything
make downOnce connected via make shell, DuckLake is automatically attached as the_ducklake:
-- Your lakehouse is ready to use!
USE the_ducklake;
-- Create a table from remote data
CREATE TABLE my_data AS
SELECT * FROM read_csv_auto('https://example.com/data.csv');
-- Query with full SQL support
SELECT COUNT(*) FROM my_data;
-- DuckLake handles ACID transactions automatically
INSERT INTO my_data VALUES ('new', 'row');| Command | Description |
|---|---|
make help |
Show all available commands |
make dev-setup |
Complete development setup (build + start) |
make demo |
Run demo with sample gene data |
make shell |
Connect to DuckDB with DuckLake attached |
| Command | Description |
|---|---|
make build |
Build all Docker images (clean) |
make build-quick |
Build using cache |
make up |
Start all services |
make down |
Stop all services |
make restart |
Restart everything |
| Command | Description |
|---|---|
make health |
Check service health |
make status |
Show service status |
make logs |
View all logs |
make logs-init |
View DuckLake init logs |
make test-connection |
Test DuckLake connection |
| Command | Description |
|---|---|
make shell |
DuckDB shell |
make psql |
PostgreSQL shell |
make minio-console |
Open MinIO web console |
make info |
Show connection details |
| Command | Description |
|---|---|
make backup-data |
Create timestamped backup |
make restore-data BACKUP_FILE=backup.tar.gz |
Restore from backup |
make clean-data |
Remove all data (with confirmation) |
make reset |
Complete reset (stop, clean, rebuild, start) |
| Command | Description |
|---|---|
make pull |
Pull latest Docker images |
make prune |
Clean up Docker resources |
All service configuration is controlled via environment variables in docker-compose.yml. The defaults work out of the box, but you can customize as needed:
| Variable | Default | Purpose |
|---|---|---|
POSTGRES_USER |
ducklake |
PostgreSQL catalog username |
POSTGRES_PASSWORD |
ducklake |
PostgreSQL catalog password |
POSTGRES_DB |
ducklake_catalog |
PostgreSQL database name |
AWS_ACCESS_KEY_ID |
minioadmin |
MinIO access key |
AWS_SECRET_ACCESS_KEY |
minioadmin |
MinIO secret key |
AWS_ENDPOINT_URL |
http://minio:9000 |
S3 endpoint URL |
BUCKET |
ducklake |
S3 bucket for data files |
| Service | URL | Credentials |
|---|---|---|
| MinIO Console | http://localhost:9000 | admin/minioadmin |
| PostgreSQL | localhost:5432 | ducklake/ducklake |
| DuckDB Shell | make shell |
Pre-configured |
-- Connect via: make shell
-- Load remote CSV data into your lakehouse
CREATE TABLE companies AS
SELECT * FROM read_csv_auto(
'https://example.com/companies.csv',
HEADER => TRUE
);
-- Query with full analytical SQL
SELECT
industry,
COUNT(*) as company_count,
AVG(revenue) as avg_revenue
FROM companies
GROUP BY industry
ORDER BY avg_revenue DESC;-- Load Parquet files from S3/MinIO
CREATE TABLE events AS
SELECT * FROM read_parquet('s3://ducklake/raw/events/*.parquet');
-- Transform and store back to lakehouse
CREATE TABLE daily_summary AS
SELECT
DATE(timestamp) as date,
event_type,
COUNT(*) as event_count
FROM events
GROUP BY DATE(timestamp), event_type;-- DuckLake supports versioning and time travel
SELECT * FROM my_table VERSION AS OF '2024-01-01 10:00:00';
-- View table history
SELECT * FROM table_history('my_table');If you want to connect to the lakehouse from an external DuckDB client (outside the container), use this configuration:
-- Install required extensions
INSTALL ducklake;
INSTALL postgres;
INSTALL httpfs;
-- Configure S3 settings for MinIO
SET s3_url_style = 'path';
SET s3_endpoint = 'localhost:9000'; -- Note: localhost, not minio
SET s3_access_key_id = 'minioadmin';
SET s3_secret_access_key = 'minioadmin';
SET s3_region = 'us-east-1';
SET s3_use_ssl = false;
-- Attach to your lakehouse
ATTACH 'ducklake:postgres:dbname=ducklake_catalog host=localhost user=ducklake password=ducklake'
AS my_lakehouse (DATA_PATH 's3://ducklake/lake/');
-- Now you can use it
USE my_lakehouse;
SHOW TABLES;You can also connect to the_ducklake from your own Python applications. Here's a complete example:
uv add --prerelease=allow duckdb boto3 psycopg2-binaryor
pip install --pre duckdb boto3 psycopg2-binary#!/usr/bin/env -S uv run --script --prerelease=allow
#
# /// script
# requires-python = ">=3.13.5"
# dependencies = [
# "boto3",
# "psycopg",
# "duckdb",
# ]
# ///
import duckdb
import os
def connect_to_ducklake():
"""Connect to the DuckLake lakehouse from Python."""
# Create DuckDB connection
con = duckdb.connect()
# Install required extensions
con.install_extension("ducklake")
con.install_extension("postgres")
con.install_extension("httpfs")
# Configure S3/MinIO settings
s3_config = {
"s3_url_style": "path",
"s3_endpoint": "localhost:9000", # External connection
"s3_access_key_id": "minioadmin",
"s3_secret_access_key": "minioadmin",
"s3_region": "us-east-1",
"s3_use_ssl": "false"
}
for key, value in s3_config.items():
con.execute(f"SET {key}='{value}';")
# Attach to DuckLake
attach_sql = """
ATTACH 'ducklake:postgres:dbname=ducklake_catalog host=localhost user=ducklake password=ducklake'
AS the_ducklake (DATA_PATH 's3://ducklake/lake/');
"""
con.execute(attach_sql)
con.execute("USE the_ducklake;")
return con
def main():
"""Example usage of DuckLake connection."""
# Make sure your quack stack is running first!
# cd /path/to/quack && make up
con = connect_to_ducklake()
# Example 1: List available tables
tables = con.execute("SHOW TABLES;").fetchall()
print("Available tables:", tables)
# Example 2: Create a sample table
con.execute("""
CREATE OR REPLACE TABLE python_example AS
SELECT
'hello' as greeting,
'world' as target,
42 as meaning_of_life,
current_timestamp as created_at
""")
# Example 3: Query the data
result = con.execute("SELECT * FROM python_example;").fetchall()
print("Sample data:", result)
# Example 4: Load data from a CSV
con.execute("""
CREATE OR REPLACE TABLE sample_data AS
SELECT
row_number() OVER () as id,
'user_' || row_number() OVER () as username,
random() * 100 as score
FROM generate_series(1, 1000)
""")
# Example 5: Analytical query
stats = con.execute("""
SELECT
COUNT(*) as total_records,
AVG(score) as avg_score,
MIN(score) as min_score,
MAX(score) as max_score
FROM sample_data
""").fetchone()
print(f"Dataset stats: {stats}")
# Example 6: Work with time series data
con.execute("""
CREATE OR REPLACE TABLE time_series AS
SELECT
(current_date - INTERVAL (row_number() OVER ()) DAY) as date,
random() * 1000 as value
FROM generate_series(1, 365)
""")
monthly_summary = con.execute("""
SELECT
date_trunc('month', date) as month,
COUNT(*) as records,
AVG(value) as avg_value,
SUM(value) as total_value
FROM time_series
GROUP BY date_trunc('month', date)
ORDER BY month
""").fetchall()
print("Monthly summary:", monthly_summary[:3]) # Show first 3 months
# Don't forget to close the connection
con.close()
print("β
DuckLake connection example completed!")
if __name__ == "__main__":
main()For production applications, consider using connection pooling and environment variables:
import os
from contextlib import contextmanager
@contextmanager
def ducklake_connection():
"""Context manager for DuckLake connections."""
con = None
try:
con = duckdb.connect()
# Install extensions
for ext in ["ducklake", "postgres", "httpfs"]:
con.execute(f"INSTALL {ext};")
# Configure from environment
s3_config = {
"s3_url_style": "path",
"s3_endpoint": os.getenv("DUCKLAKE_S3_ENDPOINT", "localhost:9000"),
"s3_access_key_id": os.getenv("DUCKLAKE_S3_ACCESS_KEY", "minioadmin"),
"s3_secret_access_key": os.getenv("DUCKLAKE_S3_SECRET_KEY", "minioadmin"),
"s3_region": os.getenv("DUCKLAKE_S3_REGION", "us-east-1"),
"s3_use_ssl": "false"
}
for key, value in s3_config.items():
con.execute(f"SET {key}='{value}';")
# Connection details from environment
pg_host = os.getenv("DUCKLAKE_PG_HOST", "localhost")
pg_user = os.getenv("DUCKLAKE_PG_USER", "ducklake")
pg_pass = os.getenv("DUCKLAKE_PG_PASSWORD", "ducklake")
pg_db = os.getenv("DUCKLAKE_PG_DB", "ducklake_catalog")
bucket = os.getenv("DUCKLAKE_BUCKET", "ducklake")
attach_sql = f"""
ATTACH 'ducklake:postgres:dbname={pg_db} host={pg_host} user={pg_user} password={pg_pass}'
AS the_ducklake (DATA_PATH 's3://{bucket}/lake/');
"""
con.execute(attach_sql)
con.execute("USE the_ducklake;")
yield con
finally:
if con:
con.close()
# Usage with context manager
def analyze_data():
with ducklake_connection() as con:
# Your analysis code here
result = con.execute("SELECT COUNT(*) FROM your_table;").fetchone()
return resultFor the advanced example, you can set these environment variables:
export DUCKLAKE_S3_ENDPOINT="localhost:9000"
export DUCKLAKE_S3_ACCESS_KEY="minioadmin"
export DUCKLAKE_S3_SECRET_KEY="minioadmin"
export DUCKLAKE_PG_HOST="localhost"
export DUCKLAKE_PG_USER="ducklake"
export DUCKLAKE_PG_PASSWORD="ducklake"
export DUCKLAKE_PG_DB="ducklake_catalog"
export DUCKLAKE_BUCKET="ducklake"# Check service status
make status
# Check health
make health
# View logs for issues
make logs
# Nuclear option - complete reset
make reset# Test the connection
make test-connection
# Check if services are healthy
make health
# View specific service logs
make logs-postgres # or logs-minio, logs-init- Increase Docker memory: Ensure Docker has at least 4GB RAM allocated
- Check disk space: Ensure sufficient space in
./data/directory - Monitor logs: Use
make logsto check for errors
# Create backup before troubleshooting
make backup-data
# Clean slate restart
make clean-data # β οΈ This deletes all data!
make dev-setup
# Restore from backup if needed
make restore-data BACKUP_FILE=backup_20241201_120000.tar.gz- Fork the repository
- Create a feature branch:
git checkout -b my-feature - Make changes and test with
make dev-setup - Commit changes:
git commit -am 'Add feature' - Push to branch:
git push origin my-feature - Open a Pull Request
- π DuckLake Documentation - Official DuckLake docs
- π¦ DuckDB Extensions Guide - DuckDB lakehouse extensions
- π³ Docker Compose Reference - Container orchestration
- π Apache Iceberg - Alternative lakehouse format
- β‘ DuckDB - Analytical database engine
- π Apache Arrow - Columnar data format
π¦ Happy Quacking! π¦
Made with β€οΈ for the data community
β Star this repo β’ π Report Issues β’ π Documentation