A production-grade data platform for a Stripe-style fintech, combining transactional storage, real-time change data capture, analytical warehousing, and batch orchestration. The project demonstrates an end-to-end flow from OLTP writes to analytics-ready marts, with infrastructure described as code.
┌─────────────┐ ┌──────────┐ ┌─────────────┐ ┌──────────────┐
│ PostgreSQL │────▶│ Debezium │────▶│ Apache Kafka │────▶│ Consumers │
│ (OLTP) │ │ (CDC) │ │ (Streaming) │ │ │
└─────────────┘ └──────────┘ └──────┬───────┘ └──────┬───────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ MongoDB │ │ Snowflake │
│ (NoSQL) │ │ (OLAP) │
└──────────────┘ └──────────────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ ML Models │ │ dbt Models │
│ (FastAPI) │ │ (Transform) │
└──────────────┘ └──────────────┘
┌─────────────────────────────────────────────┐
│ Apache Airflow (Orchestration) │
│ DAGs: ETL batch + refresh + monitoring │
└─────────────────────────────────────────────┘
PostgreSQL is the system of record. Debezium reads the write-ahead log and streams row-level changes into Kafka, so the OLTP database is never queried for replication. Consumers fan the stream out to MongoDB (semi-structured data and ML features) and to the analytical warehouse. dbt transforms raw tables into a star schema, and Airflow orchestrates the batch ETL, the materialized-view refreshes, and the cross-system data-quality checks.
- OLTP: PostgreSQL 16 (range partitioning, logical replication, UUID keys)
- OLAP: Snowflake (star schema)
- NoSQL staging: MongoDB 7 (feature store, TTL and text indexes)
- Streaming / CDC: Apache Kafka + Debezium PostgreSQL connector
- Orchestration: Apache Airflow 2.8
- Transformation: dbt (staging and marts models)
- Serving: FastAPI fraud-scoring API
- Infrastructure as Code: Terraform (AWS)
stripe-data-architecture/
├── docker/
│ └── docker-compose.yml # Full local infrastructure
├── sql/
│ ├── 01_oltp_schema.sql # Normalized OLTP schema (3NF)
│ ├── 02_oltp_indexes.sql # Indexes and partitioning
│ ├── 03_oltp_seed_data.sql # Synthetic seed data
│ └── 04_olap_schema.sql # OLAP star schema
├── mongodb/
│ └── init_collections.js # MongoDB collections and indexes
├── debezium/
│ └── connector_postgres.json # CDC connector configuration
├── airflow/
│ └── dags/
│ ├── dag_oltp_to_olap.py # ETL PostgreSQL to OLAP
│ ├── dag_oltp_to_mongodb.py # Sync to MongoDB
│ └── dag_data_quality.py # Data-quality checks
├── dbt/
│ ├── dbt_project.yml
│ └── models/
│ ├── staging/ # stg_transactions, stg_merchants
│ └── marts/ # fact_transactions, dimensions, daily revenue
├── scripts/
│ ├── start_pipeline.sh # One-command startup
│ ├── kafka_consumer.py # Kafka to MongoDB consumer
│ └── fraud_scoring_api.py # FastAPI fraud-detection service
├── terraform/
│ └── main.tf # Cloud infrastructure (AWS)
├── docs/
│ ├── architecture_decisions.md # Architecture Decision Records
│ └── mcd_stripe.mermaid # Conceptual data model
├── .env.example
└── .gitignore
- Docker and Docker Compose
- Python 3.11+
gettext(providesenvsubst, used to inject secrets into the Debezium connector)- Terraform (optional, only for cloud deployment)
- dbt (optional, to run transformations against the warehouse)
Copy the example environment file and adjust the values before starting anything:
cp .env.example .envNo real secret is committed. .env is gitignored. POSTGRES_PASSWORD (and the other
credentials in .env.example) are demo defaults; override them in your local .env. The
startup script loads .env and substitutes POSTGRES_PASSWORD into the Debezium connector
config at registration time, so the password is never hard-coded in connector_postgres.json.
chmod +x scripts/start_pipeline.sh
./scripts/start_pipeline.sh
# Check the running services
docker-compose -f docker/docker-compose.yml ps| Service | URL | Credentials |
|---|---|---|
| PostgreSQL | localhost:5432 |
from your .env |
| MongoDB | localhost:27017 |
from your .env |
| Kafka UI | http://localhost:8080 |
none |
| Airflow | http://localhost:8081 |
from your .env |
| Debezium | http://localhost:8083 |
none |
Key decisions are documented as Architecture Decision Records in
docs/architecture_decisions.md, covering the choice of
PostgreSQL for OLTP, MongoDB as a feature store, Kafka and Debezium for CDC, the OLAP star
schema, Airflow for orchestration, and the security framework (PCI-DSS tokenization, AES-256
at rest, TLS 1.3, RBAC, GDPR soft-delete).
MIT