Skip to content

Yatimai/stripe-data-architecture

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stripe Data Architecture

A production-grade data platform for a Stripe-style fintech, combining transactional storage, real-time change data capture, analytical warehousing, and batch orchestration. The project demonstrates an end-to-end flow from OLTP writes to analytics-ready marts, with infrastructure described as code.

Architecture

┌─────────────┐     ┌──────────┐     ┌─────────────┐     ┌──────────────┐
│  PostgreSQL  │────▶│ Debezium │────▶│ Apache Kafka │────▶│  Consumers   │
│    (OLTP)    │     │   (CDC)  │     │  (Streaming) │     │              │
└─────────────┘     └──────────┘     └──────┬───────┘     └──────┬───────┘
                                            │                     │
                                            ▼                     ▼
                                    ┌──────────────┐     ┌──────────────┐
                                    │   MongoDB     │     │  Snowflake   │
                                    │   (NoSQL)     │     │   (OLAP)     │
                                    └──────────────┘     └──────────────┘
                                            │                     │
                                            ▼                     ▼
                                    ┌──────────────┐     ┌──────────────┐
                                    │  ML Models   │     │  dbt Models  │
                                    │  (FastAPI)   │     │  (Transform) │
                                    └──────────────┘     └──────────────┘

                    ┌─────────────────────────────────────────────┐
                    │         Apache Airflow (Orchestration)       │
                    │    DAGs: ETL batch + refresh + monitoring    │
                    └─────────────────────────────────────────────┘

PostgreSQL is the system of record. Debezium reads the write-ahead log and streams row-level changes into Kafka, so the OLTP database is never queried for replication. Consumers fan the stream out to MongoDB (semi-structured data and ML features) and to the analytical warehouse. dbt transforms raw tables into a star schema, and Airflow orchestrates the batch ETL, the materialized-view refreshes, and the cross-system data-quality checks.

Tech Stack

  • OLTP: PostgreSQL 16 (range partitioning, logical replication, UUID keys)
  • OLAP: Snowflake (star schema)
  • NoSQL staging: MongoDB 7 (feature store, TTL and text indexes)
  • Streaming / CDC: Apache Kafka + Debezium PostgreSQL connector
  • Orchestration: Apache Airflow 2.8
  • Transformation: dbt (staging and marts models)
  • Serving: FastAPI fraud-scoring API
  • Infrastructure as Code: Terraform (AWS)

Project Structure

stripe-data-architecture/
├── docker/
│   └── docker-compose.yml          # Full local infrastructure
├── sql/
│   ├── 01_oltp_schema.sql          # Normalized OLTP schema (3NF)
│   ├── 02_oltp_indexes.sql         # Indexes and partitioning
│   ├── 03_oltp_seed_data.sql       # Synthetic seed data
│   └── 04_olap_schema.sql          # OLAP star schema
├── mongodb/
│   └── init_collections.js         # MongoDB collections and indexes
├── debezium/
│   └── connector_postgres.json     # CDC connector configuration
├── airflow/
│   └── dags/
│       ├── dag_oltp_to_olap.py     # ETL PostgreSQL to OLAP
│       ├── dag_oltp_to_mongodb.py  # Sync to MongoDB
│       └── dag_data_quality.py     # Data-quality checks
├── dbt/
│   ├── dbt_project.yml
│   └── models/
│       ├── staging/                # stg_transactions, stg_merchants
│       └── marts/                  # fact_transactions, dimensions, daily revenue
├── scripts/
│   ├── start_pipeline.sh           # One-command startup
│   ├── kafka_consumer.py           # Kafka to MongoDB consumer
│   └── fraud_scoring_api.py        # FastAPI fraud-detection service
├── terraform/
│   └── main.tf                     # Cloud infrastructure (AWS)
├── docs/
│   ├── architecture_decisions.md   # Architecture Decision Records
│   └── mcd_stripe.mermaid          # Conceptual data model
├── .env.example
└── .gitignore

Getting Started

Prerequisites

  • Docker and Docker Compose
  • Python 3.11+
  • gettext (provides envsubst, used to inject secrets into the Debezium connector)
  • Terraform (optional, only for cloud deployment)
  • dbt (optional, to run transformations against the warehouse)

Configuration

Copy the example environment file and adjust the values before starting anything:

cp .env.example .env

No real secret is committed. .env is gitignored. POSTGRES_PASSWORD (and the other credentials in .env.example) are demo defaults; override them in your local .env. The startup script loads .env and substitutes POSTGRES_PASSWORD into the Debezium connector config at registration time, so the password is never hard-coded in connector_postgres.json.

Running

chmod +x scripts/start_pipeline.sh
./scripts/start_pipeline.sh

# Check the running services
docker-compose -f docker/docker-compose.yml ps

Service endpoints

Service URL Credentials
PostgreSQL localhost:5432 from your .env
MongoDB localhost:27017 from your .env
Kafka UI http://localhost:8080 none
Airflow http://localhost:8081 from your .env
Debezium http://localhost:8083 none

Design Notes

Key decisions are documented as Architecture Decision Records in docs/architecture_decisions.md, covering the choice of PostgreSQL for OLTP, MongoDB as a feature store, Kafka and Debezium for CDC, the OLAP star schema, Airflow for orchestration, and the security framework (PCI-DSS tokenization, AES-256 at rest, TLS 1.3, RBAC, GDPR soft-delete).

License

MIT

About

Production-grade data platform combining OLTP (PostgreSQL), OLAP (Snowflake), MongoDB, Kafka, Debezium CDC, Airflow, dbt, and Terraform.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors