Quick Start • Features • Architecture • Demo • Docs • Community
Agentic Data Engineering Platform is an open-source, production-ready ETL solution that combines the Medallion Architecture with AI-powered agents that autonomously profile, clean, and optimize your data—so you can focus on insights, not infrastructure.
|
Three autonomous agents work 24/7:
No more manual data cleaning! |
Built on modern tech that's 10x faster:
Process millions of rows in seconds! |
|
Industry-standard Medallion pattern:
Scale from prototype to production! |
Interactive Streamlit interface:
From data to decisions in minutes! |
# 60 seconds to your first pipeline!
git clone <your-repo> && cd agentic-data-engineer
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python scripts/generate_sample_data.py
python src/orchestration/prefect_flows.py
streamlit run dashboards/streamlit_medallion_app.py✅ Python 3.10 or higher
✅ 4GB RAM (minimum)
✅ 1GB free disk space
✅ Love for clean data 💙Step 1: Clone & Setup Environment
```bash # Clone the repository git clone https://github.com/yourusername/agentic-data-engineer.git cd agentic-data-engineerpython -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
</details>
<details>
<summary><b>Step 2: Initialize Project</b></summary>
```bash
# Run automated setup
python scripts/setup_initial.py
# Generate sample e-commerce data (1000 records with quality issues)
python scripts/generate_sample_data.py
✅ Output: Sample dataset with intentional issues for testing AI agents
Step 3: Run Your First Pipeline
```bash # Execute the complete ETL pipeline python src/orchestration/prefect_flows.py ```🎯 Watch as the agents:
- ✅ Profile your data (discover issues)
- ✅ Score data quality (0-100)
- ✅ Auto-remediate problems (fix issues)
- ✅ Create Bronze → Silver → Gold layers
- ✅ Generate business aggregates
🚀 Starting Agentic ETL Pipeline
✅ Extracted 1,000 rows
🔍 Profiling dataset: Found 10 issues
📊 Quality Score: 92/100
🔧 Auto-remediation: 7 actions taken
✅ Pipeline completed successfully!
Step 4: Launch Dashboard
```bash streamlit run dashboards/streamlit_medallion_app.py ```🌐 Open: http://localhost:8501
Explore 7 Interactive Pages:
- 🏠 Overview Dashboard
- 🥉 Bronze Layer Explorer
- 🥈 Silver Layer Analytics
- 🥇 Gold Layer Insights
- 📊 Quality Monitoring
- 🔍 Data Lineage
- ⚙️ Pipeline Performance
# Traditional Approach: Manual, Error-Prone
df = pd.read_csv("data.csv")
df = df.dropna() # Hope for the best?
df = df.drop_duplicates() # Good enough?
# ... 50 more lines of cleaning code ...
# Agentic Approach: AI-Powered, Automatic
from src.agents.agentic_agents import DataProfilerAgent, RemediationAgent
profiler = DataProfilerAgent()
profile = profiler.profile_dataset(df, "my_data")
# 🔍 Discovers: 23 issues across 8 categories
remediation = RemediationAgent()
df_clean, actions = remediation.auto_remediate(df, profile['issues_detected'])
# 🔧 Fixed: Whitespace, duplicates, negatives, outliers, formats
# ✅ Result: 98% quality score (up from 73%)┌─────────────────────────────────────────────────────────────┐
│ DATA JOURNEY │
├─────────────────────────────────────────────────────────────┤
│ │
│ 📥 Raw Sources (CSV, JSON, Parquet, APIs) │
│ ↓ │
│ 🥉 BRONZE LAYER │
│ • Immutable raw data │
│ • Full audit trail │
│ • No transformations │
│ ↓ │
│ 🥈 SILVER LAYER │
│ • Deduplicated & cleaned │
│ • Schema validated │
│ • Business rules applied │
│ • Ready for analytics │
│ ↓ │
│ 🥇 GOLD LAYER │
│ • Business aggregates │
│ • KPIs & metrics │
│ • Optimized for queries │
│ • Dashboard-ready │
│ ↓ │
│ 📊 CONSUMPTION (BI Tools, ML Models, APIs) │
│ │
└─────────────────────────────────────────────────────────────┘
| Metric | Score | Trend | Status |
|---|---|---|---|
| Overall Quality | 92/100 | ↑ 3% | 🟢 Excellent |
| Completeness | 95% | ↑ 2% | 🟢 Great |
| Validity | 98% | → | 🟢 Perfect |
| Consistency | 88% | ↓ 1% | 🟡 Good |
| Accuracy | 91% | ↑ 4% | 🟢 Excellent |
|
Processing Speed |
Memory Efficiency |
┌──────────────────────────────────────────────────────────────────┐
│ AGENTIC CONTROL LAYER 🤖 │
├──────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌──────────────┐ │
│ │ Profiler │───▶│ Quality │───▶│ Remediation │ │
│ │ Agent │ │ Agent │ │ Agent │ │
│ │ │ │ │ │ │ │
│ │ • Discover │ │ • Monitor │ │ • Auto-fix │ │
│ │ • Analyze │ │ • Score │ │ • Validate │ │
│ │ • Report │ │ • Alert │ │ • Optimize │ │
│ └─────────────┘ └─────────────┘ └──────────────┘ │
│ │
└────────────────────────────┬─────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ DATA PROCESSING LAYER ⚙️ │
├──────────────────────────────────────────────────────────────────┤
│ │
│ 🥉 Bronze │ 🥈 Silver │ 🥇 Gold │
│ ──────────── │ ────────────── │ ─────────── │
│ • Raw data │ • Cleaned data │ • Aggregates │
│ • Parquet │ • Validated │ • KPIs │
│ • Immutable │ • Typed │ • Metrics │
│ │
└────────────────────────────┬─────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ STORAGE LAYER 💾 │
├──────────────────────────────────────────────────────────────────┤
│ │
│ DuckDB (Analytical Database) │
│ • OLAP optimized │
│ • Columnar storage │
│ • SQL interface │
│ │
└──────────────────────────────────────────────────────────────────┘
1️⃣ Beginner: Understanding the Basics
Time Investment: 30 minutes
You'll Learn: Core concepts, basic workflow
2️⃣ Intermediate: Customization
Time Investment: 2 hours
You'll Learn: Adapt platform to your needs
3️⃣ Advanced: Production Deployment
Time Investment: 4 hours
You'll Learn: Enterprise-grade deployment
# Quick API Examples
# 1. Data Profiling
from src.agents.agentic_agents import DataProfilerAgent
profiler = DataProfilerAgent()
profile = profiler.profile_dataset(df, "my_dataset")
# 2. Quality Scoring
from src.agents.agentic_agents import QualityAgent
quality = QualityAgent()
score = quality.calculate_quality_score(profile)
# 3. Auto-Remediation
from src.agents.agentic_agents import RemediationAgent
remediation = RemediationAgent()
clean_df, actions = remediation.auto_remediate(df, profile['issues_detected'])
# 4. DuckDB Operations
from src.database.duckdb_manager import MedallionDuckDB
db = MedallionDuckDB()
db.load_to_bronze(df, "my_table")
db.promote_to_silver("my_table", "my_table_clean")Perfect for analyzing customer behavior, order patterns, and product performance.
✅ Handles messy transaction data
✅ Auto-cleans customer records
✅ Creates ready-to-use KPIs
Clean and validate financial transactions with confidence.
✅ Detects data anomalies
✅ Ensures compliance rules
✅ Tracks data lineage for audits
Transform raw data into executive-ready dashboards.
✅ Automated data prep
✅ Quality guarantees
✅ Fast query performance
Reliable, clean datasets for model training.
✅ Feature engineering ready
✅ Drift detection
✅ Reproducible pipelines
- Medallion Architecture
- Basic AI Agents
- Streamlit Dashboard
- DuckDB Integration
- Sample Dataset
- LangChain Integration for NLP queries
- Advanced ML Anomaly Detection
- Real-time Streaming Support
- Multi-source Connectors (PostgreSQL, MySQL, S3)
- Data Versioning (Delta Lake)
- Cloud Deployment (AWS/Azure/GCP)
- Kubernetes Orchestration
- RBAC & Security
- GraphQL API
- Slack/Teams Integrations
- GPT-4 Powered Data Analysis
- Automated Feature Engineering
- Predictive Quality Monitoring
- Self-Optimizing Pipelines
We ❤️ contributions! Here's how you can help:
|
🐛 Report Bugs Found an issue? Open a bug report |
💡 Suggest Features Have an idea? Request a feature |
📝 Improve Docs Better explanations? Edit the docs |
|
🔧 Submit Code Fix or feature? Create a pull request |
⭐ Star the Repo Show support! Give us a star |
💬 Join Discussion Ask questions! GitHub Discussions |
# Fork and clone your fork
git clone https://github.com/YOUR_USERNAME/agentic-data-engineer.git
# Create a feature branch
git checkout -b feature/amazing-feature
# Make your changes and commit
git commit -m "Add amazing feature"
# Push and create PR
git push origin feature/amazing-feature- ✅ Follow PEP 8 style guide
- ✅ Add docstrings to functions
- ✅ Include unit tests
- ✅ Update documentation
- ✅ Run
pytestbefore submitting
This project is licensed under the MIT License - see the LICENSE file for details.
MIT License - Do whatever you want!
✅ Commercial use
✅ Modification
✅ Distribution
✅ Private use
Built with amazing open-source tools:
- DuckDB - The SQLite of analytics
- Polars - Lightning-fast DataFrames
- Prefect - Modern workflow orchestration
- Streamlit - Beautiful data apps
- Pandera - Data validation
- Great Expectations - Data quality
- Evidently - ML monitoring
Special thanks to all contributors and the open-source community! 💙
If this project helped you, please consider:
⭐ Starring the repository
🐛 Reporting bugs
💡 Suggesting features
📢 Sharing with others
☕ Buying me a coffee
Built with ❤️ by Your Name | Last Updated: November 2024