Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 26 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,29 @@
# Hacker News
# Hacker News' API data processing pipeline

## Quick Start
### First time Setup
First, run garage once from the docker compose to set up the access keys
```bash
docker compose up -d garage-webui
```
Using garage UI available at **http://localhost:3909/** :
1. Set up an access key
2. create a bucket named "bronze" accessed by that same access key.
3. then update the following environment of kafka-connect-setup in the docker compose:
1. AWS_ACCESS_KEY_ID
2. AWS_SECRET_ACCESS_KEY

With this setup, garage will keep the access keys in its dedicated metadata folder.

### Launch the project
From the root run the following command :
```bash
docker-compose up
```


---

## Available UIs
### Garage UI
Open in browser: **http://localhost:3909/**

Expand All @@ -29,17 +44,17 @@ jupyter notebook explore_data.ipynb
## 🏗️ Architecture

```
HN API → Kafka Producer → Kafka Topics
┌────────────────┐
│ BRONZE Layer │ ← Spark + Delta Lake
│ (Raw Data) │ • Kafka → Delta
└────────────────┘ • ACID writes
HN API → Kafka Producer → Kafka
┌────────────────┐
│ SILVER Layer │ ← Spark + Delta Lake
│ (Clean Data) │ • HTML cleaning
└────────────────┘ • Quality scoring
│ BRONZE Layer │
│ (Raw Data) │---------|
└────────────────┘ | ← Spark + Delta Lake
↓ | • HTML cleaning
┌────────────────┐ | • aggregation
│ SILVER Layer │←--------|
│ (Clean Data) │
└────────────────┘
```

---
Expand Down
2 changes: 0 additions & 2 deletions bronze/__init__.py

This file was deleted.

52 changes: 0 additions & 52 deletions bronze/main.py

This file was deleted.

184 changes: 0 additions & 184 deletions bronze/spark_loader.py

This file was deleted.

Loading