Skip to content

aymaneo/HackerNews

Repository files navigation

Hacker News

Quick Start

The Docker Compose file defines the garage-meta and garage-data volumes as external volumes, so they need to be created manually once before starting the stack:

docker volume create garage-meta
docker volume create garage-data
docker compose build hn-producer
docker-compose up -d

Don't forget to create your access key, secret key and buckets before launching a notebook, in the Garage UI interface !


Garage UI

Open in browser: http://localhost:3909/

Spark UI

Open in browser: http://localhost:8080/ui

Kafka UI

Open in browser: http://localhost:8082

  • View topics: hn-stories, hn-comments

Query Delta Lake (Jupyter Notebook)

jupyter notebook explore_data.ipynb

🏗️ Architecture

HN API → Kafka Producer → Kafka Topics
                             ↓
                    ┌────────────────┐
                    │  BRONZE Layer  │  ← Spark + Delta Lake
                    │  (Raw Data)    │     • Kafka → Delta
                    └────────────────┘     • ACID writes
                             ↓
                    ┌────────────────┐
                    │  SILVER Layer  │  ← Spark + Delta Lake
                    │  (Clean Data)  │     • HTML cleaning
                    └────────────────┘     • Quality scoring

Data Schema

Bronze Layer (Raw from Kafka)

Stories: id, by, title, url, score, descendants, time, type, text, kids, _kafka_offset, _kafka_partition, _bronze_ingested_at

Comments: id, by, parent, story_id, text, time, type, kids, deleted, dead, _kafka_offset, _kafka_partition, _bronze_ingested_at

Silver Layer (Cleaned)

Stories: id, author, title, url, score, comment_count, timestamp, text_raw, text_clean, has_url, has_text, type

Comments: id, author, story_id, parent, timestamp, text_raw, text_clean, has_text, word_count, char_count, has_replies, is_deleted, is_dead, quality_score, type

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors