Skip to content

NguyenDucAnforwork/Big-Data-Project

Repository files navigation

NYC Taxi Trip Streaming Pipeline

Our Dashboard for the NYC Taxi Trip Dataset

End-to-end real-time data pipeline: Kafka → Spark Streaming → Cassandra → Grafana

Architecture

Architecture Diagram

  • Kafka: Message broker for streaming taxi trip data (KRaft mode)
  • Spark Structured Streaming: Window aggregations with watermarking
  • Cassandra: Time-series storage for aggregated metrics
  • Grafana: Real-time dashboards and visualizations
  • Hadoop HDFS: Enables long-term storage, historical analysis, and recomputation

Prerequisites

  • Docker Desktop installed and running

  • Taxi data: data/yellow_tripdata_2025-08.parquet (Download from NYC TLC) and put parquet file in /data folder

  • Install Ubuntu

  • Open Docker Desktop -> Settings -> Enable WSL Integration -> Apply & Restart

From this point, pull up your Ubuntu instance and get it running.

  • Check docker running :
docker version
  • Install astral uv for dependency management:
curl -LsSf https://astral.sh/uv/install.sh | bash
source ~/.bashrc
  • Install Java 17:
sudo apt install -y openjdk-17-jdk
java -version   # verify
  • Clone this folder & sync dependencies
git clone <link>
cd <folder>
uv sync

Quick Start

1. First-Time Setup:

# Start services
docker compose up -d

Pipeline should now be up and running!

2. Check Logs

# Kafka Broker logs
docker logs -f broker

Check spark logs

docker logs -f spark

Check producer logs

docker logs -f producer

3. Access Services

  • Grafana Dashboard: http://localhost:3000 (admin/admin)
  • Cassandra: docker exec -it cassandra cqlsh
  • Kafka: localhost:8080

4. Debug Commands

# Shut down containers and remove volumns
docker compose down -v

# Remove checkpoint files of Spark
rm -rf /tmp/spark_checkpoints

# That one time Broker had depression
rm -rf kafka/kafka_data
sudo chown -R $USER:$USER kafka/kafka_data/

# LEGACY: Manually initialize Cassandra schema
docker exec -i cassandra cqlsh < cassandra/init.cql

Data Flow

  1. Producer reads Parquet file and streams JSON messages to Kafka topic taxi-trips
  2. Spark Streaming consumes messages, performs windowed aggregations (5-min and hourly), applies watermarking
  3. Cassandra stores raw trips and aggregated metrics in taxi_streaming keyspace
  4. Grafana visualizes metrics from Cassandra in real-time

Project Structure

kafka/               # Kafka producer/consumer
spark/               # Spark Structured Streaming job
cassandra/           # Schema initialization (CQL)
grafana/             # Dashboard provisioning
schemas/             # Avro schema (reference)
data/                # Input data (Parquet)

Monitoring

Query Cassandra data:

USE taxi_streaming;
SELECT * FROM trip_aggregations_5min WHERE vendor_id = 1 LIMIT 10;
SELECT * FROM trip_aggregations_hourly LIMIT 5;

View Kafka topics:

docker exec -it broker opt/kafka/bin/kafka-topics.sh --list --bootstrap-server broker:9092

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors