End-to-end real-time data pipeline: Kafka → Spark Streaming → Cassandra → Grafana
- Kafka: Message broker for streaming taxi trip data (KRaft mode)
- Spark Structured Streaming: Window aggregations with watermarking
- Cassandra: Time-series storage for aggregated metrics
- Grafana: Real-time dashboards and visualizations
- Hadoop HDFS: Enables long-term storage, historical analysis, and recomputation
-
Docker Desktop installed and running
-
Taxi data:
data/yellow_tripdata_2025-08.parquet(Download from NYC TLC) and put parquet file in/datafolder -
Install Ubuntu
-
Open Docker Desktop -> Settings -> Enable WSL Integration -> Apply & Restart
From this point, pull up your Ubuntu instance and get it running.
- Check docker running :
docker version- Install astral uv for dependency management:
curl -LsSf https://astral.sh/uv/install.sh | bash
source ~/.bashrc- Install Java 17:
sudo apt install -y openjdk-17-jdk
java -version # verify- Clone this folder & sync dependencies
git clone <link>
cd <folder>
uv sync# Start services
docker compose up -dPipeline should now be up and running!
# Kafka Broker logs
docker logs -f brokerCheck spark logs
docker logs -f sparkCheck producer logs
docker logs -f producer- Grafana Dashboard: http://localhost:3000 (admin/admin)
- Cassandra:
docker exec -it cassandra cqlsh - Kafka: localhost:8080
# Shut down containers and remove volumns
docker compose down -v
# Remove checkpoint files of Spark
rm -rf /tmp/spark_checkpoints
# That one time Broker had depression
rm -rf kafka/kafka_data
sudo chown -R $USER:$USER kafka/kafka_data/
# LEGACY: Manually initialize Cassandra schema
docker exec -i cassandra cqlsh < cassandra/init.cql- Producer reads Parquet file and streams JSON messages to Kafka topic
taxi-trips - Spark Streaming consumes messages, performs windowed aggregations (5-min and hourly), applies watermarking
- Cassandra stores raw trips and aggregated metrics in
taxi_streamingkeyspace - Grafana visualizes metrics from Cassandra in real-time
kafka/ # Kafka producer/consumer
spark/ # Spark Structured Streaming job
cassandra/ # Schema initialization (CQL)
grafana/ # Dashboard provisioning
schemas/ # Avro schema (reference)
data/ # Input data (Parquet)
Query Cassandra data:
USE taxi_streaming;
SELECT * FROM trip_aggregations_5min WHERE vendor_id = 1 LIMIT 10;
SELECT * FROM trip_aggregations_hourly LIMIT 5;View Kafka topics:
docker exec -it broker opt/kafka/bin/kafka-topics.sh --list --bootstrap-server broker:9092

