How to run this project but with Kubernetes local development environment (Kind - Kubernetes IN Docker)
Install Kind and kubectl if you haven't already. This deploy script creates the cluster, builds images, and applies manifests.
./scripts/deploy.shThen in the same or a different terminal, forward the Producer's port to the localhost port:
kubectl -n big-data port-forward svc/producer-gateway 8000:8000In a separate terminal, run the data generator script:
uv run python generator/generator.pyServices are exposed via NodePort and need to be forwarded to the localhost port.
-
Grafana: http://localhost:3000
kubectl -n big-data port-forward svc/grafana 3000:3000
Username:
adminPassword:admin -
Kafka UI: http://localhost:8080
kubectl -n big-data port-forward svc/kafka-ui 8080:8080
kubectl is the command-line tool for talking to the Kubernetes API. The syntax is always:
kubectl [command] [TYPE] [NAME] [flags]
| Command | Description | Example |
|---|---|---|
get |
List resources. | kubectl get pods -n big-data |
describe |
Show detailed config and events (errors). | kubectl describe pod cassandra-0 -n big-data |
logs |
Print stdout/stderr of a container. | kubectl logs -f deploy/spark-streaming -n big-data |
exec |
Run a command inside a running container. | kubectl exec -it cassandra-0 -n big-data -- cqlsh |
delete |
Delete a resource. | kubectl delete pod producer-xyz -n big-data |
Key Flags:
-n big-data: Specifies the Namespace. If you forget this,kubectllooks in thedefaultnamespace and won't find your apps.-f: Follow the logs (stream them).-w: Watch for changes (live updates to the list).
Reset: Restart the pipeline with fresh data without destroying the cluster.
./scripts/reset.shWhat this does:
- Force Stops all pods (Instant).
- Wipes all data (PVCs).
- Restarts everything automatically.
Stop (Freeze State): Stop the container, keep the data.
./scripts/stop.shResume (Unfreeze): Start the container, there will be a restart from all pods.
./scripts/start.shTeardown: Delete the cluster (Kind container).
kind delete cluster --name big-data- What it is: A tool for running multi-container Docker applications on a single host.
- How it works: It tells the Docker Daemon: "Start these 6 containers and network them together."
- Limitation: It doesn't handle "node failure" (because there's only one node). Scaling is manual. Service discovery is simple DNS.
- What it is: Kind (Kubernetes IN Docker) runs a whole Kubernetes cluster inside Docker containers.
- The Difference: Kubernetes is an Orchestrator. It doesn't just "run containers". It manages State.
- Self-Healing: If a Spark pod crashes, Kubernetes restarts it automatically.
- Scheduling: It decides where to run pods (simulated nodes in Kind).
- Storage: It attaches persistent disks (PVCs) to pods, regardless of where they run.
Kubernetes uses YAML "Manifests" to define the desired state. Here is a line-by-line breakdown of our project structure:
Namespace:big-data. This is a virtual cluster. It isolates our resources (Pods, Services, PVCs) from system components.ConfigMap: Key-value pairs for configuration.cassandra-init: Stores the contents ofinit.cql. This allows us to inject the schema script as a file without rebuilding the image.grafana-datasources: Stores thedatasource.yamlso Grafana knows how to talk to Cassandra on boot.
This file defines StatefulSets and Services.
StatefulSetvsDeployment:- We use
StatefulSetbecause Kafka and Cassandra need Stable Network Identities (cassandra-0) and Stable Storage. Ifcassandra-0dies, it must come back with the same name and the same data volume.
- We use
- Field Breakdown:
serviceName: Links the Pods to the Headless Service for DNS discovery.volumeClaimTemplates: Automatically creates a PersistentVolumeClaim (PVC) for each replica. This creates the "virtual hard drive" (kafka-data).readinessProbe: Usecqlshto ask "Are you ready?". If this fails (non-zero exit), Kubernetes won't send traffic to the pod.
Job: A Pod that runs once to completion.- How it works:
- It mounts the
cassandra-initConfigMap as a file at/cassandra/init.cql. - It runs
cqlshto execute that script. restartPolicy: OnFailure: If the script fails (e.g., connection refused), K8s retries until it works.
- It mounts the
This file defines Deployments and PersistentVolumeClaims.
Deployment: Used for stateless apps.imagePullPolicy: Never: Crucial for local dev. It tells K8s "Use the image I built locally, don't try to download from the internet."
PersistentVolumeClaim(PVC):spark-checkpoints-pvc: A request for 1Gi of storage. Spark needs this to save its offset positions. If the pod crashes, it reads this PVC to resume processing exactly where it left off (Fault Tolerance).
NodePortService:type: NodePort: Opens a specific port on the "Node" (the Kind container).nodePort: 30300: We manually pinned this port so we know where to access it. (Though port-forwarding is superior for local dev).
We configured this cluster for Local Development, but here is how it compares to Production Standards:
- Our Setup:
replicas: 1. Single point of failure, but efficient for local dev. - Standard:
replicas: 3(minimum).- Kafka: 3 Brokers to allow voting/quorum.
- Cassandra: 3 Nodes (Ring architecture).
- Spark: Multiple Executor pods scaling horizontally based on load (HPA).
- Our Setup:
standardStorageClass (HostPath in Kind). - Standard: Cloud Storage (AWS EBS, Google Persistent Disk).
- For Kafka/Cassandra: Use fast SSD-backed PVCs (
io1orgp3on AWS). - For Spark Checkpoints: Use Object Storage (S3/GCS) instead of block storage, as checkpoints are write-heavy and don't require block-level performance.
- For Kafka/Cassandra: Use fast SSD-backed PVCs (
- Our Setup: No limits defined (uses whatever your laptop handles).
- Standard: Explicit
resources.requestsandlimits.This prevents a rogue query from crashing the entire node.resources: requests: memory: "4Gi" # Guaranteed RAM cpu: "2" # Guaranteed CPU cores