diff --git a/spark-on-docker-swarm/README.md b/spark-on-docker-swarm/README.md index 8661ef7..21b8ccc 100644 --- a/spark-on-docker-swarm/README.md +++ b/spark-on-docker-swarm/README.md @@ -1,12 +1,12 @@ # Deploy Stand Alone Spark Cluster on Docker Swarm -This project brings up a simple Apache Spark stand alone cluster in a Docker swarm. It will also launch and make available a Jupyter PySpark notebook that is connected to the Spark cluster. The cluster has [`matplotlib`](https://matplotlib.org) and [`pandas`](https://pandas.pydata.org) preinstalled for you PySpark on Jupyter joys. +This project brings up a simple Apache Spark stand alone cluster in a Docker swarm. It will also launch and make available a Jupyter PySpark notebook that is connected to the Spark cluster. The cluster has [`matplotlib`](https://matplotlib.org) and [`pandas`](https://pandas.pydata.org) preinstalled for your PySpark on Jupyter joys. ## Usage First, edit the following items as needed for your swarm: 1. `configured-sparknode -> spark-conf -> spark-env.sh`: adjust the environment variables as appropriate for your cluster's nodes, most notably `SPARK_WORKER_MEMORY` and `SPARK_WORKER_CORES`. Leave 1-2 cores and at least 10% of RAM for other processes. -2. `configured-sparknode -> spark-conf -> spark-env.sh`: Adjust the memory and core settings for the executors and driver. Each executor should have about 5 cores (if possible), and should be a whole divisor into `SPARK_WORKER_CORES`. Spark will launch as many executors as `SPARK_WORKER_CORES` divided by `spark.executor.cores`. reserver about 7-8% of `SPARK_WORKER_MEMORY` for overhead when setting `spark.executor.memory`. +2. `configured-sparknode -> spark-conf -> spark-env.sh`: Adjust the memory and core settings for the executors and driver. Each executor should have about 5 cores (if possible), and should be a whole divisor into `SPARK_WORKER_CORES`. Spark will launch as many executors as `SPARK_WORKER_CORES` divided by `spark.executor.cores`. Reserve about 7-8% of `SPARK_WORKER_MEMORY` for overhead when setting `spark.executor.memory`. 3. `build-images.sh`: Adjust the IP address for your local Docker registry. You can use a domain name if all nodes in your swarm can resolve it. This is needed as it allows all nodes in the swarm to pull the locally built Docker images. 4. `spark-deploy.yml`: Adjust all image names for the updated local Docker registry address you used in the prior step. Also, adjust the resource limits for each of the services. Setting a `cpus` limit here that is smaller than the number of cores on your node has the effect of giving your process a fraction of each core's capacity. You might consider doing this if your swarm hosts other services or does not handle long term 100% CPU load well (e.g., overheats). Also adjust the `replicas` count for the `spark-worker` service to be equal to the number of nodes in your swarm (or less). diff --git a/spark-on-docker-swarm/configured-spark-node/spark-conf/spark-env.sh b/spark-on-docker-swarm/configured-spark-node/spark-conf/spark-env.sh index 766da90..79dee53 100644 --- a/spark-on-docker-swarm/configured-spark-node/spark-conf/spark-env.sh +++ b/spark-on-docker-swarm/configured-spark-node/spark-conf/spark-env.sh @@ -14,5 +14,5 @@ SPARK_WORKER_WEBUI_PORT=8081 # which python the spark cluster should use for pyspark PYSPARK_PYTHON=python3 -# hash seed so all node has numbers consistently +# hash seed so all node hash numbers consistently PYTHONHASHSEED=8675309