added persistence for jupyter notebooks

DIYBigData · Sep 12, 2019 · ebbb894 · ebbb894
1 parent 34dd5fb
commit ebbb894
Showing 3 changed files with 9 additions and 6 deletions.
diff --git a/spark-on-docker-swarm/README.md b/spark-on-docker-swarm/README.md
@@ -10,18 +10,17 @@ First, edit the following items as needed for your swarm:
 3. `build-images.sh`: Adjust the IP address for your local Docker registry. You can use a domain name if all nodes in your swarm can resolve it. This is needed as it allows all nodes in the swarm to pull the locally built Docker images.
 4. `spark-deploy.yml`: Adjust all image names for the updated local Docker registry address you used in the prior step. Also, adjust the resource limits for each of the services. Setting a `cpus` limit here that is smaller than the number of cores on your node has the effect of giving your process a fraction of each core's capacity. You might consider doing this if your swarm hosts other services or does not handle long term 100% CPU load well (e.g., overheats). Also adjust the `replicas` count for the `spark-worker` service to be equal to the number of nodes in your swarm (or less). 
 
-Then, to start up the Spark cluster in your Docker swarm, `cd` into this project's directory and:
+This set up depend son have a GlusterFS volume mounted at `/mnt/gfs` on all nodes and a directory `/mnt/gfs/jupyter-notbooks` exists on it. Then, to start up the Spark cluster in your Docker swarm, `cd` into this project's directory and:
 ```
 ./build-images.sh
 docker stack deploy -c deploy-spark-swarm.yml spark
 ```
 
-Then point your development computer's browser at `http://swarm-public-ip:7777/` to load the Jupyter notebook.
+Point your development computer's browser at `http://swarm-public-ip:7777/` to load the Jupyter notebook.
 
 ## TODO
 This cluster is a work in progress. Currently, the following items are missing:
-* Persistence for Jupyter notebooks. Once you bring down the cluster, all notebooks you made are deleted.
 * A distributed file system, such as HDFS or QFS. Currently there is no way to ingest data into the cluster except through network transfers, such as through `curl`, set up in a Jupyter notebook.
 
 ## Acknowledgements
-The docker configuration leverages the [`gettyimages/spark`](https://hub.docker.com/r/gettyimages/spark/) Docker image as a starting point. 
+The docker configuration leverages the [`gettyimages/spark`](https://hub.docker.com/r/gettyimages/spark/) Docker image as a starting point. 
diff --git a/spark-on-docker-swarm/deploy-spark-swarm.yml b/spark-on-docker-swarm/deploy-spark-swarm.yml
@@ -1,4 +1,4 @@
-version: '3'
+version: '3.4'
 services:
     spark-master:
         image: master:5000/configured-spark-node:latest
@@ -73,6 +73,10 @@ services:
         ports:
             - 7777:7777
             - 4040:4040
+        volumes:
+            - type: bind
+              source: /mnt/gfs/jupyter-notbooks
+              target: /home/jupyter/notebooks
         deploy:
             resources:
                 limits:
@@ -81,3 +85,4 @@ services:
 
 networks:
     spark-network:
+
diff --git a/spark-on-docker-swarm/spark-jupyter-notebook/Dockerfile b/spark-on-docker-swarm/spark-jupyter-notebook/Dockerfile
@@ -2,7 +2,6 @@ FROM configured-spark-node:latest
 
 RUN apt-get install -y g++
 RUN pip3 install jupyter
-RUN mkdir -p /home/jupyter/notebooks
 RUN mkdir -p /home/jupyter/runtime
 
 COPY start-jupyter.sh /