Skip to content

Commit

Permalink
added persistence for jupyter notebooks
Browse files Browse the repository at this point in the history
michaelkamprath committed Sep 12, 2019
1 parent 34dd5fb commit ebbb894
Showing 3 changed files with 9 additions and 6 deletions.
7 changes: 3 additions & 4 deletions spark-on-docker-swarm/README.md
Original file line number Diff line number Diff line change
@@ -10,18 +10,17 @@ First, edit the following items as needed for your swarm:
3. `build-images.sh`: Adjust the IP address for your local Docker registry. You can use a domain name if all nodes in your swarm can resolve it. This is needed as it allows all nodes in the swarm to pull the locally built Docker images.
4. `spark-deploy.yml`: Adjust all image names for the updated local Docker registry address you used in the prior step. Also, adjust the resource limits for each of the services. Setting a `cpus` limit here that is smaller than the number of cores on your node has the effect of giving your process a fraction of each core's capacity. You might consider doing this if your swarm hosts other services or does not handle long term 100% CPU load well (e.g., overheats). Also adjust the `replicas` count for the `spark-worker` service to be equal to the number of nodes in your swarm (or less).

Then, to start up the Spark cluster in your Docker swarm, `cd` into this project's directory and:
This set up depend son have a GlusterFS volume mounted at `/mnt/gfs` on all nodes and a directory `/mnt/gfs/jupyter-notbooks` exists on it. Then, to start up the Spark cluster in your Docker swarm, `cd` into this project's directory and:
```
./build-images.sh
docker stack deploy -c deploy-spark-swarm.yml spark
```

Then point your development computer's browser at `http://swarm-public-ip:7777/` to load the Jupyter notebook.
Point your development computer's browser at `http://swarm-public-ip:7777/` to load the Jupyter notebook.

## TODO
This cluster is a work in progress. Currently, the following items are missing:
* Persistence for Jupyter notebooks. Once you bring down the cluster, all notebooks you made are deleted.
* A distributed file system, such as HDFS or QFS. Currently there is no way to ingest data into the cluster except through network transfers, such as through `curl`, set up in a Jupyter notebook.

## Acknowledgements
The docker configuration leverages the [`gettyimages/spark`](https://hub.docker.com/r/gettyimages/spark/) Docker image as a starting point.
The docker configuration leverages the [`gettyimages/spark`](https://hub.docker.com/r/gettyimages/spark/) Docker image as a starting point.
7 changes: 6 additions & 1 deletion spark-on-docker-swarm/deploy-spark-swarm.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
version: '3'
version: '3.4'
services:
spark-master:
image: master:5000/configured-spark-node:latest
@@ -73,6 +73,10 @@ services:
ports:
- 7777:7777
- 4040:4040
volumes:
- type: bind
source: /mnt/gfs/jupyter-notbooks
target: /home/jupyter/notebooks
deploy:
resources:
limits:
@@ -81,3 +85,4 @@ services:

networks:
spark-network:

1 change: 0 additions & 1 deletion spark-on-docker-swarm/spark-jupyter-notebook/Dockerfile
Original file line number Diff line number Diff line change
@@ -2,7 +2,6 @@ FROM configured-spark-node:latest

RUN apt-get install -y g++
RUN pip3 install jupyter
RUN mkdir -p /home/jupyter/notebooks
RUN mkdir -p /home/jupyter/runtime

COPY start-jupyter.sh /

0 comments on commit ebbb894

Please sign in to comment.