Skip to content

Commit

Permalink
rearranged dirctory and fixed typos
Browse files Browse the repository at this point in the history
  • Loading branch information
michaelkamprath committed Sep 22, 2019
1 parent 37e19e4 commit 6163821
Show file tree
Hide file tree
Showing 8 changed files with 5 additions and 9 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,10 @@ First, edit the following items as needed for your swarm:
3. `build-images.sh`: Adjust the IP address for your local Docker registry that all nodes in your cluster can access. You can use a domain name if all nodes in your swarm can resolve it. This is needed as it allows all nodes in the swarm to pull the locally built Docker images.
4. `spark-deploy.yml`: Adjust all image names for the updated local Docker registry address you used in the prior step. Also, adjust the resource limits for each of the services. Setting a `cpus` limit here that is smaller than the number of cores on your node has the effect of giving your process a fraction of each core's capacity. You might consider doing this if your swarm hosts other services or does not handle long term 100% CPU load well (e.g., overheats). Also adjust the `replicas` count for the `spark-worker` service to be equal to the number of nodes in your swarm (or less).

This set up depends on have a GlusterFS volume mounted at `/mnt/gfs` on all nodes and the directories exist on it:
This set up depends on have a GlusterFS volume mounted at `/mnt/gfs` on all nodes and the following directories exist on it:

* `/mnt/gfs/jupyter-notbooks`
* `/mnt/gfs/data`
* `/mnt/gfs/jupyter-notbooks` - used to persist the Jupyter notebooks.
* `/mnt/gfs/data` - This is where data to analyze with spark gets placed.

Then, to start up the Spark cluster in your Docker swarm, `cd` into this project's directory and:
```
Expand All @@ -23,9 +23,5 @@ docker stack deploy -c deploy-spark-swarm.yml spark

Point your development computer's browser at `http://swarm-public-ip:7777/` to load the Jupyter notebook.

## TODO
This cluster is a work in progress. Currently, the following items are missing:
* A distributed file system, such as HDFS or QFS. Currently there is no way to ingest data into the cluster except through network transfers, such as through `curl`, set up in a Jupyter notebook.

## Acknowledgements
The docker configuration leverages the [`gettyimages/spark`](https://hub.docker.com/r/gettyimages/spark/) Docker image as a starting point.
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

set -e

#build images
# build images
docker build -t configured-spark-node:latest ./configured-spark-node
docker build -t spark-jupyter-notebook:latest ./spark-jupyter-notebook

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ services:
- 4040:4040
volumes:
- type: bind
source: /mnt/gfs/jupyter-notbooks
source: /mnt/gfs/jupyter-notebooks
target: /home/jupyter/notebooks
- type: bind
source: /mnt/gfs/data
Expand Down

0 comments on commit 6163821

Please sign in to comment.