Skip to content

Commit

Permalink
Merge pull request #1 from DIYBigData/spark-qfs-swarm
Browse files Browse the repository at this point in the history
Spark + QFS Docker Swarm Stack
  • Loading branch information
michaelkamprath authored Oct 6, 2019
2 parents d40ef55 + 8edf41e commit f3902ed
Show file tree
Hide file tree
Showing 18 changed files with 422 additions and 0 deletions.
42 changes: 42 additions & 0 deletions spark-qfs-swarm/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Deploy Standalone Spark Cluster with QFS on Docker Swarm
This project deploys a standalone Spark Cluster onto a Docker Swarm. Includes the [Quantcast File System](https://github.com/quantcast/qfs) (QFS) as the clusters distributed file system. Why QFS? Why not. this configuration will also launch and make available a Jupyter PySpark notebook that is connected to the Spark cluster. The cluster has [`matplotlib`](https://matplotlib.org) and [`pandas`](https://pandas.pydata.org) preinstalled for your PySpark on Jupyter joys.

## Usage
First, edit the following items as needed for your swarm:

1. `worker-node -> spark-conf -> spark-env.sh`: adjust the environment variables as appropriate for your cluster's nodes, most notably `SPARK_WORKER_MEMORY` and `SPARK_WORKER_CORES`. Leave 1-2 cores and at least 10% of RAM for other processes.
2. `worker-node -> spark-conf -> spark-env.sh`: Adjust the memory and core settings for the executors and driver. Each executor should have about 5 cores (if possible), and should be a whole divisor into `SPARK_WORKER_CORES`. Spark will launch as many executors as `SPARK_WORKER_CORES` divided by `spark.executor.cores`. Reserve about 7-8% of `SPARK_WORKER_MEMORY` for overhead when setting `spark.executor.memory`.
3. `build-images.sh`: Adjust the IP address for your local Docker registry that all nodes in your cluster can access. You can use a domain name if all nodes in your swarm can resolve it. This is needed as it allows all nodes in the swarm to pull the locally built Docker images.
4. `deploy-spark-qfs-swarm.yml`: Adjust all image names for the updated local Docker registry address you used in the prior step. Also, adjust the resource limits for each of the services. Setting a `cpus` limit here that is smaller than the number of cores on your node has the effect of giving your process a fraction of each core's capacity. You might consider doing this if your swarm hosts other services or does not handle long term 100% CPU load well (e.g., overheats). Also adjust the `replicas` count for the `spark-worker` service to be equal to the number of nodes in your swarm (or less).

This set up depends on have a GlusterFS volume mounted at `/mnt/gfs` on all nodes and the following directories exist on it:

* `/mnt/gfs/jupyter-notbooks` - used to persist the Jupyter notebooks.
* `/mnt/data/qfs/logs` - where QFS will store it's logs
* `/mnt/data/qfs/chunk` - Where the chunk servers of QFS will store the data
* `/mnt/data/qfs/checkpoint` - Where the QFS metaserver will store the fulesystem check points
* `/mnt/data/spark` - The local working directory for spark

You can adjust these as you see fit, but be sure to update the mounts specified in `deploy-spark-qfs-swarm.yml`.

Then, to start up the Spark cluster in your Docker swarm, `cd` into this project's directory and:
```
./build-images.sh
docker stack deploy -c deploy-spark-qfs-swarm.yml spark
```

Point your development computer's browser at `http://swarm-public-ip:7777/` to load the Jupyter notebook.

### Working with QFS
To launch a Docker container to give you command line access to QFS, use the following command:
```
docker run -it --network="spark_cluster_network" master:5000/qfs-master:latest /bin/bash
```
Note that you must attach to the network on which the Docker spark cluster services are using. From this command prompt, the following commands are pre-configured to connect to the QFS instance:

* `qfs` - enables most linux-style file operations on the QFS instance.
* `cptoqfs` - Copies files from the local file system (in the Docker container) to the QFS instance.
* `cpfromqfs` - Copies files from the QFS instance to the local file system (in the Docker container)
* `qfsshell` - A useful shell-style interface to the QFS instance

You might consider adding a volume mount to the `docker run` command so that the Docker container can access data from you local file system.
21 changes: 21 additions & 0 deletions spark-qfs-swarm/build-images.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#!/bin/bash

set -e

# build images
docker build -t worker-node:latest ./worker-node
docker build -t qfs-master:latest ./qfs-master
docker build -t spark-master:latest ./spark-master
docker build -t jupyter-server:latest ./jupyter-server

# tag image with local repository
docker tag worker-node:latest master:5000/worker-node:latest
docker tag qfs-master:latest master:5000/qfs-master:latest
docker tag spark-master:latest master:5000/spark-master:latest
docker tag jupyter-server:latest master:5000/jupyter-server:latest

# push the images to local repository
docker push master:5000/worker-node:latest
docker push master:5000/qfs-master:latest
docker push master:5000/spark-master:latest
docker push master:5000/jupyter-server:latest
104 changes: 104 additions & 0 deletions spark-qfs-swarm/deploy-spark-qfs-swarm.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
version: '3.4'
services:
qfs-master:
image: master:5000/qfs-master:latest
hostname: qfs-master
networks:
- cluster_network
ports:
- 20000:20000
- 30000:30000
- 20050:20050
volumes:
- type: bind
source: /mnt/data/qfs
target: /data/qfs
deploy:
resources:
limits:
cpus: "2.0"
memory: 2g
placement:
constraints:
- node.role == manager
spark-master:
image: master:5000/spark-master:latest
hostname: spark-master
environment:
- SPARK_PUBLIC_DNS=10.1.1.1
- SPARK_LOG_DIR=/data/spark/logs
networks:
- cluster_network
ports:
- 6066:6066
- 7077:7077
- 8080:8080
volumes:
- type: bind
source: /mnt/data/spark
target: /data/spark
deploy:
resources:
limits:
cpus: "2.0"
memory: 6g
jupyter-server:
image: master:5000/jupyter-server:latest
hostname: jupyter-server
environment:
- SPARK_PUBLIC_DNS=10.1.1.1
- SPARK_LOG_DIR=/data/spark/logs
depends_on:
- spark-master
- qfs-master
- worker-node
networks:
- cluster_network
ports:
- 7777:7777
- 4040:4040
volumes:
- type: bind
source: /mnt/gfs/jupyter-notebooks
target: /home/jupyter/notebooks
- type: bind
source: /mnt/gfs/data
target: /data
deploy:
resources:
limits:
cpus: "2.0"
memory: 6g
worker-node:
image: master:5000/worker-node:latest
hostname: worker
environment:
- SPARK_PUBLIC_DNS=10.1.1.1
- SPARK_LOG_DIR=/data/spark/logs
depends_on:
- qfs-master
- spark-master
networks:
- cluster_network
ports:
- 8081:8081
volumes:
- type: bind
source: /mnt/data/qfs
target: /data/qfs
- type: bind
source: /mnt/data/spark
target: /data/spark
deploy:
mode: global
resources:
limits:
cpus: "6.0"
memory: 56g
networks:
cluster_network:
attachable: true
ipam:
driver: default
config:
- subnet: 10.20.30.0/24
9 changes: 9 additions & 0 deletions spark-qfs-swarm/jupyter-server/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
FROM worker-node:latest

RUN apt-get install -y g++
RUN pip3 install jupyter
RUN mkdir -p /home/jupyter/runtime

COPY start-jupyter.sh /

CMD ["/bin/bash", "/start-jupyter.sh"]
3 changes: 3 additions & 0 deletions spark-qfs-swarm/jupyter-server/start-jupyter.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/bash

XDG_RUNTIME_DIR=/home/jupyter/runtime PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=7777 --notebook-dir=/home/jupyter/notebooks --ip=* --no-browser --allow-root --NotebookApp.token='' --NotebookApp.password=''" $SPARK_HOME/bin/pyspark --master spark://spark-master:7077
28 changes: 28 additions & 0 deletions spark-qfs-swarm/qfs-master/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
FROM worker-node:latest

#
# Expected volumes:
# /data/qfs - this is where QFS will store its data
#
# Instance should run on the swam's master node so as to persist configuration
#

# need python 2 for webserver

RUN apt-get update \
&& apt-get install -y python2.7 less wget \
&& ln -s /usr/bin/python2.7 /usr/bin/python2 \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*

# set configuration
COPY ./qfs-conf/* $QFS_HOME/conf/

# create some useful bash aliases for when at bash shell prompt of this image
RUN echo 'alias qfs="qfs -fs qfs://qfs-master:20000"' >> ~/.bashrc \
&& echo 'alias cptoqfs="cptoqfs -s qfs-master -p 20000"' >> ~/.bashrc \
&& echo 'alias cpfromqfs="cpfromqfs -s qfs-master -p 20000"' >> ~/.bashrc \
&& echo 'alias qfsshell="qfsshell -s qfs-master -p 20000"' >> ~/.bashrc

COPY start-qfs-master.sh /
CMD ["/bin/bash", "/start-qfs-master.sh"]
12 changes: 12 additions & 0 deletions spark-qfs-swarm/qfs-master/qfs-conf/Metaserver.prp
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
metaServer.clientPort = 20000
metaServer.chunkServerPort = 30000
metaServer.createEmptyFs = 1
metaServer.logDir = /data/qfs/logs
metaServer.cpDir = /data/qfs/checkpoint
metaServer.recoveryInterval = 30
metaServer.clusterKey = qfs-personal-compute-cluster
metaServer.msgLogWriter.logLevel = INFO
chunkServer.msgLogWriter.logLevel = NOTICE
metaServer.rootDirMode = 0777
metaServer.rootDirGroup = 1000
metaServer.rootDirUser = 1000
Empty file.
7 changes: 7 additions & 0 deletions spark-qfs-swarm/qfs-master/qfs-conf/webUI.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
[webserver]
webServer.metaserverHost = qfs-master
webServer.metaserverPort = 20000
webServer.port = 20050
webServer.docRoot = $QFS_HOME/webui/files/
webServer.host = 0.0.0.0
webserver.allmachinesfn = /dev/null
9 changes: 9 additions & 0 deletions spark-qfs-swarm/qfs-master/start-qfs-master.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#!/bin/bash

$QFS_HOME/bin/metaserver $QFS_HOME/conf/Metaserver.prp &> $QFS_LOGS_DIR/metaserver.log &

python2 $QFS_HOME/webui/qfsstatus.py $QFS_HOME/conf/webUI.cfg &> $QFS_LOGS_DIR/webui.log &

# now do nothing and do not exit
while true; do sleep 3600; done

9 changes: 9 additions & 0 deletions spark-qfs-swarm/spark-master/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
FROM worker-node:latest

#
# Expected volumes:
# /data/spark - this is the spark working directory
#

COPY start-spark-master.sh /
CMD ["/bin/bash", "/start-spark-master.sh"]
7 changes: 7 additions & 0 deletions spark-qfs-swarm/spark-master/start-spark-master.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#!/bin/bash

# start Spark master
$SPARK_HOME/sbin/start-master.sh

# now do nothing and do not exit
while true; do sleep 3600; done
88 changes: 88 additions & 0 deletions spark-qfs-swarm/worker-node/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
FROM debian:stretch
MAINTAINER Michael Kamprath "https://github.com/michaelkamprath"
#
# Base image for Apace Spak standalone cluster with QFS
#
# Inspired by https://hub.docker.com/r/gettyimages/spark/dockerfile
#
#
# Expected volumes:
# /data/qfs - this is where QFS will store its data
# /data/spark - this is the spark working directory
#
# Expected service names:
# qfs-master - the service where the QFS metaserver runs
# spark-master - the service where the spark master runs
#

RUN apt-get update \
&& apt-get install -y locales \
&& dpkg-reconfigure -f noninteractive locales \
&& locale-gen C.UTF-8 \
&& /usr/sbin/update-locale LANG=C.UTF-8 \
&& echo "en_US.UTF-8 UTF-8" >> /etc/locale.gen \
&& locale-gen \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*

ENV LANG en_US.UTF-8
ENV LANGUAGE en_US:en
ENV LC_ALL en_US.UTF-8

RUN apt-get update \
&& apt-get install -y curl unzip \
python3 python3-setuptools \
libboost-regex-dev \
&& ln -s /usr/bin/python3 /usr/bin/python \
&& easy_install3 pip py4j \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*

ENV PYTHONIOENCODING UTF-8
ENV PIP_DISABLE_PIP_VERSION_CHECK 1

# JAVA
RUN apt-get update \
&& apt-get install -y openjdk-8-jre \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*

# QFS
ENV QFS_VERSION 2.1.2
ENV HADOOP_VERSION 2.7.2
ENV QFS_PACKAGE qfs-debian-9-${QFS_VERSION}-x86_64
ENV QFS_HOME /usr/qfs-${QFS_VERSION}
ENV QFS_LOGS_DIR /data/qfs/logs
ENV LD_LIBRARY_PATH ${QFS_HOME}/lib
RUN curl -sL --retry 3 \
"https://s3.amazonaws.com/quantcast-qfs/qfs-debian-9-${QFS_VERSION}-x86_64.tgz" \
| gunzip \
| tar x -C /usr/ \
&& mv /usr/$QFS_PACKAGE $QFS_HOME \
&& chown -R root:root $QFS_HOME
COPY ./qfs-conf/* $QFS_HOME/conf/
ENV PATH $PATH:${QFS_HOME}/bin:${QFS_HOME}/bin/tools

# SPARK
ENV SPARK_VERSION 2.4.4
ENV SPARK_PACKAGE spark-${SPARK_VERSION}-bin-hadoop2.7
ENV SPARK_HOME /usr/spark-${SPARK_VERSION}
ENV SPARK_DIST_CLASSPATH="$QFS_HOME/lib/hadoop-$HADOOP_VERSION-qfs-$QFS_VERSION.jar:$QFS_HOME/lib/qfs-access-$QFS_VERSION"
ENV HADOOP_CONF_DIR=${SPARK_HOME}/conf/
ENV PATH $PATH:${SPARK_HOME}/bin
RUN curl -sL --retry 3 \
"https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/${SPARK_PACKAGE}.tgz" \
| gunzip \
| tar x -C /usr/ \
&& mv /usr/$SPARK_PACKAGE $SPARK_HOME \
&& chown -R root:root $SPARK_HOME
COPY ./spark-conf/* $SPARK_HOME/conf/

# add python libraries useful in PySpark
RUN python3 -mpip install matplotlib \
&& pip3 install pandas

# set up command
WORKDIR /root
COPY start-worker-node.sh /
CMD ["/bin/bash", "/start-worker-node.sh"]
10 changes: 10 additions & 0 deletions spark-qfs-swarm/worker-node/qfs-conf/Chunkserver.prp
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
chunkServer.metaServer.hostname = qfs-master
chunkServer.metaServer.port = 30000
chunkServer.clientPort = 22000
chunkServer.chunkDir = /data/qfs/chunk
chunkServer.clusterKey = qfs-personal-compute-cluster
chunkServer.stdout = /dev/null
chunkServer.stderr = /dev/null
chunkServer.ioBufferPool.partitionBufferCount = 65536
chunkServer.msgLogWriter.logLevel = INFO
chunkServer.diskQueue.threadCount = 4
23 changes: 23 additions & 0 deletions spark-qfs-swarm/worker-node/spark-conf/core-site.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Setting for QFS-->

<configuration>
<property>
<name>fs.qfs.impl</name>
<value>com.quantcast.qfs.hadoop.QuantcastFileSystem</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>qfs://qfs-master:20000</value>
</property>
<property>
<name>fs.qfs.metaServerHost</name>
<value>qfs-master</value>
</property>
<property>
<name>fs.qfs.metaServerPort</name>
<value>20000</value>
</property>
</configuration>
Loading

0 comments on commit f3902ed

Please sign in to comment.