Skip to content

Commit

Permalink
added matplotlib and pandas to build
Browse files Browse the repository at this point in the history
  • Loading branch information
michaelkamprath committed Sep 9, 2019
1 parent 44a7438 commit 1d2966d
Show file tree
Hide file tree
Showing 2 changed files with 6 additions and 3 deletions.
3 changes: 1 addition & 2 deletions spark-on-docker-swarm/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Deploy Stand Alone Spark Cluster on Docker Swarm

This project brings up a simple Apache Spark stand alone cluster in a Docker swarm. It will also launch and make available a Jupyter PySpark notebook that is connected to the Spark cluster.
This project brings up a simple Apache Spark stand alone cluster in a Docker swarm. It will also launch and make available a Jupyter PySpark notebook that is connected to the Spark cluster. The cluster has [`matplotlib`](https://matplotlib.org) and [`pandas`](https://pandas.pydata.org) preinstalled for you PySpark on Jupyter joys.

## Usage
First, edit the following items as needed for your swarm:
Expand All @@ -22,7 +22,6 @@ Then point your development computer's browser at `http://swarm-public-ip:7777/`
This cluster is a work in progress. Currently, the following items are missing:
* Persistence for Jupyter notebooks. Once you bring down the cluster, all notebooks you made are deleted.
* A distributed file system, such as HDFS or QFS. Currently there is no way to ingest data into the cluster except through network transfers, such as through `curl`, set up in a Jupyter notebook.
* Robust set Python libraries. This build is currently missing things like [`matplotlib`](https://matplotlib.org) and [`pandas`](https://pandas.pydata.org) from the build.

## Acknowledgements
The docker configuration leverages the [`gettyimages/spark`](https://hub.docker.com/r/gettyimages/spark/) Docker image as a starting point.
6 changes: 5 additions & 1 deletion spark-on-docker-swarm/configured-spark-node/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@
FROM gettyimages/spark

# add python libraries useful in PySpark
RUN python3 -mpip install matplotlib \
&& pip3 install pandas

# copy desired configuration to the spark conf
COPy ./spark-conf/* $SPARK_HOME/conf/
COPY ./spark-conf/* $SPARK_HOME/conf/

# same default command as the FROM image
CMD ["bin/spark-class", "org.apache.spark.deploy.master.Master"]

0 comments on commit 1d2966d

Please sign in to comment.