Skip to content

Commit 34a441b

Browse files
committed
Document Kafka and Tensorflow on Spark install
1 parent 8e9bcb3 commit 34a441b

File tree

2 files changed

+154
-0
lines changed

2 files changed

+154
-0
lines changed

_drafts/kafka.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
---
2+
layout: post
3+
title: "Kafka"
4+
author: Josh Arnold
5+
categories: kafka
6+
---
7+

_drafts/tensorflow-on-spark.md

Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
---
2+
layout: post
3+
title: "Tensorflow on Spark"
4+
author: Josh Arnold
5+
categories: spark machine-learning deep-learning tensorflow
6+
---
7+
8+
TensorFlow on Spark
9+
-------------------
10+
11+
As per Dan Fabbri's suggestion, we installed TensforFlow on Spark, which
12+
provides model-level parallelism for Tensorflow. Following the instructions
13+
from [Yahoo](https://github.com/yahoo/TensorFlowOnSpark/wiki/GetStarted_YARN).
14+
15+
### Install Python 2.7
16+
17+
This version of python is to be copied to HDFS.
18+
19+
# download and extract Python 2.7
20+
export PYTHON_ROOT=~/Python
21+
curl -O https://www.python.org/ftp/python/2.7.12/Python-2.7.12.tgz
22+
tar -xvf Python-2.7.12.tgz
23+
rm Python-2.7.12.tgz
24+
25+
# compile into local PYTHON_ROOT
26+
pushd Python-2.7.12
27+
./configure --prefix="${PYTHON_ROOT}" --enable-unicode=ucs4
28+
make
29+
make install
30+
popd
31+
rm -rf Python-2.7.12
32+
33+
# install pip
34+
pushd "${PYTHON_ROOT}"
35+
curl -O https://bootstrap.pypa.io/get-pip.py
36+
bin/python get-pip.py
37+
rm get-pip.py
38+
39+
# install tensorflow (and any custom dependencies)
40+
export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera
41+
# This next step was necessary because the install wasn't finding
42+
# javac
43+
export PATH=$PATH:/usr/java/jdk1.7.0_67-cloudera/bin
44+
${PYTHON_ROOT}/bin/pip install pydoop
45+
# Note: add any extra dependencies here
46+
popd
47+
48+
### Install and compile TensorFlow w/ RDMA Support
49+
50+
The instructions recommend installing tensorflow from source, but this
51+
requires installing bazel, which I expect to be a major pain. Instead,
52+
for now, I've installed via `pip`:
53+
54+
```bash
55+
${PYTHON_ROOT}/bin/pip install tensorflow
56+
```
57+
58+
### Install and compile Hadoop InputFormat/OutputFormat for TFRecords
59+
60+
[TFRecords](https://github.com/tensorflow/ecosystem/tree/master/hadoop)
61+
62+
Here are the original instructions:
63+
64+
```bash
65+
git clone https://github.com/tensorflow/ecosystem.git
66+
# follow build instructions to generate tensorflow-hadoop-1.0-SNAPSHOT.jar
67+
# copy jar to HDFS for easier reference
68+
hadoop fs -put tensorflow-hadoop-1.0-SNAPSHOT.jar
69+
```
70+
71+
Building the jar is fairly involved and requires its own branch of installs.
72+
73+
#### protoc 3.1.0
74+
75+
[Google's data interchange format](https://developers.google.com/protocol-buffers/)
76+
[(with source code)](https://github.com/google/protobuf/tree/master/src)
77+
78+
> Protocol buffers are Google's language-neutral, platform-neutral,
79+
> extensible mechanism for serializing structured data – think XML,
80+
> but smaller,
81+
> faster, and simpler. You define how you want your data to be structured
82+
> once, then you can use special generated source code to easily write and
83+
> read your structured data to and from a variety of data streams and using a
84+
> variety of languages.
85+
86+
```bash
87+
wget https://github.com/google/protobuf/archive/v3.1.0.tar.gz
88+
```
89+
90+
I had to yum install `autoconf`, `automake`, and `libtool`. In the future,
91+
we should make sure to include these in cfengine.
92+
93+
94+
Then
95+
96+
```bash
97+
./autogen.sh
98+
./configure
99+
make
100+
make check
101+
sudo make install
102+
sudo ldconfig # refresh shared library cache
103+
```
104+
105+
Note that `make check` passed all 7 tests.
106+
107+
#### Apache Maven
108+
109+
```bash
110+
wget
111+
tar -xvzf apache-
112+
```
113+
114+
```bash
115+
$MAVEN_HOME/mvn clean package
116+
$MAVEN_HOME/mvn install
117+
```
118+
119+
This will generate the jar in the directory `ecosystem/hadoop/target/tensorflow-hadoop-1.0-SNAPSHOT.jar`.
120+
121+
```bash
122+
hadoop fs -put tensorflow-hadoop-1.0-SNAPSHOT.jar
123+
```
124+
125+
### Create a Python w/ TensorFlow zip package for Spark
126+
127+
```bash
128+
pushd "${PYTHON_ROOT}"
129+
zip -r Python.zip *
130+
popd
131+
```
132+
133+
Copy this Python distribution into HDFS:
134+
```bash
135+
hadoop fs -put ${PYTHON_ROOT}/Python.zip
136+
```
137+
138+
### Install TensorFlowOnSpark
139+
140+
Next, clone this repo and build a zip package for Spark:
141+
142+
```bash
143+
git clone [email protected]:yahoo/TensorFlowOnSpark.git
144+
pushd TensorFlowOnSpark/src
145+
zip -r ../tfspark.zip *
146+
popd
147+
```

0 commit comments

Comments
 (0)