Skip to content

Commit 112fd0f

Browse files
committed
Add post on jupyter notebooks
1 parent 34a441b commit 112fd0f

14 files changed

+520
-186
lines changed

_config.yml

+5-1
Original file line numberDiff line numberDiff line change
@@ -23,8 +23,12 @@ twitter_username: ACCREVandy
2323
github_username: bigdata-vandy
2424
slack_channel: ACCRE-Forum
2525

26+
# Pages
27+
#include: ['_pages']
28+
29+
permalink: "/:title/"
2630
# Build settings
27-
markdown: kramdown
31+
markdown: kramdown
2832
theme: minima
2933
gems:
3034
- jekyll-feed

_drafts/hue-notebooks.md

+15
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
---
2+
layout: post
3+
title: "Creating Notebooks with Hue"
4+
author: Josh Arnold
5+
categories: hue
6+
---
7+
8+
* TOC
9+
{:toc}
10+
11+
12+
It's a bad idea
13+
---------------
14+
* A
15+
* b

_drafts/tensorflow-on-spark.md

+8-141
Original file line numberDiff line numberDiff line change
@@ -1,147 +1,14 @@
11
---
22
layout: post
3-
title: "Tensorflow on Spark"
3+
title: "Using TensorflowOnSpark with GPFS on the ACCRE Cluster"
44
author: Josh Arnold
5-
categories: spark machine-learning deep-learning tensorflow
5+
categories: spark tensorflow slurm
66
---
77

8-
TensorFlow on Spark
9-
-------------------
8+
## Motivation
9+
[TensorflowOnSpark][tfos]
1010

11-
As per Dan Fabbri's suggestion, we installed TensforFlow on Spark, which
12-
provides model-level parallelism for Tensorflow. Following the instructions
13-
from [Yahoo](https://github.com/yahoo/TensorFlowOnSpark/wiki/GetStarted_YARN).
14-
15-
### Install Python 2.7
16-
17-
This version of python is to be copied to HDFS.
18-
19-
# download and extract Python 2.7
20-
export PYTHON_ROOT=~/Python
21-
curl -O https://www.python.org/ftp/python/2.7.12/Python-2.7.12.tgz
22-
tar -xvf Python-2.7.12.tgz
23-
rm Python-2.7.12.tgz
24-
25-
# compile into local PYTHON_ROOT
26-
pushd Python-2.7.12
27-
./configure --prefix="${PYTHON_ROOT}" --enable-unicode=ucs4
28-
make
29-
make install
30-
popd
31-
rm -rf Python-2.7.12
32-
33-
# install pip
34-
pushd "${PYTHON_ROOT}"
35-
curl -O https://bootstrap.pypa.io/get-pip.py
36-
bin/python get-pip.py
37-
rm get-pip.py
38-
39-
# install tensorflow (and any custom dependencies)
40-
export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera
41-
# This next step was necessary because the install wasn't finding
42-
# javac
43-
export PATH=$PATH:/usr/java/jdk1.7.0_67-cloudera/bin
44-
${PYTHON_ROOT}/bin/pip install pydoop
45-
# Note: add any extra dependencies here
46-
popd
47-
48-
### Install and compile TensorFlow w/ RDMA Support
49-
50-
The instructions recommend installing tensorflow from source, but this
51-
requires installing bazel, which I expect to be a major pain. Instead,
52-
for now, I've installed via `pip`:
53-
54-
```bash
55-
${PYTHON_ROOT}/bin/pip install tensorflow
56-
```
57-
58-
### Install and compile Hadoop InputFormat/OutputFormat for TFRecords
59-
60-
[TFRecords](https://github.com/tensorflow/ecosystem/tree/master/hadoop)
61-
62-
Here are the original instructions:
63-
64-
```bash
65-
git clone https://github.com/tensorflow/ecosystem.git
66-
# follow build instructions to generate tensorflow-hadoop-1.0-SNAPSHOT.jar
67-
# copy jar to HDFS for easier reference
68-
hadoop fs -put tensorflow-hadoop-1.0-SNAPSHOT.jar
69-
```
70-
71-
Building the jar is fairly involved and requires its own branch of installs.
72-
73-
#### protoc 3.1.0
74-
75-
[Google's data interchange format](https://developers.google.com/protocol-buffers/)
76-
[(with source code)](https://github.com/google/protobuf/tree/master/src)
77-
78-
> Protocol buffers are Google's language-neutral, platform-neutral,
79-
> extensible mechanism for serializing structured data – think XML,
80-
> but smaller,
81-
> faster, and simpler. You define how you want your data to be structured
82-
> once, then you can use special generated source code to easily write and
83-
> read your structured data to and from a variety of data streams and using a
84-
> variety of languages.
85-
86-
```bash
87-
wget https://github.com/google/protobuf/archive/v3.1.0.tar.gz
88-
```
89-
90-
I had to yum install `autoconf`, `automake`, and `libtool`. In the future,
91-
we should make sure to include these in cfengine.
92-
93-
94-
Then
95-
96-
```bash
97-
./autogen.sh
98-
./configure
99-
make
100-
make check
101-
sudo make install
102-
sudo ldconfig # refresh shared library cache
103-
```
104-
105-
Note that `make check` passed all 7 tests.
106-
107-
#### Apache Maven
108-
109-
```bash
110-
wget
111-
tar -xvzf apache-
112-
```
113-
114-
```bash
115-
$MAVEN_HOME/mvn clean package
116-
$MAVEN_HOME/mvn install
117-
```
118-
119-
This will generate the jar in the directory `ecosystem/hadoop/target/tensorflow-hadoop-1.0-SNAPSHOT.jar`.
120-
121-
```bash
122-
hadoop fs -put tensorflow-hadoop-1.0-SNAPSHOT.jar
123-
```
124-
125-
### Create a Python w/ TensorFlow zip package for Spark
126-
127-
```bash
128-
pushd "${PYTHON_ROOT}"
129-
zip -r Python.zip *
130-
popd
131-
```
132-
133-
Copy this Python distribution into HDFS:
134-
```bash
135-
hadoop fs -put ${PYTHON_ROOT}/Python.zip
136-
```
137-
138-
### Install TensorFlowOnSpark
139-
140-
Next, clone this repo and build a zip package for Spark:
141-
142-
```bash
143-
git clone [email protected]:yahoo/TensorFlowOnSpark.git
144-
pushd TensorFlowOnSpark/src
145-
zip -r ../tfspark.zip *
146-
popd
147-
```
11+
[tfos]: http://yahoohadoop.tumblr.com/post/157196317141/open-sourcing-tensorflowonspark-distributed-deep
12+
[accre-hpc]: https://github.com/accre/BigData/tree/master/spark-slurm
13+
[accre-gh-ss]: https://github.com/accre/BigData/tree/master/spark-slurm
14+
[slurm-srun]: https://slurm.schedmd.com/srun.html

_drafts/using-hue.md

+61-32
Original file line numberDiff line numberDiff line change
@@ -2,24 +2,74 @@
22
layout: post
33
title: "Using Hue on the Big Data Cluster"
44
author: Josh Arnold
5-
categories: hue cloudera spark
5+
categories: hue
66
---
77

8+
* TOC
9+
{:toc}
10+
811
# Requirements
9-
The bigdata cluster is available for use by the Vanderbilt community.
10-
Users need
11-
a valid Vanderbilt ID and password to log on, and once they've logged on,
12-
should [contact ACCRE](http://www.accre.vanderbilt.edu/?page_id=367)
13-
about getting permission. Once approved, users will be
14-
able to connect to `bigdata.accre.vanderbilt.edu` with `ssh`.
1512

13+
The bigdata cluster is available for use by the Vanderbilt community.
14+
Users should [contact ACCRE](http://www.accre.vanderbilt.edu/?page_id=367)
15+
to get access to the cluster.
1616

1717
# Logging on to the Cluster via Hue
18-
Navigate to `bigdata.accre.vanderbilt.edu:8888` in your web browser.
18+
Once approved, users will be
19+
able to connect to `bigdata.accre.vanderbilt.edu` via `ssh`, but Cloudera
20+
Manager provides a WebUI to interact with the cluster called Hue.
21+
To access Hue, simply to `bigdata.accre.vanderbilt.edu:8888` in your web browser
22+
and enter your credentials.
23+
24+
# Using the HDFS file browser
25+
26+
If you've used the web UIs for Dropbox, Google Drive, etc., then this step
27+
is a piece of cake. The File Browser is accessed from the
28+
dog-eared-piece-of-paper icon near the top right of the screen. In the file
29+
broswer, you're able to navigate the directory structure of HDFS and even
30+
view the contents of text files.
31+
32+
When a new user logs into Hue, Hue creates an HDFS directory for that user
33+
at `/user/<vunetid>` which becomes that user's home directory.
34+
*Note that, by default, logging in to Hue creates a new user's home directory
35+
with read and execute permissions enabled for the world!*
36+
37+
Files can be uploaded to your directories using the drag-and-drop mechanism; however,
38+
the file size for transferring through the WebUI is capped at around 50GB,
39+
so other tools like `scp` or `rsync` are necessary for moving large files
40+
onto the cluster.
41+
42+
In addition to your own data, ACCRE hosts some publicly available datasets
43+
at `/data/`:
44+
45+
Directory | Description
46+
--------------------- | -----------
47+
babs | Bay Area bikeshare data
48+
capitalbikeshare data | DC area bikeshare data
49+
citibike-tripdata | NYC bikeshare data
50+
google-ngrams | n-grams collected from Google Books
51+
nyc-tlc | NYC taxi trip data
52+
stack-archives | historic posts from StackOverflow, et al.
53+
54+
If you know of other datasets that may appeal to the Vanderbilt community at
55+
large, just let us know!
1956

20-
# Overview of Cloudera Services
57+
# Building an application
2158

22-
The Hadoop ecosystem is thriving, and Cloudera implememnts many of these
59+
Hue uses Oozie to compose workflows on the cluster, and to access it, you'll
60+
need to follow the tabs `Workflows -> Editors -> Workflows`. From here, click
61+
the `+ Create` button, and you'll arrive at the workflow composer screen. You
62+
can drag and drop an application into your workflow, for instance a Spark job.
63+
Here you can specify the jar file (which, conveniently,
64+
you can generate from our [GitHub repo][spark-wc-gh], and specify options and inputs.
65+
If you want to interactively select your input and output files each time you
66+
execute the job, you can use the special keywords `${input}` and `${output}`, which
67+
is a nice feature for generalizing your workflows.
68+
69+
# Overview of Cloudera services
70+
71+
The Hadoop ecosystem is rich with applications
72+
and Cloudera implememnts many of these
2373
technologies out of the box.
2474

2575
| Cloudera Services | Description
@@ -36,25 +86,4 @@ technologies out of the box.
3686
| Pig | High-level language for expressing data analysis programs
3787
| Solr | Text search engine supporting free form queries
3888

39-
In general
40-
41-
# Using the HDFS File Browser
42-
If you've used the web UIs for Dropbox, Google Drive, etc., then this step
43-
is a piece of cake. The File Browser is accessed from the
44-
dog-eared-piece-of-paper icon near the top right of the screen.
45-
46-
*Note: by default, logging in to Hue creates a new user's home directory
47-
at /user/username with read and execute permissions enabled for the world!*
48-
49-
The file size for transferring through the WebUI is capped at 50GB ??.
50-
51-
# MapReduce
52-
53-
The origins of Big Data as we know it today start with MapReduce.
54-
MapReduce 1 was designed to move computation to the data.
55-
56-
But no mechanism for caching...
57-
58-
# Spark
59-
Enter Spark
60-
89+
[spark-wc-gh]: https://github.com/bigdata-vandy/spark-wordcount

_drafts/zeppelin-notebooks.md

+13
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
---
2+
layout: post
3+
title: "Creating Notebooks with Apache Zeppelin"
4+
author: Josh Arnold
5+
categories: zeppelin
6+
---
7+
8+
* TOC
9+
{:toc}
10+
11+
# Getting Started
12+
13+
[

_includes/header.html

+30
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
<header class="site-header" role="banner">
2+
3+
<div class="wrapper">
4+
5+
<a class="site-title" href="{{ "/" | relative_url}}">{{ site.title | escape }}</a>
6+
7+
<nav class="site-nav">
8+
<span class="menu-icon">
9+
<svg viewBox="0 0 18 15" width="18px" height="15px">
10+
<path fill="#424242" d="M18,1.484c0,0.82-0.665,1.484-1.484,1.484H1.484C0.665,2.969,0,2.304,0,1.484l0,0C0,0.665,0.665,0,1.484,0 h15.031C17.335,0,18,0.665,18,1.484L18,1.484z"/>
11+
<path fill="#424242" d="M18,7.516C18,8.335,17.335,9,16.516,9H1.484C0.665,9,0,8.335,0,7.516l0,0c0-0.82,0.665-1.484,1.484-1.484 h15.031C17.335,6.031,18,6.696,18,7.516L18,7.516z"/>
12+
<path fill="#424242" d="M18,13.516C18,14.335,17.335,15,16.516,15H1.484C0.665,15,0,14.335,0,13.516l0,0 c0-0.82,0.665-1.484,1.484-1.484h15.031C17.335,12.031,18,12.696,18,13.516L18,13.516z"/>
13+
</svg>
14+
</span>
15+
16+
<div class="trigger">
17+
{% comment %}
18+
{% include navigation.html %}
19+
{% endcomment %}
20+
{% for my_page in site.pages %}
21+
{% if my_page.title %}
22+
<a class="page-link" href="{{ my_page.url | relative_url }}">{{ my_page.title | escape }}</a>
23+
{% endif %}
24+
{% endfor %}
25+
</div>
26+
</nav>
27+
28+
</div>
29+
30+
</header>

_includes/navigation.html

+24
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
{% capture html %}
2+
<ul>
3+
{% if include.context == "/" %}
4+
<li class="{% if page.url == "/" %}active{% endif %}">
5+
<a href="{{ site.baseurl }}/">{{ site.title }}</a>
6+
</li>
7+
{% endif %}
8+
9+
{% assign entries = site.pages | sort: "path" %}
10+
{% for entry in entries %}
11+
12+
{% capture slug %}{{ entry.url | split: "/" | last }}{% endcapture %}
13+
{% capture current %}{{ entry.url | remove: slug | remove: "//" | append: "/" }}{% endcapture %}
14+
15+
{% if current == include.context %}
16+
<li class="{% if page.url contains entry.url %}active{% endif %}">
17+
<a href="{{ site.baseurl }}{{ entry.url }}">{{ entry.title }}</a>
18+
{% include navigation.html context=entry.url %}
19+
</li>
20+
{% endif %}
21+
22+
{% endfor %}
23+
</ul>
24+
{% endcapture %}{{ html | strip_newlines | replace:' ','' | replace:' ','' | replace:' ',' ' }}

_layouts/page.html

+4-1
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,10 @@
66
<header class="post-header">
77
<h1 class="post-title">{{ page.title | escape }}</h1>
88
</header>
9-
9+
{% comment %}
10+
{% include navigation.html context="/"%}
11+
{% endcomment %}
12+
1013
<div class="post-content">
1114
{{ content }}
1215
</div>

_pages/foo/bar/foo.md

+10
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
---
2+
layout: page
3+
title: "Test page"
4+
permalink: "/foo/bar/"
5+
categories: page
6+
tags: test
7+
---
8+
9+
Some text.
10+

0 commit comments

Comments
 (0)