bigdata-vandy
diff --git a/‎_config.yml
Lines changed: 5 additions & 1 deletion b/‎_config.yml
Lines changed: 5 additions & 1 deletion
diff --git a/‎_drafts/hue-notebooks.md
Lines changed: 15 additions & 0 deletions b/‎_drafts/hue-notebooks.md
Lines changed: 15 additions & 0 deletions
diff --git a/‎_drafts/tensorflow-on-spark.md
Lines changed: 8 additions & 141 deletions b/‎_drafts/tensorflow-on-spark.md
Lines changed: 8 additions & 141 deletions
diff --git a/‎_drafts/using-hue.md
Lines changed: 61 additions & 32 deletions b/‎_drafts/using-hue.md
Lines changed: 61 additions & 32 deletions
diff --git a/‎_drafts/zeppelin-notebooks.md
Lines changed: 13 additions & 0 deletions b/‎_drafts/zeppelin-notebooks.md
Lines changed: 13 additions & 0 deletions
diff --git a/‎_includes/header.html
Lines changed: 30 additions & 0 deletions b/‎_includes/header.html
Lines changed: 30 additions & 0 deletions
diff --git a/‎_includes/navigation.html
Lines changed: 24 additions & 0 deletions b/‎_includes/navigation.html
Lines changed: 24 additions & 0 deletions
diff --git a/‎_layouts/page.html
Lines changed: 4 additions & 1 deletion b/‎_layouts/page.html
Lines changed: 4 additions & 1 deletion
diff --git a/‎_pages/foo/bar/foo.md
Lines changed: 10 additions & 0 deletions b/‎_pages/foo/bar/foo.md
Lines changed: 10 additions & 0 deletions
@@ -23,8 +23,12 @@ twitter_username: ACCREVandy
 github_username: bigdata-vandy 
 slack_channel: ACCRE-Forum 
 
+# Pages
+#include: ['_pages']
+
+permalink: "/:title/"
 # Build settings
-markdown: kramdown
+markdown: kramdown 
 theme: minima
 gems:
   - jekyll-feed
 
@@ -0,0 +1,15 @@
+---
+layout: post
+title:  "Creating Notebooks with Hue"
+author: Josh Arnold
+categories: hue 
+---
+
+* TOC
+{:toc}
+
+
+It's a bad idea
+---------------
+* A
+  * b
@@ -1,147 +1,14 @@
 ---
 layout: post
-title:  "Tensorflow on Spark"
+title:  "Using TensorflowOnSpark with GPFS on the ACCRE Cluster"
 author: Josh Arnold
-categories: spark machine-learning deep-learning tensorflow 
+categories: spark tensorflow slurm 
 ---
 
-TensorFlow on Spark
--------------------
+## Motivation
+[TensorflowOnSpark][tfos]
 
-As per Dan Fabbri's suggestion, we installed TensforFlow on Spark, which 
-provides model-level parallelism for Tensorflow. Following the instructions 
-from [Yahoo](https://github.com/yahoo/TensorFlowOnSpark/wiki/GetStarted_YARN).
-
-### Install Python 2.7
-
-This version of python is to be copied to HDFS.
-
-    # download and extract Python 2.7
-    export PYTHON_ROOT=~/Python
-    curl -O https://www.python.org/ftp/python/2.7.12/Python-2.7.12.tgz
-    tar -xvf Python-2.7.12.tgz
-    rm Python-2.7.12.tgz
-
-    # compile into local PYTHON_ROOT
-    pushd Python-2.7.12
-    ./configure --prefix="${PYTHON_ROOT}" --enable-unicode=ucs4
-    make
-    make install
-    popd
-    rm -rf Python-2.7.12
-
-    # install pip
-    pushd "${PYTHON_ROOT}"
-    curl -O https://bootstrap.pypa.io/get-pip.py
-    bin/python get-pip.py
-    rm get-pip.py
-
-    # install tensorflow (and any custom dependencies)
-    export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera
-    # This next step was necessary because the install wasn't finding
-    # javac
-    export PATH=$PATH:/usr/java/jdk1.7.0_67-cloudera/bin
-    ${PYTHON_ROOT}/bin/pip install pydoop
-    # Note: add any extra dependencies here
-    popd
-
-### Install and compile TensorFlow w/ RDMA Support
-
-The instructions recommend installing tensorflow from source, but this 
-requires installing bazel, which I expect to be a major pain. Instead, 
-for now, I've installed via `pip`:
-
-```bash
-${PYTHON_ROOT}/bin/pip install tensorflow
-```
-
-### Install and compile Hadoop InputFormat/OutputFormat for TFRecords
-
-[TFRecords](https://github.com/tensorflow/ecosystem/tree/master/hadoop)
-
-Here are the original instructions:
-
-```bash
-git clone https://github.com/tensorflow/ecosystem.git
-# follow build instructions to generate tensorflow-hadoop-1.0-SNAPSHOT.jar
-# copy jar to HDFS for easier reference
-hadoop fs -put tensorflow-hadoop-1.0-SNAPSHOT.jar
-```
-
-Building the jar is fairly involved and requires its own branch of installs.
-
-#### protoc 3.1.0
-
-[Google's data interchange format](https://developers.google.com/protocol-buffers/)
-[(with source code)](https://github.com/google/protobuf/tree/master/src)
-
-> Protocol buffers are Google's language-neutral, platform-neutral, 
-> extensible mechanism for serializing structured data – think XML, 
-> but smaller,
-> faster, and simpler. You define how you want your data to be structured 
-> once, then you can use special generated source code to easily write and 
-> read your structured data to and from a variety of data streams and using a
-> variety of languages. 
-
-```bash
-wget https://github.com/google/protobuf/archive/v3.1.0.tar.gz
-```
-
-I had to yum install `autoconf`, `automake`, and `libtool`. In the future,
-we should make sure to include these in cfengine.
-
-
-Then
-
-```bash
-./autogen.sh
-./configure
-make
-make check
-sudo make install
-sudo ldconfig # refresh shared library cache
-```
-
-Note that `make check` passed all 7 tests.
-
-#### Apache Maven
-
-```bash
-wget 
-tar -xvzf apache-
-```
-
-```bash
-$MAVEN_HOME/mvn clean package
-$MAVEN_HOME/mvn install
-```
-
-This will generate the jar in the directory `ecosystem/hadoop/target/tensorflow-hadoop-1.0-SNAPSHOT.jar`.
-
-```bash
-hadoop fs -put tensorflow-hadoop-1.0-SNAPSHOT.jar
-```
-
-### Create a Python w/ TensorFlow zip package for Spark
-
-```bash
-pushd "${PYTHON_ROOT}"
-zip -r Python.zip *
-popd
-```
-
-Copy this Python distribution into HDFS:
-```bash
-hadoop fs -put ${PYTHON_ROOT}/Python.zip
-```
-
-### Install TensorFlowOnSpark
-
-Next, clone this repo and build a zip package for Spark:
-
-```bash
-git clone [email protected]:yahoo/TensorFlowOnSpark.git
-pushd TensorFlowOnSpark/src
-zip -r ../tfspark.zip *
-popd
-```
+[tfos]: http://yahoohadoop.tumblr.com/post/157196317141/open-sourcing-tensorflowonspark-distributed-deep
+[accre-hpc]:  https://github.com/accre/BigData/tree/master/spark-slurm
+[accre-gh-ss]: https://github.com/accre/BigData/tree/master/spark-slurm
+[slurm-srun]:   https://slurm.schedmd.com/srun.html
@@ -2,24 +2,74 @@
 layout: post
 title:  "Using Hue on the Big Data Cluster"
 author: Josh Arnold
-categories: hue cloudera spark 
+categories: hue 
 ---
 
+* TOC
+{:toc}
+
 # Requirements
-The bigdata cluster is available for use by the Vanderbilt community.
-Users need
-a valid Vanderbilt ID and password to log on, and once they've logged on,
-should [contact ACCRE](http://www.accre.vanderbilt.edu/?page_id=367) 
-about getting permission. Once approved, users will be
-able to connect to `bigdata.accre.vanderbilt.edu` with `ssh`.
 
+The bigdata cluster is available for use by the Vanderbilt community.
+Users should [contact ACCRE](http://www.accre.vanderbilt.edu/?page_id=367) 
+to get access to the cluster. 
 
 # Logging on to the Cluster via Hue
-Navigate to `bigdata.accre.vanderbilt.edu:8888` in your web browser.
+Once approved, users will be
+able to connect to `bigdata.accre.vanderbilt.edu` via `ssh`, but Cloudera 
+Manager provides a WebUI to interact with the cluster called Hue.
+To access Hue, simply to `bigdata.accre.vanderbilt.edu:8888` in your web browser
+and enter your credentials.
+
+# Using the HDFS file browser
+
+If you've used the web UIs for Dropbox, Google Drive, etc., then this step
+is a piece of cake. The File Browser is accessed from the 
+dog-eared-piece-of-paper icon near the top right of the screen. In the file
+broswer, you're able to navigate the directory structure of HDFS and even
+view the contents of text files.
+
+When a new user logs into Hue, Hue creates an HDFS directory for that user
+at `/user/<vunetid>` which becomes that user's home directory.
+*Note that, by default, logging in to Hue creates a new user's home directory
+with read and execute permissions enabled for the world!*
+
+Files can be uploaded to your directories using the drag-and-drop mechanism; however, 
+the file size for transferring through the WebUI is capped at around 50GB, 
+so other tools like `scp` or `rsync` are necessary for moving large files
+onto the cluster.
+
+In addition to your own data, ACCRE hosts some publicly available datasets
+at `/data/`:
+
+Directory             | Description
+--------------------- | -----------
+babs                  | Bay Area bikeshare data
+capitalbikeshare data | DC area bikeshare data
+citibike-tripdata     | NYC bikeshare data
+google-ngrams         | n-grams collected from Google Books
+nyc-tlc               | NYC taxi trip data
+stack-archives        | historic posts from StackOverflow, et al.
+
+If you know of other datasets that may appeal to the Vanderbilt community at
+large, just let us know!
 
-# Overview of Cloudera Services
+# Building an application
 
-The Hadoop ecosystem is thriving, and Cloudera implememnts many of these
+Hue uses Oozie to compose workflows on the cluster, and to access it, you'll 
+need to follow the tabs `Workflows -> Editors -> Workflows`. From here, click
+the `+ Create` button, and you'll arrive at the workflow composer screen. You
+can drag and drop an application into your workflow, for instance a Spark job. 
+Here you can specify the jar file (which, conveniently, 
+you can generate from our [GitHub repo][spark-wc-gh], and specify options and inputs.
+If you want to interactively select your input and output files each time you
+execute the job, you can use the special keywords `${input}` and `${output}`, which
+is a nice feature for generalizing your workflows.
+
+# Overview of Cloudera services
+
+The Hadoop ecosystem is rich with applications 
+and Cloudera implememnts many of these
 technologies out of the box.
 
 | Cloudera Services     | Description 
@@ -36,25 +86,4 @@ technologies out of the box.
 | Pig                   | High-level language for expressing data analysis programs 
 | Solr                  | Text search engine supporting free form queries 
 
-In general 
-
-# Using the HDFS File Browser
-If you've used the web UIs for Dropbox, Google Drive, etc., then this step
-is a piece of cake. The File Browser is accessed from the 
-dog-eared-piece-of-paper icon near the top right of the screen.  
-
-*Note: by default, logging in to Hue creates a new user's home directory
-at /user/username with read and execute permissions enabled for the world!*
-
-The file size for transferring through the WebUI is capped at 50GB ??.
-
-# MapReduce
-
-The origins of Big Data as we know it today start with MapReduce. 
-MapReduce 1 was designed to move computation to the data. 
-
-But no mechanism for caching... 
-
-# Spark
-Enter Spark
-
+[spark-wc-gh]: https://github.com/bigdata-vandy/spark-wordcount
@@ -0,0 +1,13 @@
+---
+layout: post
+title: "Creating Notebooks with Apache Zeppelin"
+author: Josh Arnold
+categories: zeppelin 
+---
+
+* TOC
+{:toc}
+
+# Getting Started
+
+[
@@ -0,0 +1,30 @@
+<header class="site-header" role="banner">
+
+  <div class="wrapper">
+
+    <a class="site-title" href="{{ "/" | relative_url}}">{{ site.title | escape }}</a>
+
+    <nav class="site-nav">
+      <span class="menu-icon">
+        <svg viewBox="0 0 18 15" width="18px" height="15px">
+          <path fill="#424242" d="M18,1.484c0,0.82-0.665,1.484-1.484,1.484H1.484C0.665,2.969,0,2.304,0,1.484l0,0C0,0.665,0.665,0,1.484,0 h15.031C17.335,0,18,0.665,18,1.484L18,1.484z"/>
+          <path fill="#424242" d="M18,7.516C18,8.335,17.335,9,16.516,9H1.484C0.665,9,0,8.335,0,7.516l0,0c0-0.82,0.665-1.484,1.484-1.484 h15.031C17.335,6.031,18,6.696,18,7.516L18,7.516z"/>
+          <path fill="#424242" d="M18,13.516C18,14.335,17.335,15,16.516,15H1.484C0.665,15,0,14.335,0,13.516l0,0 c0-0.82,0.665-1.484,1.484-1.484h15.031C17.335,12.031,18,12.696,18,13.516L18,13.516z"/>
+        </svg>
+      </span>
+
+      <div class="trigger">
+        {% comment %}
+          {% include navigation.html %}
+        {% endcomment %}
+        {% for my_page in site.pages %}
+          {% if my_page.title %}
+          <a class="page-link" href="{{ my_page.url | relative_url }}">{{ my_page.title | escape }}</a>
+          {% endif %}
+        {% endfor %}
+      </div>
+    </nav>
+
+  </div>
+
+</header>
@@ -0,0 +1,24 @@
+{% capture html %}
+<ul>
+	{% if include.context == "/" %}
+	<li class="{% if page.url == "/" %}active{% endif %}">
+		<a href="{{ site.baseurl }}/">{{ site.title }}</a>
+	</li>
+	{% endif %}
+
+	{% assign entries = site.pages | sort: "path" %}
+	{% for entry in entries %}
+
+	{% capture slug    %}{{ entry.url | split: "/"   | last                       }}{% endcapture %}
+	{% capture current %}{{ entry.url | remove: slug | remove: "//" | append: "/" }}{% endcapture %}
+
+	{% if current == include.context %}
+	<li class="{% if page.url contains entry.url %}active{% endif %}">
+		<a href="{{ site.baseurl }}{{ entry.url }}">{{ entry.title }}</a>
+		{% include navigation.html context=entry.url %}
+	</li>
+	{% endif %}
+
+	{% endfor %}
+</ul>
+{% endcapture %}{{ html | strip_newlines | replace:' ','' | replace:' ','' | replace:' ',' ' }}
@@ -6,7 +6,10 @@
   <header class="post-header">
     <h1 class="post-title">{{ page.title | escape }}</h1>
   </header>
-
+  {% comment %}
+  {% include navigation.html context="/"%}
+  {% endcomment %}
+  
   <div class="post-content">
     {{ content }}
   </div>
 
@@ -0,0 +1,10 @@
+---
+layout: page
+title: "Test page"
+permalink: "/foo/bar/"
+categories: page
+tags: test
+---
+
+Some text.
+