Working on hadoop-distcp draft.

arnold-jr · arnold-jr · commit eab28bf4ad59 · 2017-02-14T19:30:16.000-06:00
diff --git a/_config.yml b/_config.yml
@@ -21,7 +21,7 @@ baseurl: "" # the subpath of your site, e.g. /blog
 url: "" # the base hostname & protocol for your site, e.g. http://example.com
 twitter_username: ACCREVandy 
 github_username: bigdata-vandy 
-slack_channel:  
+slack_channel: ACCRE-Forum 
 
 # Build settings
 markdown: kramdown
diff --git a/_drafts/hadoop-distcp.md b/_drafts/hadoop-distcp.md
@@ -5,34 +5,70 @@ author: Josh Arnold
 categories: hadoop io
 ---
 
-## Getting Started
-For instructions on obtaining access to the Vanderbilt Big Data Cluster, 
-see [this blog post](page)
-
-## Using the Hadoop FileSystem
+## Using Hadoop Distributed File System
 To analyze big data you need big data. That is, you need to have data stored
-on disk.
+on disk. The *de facto* standard for storing big data in resilient, distributed
+manner is Apache's Hadoop Distributed File System ([HDFS][apache-hadoop]).
+This post walks through different methods of storing data in HDFS on the
+[ACCRE BigData Cluster]({{ site.baseurl }}{% link _posts/2017-02-02-intro-to-the-cluster.md %}), and along the way, we'll introduce some basic 
+[Hadoop File System shell commands][hadoop-commands].
 
 
+### Local to HDFS 
 
-### Local <--> HDFS 
+Probably the most common workflow for new users is to `scp` some data to
+`bigdata.accre.vanderbilt.edu` and then move that data to HDFS. The command
+for doing that is:
 
 ```bash
 hadoop fs -copyFromLocal \
   file:///scratch/$USER/some/data hdfs:///user/$USER/some/dir
 ```
 
+or, equivalently:
+
+```bash
+hadoop fs -copyFromLocal some/data some/dir
+```
+
+The second option highlights the use of paths relative to the user's home 
+directory in both the local and the hadoop file systems.
+
+We also have the option to use `-moveFromLocal` which will delete
+the local source file once it is copied to HDFS. This command is useful if 
+you have many large files that you don't want hanging around on the native 
+file system on the cluster. One solution is to combine an `scp` command with a
+remote `ssh` command:
+
+```bash
+for f in *.txt; do 
+  scp $f bigdata:$f; 
+  ssh bigdata "hadoop fs -moveFromLocal $f $f"; 
+done
+```
+
+### HDFS to Local
+
+Copying from HDFS to a local drive works in very much the same with with the 
+analogous `hadoop fs` commands `-copyToLocal` and `-moveToLocal`.
+
+### Intra-HDFS
+
 
-## DistCp
 
 ### HDFS <--> HDFS
 
 
 ### AWS <--> HDFS
 
-A note on AWS credentials. 
-Using the command line tool 
+To use Amazon Web Services (AWS), a user needs to have credentials. Getting 
+credentialed is a slightly tedious but well-documented process that warrants
+no further explanation here. Instead, I assume that you have credentials
+stored in the file `~/.aws/credentials` on node `abd740`.
 
+Your AWS credentials need to be passed as command-line arguments to distcp,
+and I've found that a convenient and somewhat conventional way is to simply 
+set the credentials as 
 I've factored out setting these credentials into it's own script, since
 setting these environment variables comes up fairly often: 
 
@@ -56,4 +92,8 @@ s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/ \
 hdfs:///user/$USER/eng-us-all
 ```
 
-Differences in `s3n` v `s3a`.
+*Note that `s3`, `s3n`, `s3a` are all distinct specifications, and you should modify
+your java `-D` options according to the data source.*
+
+[apache-hadoop]: http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
+[hadoop-commands]: http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FileSystemShell.html 
diff --git a/_includes/icon-slack.svg b/_includes/icon-slack.svg
@@ -1,9 +1,7 @@
-<?xml version="1.0" encoding="utf-8"?>
-<!-- Generator: Adobe Illustrator 19.2.1, SVG Export Plug-In . SVG Version: 6.00 Build 0)  -->
-<svg width="100%" version="1.1" id="Layer_1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0px" y="0px"
-	 viewBox="0 0 270 270" style="enable-background:new 0 0 270 270;" xml:space="preserve">
+<svg id="Layer_1" viewBox="0 0 270 270" width="16px" height="16px" x="0px" y="0px"
+	 style="enable-background:new 0 0 270 270;">
 <style type="text/css">
-	.st0{fill:#2D333A;}
+	.st0{fill:#828282;}
 </style>
 <g id="Layer_1_1_">
 </g>
@@ -14,8 +12,6 @@
 <g id="Layer_2">
 	<g>
 		
-			<rect x="128.5" y="127.3" transform="matrix(0.9482 -0.3176 0.3176 0.9482 -36.0197 50.6366)" class="st0" width="17.6" height="17"/>
-		<g>
 			
 				<rect x="128.5" y="127.3" transform="matrix(0.9482 -0.3176 0.3176 0.9482 -36.0197 50.6366)" class="st0" width="17.6" height="17"/>
 			<path class="st0" d="M194.6,118.5c-12.9-43-31.5-53-74.5-40.1s-53,31.5-40.1,74.5s31.5,53,74.5,40.1S207.5,161.5,194.6,118.5z