Skip to content

Commit eab28bf

Browse files
committed
Working on hadoop-distcp draft.
1 parent 6d7b36f commit eab28bf

File tree

3 files changed

+55
-19
lines changed

3 files changed

+55
-19
lines changed

_config.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ baseurl: "" # the subpath of your site, e.g. /blog
2121
url: "" # the base hostname & protocol for your site, e.g. http://example.com
2222
twitter_username: ACCREVandy
2323
github_username: bigdata-vandy
24-
slack_channel:
24+
slack_channel: ACCRE-Forum
2525

2626
# Build settings
2727
markdown: kramdown

_drafts/hadoop-distcp.md

Lines changed: 51 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -5,34 +5,70 @@ author: Josh Arnold
55
categories: hadoop io
66
---
77

8-
## Getting Started
9-
For instructions on obtaining access to the Vanderbilt Big Data Cluster,
10-
see [this blog post](page)
11-
12-
## Using the Hadoop FileSystem
8+
## Using Hadoop Distributed File System
139
To analyze big data you need big data. That is, you need to have data stored
14-
on disk.
10+
on disk. The *de facto* standard for storing big data in resilient, distributed
11+
manner is Apache's Hadoop Distributed File System ([HDFS][apache-hadoop]).
12+
This post walks through different methods of storing data in HDFS on the
13+
[ACCRE BigData Cluster]({{ site.baseurl }}{% link _posts/2017-02-02-intro-to-the-cluster.md %}), and along the way, we'll introduce some basic
14+
[Hadoop File System shell commands][hadoop-commands].
1515

1616

17+
### Local to HDFS
1718

18-
### Local <--> HDFS
19+
Probably the most common workflow for new users is to `scp` some data to
20+
`bigdata.accre.vanderbilt.edu` and then move that data to HDFS. The command
21+
for doing that is:
1922

2023
```bash
2124
hadoop fs -copyFromLocal \
2225
file:///scratch/$USER/some/data hdfs:///user/$USER/some/dir
2326
```
2427

28+
or, equivalently:
29+
30+
```bash
31+
hadoop fs -copyFromLocal some/data some/dir
32+
```
33+
34+
The second option highlights the use of paths relative to the user's home
35+
directory in both the local and the hadoop file systems.
36+
37+
We also have the option to use `-moveFromLocal` which will delete
38+
the local source file once it is copied to HDFS. This command is useful if
39+
you have many large files that you don't want hanging around on the native
40+
file system on the cluster. One solution is to combine an `scp` command with a
41+
remote `ssh` command:
42+
43+
```bash
44+
for f in *.txt; do
45+
scp $f bigdata:$f;
46+
ssh bigdata "hadoop fs -moveFromLocal $f $f";
47+
done
48+
```
49+
50+
### HDFS to Local
51+
52+
Copying from HDFS to a local drive works in very much the same with with the
53+
analogous `hadoop fs` commands `-copyToLocal` and `-moveToLocal`.
54+
55+
### Intra-HDFS
56+
2557

26-
## DistCp
2758

2859
### HDFS <--> HDFS
2960

3061

3162
### AWS <--> HDFS
3263

33-
A note on AWS credentials.
34-
Using the command line tool
64+
To use Amazon Web Services (AWS), a user needs to have credentials. Getting
65+
credentialed is a slightly tedious but well-documented process that warrants
66+
no further explanation here. Instead, I assume that you have credentials
67+
stored in the file `~/.aws/credentials` on node `abd740`.
3568

69+
Your AWS credentials need to be passed as command-line arguments to distcp,
70+
and I've found that a convenient and somewhat conventional way is to simply
71+
set the credentials as
3672
I've factored out setting these credentials into it's own script, since
3773
setting these environment variables comes up fairly often:
3874

@@ -56,4 +92,8 @@ s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/ \
5692
hdfs:///user/$USER/eng-us-all
5793
```
5894

59-
Differences in `s3n` v `s3a`.
95+
*Note that `s3`, `s3n`, `s3a` are all distinct specifications, and you should modify
96+
your java `-D` options according to the data source.*
97+
98+
[apache-hadoop]: http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
99+
[hadoop-commands]: http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FileSystemShell.html

_includes/icon-slack.svg

Lines changed: 3 additions & 7 deletions
Loading

0 commit comments

Comments
 (0)