@@ -5,34 +5,70 @@ author: Josh Arnold
5
5
categories : hadoop io
6
6
---
7
7
8
- ## Getting Started
9
- For instructions on obtaining access to the Vanderbilt Big Data Cluster,
10
- see [ this blog post] ( page )
11
-
12
- ## Using the Hadoop FileSystem
8
+ ## Using Hadoop Distributed File System
13
9
To analyze big data you need big data. That is, you need to have data stored
14
- on disk.
10
+ on disk. The * de facto* standard for storing big data in resilient, distributed
11
+ manner is Apache's Hadoop Distributed File System ([ HDFS] [ apache-hadoop ] ).
12
+ This post walks through different methods of storing data in HDFS on the
13
+ [ ACCRE BigData Cluster] ({{ site.baseurl }}{% link _ posts/2017-02-02-intro-to-the-cluster.md %}), and along the way, we'll introduce some basic
14
+ [ Hadoop File System shell commands] [ hadoop-commands ] .
15
15
16
16
17
+ ### Local to HDFS
17
18
18
- ### Local <--> HDFS
19
+ Probably the most common workflow for new users is to ` scp ` some data to
20
+ ` bigdata.accre.vanderbilt.edu ` and then move that data to HDFS. The command
21
+ for doing that is:
19
22
20
23
``` bash
21
24
hadoop fs -copyFromLocal \
22
25
file:///scratch/$USER /some/data hdfs:///user/$USER /some/dir
23
26
```
24
27
28
+ or, equivalently:
29
+
30
+ ``` bash
31
+ hadoop fs -copyFromLocal some/data some/dir
32
+ ```
33
+
34
+ The second option highlights the use of paths relative to the user's home
35
+ directory in both the local and the hadoop file systems.
36
+
37
+ We also have the option to use ` -moveFromLocal ` which will delete
38
+ the local source file once it is copied to HDFS. This command is useful if
39
+ you have many large files that you don't want hanging around on the native
40
+ file system on the cluster. One solution is to combine an ` scp ` command with a
41
+ remote ` ssh ` command:
42
+
43
+ ``` bash
44
+ for f in * .txt; do
45
+ scp $f bigdata:$f ;
46
+ ssh bigdata " hadoop fs -moveFromLocal $f $f " ;
47
+ done
48
+ ```
49
+
50
+ ### HDFS to Local
51
+
52
+ Copying from HDFS to a local drive works in very much the same with with the
53
+ analogous ` hadoop fs ` commands ` -copyToLocal ` and ` -moveToLocal ` .
54
+
55
+ ### Intra-HDFS
56
+
25
57
26
- ## DistCp
27
58
28
59
### HDFS <--> HDFS
29
60
30
61
31
62
### AWS <--> HDFS
32
63
33
- A note on AWS credentials.
34
- Using the command line tool
64
+ To use Amazon Web Services (AWS), a user needs to have credentials. Getting
65
+ credentialed is a slightly tedious but well-documented process that warrants
66
+ no further explanation here. Instead, I assume that you have credentials
67
+ stored in the file ` ~/.aws/credentials ` on node ` abd740 ` .
35
68
69
+ Your AWS credentials need to be passed as command-line arguments to distcp,
70
+ and I've found that a convenient and somewhat conventional way is to simply
71
+ set the credentials as
36
72
I've factored out setting these credentials into it's own script, since
37
73
setting these environment variables comes up fairly often:
38
74
@@ -56,4 +92,8 @@ s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/ \
56
92
hdfs:///user/$USER /eng-us-all
57
93
```
58
94
59
- Differences in ` s3n ` v ` s3a ` .
95
+ * Note that ` s3 ` , ` s3n ` , ` s3a ` are all distinct specifications, and you should modify
96
+ your java ` -D ` options according to the data source.*
97
+
98
+ [ apache-hadoop ] : http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
99
+ [ hadoop-commands] : http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FileSystemShell.html
0 commit comments