1
1
---
2
2
layout : post
3
- title : " Moving Data on the Cluster "
3
+ title : " Moving Data with the Hadoop Distributed File System "
4
4
author : Josh Arnold
5
+ date : 2017-02-15 11:28:00 -0600
5
6
categories : hadoop io
6
7
---
7
8
8
- ## Using Hadoop Distributed File System
9
9
To analyze big data you need big data. That is, you need to have data stored
10
- on disk. The * de facto* standard for storing big data in resilient, distributed
10
+ on disk. The * de facto* standard for storing big data in a resilient, distributed
11
11
manner is Apache's Hadoop Distributed File System ([ HDFS] [ apache-hadoop ] ).
12
12
This post walks through different methods of storing data in HDFS on the
13
13
[ ACCRE BigData Cluster] ({{ site.baseurl }}{% link _ posts/2017-02-02-intro-to-the-cluster.md %}), and along the way, we'll introduce some basic
14
14
[ Hadoop File System shell commands] [ hadoop-commands ] .
15
15
16
+ ## Intra-HDFS
16
17
17
18
### Local to HDFS
18
19
@@ -32,7 +33,7 @@ hadoop fs -copyFromLocal some/data some/dir
32
33
```
33
34
34
35
The second option highlights the use of paths relative to the user's home
35
- directory in both the local and the hadoop file systems.
36
+ directory in both the local and the Hadoop file systems.
36
37
37
38
We also have the option to use ` -moveFromLocal ` which will delete
38
39
the local source file once it is copied to HDFS. This command is useful if
52
53
Copying from HDFS to a local drive works in very much the same with with the
53
54
analogous ` hadoop fs ` commands ` -copyToLocal ` and ` -moveToLocal ` .
54
55
55
- ### Intra- HDFS
56
+ ### Moving data on HDFS
56
57
58
+ The ` hadoop fs ` commands also have analogues for the \* nix commands ` mv ` , ` cp ` ,
59
+ ` mkdir ` , ` rm ` , ` rmdir, ` ls` , ` chmod` , ` chown` and many other whose use is
60
+ very similar to the \* nix versions.
57
61
62
+ ## Inter-HDFS
58
63
59
- ### HDFS <--> HDFS
64
+ In the intra-HDFS, all the distributed files have to gather at a single node
65
+ at some point along the way, a * many-to-one* or * one-to-many* model if you will.
66
+ But moving data between HDFS clusters can be greatly accelerated since
67
+ HDFS file blocks only reside on (typically) 3 different nodes within a cluster;
68
+ thus, this model is "few-to-few", and Hadoop provides the ` DistCp ` ("distributed copy")
69
+ utility for just such applications.
60
70
71
+ ### HDFS to HDFS
61
72
62
- ### AWS <--> HDFS
73
+ Passing data from one HDFS cluster to the next if fairly vanilla:
63
74
64
- To use Amazon Web Services (AWS), a user needs to have credentials. Getting
75
+ ``` bash
76
+ hadoop distcp hdfs://another-hdfs-host:8020/foo/bar \
77
+ hdfs://abd740:8020/bar/foo
78
+ ```
79
+
80
+ This could be useful if you have collaborators running a Hadoop cluster who'd
81
+ like to share their data with you.
82
+
83
+ ### AWS S3 to HDFS
84
+
85
+ Copying to and from Amazon's S3 (Simple Storage Service) storage is
86
+ also supported by ` distcp ` .
87
+ To use AWS (Amazon Web Services), a user needs to have credentials. Getting
65
88
credentialed is a slightly tedious but well-documented process that warrants
66
89
no further explanation here. Instead, I assume that you have credentials
67
90
stored in the file ` ~/.aws/credentials ` on node ` abd740 ` .
68
91
69
92
Your AWS credentials need to be passed as command-line arguments to distcp,
70
93
and I've found that a convenient and somewhat conventional way is to simply
71
- set the credentials as
94
+ set the credentials as environment variables.
72
95
I've factored out setting these credentials into it's own script, since
73
96
setting these environment variables comes up fairly often:
74
97
75
98
``` bash
76
99
#! /bin/bash
100
+ # ~/.aws/set_credentials.sh
77
101
78
102
export $( cat ~ /.aws/credentials | grep -v " ^\[" | awk ' {print toupper($1)$2$3 }' )
79
103
80
104
```
81
105
106
+ I also store my distcp command in a script:
82
107
83
108
``` bash
84
109
#! /bin/bash
@@ -92,8 +117,10 @@ s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/ \
92
117
hdfs:///user/$USER /eng-us-all
93
118
```
94
119
95
- * Note that ` s3 ` , ` s3n ` , ` s3a ` are all distinct specifications, and you should modify
96
- your java ` -D ` options according to the data source.*
120
+ It's really that simple; however, note that ` s3 ` , ` s3n ` , ` s3a ` are all distinct
121
+ specifications, and you should modify
122
+ your java ` -D ` options according to the data source.
123
+
97
124
98
125
[ apache-hadoop ] : http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
99
126
[ hadoop-commands] : http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FileSystemShell.html
0 commit comments