Skip to content

Commit 8e9bcb3

Browse files
committed
Add moving data post and adjust Slack icon svg.
Made the Slack icon bigger by changing the viewbox attribute; the graphic had a lot of negative space around it.
1 parent eab28bf commit 8e9bcb3

File tree

3 files changed

+62
-12
lines changed

3 files changed

+62
-12
lines changed

_drafts/hosted-datasets.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
---
2+
layout: post
3+
title: "Datasets Hosted on the Big Data Cluster"
4+
author: Josh Arnold
5+
categories: hadoop hdfs data
6+
---
7+
8+
To encourage use of our Big Data cluster, we at ACCRE have gathered various
9+
large datasets and made them publicly available at `hdfs:///data`.
10+
11+
## Stack Archives
12+
13+
14+
## New York City Taxi LC
15+
16+
17+
## Human Microbiome Project
18+
19+
## Conclusions
20+
21+
If you have a fairly generic dataset that you'd like us to grab, or if you have
22+
your own data that you'd like to share on our cluster, reach out to us at the
23+
links below!

_includes/icon-slack.svg

Lines changed: 1 addition & 1 deletion
Loading

_drafts/hadoop-distcp.md renamed to _posts/2017-02-15-using-hdfs.md

Lines changed: 38 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,19 @@
11
---
22
layout: post
3-
title: "Moving Data on the Cluster"
3+
title: "Moving Data with the Hadoop Distributed File System"
44
author: Josh Arnold
5+
date: 2017-02-15 11:28:00 -0600
56
categories: hadoop io
67
---
78

8-
## Using Hadoop Distributed File System
99
To analyze big data you need big data. That is, you need to have data stored
10-
on disk. The *de facto* standard for storing big data in resilient, distributed
10+
on disk. The *de facto* standard for storing big data in a resilient, distributed
1111
manner is Apache's Hadoop Distributed File System ([HDFS][apache-hadoop]).
1212
This post walks through different methods of storing data in HDFS on the
1313
[ACCRE BigData Cluster]({{ site.baseurl }}{% link _posts/2017-02-02-intro-to-the-cluster.md %}), and along the way, we'll introduce some basic
1414
[Hadoop File System shell commands][hadoop-commands].
1515

16+
## Intra-HDFS
1617

1718
### Local to HDFS
1819

@@ -32,7 +33,7 @@ hadoop fs -copyFromLocal some/data some/dir
3233
```
3334

3435
The second option highlights the use of paths relative to the user's home
35-
directory in both the local and the hadoop file systems.
36+
directory in both the local and the Hadoop file systems.
3637

3738
We also have the option to use `-moveFromLocal` which will delete
3839
the local source file once it is copied to HDFS. This command is useful if
@@ -52,33 +53,57 @@ done
5253
Copying from HDFS to a local drive works in very much the same with with the
5354
analogous `hadoop fs` commands `-copyToLocal` and `-moveToLocal`.
5455

55-
### Intra-HDFS
56+
### Moving data on HDFS
5657

58+
The `hadoop fs` commands also have analogues for the \*nix commands `mv`, `cp`,
59+
`mkdir`, `rm`, `rmdir, `ls`, `chmod`, `chown` and many other whose use is
60+
very similar to the \*nix versions.
5761

62+
## Inter-HDFS
5863

59-
### HDFS <--> HDFS
64+
In the intra-HDFS, all the distributed files have to gather at a single node
65+
at some point along the way, a *many-to-one* or *one-to-many* model if you will.
66+
But moving data between HDFS clusters can be greatly accelerated since
67+
HDFS file blocks only reside on (typically) 3 different nodes within a cluster;
68+
thus, this model is "few-to-few", and Hadoop provides the `DistCp` ("distributed copy")
69+
utility for just such applications.
6070

71+
### HDFS to HDFS
6172

62-
### AWS <--> HDFS
73+
Passing data from one HDFS cluster to the next if fairly vanilla:
6374

64-
To use Amazon Web Services (AWS), a user needs to have credentials. Getting
75+
```bash
76+
hadoop distcp hdfs://another-hdfs-host:8020/foo/bar \
77+
hdfs://abd740:8020/bar/foo
78+
```
79+
80+
This could be useful if you have collaborators running a Hadoop cluster who'd
81+
like to share their data with you.
82+
83+
### AWS S3 to HDFS
84+
85+
Copying to and from Amazon's S3 (Simple Storage Service) storage is
86+
also supported by `distcp`.
87+
To use AWS (Amazon Web Services), a user needs to have credentials. Getting
6588
credentialed is a slightly tedious but well-documented process that warrants
6689
no further explanation here. Instead, I assume that you have credentials
6790
stored in the file `~/.aws/credentials` on node `abd740`.
6891

6992
Your AWS credentials need to be passed as command-line arguments to distcp,
7093
and I've found that a convenient and somewhat conventional way is to simply
71-
set the credentials as
94+
set the credentials as environment variables.
7295
I've factored out setting these credentials into it's own script, since
7396
setting these environment variables comes up fairly often:
7497

7598
```bash
7699
#!/bin/bash
100+
# ~/.aws/set_credentials.sh
77101

78102
export $(cat ~/.aws/credentials | grep -v "^\[" | awk '{print toupper($1)$2$3 }')
79103

80104
```
81105

106+
I also store my distcp command in a script:
82107

83108
```bash
84109
#!/bin/bash
@@ -92,8 +117,10 @@ s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/ \
92117
hdfs:///user/$USER/eng-us-all
93118
```
94119

95-
*Note that `s3`, `s3n`, `s3a` are all distinct specifications, and you should modify
96-
your java `-D` options according to the data source.*
120+
It's really that simple; however, note that `s3`, `s3n`, `s3a` are all distinct
121+
specifications, and you should modify
122+
your java `-D` options according to the data source.
123+
97124

98125
[apache-hadoop]: http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
99126
[hadoop-commands]: http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FileSystemShell.html

0 commit comments

Comments
 (0)