2
2
layout : post
3
3
title : " Using Hue on the Big Data Cluster"
4
4
author : Josh Arnold
5
- categories : hue cloudera spark
5
+ categories : hue
6
6
---
7
7
8
+ * TOC
9
+ {: toc }
10
+
8
11
# Requirements
9
- The bigdata cluster is available for use by the Vanderbilt community.
10
- Users need
11
- a valid Vanderbilt ID and password to log on, and once they've logged on,
12
- should [ contact ACCRE] ( http://www.accre.vanderbilt.edu/?page_id=367 )
13
- about getting permission. Once approved, users will be
14
- able to connect to ` bigdata.accre.vanderbilt.edu ` with ` ssh ` .
15
12
13
+ The bigdata cluster is available for use by the Vanderbilt community.
14
+ Users should [ contact ACCRE] ( http://www.accre.vanderbilt.edu/?page_id=367 )
15
+ to get access to the cluster.
16
16
17
17
# Logging on to the Cluster via Hue
18
- Navigate to ` bigdata.accre.vanderbilt.edu:8888 ` in your web browser.
18
+ Once approved, users will be
19
+ able to connect to ` bigdata.accre.vanderbilt.edu ` via ` ssh ` , but Cloudera
20
+ Manager provides a WebUI to interact with the cluster called Hue.
21
+ To access Hue, simply to ` bigdata.accre.vanderbilt.edu:8888 ` in your web browser
22
+ and enter your credentials.
23
+
24
+ # Using the HDFS file browser
25
+
26
+ If you've used the web UIs for Dropbox, Google Drive, etc., then this step
27
+ is a piece of cake. The File Browser is accessed from the
28
+ dog-eared-piece-of-paper icon near the top right of the screen. In the file
29
+ broswer, you're able to navigate the directory structure of HDFS and even
30
+ view the contents of text files.
31
+
32
+ When a new user logs into Hue, Hue creates an HDFS directory for that user
33
+ at ` /user/<vunetid> ` which becomes that user's home directory.
34
+ * Note that, by default, logging in to Hue creates a new user's home directory
35
+ with read and execute permissions enabled for the world!*
36
+
37
+ Files can be uploaded to your directories using the drag-and-drop mechanism; however,
38
+ the file size for transferring through the WebUI is capped at around 50GB,
39
+ so other tools like ` scp ` or ` rsync ` are necessary for moving large files
40
+ onto the cluster.
41
+
42
+ In addition to your own data, ACCRE hosts some publicly available datasets
43
+ at ` /data/ ` :
44
+
45
+ Directory | Description
46
+ --------------------- | -----------
47
+ babs | Bay Area bikeshare data
48
+ capitalbikeshare data | DC area bikeshare data
49
+ citibike-tripdata | NYC bikeshare data
50
+ google-ngrams | n-grams collected from Google Books
51
+ nyc-tlc | NYC taxi trip data
52
+ stack-archives | historic posts from StackOverflow, et al.
53
+
54
+ If you know of other datasets that may appeal to the Vanderbilt community at
55
+ large, just let us know!
19
56
20
- # Overview of Cloudera Services
57
+ # Building an application
21
58
22
- The Hadoop ecosystem is thriving, and Cloudera implememnts many of these
59
+ Hue uses Oozie to compose workflows on the cluster, and to access it, you'll
60
+ need to follow the tabs ` Workflows -> Editors -> Workflows ` . From here, click
61
+ the ` + Create ` button, and you'll arrive at the workflow composer screen. You
62
+ can drag and drop an application into your workflow, for instance a Spark job.
63
+ Here you can specify the jar file (which, conveniently,
64
+ you can generate from our [ GitHub repo] [ spark-wc-gh ] , and specify options and inputs.
65
+ If you want to interactively select your input and output files each time you
66
+ execute the job, you can use the special keywords ` ${input} ` and ` ${output} ` , which
67
+ is a nice feature for generalizing your workflows.
68
+
69
+ # Overview of Cloudera services
70
+
71
+ The Hadoop ecosystem is rich with applications
72
+ and Cloudera implememnts many of these
23
73
technologies out of the box.
24
74
25
75
| Cloudera Services | Description
@@ -36,25 +86,4 @@ technologies out of the box.
36
86
| Pig | High-level language for expressing data analysis programs
37
87
| Solr | Text search engine supporting free form queries
38
88
39
- In general
40
-
41
- # Using the HDFS File Browser
42
- If you've used the web UIs for Dropbox, Google Drive, etc., then this step
43
- is a piece of cake. The File Browser is accessed from the
44
- dog-eared-piece-of-paper icon near the top right of the screen.
45
-
46
- * Note: by default, logging in to Hue creates a new user's home directory
47
- at /user/username with read and execute permissions enabled for the world!*
48
-
49
- The file size for transferring through the WebUI is capped at 50GB ??.
50
-
51
- # MapReduce
52
-
53
- The origins of Big Data as we know it today start with MapReduce.
54
- MapReduce 1 was designed to move computation to the data.
55
-
56
- But no mechanism for caching...
57
-
58
- # Spark
59
- Enter Spark
60
-
89
+ [ spark-wc-gh ] : https://github.com/bigdata-vandy/spark-wordcount
0 commit comments