Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
321 changes: 37 additions & 284 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,308 +1,61 @@
# Hadoop and Yarn Setup

## 1. set passwordless login
### Pre-requisities:
1. JAVA Setup should be completed and JAVA_HOME should be set in the environment variable.
2. Make sure the nodes are set for password-less SSH both ways(master->slaves).
3. Since we use the environment variables a lot in our scripts, make sure to comment out the portion following this statement in your ~/.bashrc ,
`If not running interactively, don't do anything`

To create user
```
sudo adduser testuser
sudo adduser testuser sudo
```
### Installations:

For local host
* To automate hadoop installation follows the steps,

```
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
```
For other hosts

```
ssh-copy-id -i ~/.ssh/id_rsa.pub user@host
ssh user@host
```
## 2. Download and install hadoop

http://hadoop.apache.org/releases.html#Download

```
#Choose the right mirror, below link is for US machines.
wget http://www-us.apache.org/dist/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz
tar xf hadoop-2.7.3.tar.gz --gzip
export HADOOP_HOME=$HOME/hadoop-2.7.3
```

## 3. Update slaves file

Add data nodes, don't add master node.
```bash
vi $HADOOP_HOME/etc/hadoop/slaves
user@host1
user@host2
```

## 4. Hadoop utils setup
```
git clone https://github.com/kmadhugit/hadoop-cluster-utils.git
cd hadoop-cluster-utils
vi add-this-to-dot-profile.sh #update correct path to env variables.
. add-this-to-dot-profile.sh
```

check whether cluster scripts are working

```
AN hostname
```

Update .bashrc

1. Delete the following check.
```
# If not running interactively, don't do anything
case $- in
*i*) ;;
*) return;;
esac
```

2. Read add-this-to-dot-profile.sh at the end of .bashrc

```
vi $HOME/.bashrc
Gi
:r $HOME/hadoop-cluster-utils/add-this-to-dot-profile.sh
G
set -o vi
```

3. copy .bashrc to all other data nodes
```bash
git clone https://github.com/kmadhugit/hadoop-cluster-utils

```
CP $HOME/.bashrc $HOME
```


## 5. Install Hadoop on all nodes
```
CP $HOME/hadoop-2.7.3.tar.gz $HOME
DN "tar xf hadoop-2.7.3.tar.gz --gzip"
```

## 6. HDFS configuration

You need to modify 2 config files for HDFS

1. core-site.xml #Modify the Hostname for the Name node
```
cd $HOME/hadoop-cluster-utils/conf
cp core-site.xml.template core-site.xml
vi core-site.xml
cp core-site.xml $HADOOP_HOME/etc/hadoop
CP core-site.xml $HADOOP_HOME/etc/hadoop
cd hadoop-cluster-utils
```

2. hdfs-site.xml
* Configuration

create local dir in name node for meta-data (

``` mkdir -p /data/user/hdfs-meta-data ```

create local dir in all data-nodes for hdfs-data

``` DN "mkdir -p /data/user/hdfs-data" ```
1. To configure `hadoop-cluster-utils`, run `./autogen.sh` which will create `config.sh` with appropriate field values.
2. User can enter `Spark` and `Hadoop` version interactively while running `./autogen.sh` file.
3. Before executing `./setup.sh` file, user can verify or edit `config.sh`.

update dir path
```
cd $HOME/hadoop-cluster-utils/conf
cp hdfs-site.xml.template hdfs-site.xml
vi hdfs-site.xml #update dir path
```
Copy the files to all nodes
* Ensure that the following java process is running in master. If not, check the log files

```bash
checkall.sh
```
cp hdfs-site.xml $HADOOP_HOME/etc/hadoop
CP hdfs-site.xml $HADOOP_HOME/etc/hadoop
```

3. Start HDFS as fresh FS

```
$HADOOP_PREFIX/bin/hdfs namenode -format mycluster
start-hdfs.sh
AN jps
# use stop-hdfs.sh for stopping
```

4. Start HDFS on existing cluster data
You need to modify ownership to self to use already created data

```
AN "sudo chown user:user /data/hdfs-meta-data"
AN "sudo chown user:user /data/hdfs-data"
start-hdfs.sh
AN jps
```

Ensure that the following java process is running in master. If not, check the log files

```
NameNode
```
Ensure that the following java process is running in slaves. If not, check the log files
```
DataNode
```

5. HDFS web address

```
http://localhost:50070
```

## 7. Yarn configuration
Invoke `checkall.sh` ensure all services are started on the Master & slaves

You need to modify 2 config files for HDFS

1. capacity-scheduler.xml #Modify resource-calculator property to DominantResourceCalculator

```bash
vi $HADOOP_HOME/etc/hadoop/capacity-scheduler.xml
```
```xml
<property>
<name>yarn.scheduler.capacity.resource-calculator</name>
<value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value>
</property>
```
2. yarn-site.xml # Modify the properties as per the description provided in the template

```
cd $HOME/hadoop-cluster-utils/conf
cp yarn-site.xml.template yarn-site.xml
vi yarn-site.xml
cp yarn-site.xml $HADOOP_HOME/etc/hadoop
CP yarn-site.xml $HADOOP_HOME/etc/hadoop
AN jps
```

Ensure that the following java process is started in master. If not, check the log files

```
NameNode
JobHistoryServer
ResourceManager
```
Ensure that the following java process is started in slaves. If not, check the log files
Ensure that the following java process is running in slaves. If not, check the hadoop log files
```
DataNode
NodeManager
```

3. Start Yarn
```
start-yarn.sh
AN jps
```

3. Resource Manager and Node Manager web Address
```
Resource Manager : http://localhost:8088/cluster
Node Manager : http://datanode:8042/node (For each node)
```

## 8. Useful scripts

```
> stop-all.sh #stop HDFS and Yarn
> start-all.sh #start HDFS and Yarn
> CP <localpath to file> <remotepath to dir> #Copy file from name nodes to all slaves
> AN <command> #execute a given command in all nodes including master
> DN <command> #execute a given command in all nodes excluding master
```

## 9. Spark Installation.

### a. Download Binary

```
http://spark.apache.org/downloads.html
#Choose the right mirror, below link is for US machines.
wget http://www-us.apache.org/dist/spark/spark-2.0.1/spark-2.0.1-bin-hadoop2.7.tgz
tar -zvf spark-2.0.1-bin-hadoop2.7.tgz
```

### b. Build it yourself

```
git clone https://github.com/apache/spark.git
git checkout -b v2.0.1 v2.0.1
export MAVEN_OPTS="-Xmx32G -XX:MaxPermSize=8G -XX:ReservedCodeCacheSize=2G"
./build/mvn -T40 -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.3 -Phive -Phive-thriftserver -DskipTests -Dmaven.javadoc.skip=true install
```

### c. Test (pre-built spark version)
```
#Add in ~/.bashrc
export SPARK_HOME=$HOME/spark-2.0.1-bin-hadoop2.7

. ~/.bashrc

${SPARK_HOME}bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --driver-memory 1024M --num-executors 2 --executor-memory 1g --executor-cores 1 ${SPARK_HOME}/examples/jars/spark-examples_2.11-2.0.1.jar 10
```

### d. Test (manual spark build)

```
#Add in ~/.bashrc
export SPARK_HOME=$HOME/spark

. ~/.bashrc

$SPARK_HOME/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --driver-memory 1024M --num-executors 2 --executor-memory 1g --executor-cores 1 /home/testuser/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.0.1.jar

```

### e. Enable EventLogging & additional settings by adding the following content to $SPARK_HOME/conf/spark-defaults.conf
```
spark.eventLog.enabled true
spark.eventLog.dir /tmp/spark-events
spark.eventLog.compress true
spark.history.fs.logDirectory /tmp/spark-events
spark.serializer org.apache.spark.serializer.KryoSerializer
```

### f. Start/Stop All Services.

The below scripts are used to start/stop the following services in an automated way,

- namenode daemon (only on hdfs master)
- datanode daemon (on all slave nodes)
- resource manager daemon (only on yarn master)
- node manager daemon (on all slave nodes)
- job history server (only on yarn master)
- Spark history server (on yarn master)

```
# Start

start-all.sh
* HDFS, Resource Manager and Node Manager web Address

```
HDFS web address : http://localhost:50070
Resource Manager : http://localhost:8088/cluster
Node Manager : http://datanode:8042/node (For each node)
```

# Stop
* Useful scripts

stop-all.sh
=======
```

## 10. Spark command line options for Yarn Scheduler.


| Option | Description |
|--------|-------------|
| --num-executors | Total number of executor JVMs to spawn across Yarn Cluster |
| --executor-cores | Total number of cores in each executor JVM |
| --executor-memory | Memory to be allocated for each JVM 1024M/1G|
| --driver-memory | Memory to be allocated for driver JVM |
| --driver-cores | Total number of vcores for driver JVM |
| | Total vcores = num-executors * executor-vcores + driver-cores |
| | Total Memory = num-executors * executor-memory + driver-memory |
|--driver-java-options | To pass driver JVM, useful in local mode for profiling |

-----------------------------------------------------------------
```
> stop-all.sh #stop HDFS and Yarn
> start-all.sh #start HDFS and Yarn
> CP <localpath to file> <remotepath to dir> #Copy file from name nodes to all slaves
> AN <command> #execute a given command in all nodes including master
> DN <command> #execute a given command in all nodes excluding master
> checkall.sh #ensure all services are started on the Master & slaves
```
24 changes: 0 additions & 24 deletions add-this-to-dot-profile.sh

This file was deleted.

Loading