Skip to content
Open
313 changes: 51 additions & 262 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Hadoop and Yarn Setup

## 1. set passwordless login
### Set passwordless login

To create user
```
Expand All @@ -20,289 +20,78 @@ For other hosts
ssh-copy-id -i ~/.ssh/id_rsa.pub user@host
ssh user@host
```
## 2. Download and install hadoop

http://hadoop.apache.org/releases.html#Download
### Pre-requisities:
1. JAVA Setup should be completed and JAVA_HOME should be set in the ~/.bashrc file (environment variable).
2. Make sure the nodes are set for password-less SSH both ways(master->slaves).
3. Since we use the environment variables a lot in our scripts, make sure to comment out the portion following this statement in your ~/.bashrc ,
`If not running interactively, don't do anything`. Update .bashrc

```
#Choose the right mirror, below link is for US machines.
wget http://www-us.apache.org/dist/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz
tar xf hadoop-2.7.3.tar.gz --gzip
export HADOOP_HOME=$HOME/hadoop-2.7.3
```

## 3. Update slaves file

Add data nodes, don't add master node.
```bash
vi $HADOOP_HOME/etc/hadoop/slaves
user@host1
user@host2
```

## 4. Hadoop utils setup
```
git clone https://github.com/kmadhugit/hadoop-cluster-utils.git
cd hadoop-cluster-utils
vi add-this-to-dot-profile.sh #update correct path to env variables.
. add-this-to-dot-profile.sh
```

check whether cluster scripts are working

```
AN hostname
```

Update .bashrc

1. Delete the following check.
Delete/comment the following check.
```
# If not running interactively, don't do anything
case $- in
*i*) ;;
*) return;;
esac
case $- in
*i*) ;;
*) return;;
esac
```
2. Read add-this-to-dot-profile.sh at the end of .bashrc
4. Install curl `sudo apt-get install curl` and install wget `sudo apt-get install wget`.
5. Same username/useraccount should be need on `master` and `slaves` nodes for multinode installation.

```
vi $HOME/.bashrc
Gi
:r $HOME/hadoop-cluster-utils/add-this-to-dot-profile.sh
G
set -o vi
```

3. copy .bashrc to all other data nodes

```
CP $HOME/.bashrc $HOME
```


## 5. Install Hadoop on all nodes
```
CP $HOME/hadoop-2.7.3.tar.gz $HOME
DN "tar xf hadoop-2.7.3.tar.gz --gzip"
```

## 6. HDFS configuration

You need to modify 2 config files for HDFS
### Installations:

1. core-site.xml #Modify the Hostname for the Name node
```
cd $HOME/hadoop-cluster-utils/conf
cp core-site.xml.template core-site.xml
vi core-site.xml
cp core-site.xml $HADOOP_HOME/etc/hadoop
CP core-site.xml $HADOOP_HOME/etc/hadoop
```

2. hdfs-site.xml
* To automate hadoop installation follows the steps,

create local dir in name node for meta-data (
```bash
git clone https://github.com/kmadhugit/hadoop-cluster-utils.git

``` mkdir -p /data/user/hdfs-meta-data ```
cd hadoop-cluster-utils
```

create local dir in all data-nodes for hdfs-data
* Configuration

1. To configure `hadoop-cluster-utils`, run `./autogen.sh` which will create `config.sh` with appropriate field values.
2. User can enter SLAVEIPs (if more than one, use comma seperated) interactively while running `./autogen.sh` file.
3. Default `Spark-2.0.1` and `Hadoop-2.7.1` version available for installation.
4. User can edit default port values, `spark` and `hadoop` versions in config.sh
5. Before executing `./setup.sh` file, user can verify or edit `config.sh`
6. Once setup script completed,source `~/.bashrc` file.

* Ensure that the following java process is running in master. If not, check the log files

``` DN "mkdir -p /data/user/hdfs-data" ```

update dir path
```
cd $HOME/hadoop-cluster-utils/conf
cp hdfs-site.xml.template hdfs-site.xml
vi hdfs-site.xml #update dir path
```bash
checkall.sh
```
Copy the files to all nodes

```
cp hdfs-site.xml $HADOOP_HOME/etc/hadoop
CP hdfs-site.xml $HADOOP_HOME/etc/hadoop
```

3. Start HDFS as fresh FS

```
$HADOOP_PREFIX/bin/hdfs namenode -format mycluster
start-hdfs.sh
AN jps
# use stop-hdfs.sh for stopping
```
Invoke `checkall.sh` ensure all services are started on the Master & slaves

4. Start HDFS on existing cluster data
You need to modify ownership to self to use already created data

```
AN "sudo chown user:user /data/hdfs-meta-data"
AN "sudo chown user:user /data/hdfs-data"
start-hdfs.sh
AN jps
```

Ensure that the following java process is running in master. If not, check the log files

```
NameNode
```
Ensure that the following java process is running in slaves. If not, check the log files
```
DataNode
```

5. HDFS web address

```
http://localhost:50070
```

## 7. Yarn configuration

You need to modify 2 config files for HDFS

1. capacity-scheduler.xml #Modify resource-calculator property to DominantResourceCalculator

```bash
vi $HADOOP_HOME/etc/hadoop/capacity-scheduler.xml
```
```xml
<property>
<name>yarn.scheduler.capacity.resource-calculator</name>
<value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value>
</property>
```
2. yarn-site.xml # Modify the properties as per the description provided in the template

```
cd $HOME/hadoop-cluster-utils/conf
cp yarn-site.xml.template yarn-site.xml
vi yarn-site.xml
cp yarn-site.xml $HADOOP_HOME/etc/hadoop
CP yarn-site.xml $HADOOP_HOME/etc/hadoop
AN jps
```

Ensure that the following java process is started in master. If not, check the log files

```
JobHistoryServer
ResourceManager
```
Ensure that the following java process is started in slaves. If not, check the log files
Ensure that the following java process is running in slaves. If not, check the hadoop log files
```
DataNode
NodeManager
```

3. Start Yarn
```
start-yarn.sh
AN jps
```

3. Resource Manager and Node Manager web Address
```
Resource Manager : http://localhost:8088/cluster
Node Manager : http://datanode:8042/node (For each node)
```

## 8. Useful scripts

```
> stop-all.sh #stop HDFS and Yarn
> start-all.sh #start HDFS and Yarn
> CP <localpath to file> <remotepath to dir> #Copy file from name nodes to all slaves
> AN <command> #execute a given command in all nodes including master
> DN <command> #execute a given command in all nodes excluding master
```

## 9. Spark Installation.

### a. Download Binary

```
http://spark.apache.org/downloads.html
#Choose the right mirror, below link is for US machines.
wget http://www-us.apache.org/dist/spark/spark-2.0.1/spark-2.0.1-bin-hadoop2.7.tgz
tar -zvf spark-2.0.1-bin-hadoop2.7.tgz
```

### b. Build it yourself

```
git clone https://github.com/apache/spark.git
git checkout -b v2.0.1 v2.0.1
export MAVEN_OPTS="-Xmx32G -XX:MaxPermSize=8G -XX:ReservedCodeCacheSize=2G"
./build/mvn -T40 -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.3 -Phive -Phive-thriftserver -DskipTests -Dmaven.javadoc.skip=true install
```

### c. Test (pre-built spark version)
```
#Add in ~/.bashrc
export SPARK_HOME=$HOME/spark-2.0.1-bin-hadoop2.7

. ~/.bashrc

${SPARK_HOME}bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --driver-memory 1024M --num-executors 2 --executor-memory 1g --executor-cores 1 ${SPARK_HOME}/examples/jars/spark-examples_2.11-2.0.1.jar 10
```

### d. Test (manual spark build)

```
#Add in ~/.bashrc
export SPARK_HOME=$HOME/spark

. ~/.bashrc

$SPARK_HOME/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --driver-memory 1024M --num-executors 2 --executor-memory 1g --executor-cores 1 /home/testuser/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.0.1.jar

```

### e. Enable EventLogging & additional settings by adding the following content to $SPARK_HOME/conf/spark-defaults.conf
```
spark.eventLog.enabled true
spark.eventLog.dir /tmp/spark-events
spark.eventLog.compress true
spark.history.fs.logDirectory /tmp/spark-events
spark.serializer org.apache.spark.serializer.KryoSerializer
```

### f. Start/Stop All Services.

The below scripts are used to start/stop the following services in an automated way,

- namenode daemon (only on hdfs master)
- datanode daemon (on all slave nodes)
- resource manager daemon (only on yarn master)
- node manager daemon (on all slave nodes)
- job history server (only on yarn master)
- Spark history server (on yarn master)

```
# Start

start-all.sh
* HDFS, Resource Manager, Node Manager and Spark web Address

```
HDFS web address : http://localhost:50070
Resource Manager : http://localhost:8088/cluster
Node Manager : http://datanode:8042/node (For each node)
Spark : http://localhost:8080 (Default)
```

# Stop
* Useful scripts

stop-all.sh
=======
```

## 10. Spark command line options for Yarn Scheduler.


| Option | Description |
|--------|-------------|
| --num-executors | Total number of executor JVMs to spawn across Yarn Cluster |
| --executor-cores | Total number of cores in each executor JVM |
| --executor-memory | Memory to be allocated for each JVM 1024M/1G|
| --driver-memory | Memory to be allocated for driver JVM |
| --driver-cores | Total number of vcores for driver JVM |
| | Total vcores = num-executors * executor-vcores + driver-cores |
| | Total Memory = num-executors * executor-memory + driver-memory |
|--driver-java-options | To pass driver JVM, useful in local mode for profiling |

-----------------------------------------------------------------
```
> stop-all.sh #stop HDFS and Yarn
> start-all.sh #start HDFS and Yarn
> CP <localpath to file> <remotepath to dir> #Copy file from name nodes to all slaves
> AN <command> #execute a given command in all nodes including master
> DN <command> #execute a given command in all nodes excluding master
> checkall.sh #ensure all services are started on the Master & slaves
```
24 changes: 0 additions & 24 deletions add-this-to-dot-profile.sh

This file was deleted.

Loading