diff --git a/README.md b/README.md index e3058c3..5d2e180 100644 --- a/README.md +++ b/README.md @@ -1 +1,95 @@ -# Spark-Benchmarks-Setup +# Benchmark Automation +This repository contains shell scripts to setup and execute below spark benchmarks +1. Spark-perf +2. spark-bench +3. Hibench +4. Terasort + +Below are detailed steps to be followed to setup each benchmark and run same. + +# Spark-perf + +### Pre-requisites: +1. Zip is installed on master machine +2. Python is installed on master machine +3. Hadoop and spark setup is already completed using scripts at https://github.com/kmadhugit/hadoop-cluster-utils.git and it is running on master & slave machines. +4. WORKDIR is set as environment variable. + +### Installations: +* To automate SPARK-PERF installation follows below steps, + + ```bash + git clone https://github.com/kavanabhat/Spark-Benchmarks-Setup.git + + cd Spark-Benchmarks-Setup/spark-perf-setup + ``` + +### How to run the script ### + 1. To configure `spark-perf`, run `./install.sh`, it will clone the spark-perf repository under path `Spark-Benchmarks-Setup/spark-perf-setup/` and also will set the hadoop and spark related variables in config.py file for spak-perf. + 2. To run benchmark , run `./runbench.sh`. It will ask for options to select type of test to be run and scale factor if you want to change. Once all inputs received it will execute selected benchmarks. + 3. Output files for benchmarks will be stored in zip format at location `Spark-Benchmarks-Setup/spark-perf-setup/wdir/spark-perf-results` and logs at `Spark-Benchmarks-Setup/spark-perf-setup/wdir/spark-perf-logs` + +# Spark-bench + +### Pre-requisites: +1. Zip is installed on master machine +2. Python is installed on master machine +3. Hadoop and spark setup is already completed using scripts at https://github.com/kmadhugit/hadoop-cluster-utils.git and it is running on master & slave machines. +4. If you want to run spark workloads on hive, then hive needs to be installed and configured. +5. The script install git-extras which needs the epel repo incase you are using redhat. Following are the steps to add the epel repo. + - rpm --import http://download.fedoraproject.org/pub/epel/RPM-GPG-KEY-EPEL-7 + - yum-config-manager --add-repo https://dl.fedoraproject.org/pub/epel/7/ppc64le/ + +### Overview ### +The code is for easing the steps for installing spark bench. Currently this installs the spark bench from branch 2.0.1 of repo at https://github.com/MaheshIBM/spark-bench + +### How to run the script ### +Clone the repo and run ./install.sh at Spark-Benchmarks-Setup/spark-bench-setup. The code is tested to run on ubuntu 16.04.1 LTS and Red Hat Enterprise Linux Server release 7.2 (Maipo). +After installation to run a workload use run_bench.sh for example to run Terasort the command is *./run_bench.sh -cr Terasort* +Use -c flag to create data, -r to only run and -cr if you want to create and run. + +# HiBench + +### Pre-requisites: +1. Zip is installed on master machine +2. Maven and Python is installed on master machine +3. Hadoop and spark setup is already completed using scripts at https://github.com/kmadhugit/hadoop-cluster-utils.git and it is running on master & slave machines. +4. Also Hive and mysql setup is completed using script mentioned in above point 3. +5. Set shell environment variable `WORKDIR` to path where you want to clone/install git repository of hibench-setup (e.g. export WORKDIR=/home/testuser) + +### Installations: +* To automate HiBench installation follows below steps, + + ```bash + git clone https://github.com/kavanabhat/Spark-Benchmarks-Setup.git + + cd Spark-Benchmarks-Setup/hibench-setup + ``` + +### How to run the script ### + + 1. To configure `HiBench`, run `./install.sh`, it will clone the HiBench repository under path `Spark-Benchmarks-Setup/hibench-setup/wdir` and also will set the hadoop and spark related variables in configuration files for HiBench. At the end, it will run build for HiBench. + 2. If `./install.sh` is installing maven on redhat machine, please execute "source ~/.bashrc to export updated maven related environment variables in your current login session. + 3. To run benchmark , run `./runbench.sh`. It will ask for options to select type of workloads to be run.Please select workload name in comma separated format for multiple inputs (e.g. sql,micro) or "all" if you want to run all workloads. + 4. Output files for benchmarks will be stored in zip format at location `Spark-Benchmarks-Setup/hibench-setup/wdir/hibench-results` and logs at `Spark-Benchmarks-Setup/hibench-setup/wdir/hibench-logs` + +# Terasort + +### Pre-requisites: +1. Zip is installed on master machine +2. Maven is installed on master machine +2. Hadoop and spark setup is already completed using scripts at https://github.com/kmadhugit/hadoop-cluster-utils.git and it is running on master & slave machines. + + +### How to install: + + ```bash + git clone https://github.com/kavanabhat/Spark-Benchmarks-Setup.git + + cd Spark-Benchmarks-Setup/terasort-setup + + ``` + ### How to run the script ### + 1. To clone and build the `Terasort` code, run `./install.sh`, it will clone the Terasort repository under path `Spark-Benchmarks-Setup/terasort-setup/wdir` and also will set the hadoop and spark related variables in configuration files for Terasort. At the end, it will run build command for Terasort. + 2. To run terasort , run `./runbench.sh`. Depending on the options selected, it will first generate the data to HDFS file (data/terasort_in)and then sort the data into HDFS (data/terasort_out), after that the data is validated and the validation output is stored in hdfs at (data/terasort_validate) + 3. Output files for sorting/validation/data generation will be stored in zip format at location `Spark-Benchmarks-Setup/terasort-setup/wdir/terasort_results` \ No newline at end of file diff --git a/hibench-setup/install.sh b/hibench-setup/install.sh index 0d652c9..298404d 100755 --- a/hibench-setup/install.sh +++ b/hibench-setup/install.sh @@ -204,6 +204,13 @@ echo -e "Building HiBench redirecting logs to $log" | tee -a $log ${HIBENCH_WORK_DIR}/HiBench/bin/build-all.sh >> $log echo -e + +if [ -f ${HIBENCH_WORK_DIR}/HiBench/hadoopbench/mahout/target/apache-mahout-distribution-0.11.0.tar.gz ] +then + cd ${HIBENCH_WORK_DIR}/HiBench/hadoopbench/mahout/target/ + tar -xvzf ${HIBENCH_WORK_DIR}/HiBench/hadoopbench/mahout/target/apache-mahout-distribution-0.11.0.tar.gz &>>/dev/null + sed -i 's|level value="info"|level value="warn"|g' ${HIBENCH_WORK_DIR}/HiBench/hadoopbench/mahout/target/apache-mahout-distribution-0.11.0/conf/log4j.xml +fi echo -e 'Please edit memory and executor related parameters like "hibench.yarn.executor.num","hibench.yarn.executor.cores","spark.executor.memory","spark.driver.memory" as per your requirement in '${HIBENCH_WORK_DIR}'/HiBench/conf/spark.conf file \n' if [ $is_redhat = 1 ] && [ $mvn_install -ne 0 ] then diff --git a/hibench-setup/runbench.sh b/hibench-setup/runbench.sh index 9207d7e..739bd8c 100755 --- a/hibench-setup/runbench.sh +++ b/hibench-setup/runbench.sh @@ -150,4 +150,4 @@ zip -r ${HIBENCH_WORK_DIR}/hibench_results/hibench_output_$current_time.zip ./* echo 'You can check results at location '${HIBENCH_WORK_DIR}'/hibench_results and logs at location '${HIBENCH_WORK_DIR}'/hibench_logs' | tee -a $log echo "Report file at ${HIBENCH_WORK_DIR}/hibench_results/hibench.report_${current_time}" | tee -a $log echo "Log file at ${HIBENCH_WORK_DIR}/hibench_logs/${last_file}" | tee -a $log -echo "Zipped file at ${HIBENCH_WORK_DIR}/hibench_results/hibench_output_$current_time.zip" | tee - a log +echo "Zipped file at ${HIBENCH_WORK_DIR}/hibench_results/hibench_output_$current_time.zip" | tee -a $log diff --git a/spark-perf-setup/install.sh b/spark-perf-setup/install.sh index fa9800b..c5082c3 100755 --- a/spark-perf-setup/install.sh +++ b/spark-perf-setup/install.sh @@ -24,18 +24,35 @@ log=${PERFWORK_DIR}/spark_perf_logs/spark_perf_install_$current_time.log echo -e | tee -a $log -#check for zip installed or not -if [ ! -x /usr/bin/zip ] ; then - echo "zip is not installed on Master, so getting installed" | tee -a $log - sudo apt-get install -y zip >> $log -fi - -if [ ! -x /usr/bin/python ] +python -mplatform |grep -i redhat >/dev/null 2>&1 +# Ubuntu +if [ $? -ne 0 ] then - echo "Python is not installed on Master, so installing Python" | tee -a $log - sudo apt-get install -y python >> $log -fi + #check for zip installed or not + if [ ! -x /usr/bin/zip ] ; then + echo "zip is not installed on Master, so getting installed" | tee -a $log + sudo apt-get install -y zip &>> $log + fi + if [ ! -x /usr/bin/python ] + then + echo "Python is not installed on Master, so installing Python" | tee -a $log + sudo apt-get install -y python &>> $log + fi +else +#redhat + if [ ! -x /usr/bin/zip ] + then + echo "zip is not installed on Master, so getting installed" | tee -a $log + sudo yum -y install zip &>> $log + fi + + if [ ! -x /usr/bin/python ] + then + echo "Python is not installed on Master, so installing Python" | tee -a $log + sudo yum -y install python &>> $log + fi +fi echo -e 'Node server details for existing hadoop and spark setup' MASTER=`hostname`