Skip to content
This repository was archived by the owner on Nov 23, 2017. It is now read-only.
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
fde24d2
adding hadoop 2.6 support
jamborta Sep 30, 2016
87153f2
a few lines of description
jamborta Sep 30, 2016
a242d0b
adding parameter to template file
jamborta Sep 30, 2016
064abd9
validate minor version
jamborta Sep 30, 2016
c282e06
validate minor version
jamborta Sep 30, 2016
a68ca9b
validate minor version
jamborta Sep 30, 2016
14f0d75
validate minor version
jamborta Sep 30, 2016
a308f74
bugfix
jamborta Oct 1, 2016
97cbb6b
correct path for ephemeral hdfs
jamborta Oct 1, 2016
e46020c
scala 2.11 for spark 2
jamborta Oct 1, 2016
c34d93e
document s3 changes
jamborta Oct 1, 2016
d8d4803
document s3 changes
jamborta Oct 1, 2016
c13a437
document s3 changes
jamborta Oct 1, 2016
9e6920c
Update README.md
jamborta Oct 1, 2016
833f2de
Update README.md
jamborta Oct 1, 2016
750ede8
Update README.md
jamborta Oct 1, 2016
1c34483
adding hadoop 2.7
jamborta Oct 8, 2016
46b6394
adding static variable VALID_HADOOP_MINOR_VERSIONS
jamborta Oct 12, 2016
ad525d7
disable tachyon for spark 2 and yarn
jamborta Oct 12, 2016
03e70b7
typo in version
jamborta Oct 13, 2016
653f338
separate case for each range of spark versions
jamborta Oct 19, 2016
21e03d0
update hadoop dependency download path
jamborta Oct 24, 2016
4a4f4a5
exhaustive checking of hadoop versions
jamborta Oct 24, 2016
d7e73bf
typo in options
jamborta Oct 24, 2016
332b90b
return -1 for unknown hadoop version
jamborta Oct 24, 2016
71c7047
return 1 for unknown hadoop version
jamborta Oct 24, 2016
a924690
safeguard for hadoop minor version 2.7
jamborta Oct 24, 2016
e3ee4e2
safeguard for hadoop minor version 2.6
jamborta Oct 24, 2016
db15dcc
resolve conflicts
jamborta Oct 24, 2016
246b888
update based on comments
jamborta Oct 24, 2016
9b0f1d1
use latest hadoop maintenance version
jamborta Oct 31, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -197,6 +197,15 @@ EC2. These scripts are intended to be used by the default Spark AMI and is *not*
expected to work on other AMIs. If you wish to start a cluster using Spark,
please refer to http://spark-project.org/docs/latest/ec2-scripts.html

## Using S3 with Hadoop 2.6 or newer

Starting Hadoop 2.6.0, s3 FS connector has been moved to a separate library called hadoop-aws.

- In order to make the package available add it as a dependency, `libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.6.4"`.
- It can also be added it directly to spark-submit, `spark-submit --packages org.apache.hadoop:hadoop-aws:2.6.4 SimpleApp.py`.

On other related note, it is recommended to use `s3a` and not `s3n` filesystem starting Hadoop 2.6.0.

## spark-ec2 Internals

The Spark cluster setup is guided by the values set in `ec2-variables.sh`.`setup.sh`
Expand Down Expand Up @@ -237,3 +246,4 @@ after the templates have been configured. You can use the environment variables
get a list of slave hostnames and `/root/spark-ec2/copy-dir` to sync a directory across machines.

5. Modify `spark_ec2.py` to add your module to the list of enabled modules.

1 change: 1 addition & 0 deletions deploy.generic/root/spark-ec2/ec2-variables.sh
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ export MODULES="{{modules}}"
export SPARK_VERSION="{{spark_version}}"
export TACHYON_VERSION="{{tachyon_version}}"
export HADOOP_MAJOR_VERSION="{{hadoop_major_version}}"
export HADOOP_MINOR_VERSION="{{hadoop_minor_version}}"
export SWAP_MB="{{swap}}"
export SPARK_WORKER_INSTANCES="{{spark_worker_instances}}"
export SPARK_MASTER_OPTS="{{spark_master_opts}}"
Expand Down
1 change: 1 addition & 0 deletions deploy_templates.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@
"spark_version": os.getenv("SPARK_VERSION"),
"tachyon_version": os.getenv("TACHYON_VERSION"),
"hadoop_major_version": os.getenv("HADOOP_MAJOR_VERSION"),
"hadoop_minor_version": os.getenv("HADOOP_MINOR_VERSION"),
"java_home": os.getenv("JAVA_HOME"),
"default_tachyon_mem": "%dMB" % tachyon_mb,
"system_ram_mb": "%d" % system_ram_mb,
Expand Down
26 changes: 21 additions & 5 deletions ephemeral-hdfs/init.sh
Original file line number Diff line number Diff line change
Expand Up @@ -30,11 +30,27 @@ case "$HADOOP_MAJOR_VERSION" in
cp /root/hadoop-native/* /root/ephemeral-hdfs/lib/native/
;;
yarn)
wget http://s3.amazonaws.com/spark-related-packages/hadoop-2.4.0.tar.gz
echo "Unpacking Hadoop"
tar xvzf hadoop-*.tar.gz > /tmp/spark-ec2_hadoop.log
rm hadoop-*.tar.gz
mv hadoop-2.4.0/ ephemeral-hdfs/
if [[ "$HADOOP_MINOR_VERSION" == "2.4" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/hadoop-2.4.0.tar.gz
echo "Unpacking Hadoop"
tar xvzf hadoop-*.tar.gz > /tmp/spark-ec2_hadoop.log
rm hadoop-*.tar.gz
mv hadoop-2.4.0/ ephemeral-hdfs/
elif [[ "$HADOOP_MINOR_VERSION" == "2.6" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/hadoop-2.6.5.tar.gz
echo "Unpacking Hadoop"
tar xvzf hadoop-*.tar.gz > /tmp/spark-ec2_hadoop.log
rm hadoop-*.tar.gz
mv hadoop-2.6.0/ ephemeral-hdfs/
elif [[ "$HADOOP_MINOR_VERSION" == "2.7" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/hadoop-2.7.3.tar.gz
echo "Unpacking Hadoop"
tar xvzf hadoop-*.tar.gz > /tmp/spark-ec2_hadoop.log
rm hadoop-*.tar.gz
mv hadoop-2.7.0/ ephemeral-hdfs/
else
echo "ERROR: Unknown Hadoop version"
fi

# Have single conf dir
rm -rf /root/ephemeral-hdfs/etc/hadoop/
Expand Down
26 changes: 21 additions & 5 deletions persistent-hdfs/init.sh
Original file line number Diff line number Diff line change
Expand Up @@ -29,11 +29,27 @@ case "$HADOOP_MAJOR_VERSION" in
cp /root/hadoop-native/* /root/persistent-hdfs/lib/native/
;;
yarn)
wget http://s3.amazonaws.com/spark-related-packages/hadoop-2.4.0.tar.gz
echo "Unpacking Hadoop"
tar xvzf hadoop-*.tar.gz > /tmp/spark-ec2_hadoop.log
rm hadoop-*.tar.gz
mv hadoop-2.4.0/ persistent-hdfs/
if [[ "$HADOOP_MINOR_VERSION" == "2.4" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/hadoop-2.4.0.tar.gz
echo "Unpacking Hadoop"
tar xvzf hadoop-*.tar.gz > /tmp/spark-ec2_hadoop.log
rm hadoop-*.tar.gz
mv hadoop-2.4.0/ persistent-hdfs/
elif [[ "$HADOOP_MINOR_VERSION" == "2.6" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/hadoop-2.6.5.tar.gz
echo "Unpacking Hadoop"
tar xvzf hadoop-*.tar.gz > /tmp/spark-ec2_hadoop.log
rm hadoop-*.tar.gz
mv hadoop-2.6.0/ persistent-hdfs/
elif [[ "$HADOOP_MINOR_VERSION" == "2.7" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/hadoop-2.7.3.tar.gz
echo "Unpacking Hadoop"
tar xvzf hadoop-*.tar.gz > /tmp/spark-ec2_hadoop.log
rm hadoop-*.tar.gz
mv hadoop-2.7.0/ persistent-hdfs/
else
echo "ERROR: Unknown Hadoop version"
fi

# Have single conf dir
rm -rf /root/persistent-hdfs/etc/hadoop/
Expand Down
7 changes: 6 additions & 1 deletion scala/init.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,15 @@ SCALA_VERSION="2.10.3"

if [[ "0.7.3 0.8.0 0.8.1" =~ $SPARK_VERSION ]]; then
SCALA_VERSION="2.9.3"
wget http://s3.amazonaws.com/spark-related-packages/scala-$SCALA_VERSION.tgz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need a scala installation on the cluster anymore as Spark should just work with a JRE. But it seems fine to have this if people find it useful

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never tried spark without scala. even spark-shell does not need scala?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - recent Spark distribution includes the scala libraries that provide the shell and other support. But since this is a useful thing irrespective lets keep this.

elif [[ "2.0.0" =~ $SPARK_VERSION ]]; then
SCALA_VERSION="2.11.8"
wget http://s3.amazonaws.com/spark-related-packages/scala-$SCALA_VERSION.tgz
else
wget http://s3.amazonaws.com/spark-related-packages/scala-$SCALA_VERSION.tgz
fi

echo "Unpacking Scala"
wget http://s3.amazonaws.com/spark-related-packages/scala-$SCALA_VERSION.tgz
tar xvzf scala-*.tgz > /tmp/spark-ec2_scala.log
rm scala-*.tgz
mv `ls -d scala-* | grep -v ec2` scala
Expand Down
162 changes: 75 additions & 87 deletions spark/init.sh
Original file line number Diff line number Diff line change
Expand Up @@ -24,119 +24,107 @@ then

# Pre-packaged spark version:
else
case "$SPARK_VERSION" in
case "$SPARK_VERSION" in
0.7.3)
if [[ "$HADOOP_MAJOR_VERSION" == "1" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/spark-0.7.3-prebuilt-hadoop1.tgz
else
wget http://s3.amazonaws.com/spark-related-packages/spark-0.7.3-prebuilt-cdh4.tgz
fi
;;
0.8.0)
if [[ "$HADOOP_MAJOR_VERSION" == "1" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/spark-0.8.0-incubating-bin-hadoop1.tgz
else
wget http://s3.amazonaws.com/spark-related-packages/spark-0.8.0-incubating-bin-cdh4.tgz
fi
;;
0.8.1)
if [[ "$HADOOP_MAJOR_VERSION" == "1" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/spark-0.8.1-incubating-bin-hadoop1.tgz
else
wget http://s3.amazonaws.com/spark-related-packages/spark-0.8.1-incubating-bin-cdh4.tgz
fi
;;
0.9.0)
if [[ "$HADOOP_MAJOR_VERSION" == "1" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/spark-0.9.0-incubating-bin-hadoop1.tgz
else
wget http://s3.amazonaws.com/spark-related-packages/spark-0.9.0-incubating-bin-cdh4.tgz
fi
;;
0.9.1)
if [[ "$HADOOP_MAJOR_VERSION" == "1" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/spark-0.9.1-bin-hadoop1.tgz
else
wget http://s3.amazonaws.com/spark-related-packages/spark-0.9.1-bin-cdh4.tgz
fi
;;
0.9.2)
if [[ "$HADOOP_MAJOR_VERSION" == "1" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/spark-0.9.2-bin-hadoop1.tgz
else
wget http://s3.amazonaws.com/spark-related-packages/spark-0.9.2-bin-cdh4.tgz
fi
;;
1.0.0)
if [[ "$HADOOP_MAJOR_VERSION" == "1" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/spark-1.0.0-bin-hadoop1.tgz
else
wget http://s3.amazonaws.com/spark-related-packages/spark-1.0.0-bin-cdh4.tgz
fi
;;
1.0.1)
if [[ "$HADOOP_MAJOR_VERSION" == "1" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/spark-1.0.1-bin-hadoop1.tgz
else
wget http://s3.amazonaws.com/spark-related-packages/spark-1.0.1-bin-cdh4.tgz
fi
;;
1.0.2)
if [[ "$HADOOP_MAJOR_VERSION" == "1" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/spark-1.0.2-bin-hadoop1.tgz
wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-prebuilt-hadoop1.tgz
elif [[ "$HADOOP_MAJOR_VERSION" == "2" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-prebuilt-cdh4.tgz
else
wget http://s3.amazonaws.com/spark-related-packages/spark-1.0.2-bin-cdh4.tgz
echo "ERROR: Unsupported Hadoop major version"
return 1
fi
;;
1.1.0)
if [[ "$HADOOP_MAJOR_VERSION" == "1" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/spark-1.1.0-bin-hadoop1.tgz
;;
0\.8\.0|0\.8\.1|0\.9\.0)
if [[ "$HADOOP_MAJOR_VERSION" == "1" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-incubating-bin-hadoop1.tgz
elif [[ "$HADOOP_MAJOR_VERSION" == "2" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/spark-1.1.0-bin-cdh4.tgz
wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-incubating-bin-cdh4.tgz
else
wget http://s3.amazonaws.com/spark-related-packages/spark-1.1.0-bin-hadoop2.4.tgz
echo "ERROR: Unsupported Hadoop major version"
return 1
fi
;;
1.1.1)
;;
# 0.9.1 - 1.0.2
0.9.1|1\.0\.[0-2])
if [[ "$HADOOP_MAJOR_VERSION" == "1" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/spark-1.1.1-bin-hadoop1.tgz
wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop1.tgz
elif [[ "$HADOOP_MAJOR_VERSION" == "2" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/spark-1.1.1-bin-cdh4.tgz
wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-cdh4.tgz
else
wget http://s3.amazonaws.com/spark-related-packages/spark-1.1.1-bin-hadoop2.4.tgz
echo "ERROR: Unsupported Hadoop major version"
return 1
fi
;;
1.2.0)
;;
# 1.1.0 - 1.3.0
1\.[1-2]\.[0-9]*|1\.3\.0)
if [[ "$HADOOP_MAJOR_VERSION" == "1" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/spark-1.2.0-bin-hadoop1.tgz
wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop1.tgz
elif [[ "$HADOOP_MAJOR_VERSION" == "2" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/spark-1.2.0-bin-cdh4.tgz
wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-cdh4.tgz
elif [[ "$HADOOP_MAJOR_VERSION" == "yarn" ]]; then
if [[ "$HADOOP_MINOR_VERSION" == "2.4" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop2.4.tgz
else
echo "ERROR: Unknown Hadoop minor version"
return 1
fi
else
wget http://s3.amazonaws.com/spark-related-packages/spark-1.2.0-bin-hadoop2.4.tgz
echo "ERROR: Unsupported Hadoop major version"
return 1
fi
;;
1.2.1)
;;
# 1.3.1 - 1.6.2
1\.[3-6]\.[0-2])
if [[ "$HADOOP_MAJOR_VERSION" == "1" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/spark-1.2.1-bin-hadoop1.tgz
wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop1.tgz
elif [[ "$HADOOP_MAJOR_VERSION" == "2" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/spark-1.2.1-bin-cdh4.tgz
wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-cdh4.tgz
elif [[ "$HADOOP_MAJOR_VERSION" == "yarn" ]]; then
if [[ "$HADOOP_MINOR_VERSION" == "2.4" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop2.4.tgz
elif [[ "$HADOOP_MINOR_VERSION" == "2.6" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop2.6.tgz
else
echo "ERROR: Unknown Hadoop minor version"
return 1
fi
else
wget http://s3.amazonaws.com/spark-related-packages/spark-1.2.1-bin-hadoop2.4.tgz
echo "ERROR: Unsupported Hadoop major version"
return 1
fi
;;
*)
;;
# 2.0.0 - 2.0.1
2\.0\.[0-1]|2\.0\.0-preview)
if [[ "$HADOOP_MAJOR_VERSION" == "1" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop1.tgz
echo "ERROR: Unknown Hadoop major version"
return 1
elif [[ "$HADOOP_MAJOR_VERSION" == "2" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-cdh4.tgz
echo "ERROR: Unknown Hadoop major version"
return 1
elif [[ "$HADOOP_MAJOR_VERSION" == "yarn" ]]; then
if [[ "$HADOOP_MINOR_VERSION" == "2.4" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop2.4.tgz
elif [[ "$HADOOP_MINOR_VERSION" == "2.6" ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about hadoop-2.7 here ? Was it left out as only > 2.0.0 supports hadoop 2.7 ? In that case I think we can have two big case statements -- one for spark major versions < 2.0 and one for major version >= 2.0 ? FWIW the main goal is to avoid making this file too long

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was to cover 1.4 - 1.6.2 so no hadoop 2.7

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be happy to rewrite, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was thinking that we could write two big case statements one to handle 1.x and the other to handle 2.x (we can add sub-case statements within them for specific 1.x quirks etc.)

wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop2.6.tgz
elif [[ "$HADOOP_MINOR_VERSION" == "2.7" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop2.7.tgz
else
echo "ERROR: Unknown Hadoop version"
return 1
fi
else
wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop2.4.tgz
echo "ERROR: Unsupported Hadoop major version"
return 1
fi
;;
*)
if [ $? != 0 ]; then
echo "ERROR: Unknown Spark version"
return -1
return 1
fi
esac
;;
esac

echo "Unpacking Spark"
tar xvzf spark-*.tgz > /tmp/spark-ec2_spark.log
Expand Down
Loading