diff --git a/README.md b/README.md index bdc7cee6..6cf70feb 100644 --- a/README.md +++ b/README.md @@ -6,8 +6,8 @@ to launch, manage and shut down on Amazon EC2. It automatically sets up Apache Spark and [HDFS](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html) on the cluster for you. This guide describes -how to use `spark-ec2` to launch clusters, how to run jobs on them, and how -to shut them down. It assumes you've already signed up for an EC2 account +how to use `spark-ec2` to launch clusters, how to run jobs on them, and how +to shut them down. It assumes you've already signed up for an EC2 account on the [Amazon Web Services site](http://aws.amazon.com/). `spark-ec2` is designed to manage multiple named clusters. You can @@ -69,13 +69,15 @@ types, and the default type is `m3.large` (which has 2 cores and 7.5 GB RAM). Refer to the Amazon pages about [EC2 instance types](http://aws.amazon.com/ec2/instance-types) and [EC2 pricing](http://aws.amazon.com/ec2/#pricing) for information about other -instance types. +instance types. - `--region=` specifies an EC2 region in which to launch instances. The default region is `us-east-1`. - `--zone=` can be used to specify an EC2 availability zone to launch instances in. Sometimes, you will get an error because there is not enough capacity in one zone, and you should try to launch in another. +- `--ebs-root-vol-type=` can be used to specify the EBS + volume type to use. The default value is `gp2`. - `--ebs-vol-size=` will attach an EBS volume with a given amount of space to each node so that you can have a persistent HDFS cluster on your nodes across cluster restarts (see below). @@ -145,7 +147,7 @@ export AWS_ACCESS_KEY_ID=ABCDEFG1234567890123 You can edit `/root/spark/conf/spark-env.sh` on each machine to set Spark configuration options, such as JVM options. This file needs to be copied to **every machine** to reflect the change. The easiest way to -do this is to use a script we provide called `copy-dir`. First edit your `spark-env.sh` file on the master, +do this is to use a script we provide called `copy-dir`. First edit your `spark-env.sh` file on the master, then run `~/spark-ec2/copy-dir /root/spark/conf` to RSYNC it to all the workers. The [configuration guide](configuration.html) describes the available configuration options. @@ -195,20 +197,20 @@ In addition to using a single input file, you can also use a directory of files This repository contains the set of scripts used to setup a Spark cluster on EC2. These scripts are intended to be used by the default Spark AMI and is *not* expected to work on other AMIs. If you wish to start a cluster using Spark, -please refer to http://spark-project.org/docs/latest/ec2-scripts.html +please refer to http://spark-project.org/docs/latest/ec2-scripts.html ## spark-ec2 Internals The Spark cluster setup is guided by the values set in `ec2-variables.sh`.`setup.sh` first performs basic operations like enabling ssh across machines, mounting ephemeral drives and also creates files named `/root/spark-ec2/masters`, and `/root/spark-ec2/slaves`. -Following that every module listed in `MODULES` is initialized. +Following that every module listed in `MODULES` is initialized. To add a new module, you will need to do the following: 1. Create a directory with the module's name. -2. Optionally add a file named `init.sh`. This is called before templates are configured +2. Optionally add a file named `init.sh`. This is called before templates are configured and can be used to install any pre-requisites. 3. Add any files that need to be configured based on the cluster setup to `templates/`. diff --git a/spark_ec2.py b/spark_ec2.py index 28d72f43..9c3ad288 100644 --- a/spark_ec2.py +++ b/spark_ec2.py @@ -249,12 +249,15 @@ def parse_args(): "--resume", action="store_true", default=False, help="Resume installation on a previously launched cluster " + "(for debugging)") + parser.add_option( + "--ebs-root-vol-type", default="gp2", + help="Root EBS volume type (e.g. 'gp2', 'io1', 'st1', 'sc1', 'standard') (default: 'gp2')") parser.add_option( "--ebs-vol-size", metavar="SIZE", type="int", default=0, help="Size (in GB) of each EBS volume.") parser.add_option( - "--ebs-vol-type", default="standard", - help="EBS volume type (e.g. 'gp2', 'standard').") + "--ebs-vol-type", default="gp2", + help="EBS volume type (e.g. 'gp2', 'io1', 'st1', 'sc1', 'standard') (default: 'gp2')") parser.add_option( "--ebs-vol-num", type="int", default=1, help="Number of EBS volumes to attach to each node as /vol[x]. " + @@ -588,9 +591,16 @@ def launch_cluster(conn, opts, cluster_name): print("Could not find AMI " + opts.ami, file=stderr) sys.exit(1) - # Create block device mapping so that we can add EBS volumes if asked to. - # The first drive is attached as /dev/sds, 2nd as /dev/sdt, ... /dev/sdz + # Create block device mapping so that we can configure and add EBS volumes if asked to. block_map = BlockDeviceMapping() + # add root ebs volume type + root_device = EBSBlockDeviceType() + root_device.volume_type = opts.ebs_root_vol_type + root_device.delete_on_termination = True + block_map['/dev/sda1'] = root_device + + # add additional EBS volumes if asked to + # The first drive is attached as /dev/sds, 2nd as /dev/sdt, ... /dev/sdz if opts.ebs_vol_size > 0: for i in range(opts.ebs_vol_num): device = EBSBlockDeviceType()