Skip to content
This repository was archived by the owner on Nov 28, 2025. It is now read-only.

Commit 415d10e

Browse files
author
Jason Lam
committed
TensorFlow on YARN
1 parent 6b892a1 commit 415d10e

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

55 files changed

+12254
-0
lines changed

yarn/README.md

Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
# TensorFlow launcher for Apache Hadoop YARN
2+
3+
This project implements a [TensorFlow](http://www.tensorflow.org/) session
4+
launcher for [Apache Hadoop YARN](http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html),
5+
such that users can utilize resources in a YARN cluster. It can support both
6+
local and distributed TensorFlow application.
7+
8+
## Prerequisites
9+
10+
1. Apache Hadoop YARN
11+
2. Zookeeper
12+
2. Python 2.6+
13+
3. TensorFlow + related packages
14+
4. Docker [optional]
15+
16+
In particular, TernsorFlow and its necessary packages must be either
17+
pre-installed on nodes in the YARN cluster or be available as a Docker image
18+
accessible from those nodes.
19+
20+
## Build
21+
22+
```sh
23+
mvn clean package
24+
```
25+
26+
Configure
27+
---------
28+
29+
Configure Apache Hadoop YARN cluster with Registry/Zookeeper enabled.
30+
31+
## Examples
32+
Tasks are submitted using `ytf-submit` script.
33+
34+
```
35+
ytf-submit [OPTIONS] -r <cluster_requirement> <task_command>
36+
```
37+
38+
`task_command` is the command to be execute for each of the task of the session.
39+
The two environment variables, `DTF_TASK_JOB_NAME` and `DTF_TASK_INDEX`, will be
40+
set before the task is executed. `cluster_requirement` is a comma separated list
41+
of job names and the number of the instances for that job, with this format:
42+
`<job_name1>:<num_tasks1>,<job_name2>:<num_task2>, ...`.
43+
44+
### Simple task submission
45+
46+
Let's execute a session with 2 x *Parameter Servers (ps)* and 4 x *Workers*.
47+
Assume task program, input data, and output train, all reside in
48+
`/home/user1/mnist` and is accessible to every node.
49+
50+
```sh
51+
$ ytf-submit -r "ps:1,worker:4" \
52+
'python /home/user1/mnist/mnist.py \
53+
--job_name ${DTF_TASK_JOB_NAME} --task_index ${DTF_TASK_INDEX} \
54+
--ps_hosts ${DTF_PS_HOSTS} --worker_hosts ${DTF_WORKER_HOSTS} \
55+
--data_dir /home/user1/mnist/data --train_dir /home/user1/mnist/train'
56+
```
57+
58+
### Enabling TensorBoard
59+
TensorBoard is enabled by `--tensorboard` or `-t`. The address of TensorBoard is
60+
available at **Tracking URL** section of the submitted aplication in Apache YARN
61+
Resource Manager web interface. For using TensorBoard, output path must be
62+
specified by `--output` or `-o`. `DTF_OUTPUT_PATH` environment variable wil be
63+
set and can be used in `task_command`. Similarly, input path can be passed to
64+
`ytf-submit` and will be available as `DTF_INPUT_PATH`.
65+
66+
```sh
67+
$ ytf-submit --tensorboard \
68+
-i /home/user1/mnist/data -o /home/user1/mnist/train10 -r "ps:1,worker:2" \
69+
'python /home/user1/mnist/mnist.py \
70+
--job_name ${DTF_TASK_JOB_NAME} --task_index ${DTF_TASK_INDEX} \
71+
--ps_hosts ${DTF_PS_HOSTS} --worker_hosts ${DTF_WORKER_HOSTS} \
72+
--data_dir ${DTF_INPUT_PATH} --train_dir ${DTF_OUTPUT_PATH}'
73+
```
74+
75+
### Passing the script file
76+
77+
The training code itself can be passed to `ytf-submit`. The code will be copied
78+
to HDFS and will be available at execution time. The path to the training code
79+
will be available as `DTF_TASK_SCRIPT` environment variable.
80+
81+
```sh
82+
$ ytf-submit --tensorboard \
83+
-i /home/user1/mnist/data -o /home/user1/mnist/train10 -r "ps:1,worker:2" \
84+
-s /home/user1/mnist/mnist.py \
85+
'python ${DTF_TASK_SCRIPT} \
86+
--job_name ${DTF_TASK_JOB_NAME} --task_index ${DTF_TASK_INDEX} \
87+
--ps_hosts ${DTF_PS_HOSTS} --worker_hosts ${DTF_WORKER_HOSTS} \
88+
--data_dir ${DTF_INPUT_PATH} --train_dir ${DTF_OUTPUT_PATH}'
89+
```
90+
91+
### Using HDFS paths
92+
Input and output paths can be HDFS paths.
93+
94+
```sh
95+
$ ytf-submit --tensorboard \
96+
-i hdfs://users/user1/mnist/data -o hdfs://users/user1/mnist/train10
97+
-r "ps:1,worker:2" -s /home/user1/mnist/mnist.py \
98+
'python ${DTF_TASK_SCRIPT} \
99+
--job_name ${DTF_TASK_JOB_NAME} --task_index ${DTF_TASK_INDEX} \
100+
--ps_hosts ${DTF_PS_HOSTS} --worker_hosts ${DTF_WORKER_HOSTS} \
101+
--data_dir ${DTF_INPUT_PATH} --train_dir ${DTF_OUTPUT_PATH}'
102+
```
103+
104+
### Using Docker
105+
To execute the tasks as a Docker container, pass the Docker image name using
106+
`--docker_image <image_name>`. The docker image is required to be accesible on
107+
the execution host. In addition to variables in **TASK EXECUTION ENVIRONMENT**,
108+
the following paths are mounted in the container.
109+
110+
- `HADOOP_HOME`, `HADOOP_CONF_DIR`, `JAVA_HOME`
111+
- `DTF_INPUT_PATH` and `DTF_OUT_PATH` if they are not hdfs path.
112+
113+
## TASK EXECUTION ENVIRONMENT
114+
115+
The user specified `task_command` will be executed as a YARN container
116+
allocated to the session. The following environment variables will be
117+
set for the `task_command` to consume.
118+
- `DTF_TASK_SCRIPT`:
119+
120+
Name of file which contains the content of the `script_file` specified
121+
during submission.
122+
123+
- `DTF_INPUT_PATH`:
124+
125+
Input path specified during submission.
126+
127+
- `DTF_OUTPUT_PATH`:
128+
129+
Output path specified during submission.
130+
131+
- `DTF_{JOBNAME}_HOSTS`:
132+
133+
Variable with a list of host (and port) allocated to the job with name
134+
`{JOBNAME}`.
135+
136+
- Format: "host1:port1,host2:port2,..."
137+
138+
The number of host:port in the list should match one specified in
139+
`cluster-requirement`. For example, `DTF_PS_HOSTS` and `DTF_WORKER_HOSTS`
140+
would be commonly used for PS and WORKER jobs.
141+
142+
- `DTF_TASK_JOB_NAME`:
143+
144+
Name of job this task is assigned to. See also `DTF_TASK_INDEX`.
145+
146+
- `DTF_TASK_INDEX`:
147+
148+
Index of the job this task is assigned to. The tuple of `DTF_TASK_JOB_NAME`,
149+
and `DTF_TASK_INDEX` can also be used to cross reference with
150+
`DTF_{JOBNAME}_HOSTS`. For example, to get the dynamic port allocated to
151+
this task.

yarn/bin/README.md

Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
Submit Command Line
2+
-------------------
3+
4+
```sh
5+
% ytf-submit -h
6+
NAME
7+
ytf-submit - Submit a TensorFlow session to Apache Hadoop YARN
8+
9+
This tool submits a YARN application master, resposible to allocate
10+
required resources, and execute corresponding tasks.
11+
12+
SYNOPSIS
13+
Usage: ./ytf-submit [OPTIONS] -r <cluster_requirement> <task_command>
14+
15+
DESCRIPTION
16+
task_command
17+
The command to be execute for each of the task of the session. The two
18+
environment variables DTF_TASK_JOB_NAME and DTF_TASK_INDEX will be set
19+
before the task is executed.
20+
See aslo TASK EXECUTION ENVIRONMENT
21+
22+
-r, --cluster_requirement <requirement>
23+
Specify cluster requiement for the session.
24+
Format: <job_name1>:<num_tasks1>,<job_name2>:<num_task2>,...
25+
Example: "ps:2,worker:4"
26+
See also TASK EXECUTION ENVIRONMENT
27+
28+
Additional options:
29+
30+
-c, --task_vcores <vcores>
31+
General form to specify number of vcores required by each of the task.
32+
DEFAULT=1
33+
34+
-c, --task_vcores <job_name>:<vcores>
35+
**NOT IMPLEMENTED YET**
36+
Job-level form to specify number of vcores required by tasks in specific
37+
job. Overrides "general" form.
38+
39+
-c, --task_vcores <job_name>[<task_index>]:<vcores>
40+
**NOT IMPLEMENTED YET**
41+
Task-level form to specify number of vcores required by a specific task.
42+
Overrides both "job-level" and "general" form.
43+
44+
-m, --task_memory <memory>
45+
General form to specify amount of memory required by each of task; with
46+
unit in MB. DEFAULT=8192
47+
48+
-m, --task_memory <job_name>:<memory>
49+
**NOT IMPLEMENTED YET**
50+
Job-level form to specify amount of memory required by tasks in specific
51+
job. Overrides "general" form.
52+
53+
-m, --task_memory <job_name>[<task_index]:<memory>
54+
**NOT IMPLEMENTED YET**
55+
Task-level form to specify amount of memory required by a specific task.
56+
Overrides both "job-level" and "general" form.
57+
58+
-i, --input input_path
59+
Input path, this variable is not interpreted by YARN-DTF at the
60+
momement, it serve as a convenience. Its value will be set as
61+
environment variable {DTF_INPUT_PATH} in tasks execution environment.
62+
DEFAULT=
63+
64+
-o, --output <output_path>
65+
Output path, this variable is not interpreted by YARN-DTF at the
66+
momement, it serve as a convenience. Its value will be set as
67+
environment variable {DTF_OUTPUT_PATH} in tasks execution environment.
68+
69+
However, when TensorBoard integration is enabled, this option becomes
70+
mandatory. See also --tensorborad option.
71+
72+
Its value will be set as environment variable {DTF_OUTPUT_PATH} in tasks
73+
execution environment.
74+
75+
-s, --script <script_file>
76+
A local script file to be transfer to tasks execution environment, where
77+
a file named by variable {DTF_TASK_SCRIPT} will contain the content of
78+
the script file. For example, if the script is a Python script,
79+
the execution command can be written as "python ${DTF_TASK_SCRIPT} ..."
80+
81+
-t, --tensorboard
82+
Enable TensorBoard integration. When enabled, YARN-DTF will start an
83+
additional YARN container as tensorboard with output path specified in
84+
--output option. DEFAULT=disabled
85+
86+
--docker_image <image_name>
87+
Enable tasks to be executed as a docker container. The docker image is
88+
required to be accesible on the execution host. In addition to variables
89+
in TASK EXECUTION ENVIRONMENT, the following paths are mounted in
90+
container to the execution host.
91+
92+
HADOOP_HOME, HADOOP_CONF_DIR, JAVA_HOME.
93+
DTF_INPUT_PATH and DTF_OUT_PATH if they are not hdfs path.
94+
95+
-q, --queue
96+
Specify which YARN queue to submit this session to.
97+
DEFAULT=default
98+
99+
-n, --name
100+
Name of this session, will be used as name of YARN application.
101+
DEFAULT=TensorFlow
102+
103+
--client
104+
**NOT IMPLEMENTED YET**
105+
Specify if an additional task should be started on locally. This
106+
would be useful if user interaction is required.
107+
108+
This task will same execution environment as the rest of the tasks,
109+
and will be assigned with DTF_TASK_JOB_NAME=client and DTF_TASK_INDEX=0;
110+
however, will not be part of the TensorFlow cluster and dynamic port
111+
allocation would not apply.
112+
113+
TASK EXECUTION ENVIRONMENT
114+
115+
The user specified 'task_command' will be executed as a YARN container
116+
allocated to the session. The following environment variables will be
117+
set for the 'task_command' to consume.
118+
119+
DTF_TASK_SCRIPT:
120+
Name of file which contains the content of the 'script_file' specified
121+
during submission.
122+
123+
DTF_INPUT_PATH:
124+
Input path specified during submission.
125+
126+
DTF_OUTPUT_PATH:
127+
Output path specified during submission.
128+
129+
DTF_{JOBNAME}_HOSTS:
130+
Variable with a list of host (and port) allocated to the job with name
131+
{JOBNAME}.
132+
Format: "host1:port1,host2:port2,..."
133+
The number of host:port in the list should match one specified in
134+
"cluster-requirement". For example, DTF_PS_HOSTS and DTF_WORKER_HOSTS
135+
would be commonly used for PS and WORKER jobs.
136+
137+
DTF_TASK_JOB_NAME:
138+
Name of job this task is assigned to. See also DTF_TASK_INDEX.
139+
140+
DTF_TASK_INDEX
141+
Index of the job this task is assigned to.
142+
The tuple of DTF_TASK_JOB_NAME, and DTF_TASK_INDEX can also be used
143+
to cross reference with DTF_{JOBNAME}_HOSTS. For example, to get the
144+
dynamic port allocated to this task.
145+
```

0 commit comments

Comments
 (0)