TensorFlowOnYARN

TensorFlow on YARN (TOY) is a toolkit to enable Hadoop users an easy way to run TensorFlow applications in distributed pattern and accomplish tasks including model management and serving inference.

This project focuses on support of running Tensorflow on YARN, as part of Deep Learning on Hadoop (HDL) effort.
YARN-6043

Goals

Support all TensorFlow components on YARN, TensorFlow distributed cluster, TensorFlow serving, TensorBoard, etc.
Support multi-tenants with consideration of different types of users, such as devOp, data scientist and data engineer
Support running TensorFlow application in a short-time/long-running job manner of both between-graph mode and in-graph mode
Support model management to deploy and also support a service layer to handle upper layer's like Spark or web backend inference request easily
Minor or no changes required to run user’s existing TensorFlow application(can be written in all officially supported languages including Python, C++, Java and Go)

Note that current project is a prototype with limitation and is still under development

Architecture

Figure1. TOY Architecture

Features

Launch a TensorFlow cluster with specified number of worker and PS server
Replace python layer with java bridge layer to start server
Generate ClusterSpec dynamically
RPC support for client to get ClusterSpec from AM
Signal handling for graceful shutdown
Package TensorFlow runtime as a resource that can be distributed easily
Run in-graph TensorFlow application in client mode
TensorBoard support
Better handling of network port conflicts
Fault tolerance
Cluster mode based on Docker
Real-time logging support
Code refine and more tests

Quick Start

Prepare the build environment following the instructions from https://www.tensorflow.org/install/install_sources

Clone the TensorFlowOnYARN repository.

git clone --recursive https://github.com/Intel-bigdata/TensorFlowOnYARN

Build the assembly.
```
cd TensorFlowOnYARN/tensorflow-parent
mvn package -Pnative -Pdist
```
tensorflow-yarn-${VERSION}.tar.gz and tensorflow-yarn-${VERSION}.zip are built out in the tensorflow-parent/tensorflow-yarn-dist/target directory. Distribute the assembly to the client node of a YARN cluster and extract.

Run the between-graph mnist example.

cd tensorflow-yarn-${VERSION}
bin/ydl-tf launch --num_worker 2 --num_ps 2

This will launch a YARN application, which creates a tf.train.Server instance for each task. A ClusterSpec is printed on the console such that you can submit the training script to. e.g.

ClusterSpec: {"ps":["node1:22257","node2:22222"],"worker":["node3:22253","node2:22255"]}

python examples/between-graph/mnist_feed.py \
  --ps_hosts="ps0.hostname:ps0.port,ps1.hostname:ps1.port" \
  --worker_hosts="worker0.hostname:worker0.port,worker1.hostname:worker1.port" \
  --task_index=0

python examples/between-graph/mnist_feed.py \
  --ps_hosts="ps0.hostname:ps0.port,ps1.hostname:ps1.port" \
  --worker_hosts="worker0.hostname:worker0.port,worker1.hostname:worker1.port" \
  --task_index=1

To get ClusterSpec of an existing TensorFlow cluster launched by a previous YARN application.
```
bin/ydl-tf cluster --app_id <Application ID>
```
You may also use YARN commands through ydl-tf.

For example, to get running application list,
```
bin/ydl-tf application --list
```
or to kill an existing YARN application(TensorFlow cluster),
```
bin/ydl-tf kill --application <Application ID>
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!