GitHub - dijiekstra/TensorFlowOnYARN: Support TensorFlow on YARN

TensorFlowOnYARN

TensorFlow on YARN (TOY) is a toolkit to enable Hadoop users an easy way to run TensorFlow applications in distributed pattern and accomplish tasks including model management and serving inference.

This project focuses on support of running Tensorflow on YARN, as part of Deep Learning on Hadoop (HDL) effort.
YARN-6043

Goals

Support all TensorFlow components on YARN, TensorFlow distributed cluster, TensorFlow serving, TensorBoard, etc.
Support multi-tenants with consideration of different types of users, such as devOp, data scientist and data engineer
Support running TensorFlow application in a short-time/long-running job manner of both between-graph mode and in-graph mode
Support model management to deploy and also support a service layer to handle upper layer's like Spark or web backend inference request easily
Minor or no changes required to run user’s existing TensorFlow application(can be written in all officially supported languages including Python, C++, Java and Go)

Note that current project is a prototype with limitation and is still under development

Architecture

Figure1. TOY Architecture

Features

Quick Start

Prepare the build environment following the instructions from https://www.tensorflow.org/install/install_sources

Clone the TensorFlowOnYARN repository.

git clone --recursive https://github.com/Intel-bigdata/TensorFlowOnYARN

Build the assembly.
```
cd TensorFlowOnYARN/tensorflow-parent
mvn package -Pnative -Pdist
```
tensorflow-yarn-${VERSION}.tar.gz and tensorflow-yarn-${VERSION}.zip are built out in the tensorflow-parent/tensorflow-yarn-dist/target directory. Distribute the assembly to the client node of a YARN cluster and extract.

Run the between-graph mnist example.

cd tensorflow-yarn-${VERSION}
bin/ydl-tf launch --num_worker 2 --num_ps 2

This will launch a YARN application, which creates a tf.train.Server instance for each task. A ClusterSpec is printed on the console such that you can submit the training script to. e.g.

ClusterSpec: {"ps":["node1:22257","node2:22222"],"worker":["node3:22253","node2:22255"]}

python examples/between-graph/mnist_feed.py \
  --ps_hosts="ps0.hostname:ps0.port,ps1.hostname:ps1.port" \
  --worker_hosts="worker0.hostname:worker0.port,worker1.hostname:worker1.port" \
  --task_index=0

python examples/between-graph/mnist_feed.py \
  --ps_hosts="ps0.hostname:ps0.port,ps1.hostname:ps1.port" \
  --worker_hosts="worker0.hostname:worker0.port,worker1.hostname:worker1.port" \
  --task_index=1

To get ClusterSpec of an existing TensorFlow cluster launched by a previous YARN application.
```
bin/ydl-tf cluster --app_id <Application ID>
```
You may also use YARN commands through ydl-tf.

For example, to get running application list,
```
bin/ydl-tf application --list
```
or to kill an existing YARN application(TensorFlow cluster),
```
bin/ydl-tf kill --application <Application ID>
```

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
bin		bin
examples		examples
tensorflow-parent		tensorflow-parent
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TensorFlowOnYARN

Goals

Architecture

Features

Quick Start

About

Releases

Packages

Languages

License

dijiekstra/TensorFlowOnYARN

Folders and files

Latest commit

History

Repository files navigation

TensorFlowOnYARN

Goals

Architecture

Features

Quick Start

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages