-
Notifications
You must be signed in to change notification settings - Fork 394
Add parameter server train & side-car eval on k8s #182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
selcukgun
wants to merge
5
commits into
tensorflow:master
Choose a base branch
from
selcukgun:master
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 3 commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
a94b1c6
Add parameter server train & side-car eval on k8s
selcukgun 48c11a3
Add inline distributed evaluation
selcukgun 8b552c9
Rename parameter server training subdirectory
selcukgun d0e4a27
Reduce evaluation steps for epoch
selcukgun 5eb91b4
Merge branch 'master' into master
selcukgun File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
22 changes: 22 additions & 0 deletions
22
distribution_strategy/parameter_server_training/Dockerfile.resnet_cifar_ps_strategy
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| FROM tensorflow/tensorflow:nightly | ||
|
|
||
| RUN apt-get install -y python3 && \ | ||
| apt install python3-pip | ||
|
|
||
| RUN pip3 install absl-py && \ | ||
| pip3 install portpicker | ||
|
|
||
| # Install git | ||
| RUN apt-get update && \ | ||
| apt-get install -y git && \ | ||
| apt-get install -y vim | ||
|
|
||
| RUN git clone --single-branch --branch benchmark https://github.com/tensorflow/models.git && \ | ||
| mv models tensorflow_models && \ | ||
| git clone https://github.com/tensorflow/model-optimization.git && \ | ||
| mv model-optimization tensorflow_model_optimization | ||
|
|
||
| COPY resnet_cifar_ps_strategy.py / | ||
|
|
||
| ENV PYTHONPATH "${PYTHONPATH}:/:/tensorflow_models" | ||
| CMD ["python", "/resnet_cifar_ps_strategy.py"] |
173 changes: 173 additions & 0 deletions
173
distribution_strategy/parameter_server_training/README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,173 @@ | ||
| # Parameter Server Training Using Distribution Strategies | ||
|
|
||
| This directory provides an example of running parameter server training with | ||
| Distribution Strategies. | ||
|
|
||
| Please first read the | ||
| [documentation](https://www.tensorflow.org/tutorials/distribute/parameter_server_training) | ||
| of Distribution Strategy for parameter server training. We also assume that readers | ||
| of this page are familiar with [Google Cloud](https://cloud.google.com/) and | ||
| its [Kubernetes Engine](https://cloud.google.com/kubernetes-engine/). | ||
|
|
||
| This directory contains the following files: | ||
|
|
||
| - kubernetes/template.yaml.jinja: jinja template used for generating Kubernetes manifests | ||
| - kubernetes/render_template.py: script for rendering the jinja template | ||
| - Dockerfile.resnet_cifar_ps_strategy: a docker file to build the model image | ||
| - resnet_cifar_ps_strategy.py: script for running any type of parameter server training task based on `TF_CONFIG` environment variable | ||
|
||
|
|
||
| ## Prerequisites | ||
|
|
||
| 1. First you need to have a Google Cloud project. Either create a new project or use an existing one. | ||
|
|
||
| 2. Install | ||
| [gcloud commandline tools](https://cloud.google.com/functions/docs/quickstart) | ||
| on your system, login, set project and zone, etc. | ||
|
|
||
| 3. Install [Docker](https://docs.docker.com/get-docker/) for your system | ||
|
|
||
| 4. Install kubectl: | ||
|
|
||
| ```bash | ||
| gcloud components install kubectl | ||
| ``` | ||
| 5. Start a Kubernetes cluster either with `gcloud` command as shown below or with | ||
| [GKE](https://cloud.google.com/kubernetes-engine/) web UI. Using more CPUs or nodes may require increasing your CPU [quotas](https://cloud.google.com/compute/quotas#requesting_additional_quota). | ||
|
|
||
| ```bash | ||
| gcloud container clusters create <cluster_name> --zone=us-west1-a --num-nodes=6 --machine-type=e2-standard-4 | ||
| ``` | ||
|
|
||
| 6. Set context for `kubectl` so that `kubectl` knows which cluster to use: | ||
|
|
||
| ```bash | ||
| kubectl config use-context <cluster_name> | ||
| ``` | ||
|
|
||
| 7. Create a | ||
| [service account](https://cloud.google.com/compute/docs/access/service-accounts) | ||
| and download its key file in JSON format. Assign Storage Admin role for | ||
| [Google Cloud Storage](https://cloud.google.com/storage/) to this service account: | ||
|
|
||
| ```bash | ||
| gcloud iam service-accounts create <service_account_id> --display-name="<display_name>" | ||
| ``` | ||
|
|
||
| ```bash | ||
| gcloud projects add-iam-policy-binding <project-id> \ | ||
| --member="serviceAccount:<service_account_id>@<project_id>.iam.gserviceaccount.com" \ | ||
| --role="roles/storage.admin" | ||
| ``` | ||
|
|
||
| 8. Create a Kubernetes secret from the JSON key file of your service account: | ||
|
|
||
| ```bash | ||
| kubectl create secret generic credential --from-file=key.json=<path_to_json_file> | ||
| ``` | ||
|
|
||
| 9. Enable GCR ([Google Container Registry](https://cloud.google.com/container-registry)) service for your project using either GCP web UI or gcloud tool: | ||
|
|
||
| ```bash | ||
| gcloud services enable containerregistry.googleapis.com | ||
| ``` | ||
|
|
||
| 10. Configure Docker to authenticate with Container Registry | ||
|
|
||
| ```bash | ||
| gcloud auth configure-docker | ||
| ``` | ||
| ## How to run the example | ||
|
|
||
| 1. Create three buckets for model data, checkpoints and training logs using either GCP web UI or gsutil tool (included with the gcloud tool you have installed above): | ||
|
|
||
| ```bash | ||
| gsutil mb gs://<bucket_name> | ||
| ``` | ||
| You will use these bucket names to modify `data_dir`, `checkpoint_dir` and `train_log_dir` in step #4. | ||
|
|
||
|
|
||
| 2. Download CIFAR-10 data and place them in your data_dir bucket. Head to the [ResNet in TensorFlow](https://github.com/tensorflow/models/tree/r1.13.0/official/resnet#cifar-10) directory to obtain CIFAR-10 data. Alternatively, you can use this [direct link](https://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz) to download and extract the data yourself as well. | ||
|
|
||
| ```bash | ||
| python cifar10_download_and_extract.py | ||
| ``` | ||
|
|
||
| Upload the contents of cifar-10-batches-bin directory to your `data_dir` bucket. | ||
|
|
||
| ```bash | ||
| gsutil -m cp cifar-10-batches-bin/* gs://<your_data_dir>/ | ||
|
|
||
| ``` | ||
|
|
||
| 3. Now let's build the Docker image: | ||
|
|
||
| ```bash | ||
| docker build --no-cache -t resnet_cifar_ps_strategy:v1 -f Dockerfile.resnet_cifar_ps_strategy . | ||
|
|
||
| ``` | ||
|
|
||
| and push the image to | ||
| [Google Cloud Container Registery](https://cloud.google.com/container-registry/): | ||
|
|
||
| ```bash | ||
| docker tag resnet_cifar_ps_strategy:v1 gcr.io/<your project>/resnet_cifar_ps_strategy:v1 | ||
| docker push gcr.io/<your project>/resnet_cifar_ps_strategy:v1 | ||
| ``` | ||
|
|
||
| 4. Modify the variables in template.yaml.jinja. You may want to change `name`, | ||
| `image`, `train_log_dir`, `script` and `cmdline_args`. | ||
|
|
||
| * `name`: name your cluster, e.g. "my-parameter-server-example". | ||
| * `image`: the name of your docker image. | ||
| * `worker_replicas`: number of worker pods. | ||
| * `ps_replicas`: number of parameter server pods. | ||
| * `num_gpus_per_worker`: number of GPUs (this does not apply for this example since parameter server distribution strategy does not have GPU support yet) | ||
| * `has_coordinator`: flag for creating coordinator job | ||
| * `has_eval`: flag for creating evaluator job (this is set to False in the default template in order to use inline distributed evaluation. Setting this flag to True enables side-car evaluation.) | ||
| * `has_tensorboard`: flag for creating tensorboard job | ||
| * `script`: the script in the docker image to run. | ||
| * `train_log_dir`: used for logging training accuracy | ||
| * `cmdline_args`: the command line arguments passed to the `script`. | ||
| * `credential_secret_json`: the filename that was registered to Kubernetes as a secret. | ||
| * `credential_secret_key`: the name of the Kubernetes secret used for storing | ||
| your service account key. | ||
| * `port`: the port for all tasks including tensorboard. | ||
| * `use_node_port`: flag for using NodePort as type of service. Jinja template generates ingress only for tensorboard when this flag is set to `true`. Setting this flag to `false` enables LoadBalancer for all pods; assigning them external IPs (which may be limited by your public IP address quota). | ||
|
|
||
| 5. Start the training and evaluation on the cluster. | ||
|
|
||
| You may want to verify the generated kubernetes manifests by running the following: | ||
|
|
||
| ```bash | ||
| cd kubernetes | ||
| python render_template.py template.yaml.jinja | kubectl create -f - --dry-run=client | ||
| ``` | ||
|
|
||
| After making sure that the above command succeeds, you can start the cluster (removing the dry-run flag): | ||
|
|
||
| ```bash | ||
| python render_template.py template.yaml.jinja | kubectl create -f - | ||
| ``` | ||
| You'll see that your cluster has started training. You can inspect logs of | ||
| workers or use tensorboard to watch your model training. | ||
|
|
||
| ```bash | ||
| kubectl get pods | ||
| ``` | ||
|
|
||
| ```bash | ||
| kubectl logs -f <pod_id> | ||
| ``` | ||
|
|
||
| 6. You can find the TensorBoard service public IP address on Services & Ingress page of GKE, and access TensorBoard on http://<tensorboard_ip> (or http://<tensorboard_ip>:5000 if you have set `use_node_port` to `false`)using your browser. | ||
|
|
||
| The training accuracy graph shall look like the following: | ||
|
|
||
|  | ||
|
|
||
| 7. Destroy the cluster | ||
|
|
||
| ```bash | ||
| gcloud container clusters delete <cluster_name> | ||
| ``` | ||
|
|
||
Binary file added
BIN
+165 KB
distribution_strategy/parameter_server_training/images/tf-dist-ps-tensorboard.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
12 changes: 12 additions & 0 deletions
12
distribution_strategy/parameter_server_training/kubernetes/render_template.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| #!/usr/bin/env python | ||
|
|
||
| from __future__ import print_function | ||
|
|
||
| import jinja2 | ||
| import sys | ||
|
|
||
| if len(sys.argv) != 2: | ||
| print("usage: {} [template-file]".format(sys.argv[0]), file=sys.stderr) | ||
| sys.exit(1) | ||
| with open(sys.argv[1], "r") as f: | ||
| print(jinja2.Template(f.read()).render()) |
150 changes: 150 additions & 0 deletions
150
distribution_strategy/parameter_server_training/kubernetes/template.yaml.jinja
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,150 @@ | ||
| {%- set name = "resnet-cifar-ps-strategy-example" -%} | ||
| {%- set image = "gcr.io/tensorflow-experimental/resnet_cifar_ps_strategy:v1" -%} | ||
| {%- set worker_replicas = 5 -%} | ||
| {%- set ps_replicas = 2 -%} | ||
| {%- set num_gpus_per_worker = 0 -%} | ||
| {%- set has_coordinator = True -%} | ||
| {%- set has_eval = False -%} | ||
| {%- set has_tensorboard = True -%} | ||
| {%- set script = "/resnet_cifar_ps_strategy.py" -%} | ||
| {%- set train_log_dir = "gs://cifar10-train-log/" -%} | ||
| {%- set cmdline_args = [ | ||
| "--data_dir=gs://cifar10-data/", | ||
| "--checkpoint_dir=gs://cifar10-ckpt/", | ||
| "--train_log_dir=" + train_log_dir | ||
| ] -%} | ||
| {%- set credential_secret_json = "key.json" -%} | ||
| {%- set credential_secret_key = "credential" -%} | ||
| {%- set port = 5000 -%} | ||
| {%- set use_node_port = True -%} | ||
|
|
||
|
|
||
| {%- set replicas = { | ||
| "worker": worker_replicas, | ||
| "ps": ps_replicas, | ||
| "chief": has_coordinator|int, | ||
| "evaluator": has_eval|int, | ||
| "tensorboard": has_tensorboard|int | ||
| } -%} | ||
|
|
||
| {%- macro worker_hosts() -%} | ||
| {% for i in range(worker_replicas) %} | ||
| \"{{ name }}-worker-{{ i }}:{{ port }}\"{%- if not loop.last -%},{%- endif -%} | ||
| {% endfor %} | ||
| {%- endmacro -%} | ||
|
|
||
| {%- macro ps_hosts() -%} | ||
| {% for i in range(ps_replicas) %} | ||
| \"{{ name }}-ps-{{ i }}:{{ port }}\"{%- if not loop.last -%},{%- endif -%} | ||
| {% endfor %} | ||
| {%- endmacro -%} | ||
|
|
||
| {%- macro tf_config(task_type, task_id) -%} | ||
| { | ||
| \"cluster\": { | ||
| \"worker\": [{{ worker_hosts() }}] | ||
| {%- if ps_replicas > 0 %}, | ||
| \"ps\": [{{ ps_hosts() }} | ||
| ]{% endif %} | ||
| {%- if has_coordinator %}, | ||
| \"chief\": [ | ||
| \"{{ name }}-chief-0:{{ port }}\" | ||
| ] | ||
| {%- endif %} | ||
| }, | ||
| \"task\": { | ||
| \"type\": \"{{ task_type }}\", | ||
| \"index\": \"{{ task_id }}\" | ||
| } | ||
| } | ||
| {%- endmacro -%} | ||
|
|
||
| {% for job in ["chief", "worker", "ps", "evaluator", "tensorboard"] -%} | ||
| {%- for i in range(replicas[job]) -%} | ||
| {% if job == "tensorboard" and use_node_port %} | ||
| kind: Ingress | ||
| apiVersion: networking.k8s.io/v1beta1 | ||
| metadata: | ||
| name: tensorboard-ingress | ||
| spec: | ||
| backend: | ||
| serviceName: {{ name }}-{{ job }}-{{ i }} | ||
| servicePort: {{ port }} | ||
| --- | ||
| {% endif -%} | ||
| kind: Service | ||
| apiVersion: v1 | ||
| metadata: | ||
| name: {{ name }}-{{ job }}-{{ i }} | ||
| spec: | ||
| type: {{ 'NodePort' if use_node_port else 'LoadBalancer' }} | ||
| selector: | ||
| name: {{ name }} | ||
| job: {{ job }} | ||
| task: "{{ i }}" | ||
| ports: | ||
| - port: {{ port }} | ||
| {%- if use_node_port %} | ||
| targetPort: {{ port }} | ||
| {%- endif %} | ||
| --- | ||
| kind: Deployment | ||
| apiVersion: apps/v1 | ||
| metadata: | ||
| name: {{ name }}-{{ job }}-{{ i }} | ||
| spec: | ||
| replicas: 1 | ||
| selector: | ||
| matchLabels: | ||
| name: {{ name }} | ||
| job: {{ job }} | ||
| task: "{{ i }}" | ||
| template: | ||
| metadata: | ||
| labels: | ||
| name: {{ name }} | ||
| job: {{ job }} | ||
| task: "{{ i }}" | ||
| spec: | ||
| containers: | ||
| {%- if job == "tensorboard" %} | ||
| - name: tensorflow | ||
| image: tensorflow/tensorflow | ||
| {%- else %} | ||
| - name: tensorflow | ||
| image: {{ image }} | ||
| {%- endif %} | ||
| env: | ||
| {%- if job != "tensorboard" %} | ||
| - name: TF_CONFIG | ||
| value: "{{ tf_config(job, i) }}" | ||
| {%- endif %} | ||
| - name: GOOGLE_APPLICATION_CREDENTIALS | ||
| value: "/var/secrets/google/{{ credential_secret_json }}" | ||
| ports: | ||
| - containerPort: {{ port }} | ||
| {%- if job == "tensorboard" %} | ||
| command: | ||
| - "tensorboard" | ||
| args: | ||
| - "--logdir={{ train_log_dir }}" | ||
| - "--port={{ port }}" | ||
| - "--host=0.0.0.0" | ||
| {%- else %} | ||
| command: | ||
| - "python" | ||
| - "{{ script }}" | ||
| {%- for cmdline_arg in cmdline_args %} | ||
| - "{{ cmdline_arg }}" | ||
| {%- endfor -%} | ||
| {%- endif %} | ||
| volumeMounts: | ||
| - name: credential | ||
| mountPath: /var/secrets/google | ||
| volumes: | ||
| - name: credential | ||
| secret: | ||
| secretName: {{ credential_secret_key }} | ||
| --- | ||
| {% endfor %} | ||
| {%- endfor -%} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant space
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.