Add parameter server train & side-car eval on k8s #182

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

selcukgun wants to merge 5 commits into tensorflow:master from selcukgun:master

selcukgun commented Feb 9, 2021

ResNet56 model (with custom training loop) variables are created on
parameter server jobs, and updated by workers. Evaluation is done using
a dedicated job which uses the checkpoints saved during the training
(side-car evaluation).

The model is trained on CIFAR10 dataset.


          Add parameter server train & side-car eval on k8s

a94b1c6

ResNet56 model (with custom training loop) variables are created on
parameter server jobs, and updated by workers. Evaluation is done using
a dedicated job which uses the checkpoints saved during the training
(side-car evaluation).

The model is trained on CIFAR10 dataset.

google-cla bot added the cla: yes label

yuefengz requested review from ckkuang and yuefengz

February 24, 2021 19:18

selcukgun added 2 commits

February 25, 2021 06:06


          Add inline distributed evaluation

48c11a3

Jinja template now turns off side-car evaluation by default so that only
the inline distributed evaluation added with this CL can be used.i
README updated.

Added efficiency wrappers that will be useful once GPU is supported with
ParameterServerStrategy.

Moved kubernetes jinja template and renderer script to dedicated
subdirectory.


          Rename parameter server training subdirectory

8b552c9

selcukgun marked this pull request as ready for review

February 25, 2021 06:12

yuefengz reviewed

View reviewed changes

Contributor

yuefengz left a comment

Thank for your PR!

distribution_strategy/parameter_server_training/README.md Outdated

+              Please first read the
+              [documentation](https://www.tensorflow.org/tutorials/distribute/parameter_server_training)
+              of Distribution Strategy for parameter server training. We also assume that readers
+              of this page  are familiar with [Google Cloud](https://cloud.google.com/) and

Contributor

yuefengz Mar 9, 2021

Redundant space

Author

selcukgun Mar 24, 2021

Done.

distribution_strategy/parameter_server_training/README.md Outdated

+              -   kubernetes/template.yaml.jinja: jinja template used for generating Kubernetes manifests
+              -   kubernetes/render_template.py: script for rendering the jinja template
+              -   Dockerfile.resnet_cifar_ps_strategy: a docker file to build the model image
+              -   resnet_cifar_ps_strategy.py: script for running any type of parameter server training task based on `TF_CONFIG` environment variable

Contributor

yuefengz Mar 9, 2021

"any type of ..." seems too general, maybe just say "a ResNet example using Cifar dataset for parameter server training"

Author

selcukgun Mar 24, 2021

Done.

distribution_strategy/parameter_server_training/resnet_cifar_ps_strategy.py

+              BATCH_SIZE = 64
+              EVAL_BATCH_SIZE = 8
+              def create_in_process_cluster(num_workers, num_ps):

Contributor

yuefengz Mar 9, 2021

Could you update the work_config part according to this tutorial? https://www.tensorflow.org/tutorials/distribute/parameter_server_training#in-process_cluster

Author

selcukgun Mar 24, 2021

Added inter ops.

distribution_strategy/parameter_server_training/resnet_cifar_ps_strategy.py Outdated

+                      set up distributed training
+                """
+                strategy = parameter_server_strategy_v2.ParameterServerStrategyV2(

Contributor

yuefengz Mar 9, 2021

let's use tf.distribute.experimental.ParameterServerStrategy.

Author

selcukgun Mar 24, 2021

Done.

distribution_strategy/parameter_server_training/resnet_cifar_ps_strategy.py Outdated

+                  logging.info("Finished joining at epoch %d. Training accuracy: %f.",
+                                epoch, train_accuracy.result())
+                  for _ in range(STEPS_PER_EPOCH):

Contributor

yuefengz Mar 9, 2021

Should evaluation use a different steps_per_epoch? since you have a different batch_size for evaluation.

Author

selcukgun Mar 24, 2021

Good point. Introducing EVAL_STEPS_PER_EPOCH and setting it to 88 in the next patch shortly. This gives us a probability of 0.99 for a row in the dataset to be evaluated.

distribution_strategy/parameter_server_training/resnet_cifar_ps_strategy.py Outdated

+                  logging.info("Finished joining at epoch %d. Training accuracy: %f.",
+                                epoch, train_accuracy.result())
+                  for _ in range(STEPS_PER_EPOCH):

Contributor

yuefengz Mar 9, 2021

Could you add a comment here saying that we are running inline distributed evaluation, in this case an evaluator job is not necessary.

Author

selcukgun Mar 24, 2021

Done.


          Reduce evaluation steps for epoch

d0e4a27

Also addressed the following:
* Added inter_ops for workers
* Replaced parameter_server_strategy_v2.ParameterServerStrategyV2 with tf.distribute.experimental.ParameterServerStrategy
* Clarified resnet_cifar_ps_strategy.py description
* Indicated that side-car evaluation job is ot needed since we are running
inline-evaluation
* Removed redundant spaces

yuefengz approved these changes

View reviewed changes

distribution_strategy/parameter_server_training/resnet_cifar_ps_strategy.py

+              flags.DEFINE_string("data_dir", "gs://cifar10_data/",
+                                  "Directory for Resnet Cifar model input. Follow the "
+                                  "instruction here to get Cifar10 data: "
+                                  "https://github.com/tensorflow/models/tree/r1.13.0/official/resnet#cifar-10")

Contributor

yuefengz Apr 12, 2021

redundant new line?

Author

selcukgun Apr 5, 2022

Split the help argument into multiple lines for readability; they are displayed as concatenated if help cmdline arg is passed.

distribution_strategy/parameter_server_training/resnet_cifar_ps_strategy.py

+                      parse_record_fn=cifar_preprocessing.parse_record,
+                      dtype=tf.float32,
+                      drop_remainder=True)
+                  eval_dataset_fn = lambda _: cifar_preprocessing.input_fn(

Contributor

yuefengz Apr 12, 2021

Is the eval data shuffled? If not, could you add a comment and a TODO?

Contributor

yuefengz Apr 12, 2021

Maybe you can just append a shuffle at the end of the dataset?

Author

selcukgun Apr 5, 2022

input_fn already shuffles the training data using process_record_dataset: code link

rainwoodman reviewed

View reviewed changes

distribution_strategy/parameter_server_training/resnet_cifar_ps_strategy.py

+                  # Since we are running inline evaluation below, a side-car evaluator job is not necessary.
+                  for _ in range(EVAL_STEPS_PER_EPOCH):
+                    coordinator.schedule(worker_eval_fn, args=(per_worker_eval_iterator,))

Member

rainwoodman Aug 24, 2022

We can probably build a similar API for DTensor async training. A major difficulty to sort out is what to do if worker_eval_fn( and or replica_fn) is multi-mesh -- for example if there is a summary Op that needs to run on a the CPU.


          Merge branch 'master' into master

5eb91b4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels