Skip to content

Conversation

@selcukgun
Copy link

ResNet56 model (with custom training loop) variables are created on
parameter server jobs, and updated by workers. Evaluation is done using
a dedicated job which uses the checkpoints saved during the training
(side-car evaluation).

The model is trained on CIFAR10 dataset.

ResNet56 model (with custom training loop) variables are created on
parameter server jobs, and updated by workers. Evaluation is done using
a dedicated job which uses the checkpoints saved during the training
(side-car evaluation).

The model is trained on CIFAR10 dataset.
@google-cla google-cla bot added the cla: yes label Feb 9, 2021
@yuefengz yuefengz requested review from ckkuang and yuefengz February 24, 2021 19:18
Jinja template now turns off side-car evaluation by default so that only
the inline distributed evaluation added with this CL can be used.i
README updated.

Added efficiency wrappers that will be useful once GPU is supported with
ParameterServerStrategy.

Moved kubernetes jinja template and renderer script to dedicated
subdirectory.
@selcukgun selcukgun marked this pull request as ready for review February 25, 2021 06:12
Copy link
Contributor

@yuefengz yuefengz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank for your PR!

Please first read the
[documentation](https://www.tensorflow.org/tutorials/distribute/parameter_server_training)
of Distribution Strategy for parameter server training. We also assume that readers
of this page are familiar with [Google Cloud](https://cloud.google.com/) and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant space

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

- kubernetes/template.yaml.jinja: jinja template used for generating Kubernetes manifests
- kubernetes/render_template.py: script for rendering the jinja template
- Dockerfile.resnet_cifar_ps_strategy: a docker file to build the model image
- resnet_cifar_ps_strategy.py: script for running any type of parameter server training task based on `TF_CONFIG` environment variable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"any type of ..." seems too general, maybe just say "a ResNet example using Cifar dataset for parameter server training"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

BATCH_SIZE = 64
EVAL_BATCH_SIZE = 8

def create_in_process_cluster(num_workers, num_ps):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the work_config part according to this tutorial? https://www.tensorflow.org/tutorials/distribute/parameter_server_training#in-process_cluster

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added inter ops.

set up distributed training
"""

strategy = parameter_server_strategy_v2.ParameterServerStrategyV2(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use tf.distribute.experimental.ParameterServerStrategy.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

logging.info("Finished joining at epoch %d. Training accuracy: %f.",
epoch, train_accuracy.result())

for _ in range(STEPS_PER_EPOCH):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should evaluation use a different steps_per_epoch? since you have a different batch_size for evaluation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Introducing EVAL_STEPS_PER_EPOCH and setting it to 88 in the next patch shortly. This gives us a probability of 0.99 for a row in the dataset to be evaluated.

logging.info("Finished joining at epoch %d. Training accuracy: %f.",
epoch, train_accuracy.result())

for _ in range(STEPS_PER_EPOCH):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment here saying that we are running inline distributed evaluation, in this case an evaluator job is not necessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Also addressed the following:
* Added inter_ops for workers
* Replaced parameter_server_strategy_v2.ParameterServerStrategyV2 with tf.distribute.experimental.ParameterServerStrategy
* Clarified resnet_cifar_ps_strategy.py description
* Indicated that side-car evaluation job is ot needed since we are running
inline-evaluation
* Removed redundant spaces
flags.DEFINE_string("data_dir", "gs://cifar10_data/",
"Directory for Resnet Cifar model input. Follow the "
"instruction here to get Cifar10 data: "
"https://github.com/tensorflow/models/tree/r1.13.0/official/resnet#cifar-10")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant new line?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split the help argument into multiple lines for readability; they are displayed as concatenated if help cmdline arg is passed.

parse_record_fn=cifar_preprocessing.parse_record,
dtype=tf.float32,
drop_remainder=True)
eval_dataset_fn = lambda _: cifar_preprocessing.input_fn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the eval data shuffled? If not, could you add a comment and a TODO?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can just append a shuffle at the end of the dataset?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input_fn already shuffles the training data using process_record_dataset: code link


# Since we are running inline evaluation below, a side-car evaluator job is not necessary.
for _ in range(EVAL_STEPS_PER_EPOCH):
coordinator.schedule(worker_eval_fn, args=(per_worker_eval_iterator,))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably build a similar API for DTensor async training. A major difficulty to sort out is what to do if worker_eval_fn( and or replica_fn) is multi-mesh -- for example if there is a summary Op that needs to run on a the CPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants