Skip to content

Commit 70a22d6

Browse files
texasmichellek8s-ci-robot
authored andcommitted
[GH Issue Summarization] Upgrade to kf v0.4.0-rc.2 (kubeflow#450)
* Update tfjob components to v1beta1 Remove old version of tensor2tensor component * Combine UI into a single jsonnet file * Upgrade GH issue summarization to kf v0.4.0-rc.2 Use latest ksonnet v0.13.1 Use latest seldon v1alpha2 Remove ksonnet app with full kubeflow platform & replace with components specific to this example. Remove outdated scripts Add cluster creation links to Click-to-deploy & kfctl Add warning not to use the Training with an Estimator guide Replace commandline with bash for better syntax highlighting Replace messy port-forwarding commands with svc/ambassador Add modelUrl param to ui component Modify teardown instructions to remove the deployment Fix grammatical mistakes * Rearrange tfjob instructions
1 parent 7990408 commit 70a22d6

File tree

107 files changed

+385
-86534
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

107 files changed

+385
-86534
lines changed
Original file line numberDiff line numberDiff line change
@@ -1,45 +1,52 @@
11
# Setup Kubeflow
22

3-
In this part, you will setup kubeflow on an existing kubernetes cluster.
3+
In this section, you will setup Kubeflow on an existing Kubernetes cluster.
44

55
## Requirements
66

7-
* A kubernetes cluster
8-
* To create a managed cluster run
9-
```commandline
10-
gcloud container clusters create kubeflow-examples-cluster
11-
```
12-
or use kubeadm: [docs](https://kubernetes.io/docs/setup/independent/create-cluster-kubeadm/)
13-
* `kubectl` CLI (command line interface) pointing to the kubernetes cluster
7+
* A Kubernetes cluster
8+
* To create a cluster, follow the instructions on the
9+
[Set up Kubernetes](https://www.kubeflow.org/docs/started/getting-started/#set-up-kubernetes)
10+
section of the Kubeflow Getting Started guide. We recommend using a
11+
managed service such as Google Kubernetes Engine (GKE).
12+
[This link](https://www.kubeflow.org/docs/started/getting-started-gke/)
13+
guides you through the process of using either
14+
[Click-to-Deploy](https://deploy.kubeflow.cloud/#/deploy) (a web-based UI) or
15+
[`kfctl`](https://github.com/kubeflow/kubeflow/blob/master/scripts/kfctl.sh)
16+
(a CLI tool) to generate a GKE cluster with all Kubeflow components
17+
installed. Note that there is no need to complete the Deploy Kubeflow steps
18+
below if you use either of these two tools.
19+
* The Kubernetes CLI `kubectl` pointing to the kubernetes cluster
1420
* Make sure that you can run `kubectl get nodes` from your terminal
1521
successfully
16-
* The ksonnet CLI, v0.9.2 or higher: [ks](https://ksonnet.io/#get-started)
22+
* The ksonnet CLI [`ks`](https://ksonnet.io/#get-started), v0.9.2 or higher:
1723
* In case you want to install a particular version of ksonnet, you can run
1824

19-
```commandline
20-
export KS_VER=ks_0.11.0_linux_amd64
21-
wget -O /tmp/$KS_VER.tar.gz https://github.com/ksonnet/ksonnet/releases/download/v0.11.0/$KS_VER.tar.gz
25+
```bash
26+
export KS_VER=0.13.1
27+
export KS_BIN=ks_${KS_VER}_linux_amd64
28+
wget -O /tmp/${KS_BIN}.tar.gz https://github.com/ksonnet/ksonnet/releases/download/v${KS_VER}/${KS_BIN}.tar.gz
2229
mkdir -p ${HOME}/bin
23-
tar -xvf /tmp/$KS_VER.tar.gz -C ${HOME}/bin
24-
export PATH=$PATH:${HOME}/bin/$KS_VER
30+
tar -xvf /tmp/${KS_BIN}.tar.gz -C ${HOME}/bin
31+
export PATH=$PATH:${HOME}/bin/${KS_BIN}
2532
```
33+
2634
## Kubeflow setup
2735

28-
Refer to the [
29-
guide](https://www.kubeflow.org/docs/started/getting-started/) for
30-
detailed instructions on how to setup kubeflow on your kubernetes cluster.
36+
Refer to the [guide](https://www.kubeflow.org/docs/started/getting-started/) for
37+
detailed instructions on how to setup Kubeflow on your Kubernetes cluster.
3138
Specifically, complete the following sections:
3239

33-
* [Deploy
34-
Kubeflow](https://www.kubeflow.org/docs/started/getting-started/)
35-
* The [ks-kubeflow](https://github.com/kubeflow/examples/tree/master/github_issue_summarization/ks-kubeflow)
36-
directory can be used instead of creating a ksonnet app from scratch.
37-
38-
* If you run into
39-
[API rate limiting errors](https://github.com/ksonnet/ksonnet/blob/master/docs/troubleshooting.md#github-rate-limiting-errors), ensure you have a `${GITHUB_TOKEN}` environment variable set.
40-
41-
* If you run into [RBAC permissions issues](https://github.com/kubeflow/kubeflow/blob/master/user_guide.md#rbac-clusters)
42-
running `ks apply` commands, be sure you have created a `cluster-admin` ClusterRoleBinding for your username.
40+
* [Deploy Kubeflow](https://www.kubeflow.org/docs/started/getting-started/)
41+
* The latest version that was tested with this walkthrough was v0.4.0-rc.2.
42+
* The [`kfctl`](https://github.com/kubeflow/kubeflow/blob/master/scripts/kfctl.sh)
43+
CLI tool can be used to install Kubeflow on an existing cluster. Follow
44+
[this guide](https://www.kubeflow.org/docs/started/getting-started/#kubeflow-quick-start)
45+
to use `kfctl` to generate a ksonnet app, create Kubeflow manifests, and
46+
install all default components onto an existing Kubernetes cluster. Note
47+
that you can likely skip this step if you used
48+
[Click-to-Deploy](https://deploy.kubeflow.cloud/#/deploy)
49+
or `kfctl` to generate your cluster.
4350

4451
* [Setup a persistent disk](https://www.kubeflow.org/docs/guides/advanced/)
4552

@@ -49,9 +56,9 @@ Kubeflow](https://www.kubeflow.org/docs/started/getting-started/)
4956
* For this example, provision a `10GB` cluster-wide shared NFS mount with the
5057
name `github-issues-data`.
5158

52-
* After the NFS is ready, delete the `tf-hub-0` pod so that it gets recreated and
59+
* After the NFS is ready, delete the `jupyter-0` pod so that it gets recreated and
5360
picks up the NFS mount. You can delete it by running `kubectl delete pod
54-
tf-hub-0 -n=${NAMESPACE}`
61+
jupyter-0 -n=${NAMESPACE}`
5562

5663
* [Bringing up a
5764
Notebook](https://www.kubeflow.org/docs/guides/components/jupyter/)
@@ -62,19 +69,44 @@ Notebook](https://www.kubeflow.org/docs/guides/components/jupyter/)
6269

6370
After completing that, you should have the following ready:
6471

65-
* A ksonnet app in a directory named `ks-kubeflow`
66-
* An output similar to this for `kubectl get pods` command
67-
68-
```commandline
69-
NAME READY STATUS RESTARTS AGE
70-
ambassador-75bb54594-dnxsd 2/2 Running 0 3m
71-
ambassador-75bb54594-hjj6m 2/2 Running 0 3m
72-
ambassador-75bb54594-z948h 2/2 Running 0 3m
73-
jupyter-chasm 1/1 Running 0 49s
74-
spartakus-volunteer-565b99cd69-knjf2 1/1 Running 0 3m
75-
tf-hub-0 1/1 Running 0 3m
76-
tf-job-dashboard-6c757d8684-d299l 1/1 Running 0 3m
77-
tf-job-operator-77776c8446-lpprm 1/1 Running 0 3m
72+
* A ksonnet app in a directory named `ks_app`
73+
* An output similar to this for `kubectl -n kubeflow get pods` command
74+
75+
```bash
76+
NAME READY STATUS RESTARTS AGE
77+
ambassador-5cf8cd97d5-6qlpz 1/1 Running 0 3m
78+
ambassador-5cf8cd97d5-rqzkx 1/1 Running 0 3m
79+
ambassador-5cf8cd97d5-wz9hl 1/1 Running 0 3m
80+
argo-ui-7c9c69d464-xpphz 1/1 Running 0 3m
81+
centraldashboard-6f47d694bd-7jfmw 1/1 Running 0 3m
82+
cert-manager-5cb7b9fb67-qjd9p 1/1 Running 0 3m
83+
cm-acme-http-solver-2jr47 1/1 Running 0 3m
84+
ingress-bootstrap-x6whr 1/1 Running 0 3m
85+
jupyter-0 1/1 Running 0 3m
86+
jupyter-chasm 1/1 Running 0 49s
87+
katib-ui-54b4667bc6-cg4jk 1/1 Running 0 3m
88+
metacontroller-0 1/1 Running 0 3m
89+
minio-7bfcc6c7b9-qrshc 1/1 Running 0 3m
90+
ml-pipeline-b59b58dd6-bwm8t 1/1 Running 0 3m
91+
ml-pipeline-persistenceagent-9ff99498c-v4k8f 1/1 Running 0 3m
92+
ml-pipeline-scheduledworkflow-78794fd86f-4tzxp 1/1 Running 0 3m
93+
ml-pipeline-ui-9884fd997-7jkdk 1/1 Running 0 3m
94+
ml-pipelines-load-samples-668gj 0/1 Completed 0 3m
95+
mysql-6f6b5f7b64-qgbkz 1/1 Running 0 3m
96+
pytorch-operator-6f87db67b7-nld5h 1/1 Running 0 3m
97+
spartakus-volunteer-7c77dc796-7jgtd 1/1 Running 0 3m
98+
studyjob-controller-68c6fc5bc8-jkc9q 1/1 Running 0 3m
99+
tf-job-dashboard-5f986cf99d-kb6gp 1/1 Running 0 3m
100+
tf-job-operator-v1beta1-5876c48976-q96nh 1/1 Running 0 3m
101+
vizier-core-78f57695d6-5t8z7 1/1 Running 0 3m
102+
vizier-core-rest-7d7dd7dbb8-dbr7n 1/1 Running 0 3m
103+
vizier-db-777675b958-c46qh 1/1 Running 0 3m
104+
vizier-suggestion-bayesianoptimization-7f46d8cb47-wlltt 1/1 Running 0 3m
105+
vizier-suggestion-grid-64c5f8bdf-2bznv 1/1 Running 0 3m
106+
vizier-suggestion-hyperband-8546bf5885-54hr6 1/1 Running 0 3m
107+
vizier-suggestion-random-c4c8d8667-l96vs 1/1 Running 0 3m
108+
whoami-app-7b575b555d-85nb8 1/1 Running 0 3m
109+
workflow-controller-5c95f95f58-hprd5 1/1 Running 0 3m
78110
```
79111

80112
* A Jupyter Notebook accessible at http://127.0.0.1:8000
@@ -83,10 +115,14 @@ tf-job-operator-77776c8446-lpprm 1/1 Running 0
83115

84116
## Summary
85117

86-
* We created a ksonnet app for our kubeflow deployment
87-
* We deployed the kubeflow-core component to our kubernetes cluster
88-
* We created a disk for storing our training data
89-
* We connected to JupyterHub and spawned a new Jupyter notebook
90-
* For additional details and self-paced learning scenarios check `Resources` section of the [getting started guide](https://www.kubeflow.org/docs/started/getting-started/)
91-
92-
*Next*: [Training the model](02_training_the_model.md)
118+
* We created a ksonnet app for our kubeflow deployment: `ks_app`.
119+
* We deployed the default Kubeflow components to our Kubernetes cluster.
120+
* We created a disk for storing our training data.
121+
* We connected to JupyterHub and spawned a new Jupyter notebook.
122+
* For additional details and self-paced learning scenarios related to this
123+
example, see the
124+
[Resources](https://www.kubeflow.org/docs/started/getting-started/#resources)
125+
section of the
126+
[Getting Started Guide](https://www.kubeflow.org/docs/started/getting-started/).
127+
128+
*Next*: [Training the model with a notebook](02_training_the_model.md)

github_issue_summarization/02_distributed_training.md

+14-7
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,26 @@
11
# Distributed training using Estimator
22

3-
Distributed training with keras currently doesn't work; see
3+
Distributed training with Keras currently does not work. Do not follow this guide
4+
until these issues have been resolved:
45

5-
* kubeflow/examples#280
6-
* kubeflow/examples#96
6+
* [kubeflow/examples#280](https://github.com/kubeflow/examples/issues/280)
7+
* [kubeflow/examples#196](https://github.com/kubeflow/examples/issues/196)
78

8-
Requires Tensorflow 1.9 or later.
9+
Requires TensorFlow 1.9 or later.
910
Requires [StorageClass](https://kubernetes.io/docs/concepts/storage/storage-classes/) capable of creating ReadWriteMany persistent volumes.
1011

1112
On GKE you can follow [GCFS documentation](https://master.kubeflow.org/docs/started/getting-started-gke/#using-gcfs-with-kubeflow) to enable it.
1213

13-
Estimator and Keras are both part of Tensorflow. These high level APIs are designed
14-
to make building models easier. In our distributed training example we will show how both
14+
Estimator and Keras are both part of TensorFlow. These high-level APIs are designed
15+
to make building models easier. In our distributed training example, we will show how both
1516
APIs work together to help build models that will be trainable in both single node and
1617
distributed manner.
1718

1819
## Keras and Estimators
1920

20-
Code required to run this example can be found in [distributed](https://github.com/kubeflow/examples/tree/master/github_issue_summarization/distributed) directory.
21+
Code required to run this example can be found in the
22+
[distributed](https://github.com/kubeflow/examples/tree/master/github_issue_summarization/distributed)
23+
directory.
2124

2225
You can read more about Estimators [here](https://www.tensorflow.org/guide/estimators).
2326
In our example we will leverage `model_to_estimator` function that allows to turn existing tf.keras model to estimator, and therefore allow it to
@@ -93,3 +96,7 @@ tool for us. Please refer to [documentation](https://www.tensorflow.org/guide/pr
9396
## Model
9497

9598
After training is complete, our model can be found in "model" PVC.
99+
100+
*Next*: [Serving the model](03_serving_the_model.md)
101+
102+
*Back*: [Setup a kubeflow cluster](01_setup_a_kubeflow_cluster.md)

github_issue_summarization/02_training_the_model.md

+9-9
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
1-
# Training the model
1+
# Training the model with a notebook
22

3-
By this point, you should have a Jupyter Notebook running at http://127.0.0.1:8000.
3+
By this point, you should have a Jupyter notebook running at http://127.0.0.1:8000.
44

55
## Download training files
66

7-
Open the Jupyter Notebook interface and create a new Terminal by clicking on
8-
menu, *New -> Terminal*. In the Terminal, clone this git repo by executing: `
7+
Open the Jupyter notebook interface and create a new Terminal by clicking on
8+
menu, *New -> Terminal*. In the Terminal, clone this git repo by executing:
99

10-
```commandline
11-
git clone https://github.com/kubeflow/examples.git`
10+
```bash
11+
git clone https://github.com/kubeflow/examples.git
1212
```
1313

1414
Now you should have all the code required to complete training in the `examples/github_issue_summarization/notebooks` folder. Navigate to this folder.
@@ -19,7 +19,7 @@ Here you should see two files:
1919

2020
## Perform training
2121

22-
Open th `Training.ipynb` notebook. This contains a complete walk-through of
22+
Open the `Training.ipynb` notebook. This contains a complete walk-through of
2323
downloading the training data, preprocessing it, and training it.
2424

2525
Run the `Training.ipynb` notebook, viewing the output at each step to confirm
@@ -44,9 +44,9 @@ kubectl --namespace=${NAMESPACE} cp ${PODNAME}:/home/jovyan/examples/github_issu
4444
kubectl --namespace=${NAMESPACE} cp ${PODNAME}:/home/jovyan/examples/github_issue_summarization/notebooks/title_pp.dpkl .
4545
```
4646

47-
For information on:
47+
_(Optional)_ You can also perform training with two alternate methods:
4848
- [Training the model using TFJob](02_training_the_model_tfjob.md)
49-
- [Distributed training using tensor2tensor](02_tensor2tensor_training.md)
49+
- [Distributed training using Estimator](02_distributed_training.md)
5050

5151
*Next*: [Serving the model](03_serving_the_model.md)
5252

github_issue_summarization/02_training_the_model_tfjob.md

+39-35
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,35 @@
11
# Training the model using TFJob
22

3-
Kubeflow offers a TensorFlow job controller for kubernetes. This allows you to run your distributed Tensorflow training
4-
job on a kubernetes cluster. For this training job, we will read our training data from GCS and write our output model
3+
Kubeflow offers a TensorFlow job controller for Kubernetes. This allows you to run your distributed Tensorflow training
4+
job on a Kubernetes cluster. For this training job, we will read our training
5+
data from Google Cloud Storage (GCS) and write our output model
56
back to GCS.
67

78
## Create the image for training
89

9-
The [notebooks](notebooks) directory contains the necessary files to create a image for training. The [train.py](notebooks/train.py) file contains the training code. Here is how you can create an image and push it to gcr.
10+
The [notebooks](notebooks) directory contains the necessary files to create an
11+
image for training. The [train.py](notebooks/train.py) file contains the
12+
training code. Here is how you can create an image and push it to Google
13+
Container Registry (GCR):
1014

11-
```commandline
15+
```bash
1216
cd notebooks/
1317
make PROJECT=${PROJECT} set-image
1418
```
1519
## Train Using PVC
1620

17-
If you don't have access to GCS or don't want to use GCS you
18-
can use a persistent volume to store the data and model.
21+
If you don't have access to GCS or do not wish to use GCS, you
22+
can use a Persistent Volume Claim (PVC) to store the data and model.
1923

20-
Create a pvc
24+
Note: your cluster must have a default storage class defined for this to work.
25+
Create a PVC:
2126

2227
```
2328
ks apply --env=${KF_ENV} -c data-pvc
2429
```
25-
26-
* Your cluster must have a default storage class defined for
27-
this to work.
2830

29-
Run the job to download the data to the PVC.
31+
32+
Run the job to download the data to the PVC:
3033

3134
```
3235
ks apply --env=${KF_ENV} -c data-downloader
@@ -38,24 +41,24 @@ Submit the training job
3841
ks apply --env=${KF_ENV} -c tfjob-pvc
3942
```
4043

41-
The resulting model will be stored on PVC so to access it you will
42-
need to run a pod and attach the PVC. For serving you can just
43-
attach it the pod serving the model.
44+
The resulting model will be stored on the PVC, so to access it you will
45+
need to run a pod and attach the PVC. For serving, you can just
46+
attach it to the pod serving the model.
4447

4548
## Training Using GCS
4649

47-
If you are running on GCS you can train using GCS to store the input
50+
If you are using GCS, you can train using GCS to store the input
4851
and the resulting model.
4952

50-
### GCS Service account
53+
### GCS service account
5154

52-
* Create a service account which will be used to read and write data from the GCS Bucket.
55+
* Create a service account that will be used to read and write data from the GCS bucket.
5356

54-
* Give the storage account `roles/storage.admin` role so that it can access GCS Buckets.
57+
* Give the storage account `roles/storage.admin` role so that it can access GCS buckets.
5558

5659
* Download its key as a json file and create a secret named `user-gcp-sa` with the key `user-gcp-sa.json`
5760

58-
```commandline
61+
```bash
5962
SERVICE_ACCOUNT=github-issue-summarization
6063
PROJECT=kubeflow-example-project # The GCP Project name
6164
gcloud iam service-accounts --project=${PROJECT} create ${SERVICE_ACCOUNT} \
@@ -74,12 +77,12 @@ kubectl --namespace=${NAMESPACE} create secret generic user-gcp-sa --from-file=u
7477

7578
### Run the TFJob using your image
7679

77-
[ks-kubeflow](ks-kubeflow) contains a ksonnet app to deploy the TFJob.
80+
[ks_app](ks_app) contains a ksonnet app to deploy the TFJob.
7881

79-
Set the appropriate params for the tfjob component
82+
Set the appropriate params for the tfjob component:
8083

81-
```commandline
82-
cd ks-kubeflow
84+
```bash
85+
cd ks_app
8386
ks param set tfjob namespace ${NAMESPACE} --env=${KF_ENV}
8487

8588
# The image pushed in the previous step
@@ -97,30 +100,31 @@ ks param set tfjob output_model_gcs_path "github-issue-summarization-data/output
97100

98101
Deploy the app:
99102

100-
```commandline
103+
```bash
101104
ks apply ${KF_ENV} -c tfjob
102105
```
103106

104107
In a while you should see a new pod with the label `tf_job_name=tf-job-issue-summarization`
105-
```commandline
106-
kubectl get pods -n=${NAMESPACE} -ltf_job_name=tf-job-issue-summarization
108+
```bash
109+
kubectl get pods -n=${NAMESPACE} tfjob-issue-summarization-master-0
107110
```
108111

109-
You can view the logs of the tf-job operator using
112+
You can view the training logs using
110113

111-
```commandline
112-
kubectl logs -f $(kubectl get pods -n=${NAMESPACE} -lname=tf-job-operator -o=jsonpath='{.items[0].metadata.name}')
114+
```bash
115+
kubectl logs -f -n=${NAMESPACE} tfjob-issue-summarization-master-0
113116
```
114117

115-
You can view the actual training logs using
118+
You can view the logs of the tf-job operator using
116119

117-
```commandline
118-
kubectl logs -f $(kubectl get pods -n=${NAMESPACE} -ltf_job_name=tf-job-issue-summarization -o=jsonpath='{.items[0].metadata.name}')
120+
```bash
121+
kubectl logs -f -n=${NAMESPACE} $(kubectl get pods -n=${NAMESPACE} -lname=tf-job-operator -o=jsonpath='{.items[0].metadata.name}')
119122
```
120123

121-
For information on:
122-
- [Training the model](02_training_the_model.md)
123-
- [Distributed training using tensor2tensor](02_tensor2tensor_training.md)
124+
125+
_(Optional)_ You can also perform training with two alternate methods:
126+
- [Training the model with a notebook](02_training_the_model.md)
127+
- [Distributed training using Estimator](02_distributed_training.md)
124128

125129
*Next*: [Serving the model](03_serving_the_model.md)
126130

0 commit comments

Comments
 (0)