You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[GH Issue Summarization] Upgrade to kf v0.4.0-rc.2 (kubeflow#450)
* Update tfjob components to v1beta1
Remove old version of tensor2tensor component
* Combine UI into a single jsonnet file
* Upgrade GH issue summarization to kf v0.4.0-rc.2
Use latest ksonnet v0.13.1
Use latest seldon v1alpha2
Remove ksonnet app with full kubeflow platform & replace with components specific to this example.
Remove outdated scripts
Add cluster creation links to Click-to-deploy & kfctl
Add warning not to use the Training with an Estimator guide
Replace commandline with bash for better syntax highlighting
Replace messy port-forwarding commands with svc/ambassador
Add modelUrl param to ui component
Modify teardown instructions to remove the deployment
Fix grammatical mistakes
* Rearrange tfjob instructions
* The [ks-kubeflow](https://github.com/kubeflow/examples/tree/master/github_issue_summarization/ks-kubeflow)
36
-
directory can be used instead of creating a ksonnet app from scratch.
37
-
38
-
* If you run into
39
-
[API rate limiting errors](https://github.com/ksonnet/ksonnet/blob/master/docs/troubleshooting.md#github-rate-limiting-errors), ensure you have a `${GITHUB_TOKEN}` environment variable set.
40
-
41
-
* If you run into [RBAC permissions issues](https://github.com/kubeflow/kubeflow/blob/master/user_guide.md#rbac-clusters)
42
-
running `ks apply` commands, be sure you have created a `cluster-admin` ClusterRoleBinding for your username.
* We created a ksonnet app for our kubeflow deployment
87
-
* We deployed the kubeflow-core component to our kubernetes cluster
88
-
* We created a disk for storing our training data
89
-
* We connected to JupyterHub and spawned a new Jupyter notebook
90
-
* For additional details and self-paced learning scenarios check `Resources` section of the [getting started guide](https://www.kubeflow.org/docs/started/getting-started/)
91
-
92
-
*Next*: [Training the model](02_training_the_model.md)
118
+
* We created a ksonnet app for our kubeflow deployment: `ks_app`.
119
+
* We deployed the default Kubeflow components to our Kubernetes cluster.
120
+
* We created a disk for storing our training data.
121
+
* We connected to JupyterHub and spawned a new Jupyter notebook.
122
+
* For additional details and self-paced learning scenarios related to this
Requires [StorageClass](https://kubernetes.io/docs/concepts/storage/storage-classes/) capable of creating ReadWriteMany persistent volumes.
10
11
11
12
On GKE you can follow [GCFS documentation](https://master.kubeflow.org/docs/started/getting-started-gke/#using-gcfs-with-kubeflow) to enable it.
12
13
13
-
Estimator and Keras are both part of Tensorflow. These highlevel APIs are designed
14
-
to make building models easier. In our distributed training example we will show how both
14
+
Estimator and Keras are both part of TensorFlow. These high-level APIs are designed
15
+
to make building models easier. In our distributed training example, we will show how both
15
16
APIs work together to help build models that will be trainable in both single node and
16
17
distributed manner.
17
18
18
19
## Keras and Estimators
19
20
20
-
Code required to run this example can be found in [distributed](https://github.com/kubeflow/examples/tree/master/github_issue_summarization/distributed) directory.
21
+
Code required to run this example can be found in the
Copy file name to clipboardExpand all lines: github_issue_summarization/02_training_the_model_tfjob.md
+39-35
Original file line number
Diff line number
Diff line change
@@ -1,32 +1,35 @@
1
1
# Training the model using TFJob
2
2
3
-
Kubeflow offers a TensorFlow job controller for kubernetes. This allows you to run your distributed Tensorflow training
4
-
job on a kubernetes cluster. For this training job, we will read our training data from GCS and write our output model
3
+
Kubeflow offers a TensorFlow job controller for Kubernetes. This allows you to run your distributed Tensorflow training
4
+
job on a Kubernetes cluster. For this training job, we will read our training
5
+
data from Google Cloud Storage (GCS) and write our output model
5
6
back to GCS.
6
7
7
8
## Create the image for training
8
9
9
-
The [notebooks](notebooks) directory contains the necessary files to create a image for training. The [train.py](notebooks/train.py) file contains the training code. Here is how you can create an image and push it to gcr.
10
+
The [notebooks](notebooks) directory contains the necessary files to create an
11
+
image for training. The [train.py](notebooks/train.py) file contains the
12
+
training code. Here is how you can create an image and push it to Google
13
+
Container Registry (GCR):
10
14
11
-
```commandline
15
+
```bash
12
16
cd notebooks/
13
17
make PROJECT=${PROJECT} set-image
14
18
```
15
19
## Train Using PVC
16
20
17
-
If you don't have access to GCS or don't want to use GCS you
18
-
can use a persistent volume to store the data and model.
21
+
If you don't have access to GCS or do not wish to use GCS, you
22
+
can use a Persistent Volume Claim (PVC) to store the data and model.
19
23
20
-
Create a pvc
24
+
Note: your cluster must have a default storage class defined for this to work.
25
+
Create a PVC:
21
26
22
27
```
23
28
ks apply --env=${KF_ENV} -c data-pvc
24
29
```
25
-
26
-
* Your cluster must have a default storage class defined for
27
-
this to work.
28
30
29
-
Run the job to download the data to the PVC.
31
+
32
+
Run the job to download the data to the PVC:
30
33
31
34
```
32
35
ks apply --env=${KF_ENV} -c data-downloader
@@ -38,24 +41,24 @@ Submit the training job
38
41
ks apply --env=${KF_ENV} -c tfjob-pvc
39
42
```
40
43
41
-
The resulting model will be stored on PVC so to access it you will
42
-
need to run a pod and attach the PVC. For serving you can just
43
-
attach it the pod serving the model.
44
+
The resulting model will be stored on the PVC, so to access it you will
45
+
need to run a pod and attach the PVC. For serving, you can just
46
+
attach it to the pod serving the model.
44
47
45
48
## Training Using GCS
46
49
47
-
If you are running on GCS you can train using GCS to store the input
50
+
If you are using GCS, you can train using GCS to store the input
48
51
and the resulting model.
49
52
50
-
### GCS Service account
53
+
### GCS service account
51
54
52
-
* Create a service account which will be used to read and write data from the GCS Bucket.
55
+
* Create a service account that will be used to read and write data from the GCS bucket.
53
56
54
-
* Give the storage account `roles/storage.admin` role so that it can access GCS Buckets.
57
+
* Give the storage account `roles/storage.admin` role so that it can access GCS buckets.
55
58
56
59
* Download its key as a json file and create a secret named `user-gcp-sa` with the key `user-gcp-sa.json`
57
60
58
-
```commandline
61
+
```bash
59
62
SERVICE_ACCOUNT=github-issue-summarization
60
63
PROJECT=kubeflow-example-project # The GCP Project name
61
64
gcloud iam service-accounts --project=${PROJECT} create ${SERVICE_ACCOUNT} \
0 commit comments