MXNet on Amazon EKS

This documents assumes that you have an EKS cluster available and running. Make sure to have a GPU-enabled Amazon EKS cluster ready.

MNIST training using MXNet on EKS

In this sample, we'll use MNIST database of handwritten digits and train the model to recognize any handwritten digit.

Build a docker image with MNIST source code and installation. Use the Dockerfile in mxnet/mnist/Dockerfile to use it.
```
docker image build mxnet/mnist -t <tag_for_image>
```
This will generate a docker image which will have all the utility to run MNIST. You can push this generated image to docker hub in your personal repo. For convenience, a docker image is already pushed in the docker hub rgaut/deeplearning-mxnet:with_mxnet.
Create a pod which will use this docker image and runs the MNIST training. The pod file is available at mxnet/mnist/mxnet.yaml
```
kubectl create -f samples/mxnet/mnist/mxnet.yaml
```
To use gpu for training you can run below command
```
kubectl create -f samples/mxnet/mnist/mxnet-gpu.yaml
```
At this point you have the pod running and training will start. You can check the status of pod by running kubectl get pod mxnet.

Check the progress in training:

 kubectl logs mxnet
 INFO:root:start with arguments Namespace(add_stn=False, batch_size=64, disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none', gpus=None, kv_store='device', load_epoch=None, lr=0.05, lr_factor=0.1, lr_step_epochs='10', model_prefix=None, mom=0.9, monitor=0, network='mlp', num_classes=10, num_epochs=20, num_examples=60000, num_layers=None, optimizer='sgd', test_io=0, top_k=0, wd=0.0001)
 DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): yann.lecun.com
 DEBUG:urllib3.connectionpool:http://yann.lecun.com:80 "GET /exdb/mnist/train-labels-idx1-ubyte.gz HTTP/1.1" 200 28881

 . . .

 INFO:root:Epoch[19] Batch [100] Speed: 59255.44 samples/sec accuracy=1.000000
 INFO:root:Epoch[19] Batch [200] Speed: 59155.16 samples/sec accuracy=0.999375
 INFO:root:Epoch[19] Batch [300] Speed: 59269.18 samples/sec accuracy=0.999687
 INFO:root:Epoch[19] Batch [400] Speed: 59127.79 samples/sec accuracy=0.999687
 INFO:root:Epoch[19] Batch [500] Speed: 59136.00 samples/sec accuracy=0.999687
 INFO:root:Epoch[19] Batch [600] Speed: 59191.81 samples/sec accuracy=0.999531
 INFO:root:Epoch[19] Batch [700] Speed: 59202.25 samples/sec accuracy=0.999687
 INFO:root:Epoch[19] Batch [800] Speed: 59169.37 samples/sec accuracy=1.000000
 INFO:root:Epoch[19] Batch [900] Speed: 59283.97 samples/sec accuracy=0.999844
 INFO:root:Epoch[19] Train-accuracy=0.999155
 INFO:root:Epoch[19] Time cost=1.016
 INFO:root:Epoch[19] Validation-accuracy=0.981688

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mxnet.md

mxnet.md

MXNet on Amazon EKS

MNIST training using MXNet on EKS

Files

mxnet.md

Latest commit

History

mxnet.md

File metadata and controls

MXNet on Amazon EKS

MNIST training using MXNet on EKS