Skip to content

Training Pipeline

Samed Güner edited this page Sep 21, 2018 · 6 revisions

Training Pipeline

The following document shall give a profound introduction over the training pipeline used in our commitment for the NIPS Adversarial Vision Challenge.

Taking Decisions: Google Cloud Platform

As underlying infrastructure Google Cloud Platform (GCP) was chosen (#9). While in the beginning, there were discussion whether to use Google's ML Engine (Machine Learning Engine) or not, observations showed that for our training purpose Google's ML Engine has been

  1. too complicated
  2. bound to initial conditions of GCP
  3. less freedom in development

Therefore a bare-metal solution with an own training pipeline using dedicated GPUs and VMs on Google Compute Engine (GCE) has been set as goal.

Overview: Training Pipeline

The training pipeline is a composition of different devops (Development and Operations) tools and infrastructure products of GCP. The main idea behind this was to minimize the overall manual inference and automatize results gathering.

In general the training pipelines is constructed on Github and GCP. While Github is serving source code of the project, GCP is being used to compile, package the source code and run training sessions. The training pipeline follows so the initial idea of using git tags to initiate the whole training process. Developers can tag at any given time of development a commit, which results in start of the build pipeline. On GCP several steps are executed, such as building and pushing of a Docker image, creating and configuring of a VM instance, deploying of the image and in the end starting the training.

Build Trigger

First of all a build trigger has to be setup on GCP. A Build Trigger is assigned to a specific git-repository and executes a build, when user-specified conditions are met:

  • a tag or commit in a given branch is available
  • tag or branch underlies specific naming

Furthermore the user specifies the path to the build configuration, which specifies steps to be executed in the pipeline. When a build trigger is being setup, GCP registers itself on webhooks at Github for being notified for any changes in the repository.

In the current training pipeline, the following build trigger executes a build:

  • any given tag in all branches

From this follows that if a developers tags a given commit and so initiates the execution of the build. As mentioned before the build configuration is the central place to concate steps and define the pipeline. It is located in repository. Furthermore GCP replicates the upstream repository on GCP to reduce the initial overhead of starting a build.

Build Configuration

Such as any deployment (aka. training pipeline) related files, the build configuration for the training pipeline is located at deployment/ and called deployment.json. Many CI/CD tools such as Jenkins or Travis are using their own specification of the build configuration, so GCP.

As of the commit the following deployment is being used:

{
    "steps": [
        {
            "name": "ubuntu",
            "args": [
                "bash",
                "deployment/build_image.sh",
                "$PROJECT_ID",
                "$TAG_NAME",
                "nips-2018-adversarial-vision-challenge-data"
            ]
        },
        {
            "name":"gcr.io/cloud-builders/gcloud",
            "args": [
                "kms",
                "decrypt",
                "--ciphertext-file=deployment/cloudbuild-service-account.json.enc",
                "--plaintext-file=deployment/cloudbuild-service-account.json",
                "--location=global",
                "--keyring=nips-2018-challenge-keyring",
                "--key=nips-2018-challenge-key"
            ]
        },
        {
            "name":"gcr.io/cloud-builders/gcloud",
            "args": [
                "kms",
                "decrypt",
                "--ciphertext-file=deployment/nips-cloudbuilder.enc",
                "--plaintext-file=deployment/nips-cloudbuilder",
                "--location=global",
                "--keyring=nips-2018-challenge-keyring",
                "--key=nips-2018-challenge-key"
            ]
        },
        {
            "name": "ubuntu",
            "args": [
                "bash",
                "deployment/create_gce_instance.sh",
                "$PROJECT_ID",
                "$TAG_NAME"
            ] 
        }
    ],
    "timeout": "1500s"
}

Each step consists of a name and a list of arguments args. In GCP the name specifies the toolset to be used. For example the first build steps uses Ubuntu as toolset, which would lead into the possibility to run Linux-commands.

While all build steps are sequentially run in the same VM, each build step is very much isolated from each other. Latter means that one is for example not able to set and share environment variables across build steps. This does not apply to the file-system. In general GCP clones the repository into the VM and steps can access files modified by previous steps.

Furthermore there are global environment variables set by GCP, which can be directly consumed in steps, such as:

  • PROJECT_ID - GCP project id
  • TAG_NAME- pushed tag

If any of the steps returns an error code other than zero, the build has been failed. This would lead to an immediate stop of the build.

Step 1: Building the Docker Image

In the first step ubuntu is being used to execute the bash script for building the Docker Image. The Docker Image later consists of all required tools, configurations and model to start and trace the training process. Follow arguments are passed to the build step:

  • Name of the bucket (here: nips-2018-adversarial-vision-challenge-data), which later retrieved in the model to output summaries and logs
  • PROJECT_ID - being used in the bash script to construct the Google Container Registry URI
  • TAG_NAME - Being used to locate the Dockerfile, name and tag the Docker Image, name the training instance on Google Compute Engine and name the output folder for summaries and logs on Google Cloud Storage

The TAG_NAME must have the following structure:

<TAG_NAME> ::= <FOLDER_NAME>-<VERSION>
<VERSION> ::= [0-9.]*$
<FOLDER_NAME> ::= [a-zA-Z-]*

Since a Docker Image can only be build out of a Dockerfile, one has to specify the Dockerfile accordingly. In our terms each model has an own Dockerfile, which should be more or less look like this:

FROM gcr.io/nips-2018-207619/nips-tensorflow-base-image:latest

ARG MODEL_ID_ARG
ARG BUCKET_NAME_ARG

# persist the args across images
ENV MODEL_ID ${MODEL_ID_ARG}
ENV BUCKET_NAME ${BUCKET_NAME_ARG}

WORKDIR /opt/app

# install python requirements
COPY requirements.txt /opt/app
RUN pip install -r requirements.txt

# copy python project
COPY . /opt/app

# mount bucket and start training
RUN chmod +x ./start.sh

ENTRYPOINT ["/bin/bash", "-c", "./start.sh"]

This Dockerfile can be located using the FOLDER_NAME specified in tags, which relates to the folder located at models/. Having a closer look at the Dockerfile shows, that at his step the model is being copied in the Docker Image and the entrypoint is set to a bash script called start.sh. Any other configurations such as CUDA Toolkit, Tensorflow-GPU, Access to Google Cloud Storage or Tiny-Image-Net has been already baked into nips-tensorflow-base-image:latest leading to high-cohesion and low coupling.

After a successful build the Docker Image is pushed as gcr.io/PROJECT_ID/FOLDER_NAME:VERSION to GCR.

Now it could be consumed at this moment at any VM with a GPU, which has following preinstalled:

  • Docker - preconfigured for GCR
  • NVIDIA Driver - for accessing the GPU
  • NVIDIA Docker Runtime - for making GPU accessible through containers
  • CUDA Toolkit - required by Tensorflow

Since the configuration of a VM in the build pipeline would take serious amount of time, we baked our own ubuntu image with those tools preinstalled. It is called nips-2018-adversarial-visionchallenge-base-image-large and accesible from Google Compute Engine > Images.

Step 2 and 3: Encrypting GCP Credentials

While pushing Docker Images to GCR does not require additional permissions from the pipeline, the access to the Google Compute Engine does. Since the Docker Image should be deployed on an own VM Instance with a dedicated GPU access to the GCE API is required.

In GCP one able to generate so-called Service Accounts for specific APIs and define explicit permission for the Service Account. Furthermore one is able to add global SSH-Keys, which will be inherited by every instance created on the GCP project.

At this step cloud-builder.enc (for accessing GCE from Terraform) and nips-cloudbuilder.enc (for accessing the created GCE Instance) are being encrypted. Since both files are not available in encrypted state. It does not represents a security issue.

Step 3: Creating GCE Instance and Starting Training Process

This steps uses ubuntu and executes a bash script (deployment/create_gce_instance.sh), which internally executes a series of commands and tools:

  1. Installs on the VM of the build pipeline Terraform
  2. Executes a Terraform Resource Plan (deployment/terraform.tf), which deploys on GCP a VM-Instance having 6 GB of RAM, a dedicated NVIDIA TESLA K80, 2 Cores and the base-image nips-2018-adversarial-visionchallenge-base-image-large. The GCE instance is named as follows <FOLDER_NAME>-<VERSION>, while . are replaced by - due to GCP.
  3. Once the VM-Instance is created, Ansible is installed in the build pipeline.
  4. Afterwards deployment/configure_gce_instance.sh is run with the IP of the freshly created VM along with the name of the previously pushed Docker Image.
  5. Ansible executes now on the passed VM-Instance the login to the GCR and runs the Docker Image, which starts the training.

Terraform: A tool to interact with infrastructure API such as GCP to create, delete and manage infrastructure resources. Resource Plan define the infrastructure to be executed as code.

Ansible: A tool to interact with a fleet of bare-metal machines to configure them on application level at once. So-called Playbooks define a series of commands to be run on the bare-metal machines.

Training Process and Tensorboard

Since the model can gather the bucket name from the environment variable passed into the Docker Image, it is capable of directly outputting any logs to a centralised GCS during the training. Logs of each training run can be found under nips-2018-adversarial-vision-challenge-data/model_data/MODEL_NAME-VERSION. Currently there is no Tensorboard during training runs in the container, due to the fact that we have a running instace at tensorboard.buzzle.io:6006. This instance has access to the bucket, where he can read all previous and multiple parallel running training runs at once.

Once a training is finished the container is being quit. Currently there is no possibility to remove the VM-Instance! This means that one has to delete the VM by theirself to prevent costs! This issue will be addressed asap.