gpu applications on google cloud

Easy CI/CD of GPU applications on Google Cloud

Including bare-metal using Gitlab and Kubernetes

Are you a data scientist who only wants to focus on modelling and coding and not on setting up a GPU cluster? Then, this blog might be interesting for you. We developed an automated pipeline using gitlab and Kubernetes that is able to run code in two GPU environments, GCP and bare-metal; no need to worry about drivers, Kubernetes cluster creation or deletion. The only thing that you should do is to push your code and it runs in a GPU!

Source code for both the custom Docker images and the Kubernetes objects definitions can be found here and here respectively.

Why GPUs?

It’s a known fact that the validity of Moore’s law  (the number of transistors in a dense integrated circuit doubles about every two years) was ended around 2010. Several reasons are involved, but mainly it’s due to physics limitations. As a consequence, GPUs took over CPUs with regards to mathematical computations since GPUs are made of thousands of cores.

On the GPU Technology Conference (GTC) in Munich (we were there!!!), Jensen Huang, CEO of NVIDIA, highlighted that a huge increase of GPU usage is to be expected for ML/AI activities in the following years. Chip manufactures are looking for new markets, probably due to the expected decline of cryptocurrency computation needs in the following years and data (processing) seems to be the new oil.

The following question might be, why are we not massively using GPUs for Data Science models? They seem to be economically more profitable than CPUs as they can speed computations and they are cheaper. Well, GPU based computations are harder to implement compared with CPUs. Also, they run into memory issues and don’t forget that the increase in computation is based on parallel computation (thousands of cores) and not all algorithms are easy to parallelize. Moreover, there are far fewer GPU experts than CPU ones (which might be changing in the near future).

But If I have to highlight specific problems, it’s all about driver installation and environment configuration. To be honest, it’s hard to get a GPU working. Moreover, drivers, software and even hardware are continuously upgrading. Furthermore, a software update can duplicate the speed of your process even when using the same hardware. Because of that, updates should be implemented fast. For a humble data scientist, this is a lot of hassle and not fun at all. However, dockerized applications and Kubernetes can help to automate some of these tedious tasks. Kubernetes and docker are technologies that are literally exploding. We’ve written a lot in previous blogs on the topic here, and here we discussed some applications of these technologies.

CI/CD pipeline with gitlab

At Anchormen, we strongly believe in continuous integration (CI) / continuous delivery (CD) pipelines. For simplicity’s sake, we will use docker and Kubernetes for CD combined with gitlab-CI.

Some months ago (see here), gitlab launched an integration with Google Kubernetes Engine (GKE) and they claim that you can “…set up CI/CD and Kubernetes deployment in just a few clicks…”. After some trials we realized that they are right, integration with GKE is easy!

As a follow up, we tried implementing a similar pipeline but with the added bonus of GPUs. However, we haven’t used the native integration to keep platform agnosticism.

As a (lazy) data scientist, I would like to have the following requirements for the CI/CD pipeline:

  1. I just to push the code in the repository and it should run. Other hassles such as drivers installation and infrastructure setup should be automated.
  2. Before running the code in an expensive GPU cluster, it should be tested in a CPU, just to be sure that the model has no typos and parameter optimization is working properly.
  3. To avoid any extra expenditure, the GPU cloud cluster should be automatically created and deleted after the model calculations are finished.
  4. The environment should be reproducible and exportable, meaning that it should run in a CPU, a cloud GPU or a bare-metal GPU.

To test the last requirement, we deployed our solution in three different scenarios – a gitlab runner CPU based, a GPU on GKE and a bare-metal GPU. To keep it simple, we used predefined docker images with Tensorflow in the first stage. Then, a personalized image was built, which included Keras and R.

MNIST Model on simple Tensorflow CPU and GPU images on GCP

A model for the MNIST dataset is like the “hello world” in Deep Learning. For testing purposes, we used this data set and a predefined model as described in the Tensorflow tutorial. For the CPU, we used the Tensorflow 1.10 image from docker hub. With regards to GPUs, we utilized the image tensorflow:18.10-py3 from nvidia GPU cloud (registration is required, see here).

All the files can be found here and they are described in the following lines:

1. The .gitlab-ci.yml file is used by GitLab Runner to manage the job. This file is self-explanatory and performs two stages, namely, test and deploy:

stages:
- test
- deploy

a) in the test stage, code-to-run.py run only for 1 epoch using the tensorflow/tensorflow:1.10.0-py3 image on the gitlab runner using a CPU. This stage tests for issues that can be solved before it is launched in a GPU environment. Of course, not all of the the issues can be found in this step. However, it’s good practice to test the code before the GPU server.

# Test in a CPU, only 1 epoch
test:
stage: test
image: tensorflow/tensorflow:1.10.0-py3
script:
- python code-to-run.py 1

b) in the deploy stage, kubectl was installed in the image cloud-sdk, which contains the gcloud program required to control the google cloud environment.

# Deploy in a GPU and run all epochs
deploy:
image: google/cloud-sdk
stage: deploy
script:
# Install kubectl from https://gitlab.com/gitlab-examples/kubernetes-deploy
- curl -L -o /usr/bin/kubectl https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/latest.txt)/bin/linux/amd64/kubectl; chmod +x /usr/bin/kubectl
- kubectl version --client

Then, a cluster on Google Cloud with a GPU is created; its credentials are downloaded and the connexon with the Kubernestes cluster is tested.

# Authorization in gcloud see https://medium.com/@gaforres/publishing-google-cloud-container-registry-images-from-gitlab-ci-23c45356ff0e
- echo ${GCLOUD_SERVICE_KEY} > gcloud-service-key.json
- gcloud auth activate-service-account --key-file gcloud-service-key.json || true
# In case the cluster was already initiated, it should be deleted
- gcloud --quiet container clusters delete ${GCLOUD_CLUSTER_NAME} --zone ${GCLOUD_ZONE} --project ${GCLOUD_PROJECT} || true
# Create a cluster on Google Cloud with a GPU
- |
gcloud container clusters create ${GCLOUD_CLUSTER_NAME} \
--project ${GCLOUD_PROJECT} \
--machine-type "n1-highmem-2" \
--accelerator "type=nvidia-tesla-v100,count=1" \
--image-type "UBUNTU" \
--num-nodes "1" \
--zone ${GCLOUD_ZONE}
- gcloud container clusters get-credentials ${GCLOUD_CLUSTER_NAME} --zone ${GCLOUD_ZONE} --project ${GCLOUD_PROJECT}
- kubectl cluster-info || true

In order to be able to pull images from the nvidia registry, a secret containing the connexon details is created.

# Secrets
- kubectl create secret docker-registry regsecret --docker-server="nvcr.io" --docker-username='$oauthtoken' --docker-password=${DOCKER_REGISTRY_PASSWORD} --docker-email="email@email.com" --namespace=default || true

The code is accessible to the pods as a configmap.

# Code as configmap
- kubectl create configmap code-to-run --from-file=code-to-run.py --namespace=default

and the nvidia drivers are installed in all the nodes with NVIDIA GPUs as explained here (see below more information).

# Install drivers in all the nodes with Nvidia GPUs
- kubectl apply -f daemonset.yaml --namespace=kube-system # https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/nvidia-driver-installer/ubuntu/daemonset.yaml

The code runs as a job to release the GPU resource/s as soon as it is finished.

# Run the code as a job
- kubectl create -f gke-gpu-gitlab-job.yaml --namespace=default

When the job is complete, results are printed and the GPU server is deleted.

# Wait until the code is finished
- until kubectl get jobs my-gpu-job --namespace=default -o json-path='{.status.conditions[?(@.type=="Complete")].status}' | grep True ; do sleep 5 ; echo "job in progress"; done;
# Get results
- kubectl logs $(kubectl get pods --selector=job-name=my-gpu-job --output=jsonpath={.items..metadata.name} --namespace=default) --namespace=default
# Shut down the gcloud cluster
- gcloud --quiet container clusters delete ${GCLOUD_CLUSTER_NAME} --zone ${GCLOUD_ZONE} --project ${GCLOUD_PROJECT} || true

2. Code-to-run file contains the python code to be run.

# From https://www.tensorflow.org/tutorials/
import sys
import tensorflow as tf
epochs=int(sys.argv[1]) if len(sys.argv) > 1 else 5
print (epochs)
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(512, activation=tf.nn.relu),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=epochs)
model.evaluate(x_test, y_test)

3. Gke-gpu-gitlab-job.yaml contains a job description for Kubernetes, see a detailed explanation here. It’s important to highlight that:

a) the NVIDIA private registry was configured by using an imagePullSecrets (see here).
b) code was mounted as a volume using a configmap, see here additional details
c) in order to use the GPU, we should include it as a resource, check how you can do that here.

apiVersion: batch/v1
kind: Job
metadata:
name: my-gpu-job
spec:
template:
spec:
imagePullSecrets:
- name: regsecret
containers:
- name: my-gpu-container
image: nvcr.io/nvidia/tensorflow:18.10-py3
command: ["/bin/bash", "-c", "nvidia-smi; nvcc --version; cd /code-to-run && python code-to-run.py"]
volumeMounts:
- name: code-to-run
mountPath: /code-to-run
# At least 1 gpu is required
resources:
limits:
nvidia.com/gpu: 1
volumes:
- name: code-to-run
configMap:
name: code-to-run
restartPolicy: Never

4. Deamonset, GPU nodes installed the drivers using this daemonset. Moreover, any new node with GPUs added to the cluster will be automatically configured (see here more details)

The execution of the model worked smoothly in both environments (CPU and GPU), except for two issues:

  • On GKE, ubuntu was used as image type and not Container-Optimized OS (COS) since NVIDIA registry certificate signature gave an error as it was recognized by an unknown authority. Probably, there is a way to fix this, but no additional time was expended on that since it was fixed by selecting Ubuntu as host image type.
  • Creating the deamonset directly from the URL failed. The yml file is included in the repository to avoid this issue.

To summarize, 1 epoch of the MNIST model was run in the CPU for testing and 5 epochs were run in the GPU. The exact same code was used for both scenarios, showing the reproducibility and exportability of the proposed approach.

In general, if the model doesn’t run properly on the CPU, it won’t run in the GPU cloud scenario either, saving you time and money.  Moreover, since every step of the process is automated in the CI/CD pipeline, any modification in the code is executed in both scenarios just by pushing the new code to the repository. Finally, the GPU cluster is deleted automatically after the job is completed, whenever it happens. Take into account that it might take hours, days or weeks in a real scenario.

MNIST Model on bare-metal GPU

Similar pipeline as before, the only difference is how to get credentials for the bare-metal GPU.  To install Kubernetes cluster in a bare-metal GPU, we follow the NVIDIA instructions, here.

All the files can be found here and they are described in the following lines:
1. The .gitlab-ci.yml file is used by GitLab Runner to manage the job. This file is self-explanatory and performs two stages, namely, test and deploy:

a) in the test stage, code-to-run.py runs only for 1 epoch using the tensorflow/tensorflow:1.10.0-py3 as before.

b) in the deploy stage, we used the same image as before (cloud-sdk) with kubectl for simplicity’s sake. Any other image that is installed with kubectl can be used. To get the credentials, we used a similar strategy as in one of our previous blogs. We decided to create a namespace with the same name as the repository (CI_PROJECT_NAME variable) since additional projects could be run in the same machine. Be aware that the default account of this namespace should have enough privileges to create and delete jobs, secrets and configmaps (see here additional information). The rest of the code is as the previous step, with the exception of the creation/deletion of the cluster.

# Deploy in a GPU and run all epochs
deploy:
image: google/cloud-sdk
stage: deploy
script:
# Install kubectl from https://gitlab.com/gitlab-examples/kubernetes-deploy
- curl -L -o /usr/bin/kubectl https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/latest.txt)/bin/linux/amd64/kubectl && chmod +x /usr/bin/kubectl
- kubectl version --client
- echo "${KUBE_CA_PEM}" > kube_ca.pem
- kubectl config set-cluster default-cluster --server=${KUBE_URL} --certificate-authority="$(pwd)/kube_ca.pem"
- kubectl config set-credentials default-admin --token=${KUBE_TOKEN}
- kubectl config set-context default-system --cluster=default-cluster --user=default-admin --namespace=${CI_PROJECT_NAME}
- kubectl config use-context default-system
- kubectl cluster-info || true
# Secrets
- kubectl delete secret regsecret --namespace=${CI_PROJECT_NAME} || true
- kubectl create secret docker-registry regsecret --docker-server="nvcr.io" --docker-username='$oauthtoken' --docker-password=${DOCKER_REGISTRY_PASSWORD} --docker-email="email@email.com" --namespace=${CI_PROJECT_NAME} || true
# Code as configmap
- kubectl delete configmap code-to-run --namespace=${CI_PROJECT_NAME} || true
- kubectl create configmap code-to-run --from-file=code-to-run.py --namespace=${CI_PROJECT_NAME}
# Run the code as a job
- kubectl delete -f gke-gpu-gitlab-job.yaml --namespace=${CI_PROJECT_NAME} || true
- kubectl create -f gke-gpu-gitlab-job.yaml --namespace=${CI_PROJECT_NAME}
# Wait until the code is finished
- until kubectl get jobs my-gpu-job --namespace=${CI_PROJECT_NAME} -o jsonpath='{.status.conditions[?(@.type=="Complete")].status}' | grep True ; do sleep 5 ; echo "job in progress"; done;
# Get results
- kubectl logs $(kubectl get pods --selector=job-name=my-gpu-job --output=jsonpath={.items..metadata.name} --namespace=${CI_PROJECT_NAME}) --namespace=${CI_PROJECT_NAME}

c) code-to-run file contains the python code to be run as before.
d) gke-gpu-gitlab-job.yaml contains a job description for Kubernetes as before.

MNIST Model on more complex Tensorflow CPU and GPU images including Keras and R on GCP

A more complex example was carried out. The code is in R and it needs the Keras library. Because of that, both were installed in the Tensorflow image. This example shows how to modify pre-existing docker images to install the dependencies that we need, using an automated CI/CD pipeline. Moreover, we used only one docker file for both CPU and GPU environments to avoid maintaining two separate docker files.

All the files can be found here and they are described in the following lines:

1. The .gitlab-ci.yml file is used by GitLab Runner to manage the job. This file is self-explanatory and performs three stages – build, test, and deploy:

a) in the build stage, images were generated and pushed to docker hub. Since we want to maintain only one docker file for both CPU and GPU environments, the original image should be different. For the CPU environment, TensorFlow/TensorFlow:1.10.0-py3 was used. Similarly, we used nvcr.io/nvidia/tensorflow:18.10-py3 image for the GPU environment. Since the NVIDIA registry is a private registry, we need to login before pulling the TensorFlow image for GPU. After both images were built, they were pushed to the Docker hub repository. Code was NOT included in the image and it was mounted as a configmap in the deploy stage. In this way, we keep the privacy of our code.

# Build images for CPU and GPU of tensorflow using the same Dockerfile by just changing the FROM field
image_build:
stage: build
image: docker:stable
script:
# Git is needed to reset the Dockerfile
- apk update && apk add --no-cache bash git
# Initial images to install R and Keras
- export IMAGE_TENSOR_CPU='tensorflow/tensorflow:1.10.0-py3'
- export IMAGE_TENSOR_GPU='nvcr.io/nvidia/tensorflow:18.10-py3'
# The name of the branch will be used as version name except master, which will be version latest
- export VERSION=$(echo $CI_COMMIT_REF_NAME | sed 's,.*/,,g')
- |
if [ "$VERSION" == "master" ] ; then
export VERSION=latest
fi
# Login to Dockerhub to push CPU image
- docker info
- docker login -u ${DOCKER_HUB_USER} -p ${DOCKER_HUB_PASSWORD}
- sed -i "s#IMAGE_NAME#${IMAGE_TENSOR_CPU}#g" Dockerfile
- docker build -t ${DOCKER_HUB_USER}/tensorflow-r:$VERSION .
- docker push ${DOCKER_HUB_USER}/tensorflow-r:$VERSION
# Login to Nvidia registry to pull the a tensorflow image for GPU
- docker login -u '$oauthtoken' -p ${DOCKER_REGISTRY_PASSWORD} nvcr.io
- git checkout -- Dockerfile
# Change initial image for GPU compatibility keeping the rest of the Dockerfile
- sed -i "s#IMAGE_NAME#${IMAGE_TENSOR_GPU}#g" Dockerfile
- docker build -t ${DOCKER_HUB_USER}/tensorflow-nvidia-r:$VERSION .
# Login to Dockerhub to push GPU image
- docker login -u ${DOCKER_HUB_USER} -p ${DOCKER_HUB_PASSWORD}
- docker push ${DOCKER_HUB_USER}/tensorflow-nvidia-r:$VERSION

b) in the test stage, code-to-run.R run only for 1 epoch using the tensorflow-r image built in the previous stage.

stages:
# Test in a CPU, only 1 epoch
test:
stage: test
image: ${DOCKER_HUB_USER}/tensorflow-r
script:
- Rscript code-to-run.R 1

c) the deploy stage is as the simple Tensorflow example on GCP (see above).

2. Code-to-run file contains the R code to be run.

args = commandArgs(trailingOnly=TRUE)
# From https://www.analyticsvidhya.com/blog/2017/06/getting-started-with-deep-learning-using-keras-in-r/
#loading keras library
library(keras)
#loading the keras inbuilt mnist dataset
data<-dataset_mnist()
#separating train and test file
train_x<-data$train$x
train_y<-data$train$y
test_x<-data$test$x
test_y<-data$test$y
rm(data)
# converting a 2D array into a 1D array for feeding into the MLP and normalis-ing the matrix
train_x <- array(train_x, dim = c(dim(train_x)[1], prod(dim(train_x)[-1]))) / 255
test_x <- array(test_x, dim = c(dim(test_x)[1], prod(dim(test_x)[-1]))) / 255
#converting the target variable to once hot encoded vectors using keras in-built function
train_y<-to_categorical(train_y,10)
test_y<-to_categorical(test_y,10)
#defining a keras sequential model
model <- keras_model_sequential() #defining the model with 1 input layer[784 neurons], 1 hidden layer[784 neu-rons] with dropout rate 0.4 and 1 output layer[10 neurons] #i.e number of digits from 0 to 9 model %>%
layer_dense(units = 784, input_shape = 784) %>%
layer_dropout(rate=0.4)%>%
layer_activation(activation = 'relu') %>%
layer_dense(units = 10) %>%
layer_activation(activation = 'softmax')
#compiling the defined model with metric = accuracy and optimiser as adam.
model %>% compile(
loss = 'categorical_crossentropy',
optimizer = 'adam',
metrics = c('accuracy')
)
#fitting the model on the training dataset
model %>% fit(train_x, train_y, epochs = args[1])
#Evaluating model on the cross validation dataset
loss_and_metrics <- model %>% evaluate(test_x, test_y)
loss_and_metrics

3. Gke-gpu-gitlab-job.yaml contains a job description for Kubernetes

apiVersion: batch/v1
kind: Job
metadata:
name: my-gpu-job
spec:
template:
spec:
imagePullSecrets:
- name: regsecret
containers:
- name: my-gpu-container
image: angelsevillacamins/tensorflow-nvidia-r
command: ["/bin/bash", "-c", "nvidia-smi; nvcc --version; cd /code-to-run && Rscript code-to-run.R 5"]
volumeMounts:
- name: code-to-run
mountPath: /code-to-run
# At least 1 gpu is required
resources:
limits:
nvidia.com/gpu: 1
volumes:
- name: code-to-run
configMap:
name: code-to-run
restartPolicy: Never

4. Deamonset is as the simple Tensorflow example on GCP (see above).

5. Dockerfile, this file generated a new image with R and Keras starting from a different image depending on the environment. Additional packages/libraries were installed as well. R version were fixed to 3.5.1, whereas latest version of the tensorflow and Keras R libraries were installed.

Gitlab CI/CD and Configuration Variables

The main variables used are passwords or connection details for the NVIDIA registry, docker hub and GCP; saved as Gitlab CI/CD Secret variables. Then, they were included in the Kubernetes cluster as secrets via environmental variables. Therefore, these sensitive variables were saved only in the Gitlab repository.

Configuration parameters in the Gitlab repository for the simple Tensorflow CPU and GPU images on GCP

      • [DOCKER_REGISTRY_PASSWORD]: For the GPU environment, images are pulled from the NVIDIA registry (nvcr.io). In order to access the NVIDIA registry, a valid password should be included in this variable (check here).
      • [GCLOUD_CLUSTER_NAME]: Name of the GPU cluster to be created with gcloud, for example my-gpu-cluster.
      • [GCLOUD_PROJECT]: Name of the project in which the GPU cluster should be created, for example plated-complex-212323.
      • [GCLOUD_SERVICE_KEY]: JSON Service Key Secret for login to GCP using gcloud (see here).
      • [GCLOUD_ZONE]: GCP zone where the GPU cluster should be created, for example us-central1-a.

Configuration parameters in the Gitlab repository for the simple Tensorflow CPU and GPU images on Bare-metal

      • [DOCKER_REGISTRY_PASSWORD]: For the GPU environment, images are pulled from the NVIDIA registry (nvcr.io). In order to access the NVIDIA registry, a valid password should be included in this variable (check here).
      • [KUBE_CA_PEM]: Kubernetes CA Certificate to access the bare-metal GPU cluster. See our previous blog here for additional details.
      • [KUBE_TOKEN]: Service Account Token to access the bare-metal GPU cluster. See our previous blog here for additional details.
      • [KUBE_URL]: URL of the bare-metal GPU cluster. See our previous blog here for additional details.

Configuration parameters in the Gitlab repository for the more complex Tensorflow CPU and GPU images including Keras and R on GCP

The same as the simple Tensorflow CPU and GPU images on GCP and the following ones:

      • [DOCKER_HUB_PASSWORD]: In the build stage, images are pushed to docker hub. Fill here the password of the docker hub account where the images should be pushed to.
      • [DOCKER_HUB_USER]: Fill here the user of the docker hub account where the images should be pushed to.

Conclusion

In this blog, we set up a pipeline able to automatically run a Deep Learning model in python for the MNIST data on different GPU environments. Neither a manual installation of the NVIDIA drivers nor Kubernetes cluster set up were needed. From the perspective of a data scientist, only the code changes should be pushed to the repository. Any other steps related to GPU infrastructure were automated. Moreover, the code was tested in CPU before being launched to the GPU. Two GPU scenarios were tested successfully – Google Kubernetes Engine and bare metal. Furthermore, user-defined images were built including R and Keras and a model for the MNIST data set was also successfully tested, using R. Regarding the code used for this blog, it is available here. Additionally, the custom Docker images can be found on the docker hub (here).

As also stated in our previous blog, we overpassed Kubernetes container readiness and liveness probes as well as secured the access to the Kubernetes cluster and the provisioned services. Take these into consideration if you use this approach in a production environment.

Like this article and want to stay updated of more news and events?
Then sign up for our newsletter!