Deep learning on Pyspark

Deep Learning on Pyspark

Using Tensorflow, Docker, GPU acceleration, Pyspark

We will attempt to benchmark Deep learning use Tensorflow on Pyspark. The solution will attempt to simplify the library setup complexity of having Tensorflow, Pyspark and Nvidia GPU to work together. The document will also benchmark Tensorflow on GPU vs CPU only. All the project files can be found on here.


Why deep learning on GPU?

Deep learning is considerably faster on GPU than simply on CPU. This is because GPU hardware units are extremely optimized toward parallel calculations. With most modern GPUs having more than a thousand core, it promises a platform that handles the parallel nature of Neural Networks faster than CPU’s can.


What are the challenges presented by GPU processing?

GPU’s are generally more complicated to setup on machines than CPU’s. An Nvidia driver must be installed and it has to fit the hardware. Then, Cuda library must be added as well; Cuda are the libraries that support GPU calculations. It handles the parallelization required for running calculations on hundreds of cores on one machine. After that, the Deep learning platform must be able to utilize the GPU processing power.


Distributed Deep Learning

While using GPU’s will certainly speed up learning, it is not linearly scalable. At some point we cannot easily add one more GPU to our machine. That is a problem Distributed processing has been working on for quite some time.

The problem with Deep Learning is that most Deep learning algorithms are linear in nature. Meaning we cannot easily distribute learning on multiple machines that will result in performance gain.

A good example of Deep learning algorithms are Neural Networks. They take a complex input, such as an image or an audio recording, and then apply mathematical transforms on these signals. The output of this transform is a vector of numbers that is easier to manipulate by other ML algorithms. Neural Networks rely on early assumption parameters called Hyper Parameters. Those Hyper Parameters greatly increase the prediction power of our model. In order to gain better results from any Neural Network one must experiment with different Hyper parameters, and that is a clearly distributable problem.

The idea is to have Pyspark Train multiple models at the same time on different machines, each with a slightly different initial Hyper parameters. Then at the end we can get the accuracy levels of all the models and select the best one.

If such a solution was to be done on one machine, it would take considerably longer to re-train as it is done sequentially. However on distributed machines, we only wait for models to train once at the same time.


Pyspark and TensorFlow

For this solution, we will use TensorFlow as the Deep Learning Framework and run it on top of Pyspark as a distributed Framework. The code is repurposed from Databricks blog.


Using Docker

In order to simplify the hardware and software dependencies for this solution, we will use Docker images. A docker image will contain all the libraries needed for the solution to work, and such facilitating scaled deployment on clusters.

This Docker file sets up a Docker image that will contain both Pyspark and Tensorflow ready to work together.

FROM is to be used incase of GPU acceleration


Using Nvidia-Docker

Nvidia docker is an Nvidia provided solution that attempts to simplify GPU accelerated Docker Images. It comes pre-installed with Cuda libraries and acts as an abstraction to the hardware. However, it is important to note that the Docker still requires Nvidia driver to be installed; making the abstraction not fully complete.

This supports all the stable releases of docker-ce

Those are the steps taken to install Docker Nvidia and Nvidia driver on Ubuntu 17.04 machine.


What will the application do?

The application will train a Nerual Network against a labeled dataset to detect images. We create a spark RDD that contains 50 potential models. Then, we use function to train all 50 models but with randomized Hyper Parameters. The script will broadcast the training and testing data to all workers before training. We then get a result RDD that contains all the accuracy of all the models with their used Hyper Parameters for us to select the best one.


Benchmark test description

It is quite difficult to attempt to benchmark CPU processing vs GPU processing because they essentially use different hardware. So we will use an 8 Cores machine with 52 GB memory versus a 1 core machine with 20 GB memory and 1 x NVIDIA Tesla K80. The main factor for us will be time it took and cost. The machines are VM instances on Google Cloud Platform. For this benchmarking, we will not run on a cluster, this will run only on a master with no slaves.

We will run the test twice – once with a small training set, and then with a relatively large one. Our metric will be time taken and exactly the same script will run in both configurations.


CPU Benchmark

The machine costs $0.332 per hour.

We first run the script using 10,000 images for training and 3,000 for testing. Here are the results:

The elapsed time is 1067 seconds which is equal to 18 minutes approx. With this we can calculate the cost which is equal to $0.0996.


Now we run against a 60,000 images set with test set of 10,000. Here are the results from the second test:

The elapsed time is 7861 seconds which is 2.18 hours; cost is $0.723.


GPU Benchmark

The machine costs $0.538 per hour.

We first run the script using 10,000 images for training and 3,000 for testing. The results are:

Elapsed time is 204 seconds which is 0,0566667 hours; cost is $0.0305.


Now we run against a 60,000 images set with test set of 10,000:

Elapsed time is 851 seconds which is 0,236389 hours; cost is $0.127177282.


Benchmark Comparison

Small batch of train data (10k images):

  CPU Machine GPU Machine Metric ratio
Cost per hour $0.332 $0.538
Time taken in seconds 1067 204 GPU is 80% faster
Cost of computation $0.0996 $0.0305 GPU is 70% cheaper
Best Model error 5.966% 5.966%
Worst Model error 41% 41%

When using a small batch of train data around 10k images, the GPU is not only 80% faster, but also 70% cheaper. The model accuracy varied widely with a smaller set of data. From a max error of 41% to 5.96%. Thanks to the automatically generated base learning rate, we can still have a decent prediction model even with less data.

Notice that some models failed to successfully converge ending with 100% error on testing. That can happen while training Neural Networks and the script handles that situation by stopping.

Large batch of train data (60k images):

  CPU Machine GPU Machine Metric ratio
Cost per hour $0.332 $0.538
Time taken in seconds 7861 851 GPU is 90% faster
Cost of computation $0.723 $0.127177282 GPU is 83% cheaper
Best Model error 1.20% 1.20%
Worst Model error 4.79% 4.79%


When using larger data to train with, the gap between the performance of GPU vs CPU widens even more. Now the GPU is 90% faster and 83% cheaper. The trend is clear – with more data, GPU will outpace CPU-only training.

The model accuracy did not vary as much as with the small data, but we managed a 4% improvement between the best to worst model error. The final accuracy of 98.8% is very good as well.



We have seen through the benchmark results that GPU has a huge effect on both the speed and the cost of the model training. It effectively reduces the stigma following deep learning that it must take increasingly long time to process and train. However with the ability to easily deploy a cluster with GPU acceleration ready, then simultaneously train multiple models at the same time using Pyspark is an extremely cost effective method. With this Deep learning can be applied to more real-world problems with ease.

Like this article and want to stay updated of more news and events?
Then sign up for our newsletter!

Don't miss out!

Subscribe to our newsletter and stay up to date with our latest articles and events!

Subscribe now

Newsletter Subscription