GPU Accelerated Deep Learning

The buzz around Deep Learning often misleads layman people to think that it is a newly invented technology but it comes as a shock for them when they are told that foundations of Deep Learning were laid down as long back as the 1940-1950s. There is a long history of deep learning where most of the popular deep neural network architectures and theories were already proposed throughout the latter half of the 20th century. If it was such the case then you may ask why the deep learning revolution is taking place in the current times and why not a few decades back
The short answer is that the right hardware and compute power, required to efficiently train the large neural networks, did not exist during those times thus all the theories were mostly on papers without practical support. In fact, there was a time where if you were researching on neural networks, you would not have been taken seriously by the machine learning research community. Although the dedicated researchers continued their work on the neural network, it mostly stayed an impractical bunch of theories till up to later half of the 2000s when the hardware revolution started to pick up.


Brief History of Early Use of GPU in Deep Learning

Fig-1 NVIDIA 1st GPU Geoforce 256 in 1999 (Source)

NVIDIA launched the first commercial GPU Geoforce 256 in 1999 and in the 2000s, it started to position itself as the leading innovator of GPU technology to boost the graphics industry. GPU which stands for Graphics Process Unit, started to gain popularity among gamers due to its ability of parallel processing that could render the graphics frames of games much faster than CPUs, thus giving a seamless gaming experience. In 2007, NVIDIA released a framework CUDA (Compute Unified Device Architecture) for software programmers, who could leverage CUDA API for General Purpose Computing on GPU (GPGPU) on NVIDIA GPUs.

Away from the traditional use in graphics processing, CUDA allowed engineers and scientists to use GPUs in other work as well which required parallel computing, especially in those tasks that are embarrassingly parallel requiring no manipulation. If you understand the mathematics of neural networks, you should recognize that its matrices operation lies in the category of embarrassingly parallel, thus making it a good candidate for GPGPU.

Fig-2 Matrix Calculation of Neural Network can be performed embarrassingly parallelly (Source)

Kumar Chellapilla’s CNN implementation on GPU in 2006 is the earliest known attempt of GPU use for Deep Learning. Stanford professor and founder of Coursera, Andrew NG is also known to be one of the early proponents of using GPUs for deep neural networks since 2008 and few other researchers actively started experimenting with GPUs after 2008-2009 with the help of CUDA. However, it was the winner of the 2012 Imagenet Challenge image classification model, Alexnet that proved to be a landmark Deep Learning model with GPU acceleration. It was definitely not the first GPU use in deep learning, but it was the grand stage at which it won gave it a cult status and a mainstream media attention, thus triggering the Deep Learning revolution.

GPU Vs CPU Architecture

Fig-3 GPU vs CPU Architecture

Let us compare the architecture of both CPU and GPU to understand why the GPU is better than the CPU to perform operations on neural networks.

The first main visible difference is that CPU only has a few cores to perform arithmetic operations whereas a GPU can have hundreds of thousands of such cores. To give a perspective, a standard good performing standard CPU has 8 cores and a powerful CPU Intel Core i9-10980XE has 18 cores and on the other hand the monstrous GeForce GTX TITAN Z NVIDIA GPU has 5760 CUDA cores. Such a high number of cores allows GPU to do parallel computing far efficiently to produce high throughput than CPU. GPU also has high memory bandwidth than CPU that allows it to move a huge amount of data all at once between the memory.

Due to its high memory bandwidth and parallelization, GPU can load very large chunks of the matrice of neural network at once and do a parallel computation to produce output. On the other hand, the CPU would have loaded the numbers sequentially with almost negligible parallelization support from its few cores. This is why for large deep neural networks that have large matrix operations, GPUs can easily outperform CPUs for the training process.

It should be noted that having such a high number of cores does not make GPU better than CPU for all operations. Any operation that cannot be broken down for parallelization is served faster by CPU due to its low latency. Thus a CPU would have computed a sequential floating-point operation faster than GPU.

The Marvel of Tensor Core

Fig-4 Voltas Tensor Core Performance (Source)

With the widespread adoption of GPU for deep learning, in 2017 NVIDIA launched a GPU Tesla V100 in 2017 with a new type of Voltas architecture that had dedicated cores called Tensor Core to support specific tensor operations for neural networks. NVIDIA claimed that Volta Tensor Cores could produce a 12 times higher throughput than its predecessors with regular CUDA.

The fundamental approach behind this was that Tensor Core was specialized to multiply two 4×4 FP16 matrices and add 4×4 FP16 or FP32 matrices in it (FP stands for Floating Point). Such matrice operations are quite common in neural networks hence the advantage of having a dedicated Tensor Core optimized enough to execute them faster than traditional CUDA core.

Fig-5 Matrice Operation supported by Tensor Core (Source)

In the next 2nd generation Turing Tensor Core architecture there was added support for FP32, INT4, and INT8 precision. And very recently, NVIDIA announced 3rd generation A100 Tensor Core GPU based on Ampere architecture with support for FP64 and a new precision Tensor Float 32 which is similar to FP32 and can deliver 20 times more speed without code change.

Fig-6 Turing Tensor Core Performance (Source)

CUDA and CuDNN for Deep Learning

Till now our discussion was focussed around the hardware aspect of GPU, let us now understand how programmers can leverage NVIDIA GPU for Deep Learning. We had touched upon CUDA in the earlier section where we told you that it was an API for programmers for performing general-purpose computing on GPU. CUDA has native support for programming languages like C/C++ and Fortran along with third-party wrapper support for other programming languages like Python, R, Matlab, Java, etc.

But mind you, CUDA was launched keeping the graphics design community in mind and although the deep learning community started using CUDA, it was an uphill task for them to focus on the low-level complexity of CUDA instead of focusing on neural networks. Hence NVIDIA released CuDNN in 2014 which was a dedicated CUDA based library for Deep Learning that provided functions for the primitive operations of neural networks like backpropagation, convolutional, pooling, etc.

Soon all the well known deep learning libraries like PyTorch, Tensorflow, Matlab, MXNet, etc. started incorporating CuDNN in their framework to provide more seamless support of CUA to its users.

Fig-7 Deep Learning Libraries supporting CUDA

Approach for GPU Acceleration

GPU can accelerate the Deep Learning pipeline only if it is used diligently otherwise it can also create a bottleneck. This usually happens when users try to push all code through GPU without considering if all these operations can be performed parallelly on GPU.

As a rule of thumb, only the compute-intensive code that can be executed parallelly should be pushed to GPU and all the rest of the sequence code should be sent to CPU. For example, code for data cleaning, preprocessing should be executed on CPU, and the code for neural network training should be run on GPU, only then you will see an overall boost in performance.

Fig-8 GPU Acceleration (Source)

GPU vs CPU Benchmarking for Deep Learning

A data scientist, Max Woolf experimented by training Tensorflow Neural Network models on multiple data sets across GPU and different CPUs on Google Cloud. His results show the expected fastness of GPU for deep learning compared to CPUs.

ANN trained on MNIST Dataset
Fig-10 CNN trained on MNIST Dataset
Fig-11 Deep CNN+ANN trained on CIFAR-10 Image Dataset


In this article, we learned how GPU played such an important role in reviving the interest of ML ommunity in neural network and bringing Deep Learning to main stream. Although GPU is supporting researchers and big companies to do wonders with Deep Learning, they are quite costly and beyond reach of most of the hobbyists. There is however options of free tier of GPU on Google Colab with limited availbility on which beginners can try hands on. 

Share this Post

In the age of information, data is undeniably the pulsating heart of transformative technologies. According to a 2020 report by IDC, the …

Artificial Intelligence (AI) has been a hot topic for many years now, with tech giants such as Google, Amazon, Microsoft, Apple, Baidu, …

Picture this: an AI-powered world where energy conservation is the norm, endangered species are protected, and natural disasters are efficiently managed. Sounds …