Scale Up Deep Learning in the Cloud

Deep learning is typically a long and costly endeavour, especially when it comes to training models. There are many factors that impact the process, but processing power, in particular, can make or break your pipeline. Today, many developers leverage graphics processing units (GPUs). Learn how you can scale up deep learning in the cloud. 

GPUs enable you to run simultaneous compute operations. This capability can significantly speed up your model training time. While on-premise GPUs aren’t an option for everyone, there is an increasing number of cloud-based GPUs options you can take advantage of.

Who Is Using Deep Learning in the Cloud?

As of Q4 2019, there are 13.3M developers working on data science, machine learning and AI development worldwide, up from 12.2M a year ago,  based on the findings from the Developer Economics Q4 2019 survey.

Click here to help us update the figure for 2020 – take part in our latest Developer Economics Q2 2020 survey live now.

Developer Economics survey - speak out and win prizes.

While many of these developers are working on more accessible, and budget friendly, machine learning (ML) projects, deep learning implementations are also gaining traction. The obvious organizations to look to for this are the cloud providers themselves and other industry giants, including Google, Facebook, and Microsoft. Others include Snapchat, Fermilab, Disney, and Carnegie Mellon University.

However, deep learning in the cloud is also proving beneficial for many smaller organizations that would otherwise not have access to the technology. As larger organizations have increased their adoption, the breadth and availability of services has increased and the cost has gone down. This has paved the way for deep learning models to be used in everything from mobile games to evaluating credit-checks.

Benefits of Deep Learning in the Cloud

Depending on the scale of your operations, implementing deep learning in the cloud can provide a number of benefits. This is particularly true for teams looking to adopt machine learning operations (MLOps) since pipelines and tooling are often already in the cloud. 

Increased scalability

One of the greatest benefits of using cloud resources for deep learning is the scalability that is possible. On-premises deployments are limited by local hardware and scaling can take significant time. In the cloud, however, you can scale as needed, temporarily provisioning hardware for particularly compute heavy tasks and scaling down during other times.

Additionally, cloud resources can provide scalability for hybrid workloads by providing burst capabilities as needed. This enables organizations to extend the value of their on-premises resources while still granting access to more performance.

Provider support for tooling

All major cloud providers offer some level of built-in support for existing ML and deep learning tools, including TensorFlow and PyTorch. This enables teams to continue working with the tools they are familiar with without limitations created by OS or infrastructure. 

Additionally, some providers offer enhancements for these frameworks. For example, pre-crafted notebooks for faster deployment. These enhancements enable teams to leverage provider tooling or resources to make implementation processes more efficient. 

Reduced barrier to entry

Machine learning in general and deep learning in particular, can require significant expertise and resources to implement. Cloud providers can help lower these barriers by offering pre-built services for developing, training, testing models. Some providers even offer ML as a service, enabling teams without ML developers to leverage the technologies available. 

Additionally, cloud resources can provide an easier entry point for deep learning operations. With cloud resources, you can test out methods and processes before making significant investments in hardware or tooling. You can also start small, and low risk, with cloud resources and scale up to on-premises investments once you better understand your hardware needs. 

GPUs in the Cloud

As cloud providers increase their support and options for deep learning implementations, organisations are beginning to take notice. While there are specialised providers available, the big three are where many organisations, especially those just getting started, should look.


Azure provides several choices for GPU-based instances. All of these instances are designed for high computation tasks, including deep learning, simulations, and visualisations.

In Azure, you can choose from three instance series:

  • NC-series—optimised for compute and network-intensive workloads. These instances can support OpenCL and CUDA-based applications and simulations. GPUs available include the NVIDIA Tesla V100, the Intel Broadwell, and the Intel HaswellGPUs.
  • NV-series—optimised for visualisations, encoding, streaming and virtual desktop infrastructures (VDI). These instances support OpenGL and DirectX. GPUs available include the AMD Radeon Instinct MI25 and NVIDIA Tesla M60 GPUs. 
  • ND-series—optimised for deep learning training scenarios and inference. GPUs available include the NVIDIA Tesla P40, Intel Skylake, and Intel Broadwell GPUs. 


AWS provides four instance options, available in multiple sizes. These include EC2 P2, P3, G3, and G4 instances. With these instances, you can choose to access NVIDIA Tesla M60, T4 Tensor, K80, or V100 GPUs and can include up to 16 GPUs per instance.

With AWS, you also have the option of using Amazon Elastic Graphics. This service enables you to connect your EC2 instances to a variety of low-cost GPUs. You can attach GPUs to any instance that is compatible for greater workload flexibility. The Elastic Graphics service also provides up to 8GB of memory and supports OpenGL 4.3.

Google Cloud

Although Google Cloud doesn’t offer dedicated instances with GPUs, it does enable you to connect GPUs to existing instances. This works with standard instances and Google Kubernetes Engine (GKE) instances. It also enables you to deploy node pools including GPUs. Support is available for NVIDIA Tesla V100, P4, T4, K80, and P100 GPUs.

Another option in Google Cloud is access to TensorFlow processing units (TPUs). These units are made of multiple GPUs. TPUs are designed to quickly perform matrix multiplication and can provide performance similar to Tensor Core enabled Tesla V100 instances. Currently, PyTorch provides partial support for TPUs.


There are a number of benefits to using cloud-based GPUs. Perhaps the most popular advantage is the scalability of the cloud. Instead of being limited to local hardware, you can quickly scale up or down without incurring on-prem overhead. You can also leverage cloud vendor support, and integrate with popular frameworks such as PyTorch and TensorFlow. 

Another popular benefit of cloud is that many vendors offer resources that can significantly save time. For example, you can use cloud AutoML tools to speed up some of your processes, and test out methods without investing too much time and costs. In this case, you also reduce risk by testing out your hypothesis. In short, cloud GPUs enable you to gain a higher level of scalability, save time, and avoid on-prem overhead.

Author Bio: Farhan Munir

With over 12 years of experience in the technical domain, I have witnessed the evolution of many web technologies, as well as the rise of the digital economy. I consider myself a life-long learner, and I love experimenting with new technologies. I embrace challenges with enthusiasm and outside-of-the-box mindset. I feel it is important to share your experiences with the rest of the world – in order to pass on the knowledge or let other folks learn from your mistakes or successes. In my spare time, I like to travel and photograph the world. YouTube

Are you using cloud GPUs in your development? Take our survey and share your experiences.