Scale Up Deep Learning in the Cloud

Deep learning is typically a long and costly endeavour, especially when it comes to training models. There are many factors that impact the process, but processing power, in particular, can make or break your pipeline. Today, many developers leverage graphics processing units (GPUs). Learn how you can scale up deep learning in the cloud. 

GPUs enable you to run simultaneous compute operations. This capability can significantly speed up your model training time. While on-premise GPUs aren’t an option for everyone, there is an increasing number of cloud-based GPUs options you can take advantage of.

Who Is Using Deep Learning in the Cloud?

As of Q4 2019, there are 13.3M developers working on data science, machine learning and AI development worldwide, up from 12.2M a year ago,  based on the findings from the Developer Economics Q4 2019 survey.

Click here to help us update the figure for 2020 – take part in our latest Developer Economics Q2 2020 survey live now.

Developer Economics survey - speak out and win prizes.

While many of these developers are working on more accessible, and budget friendly, machine learning (ML) projects, deep learning implementations are also gaining traction. The obvious organizations to look to for this are the cloud providers themselves and other industry giants, including Google, Facebook, and Microsoft. Others include Snapchat, Fermilab, Disney, and Carnegie Mellon University.

However, deep learning in the cloud is also proving beneficial for many smaller organizations that would otherwise not have access to the technology. As larger organizations have increased their adoption, the breadth and availability of services has increased and the cost has gone down. This has paved the way for deep learning models to be used in everything from mobile games to evaluating credit-checks.

Benefits of Deep Learning in the Cloud

Depending on the scale of your operations, implementing deep learning in the cloud can provide a number of benefits. This is particularly true for teams looking to adopt machine learning operations (MLOps) since pipelines and tooling are often already in the cloud. 

Increased scalability

One of the greatest benefits of using cloud resources for deep learning is the scalability that is possible. On-premises deployments are limited by local hardware and scaling can take significant time. In the cloud, however, you can scale as needed, temporarily provisioning hardware for particularly compute heavy tasks and scaling down during other times.

Additionally, cloud resources can provide scalability for hybrid workloads by providing burst capabilities as needed. This enables organizations to extend the value of their on-premises resources while still granting access to more performance.

Provider support for tooling

All major cloud providers offer some level of built-in support for existing ML and deep learning tools, including TensorFlow and PyTorch. This enables teams to continue working with the tools they are familiar with without limitations created by OS or infrastructure. 

Additionally, some providers offer enhancements for these frameworks. For example, pre-crafted notebooks for faster deployment. These enhancements enable teams to leverage provider tooling or resources to make implementation processes more efficient. 

Reduced barrier to entry

Machine learning in general and deep learning in particular, can require significant expertise and resources to implement. Cloud providers can help lower these barriers by offering pre-built services for developing, training, testing models. Some providers even offer ML as a service, enabling teams without ML developers to leverage the technologies available. 

Additionally, cloud resources can provide an easier entry point for deep learning operations. With cloud resources, you can test out methods and processes before making significant investments in hardware or tooling. You can also start small, and low risk, with cloud resources and scale up to on-premises investments once you better understand your hardware needs. 

GPUs in the Cloud

As cloud providers increase their support and options for deep learning implementations, organisations are beginning to take notice. While there are specialised providers available, the big three are where many organisations, especially those just getting started, should look.


Azure provides several choices for GPU-based instances. All of these instances are designed for high computation tasks, including deep learning, simulations, and visualisations.

In Azure, you can choose from three instance series:

  • NC-series—optimised for compute and network-intensive workloads. These instances can support OpenCL and CUDA-based applications and simulations. GPUs available include the NVIDIA Tesla V100, the Intel Broadwell, and the Intel HaswellGPUs.
  • NV-series—optimised for visualisations, encoding, streaming and virtual desktop infrastructures (VDI). These instances support OpenGL and DirectX. GPUs available include the AMD Radeon Instinct MI25 and NVIDIA Tesla M60 GPUs. 
  • ND-series—optimised for deep learning training scenarios and inference. GPUs available include the NVIDIA Tesla P40, Intel Skylake, and Intel Broadwell GPUs. 


AWS provides four instance options, available in multiple sizes. These include EC2 P2, P3, G3, and G4 instances. With these instances, you can choose to access NVIDIA Tesla M60, T4 Tensor, K80, or V100 GPUs and can include up to 16 GPUs per instance.

With AWS, you also have the option of using Amazon Elastic Graphics. This service enables you to connect your EC2 instances to a variety of low-cost GPUs. You can attach GPUs to any instance that is compatible for greater workload flexibility. The Elastic Graphics service also provides up to 8GB of memory and supports OpenGL 4.3.

Google Cloud

Although Google Cloud doesn’t offer dedicated instances with GPUs, it does enable you to connect GPUs to existing instances. This works with standard instances and Google Kubernetes Engine (GKE) instances. It also enables you to deploy node pools including GPUs. Support is available for NVIDIA Tesla V100, P4, T4, K80, and P100 GPUs.

Another option in Google Cloud is access to TensorFlow processing units (TPUs). These units are made of multiple GPUs. TPUs are designed to quickly perform matrix multiplication and can provide performance similar to Tensor Core enabled Tesla V100 instances. Currently, PyTorch provides partial support for TPUs.


There are a number of benefits to using cloud-based GPUs. Perhaps the most popular advantage is the scalability of the cloud. Instead of being limited to local hardware, you can quickly scale up or down without incurring on-prem overhead. You can also leverage cloud vendor support, and integrate with popular frameworks such as PyTorch and TensorFlow. 

Another popular benefit of cloud is that many vendors offer resources that can significantly save time. For example, you can use cloud AutoML tools to speed up some of your processes, and test out methods without investing too much time and costs. In this case, you also reduce risk by testing out your hypothesis. In short, cloud GPUs enable you to gain a higher level of scalability, save time, and avoid on-prem overhead.

Author Bio: Farhan Munir

With over 12 years of experience in the technical domain, I have witnessed the evolution of many web technologies, as well as the rise of the digital economy. I consider myself a life-long learner, and I love experimenting with new technologies. I embrace challenges with enthusiasm and outside-of-the-box mindset. I feel it is important to share your experiences with the rest of the world – in order to pass on the knowledge or let other folks learn from your mistakes or successes. In my spare time, I like to travel and photograph the world. YouTube

Are you using cloud GPUs in your development? Take our survey and share your experiences.


The TechGig Engineering Tech Stack

TechGig is a technology platform to Attract, Engage and Hire top tech talent. Companies can source talent from a growing community of 4 million developers and leverage the power of automated real-time assessments and data-driven decision support.

Deciding Tech Stack is the most important decision you need to take while creating any Tech Product. The important criterion we have used while deciding technology stack are developer ecosystem, agility, scalability, speed and performance. Every product has unique requirements and you need to select technology stack matching your requirements. Blindly copying others’ technology decisions is not the right way to go. With the growth of users and features sometimes, you might need to replace old tech pieces and adopt new ones. While designing your architecture you should design in such a way that replacing some part of your older stack should not lead to a major rewrite.

TechGig engineering team believes in Open Source technologies. Our complete stack is made up of open-source software. For different layers, we have used different technologies specific to the requirement.

Front End

Our Front End is served by LAMP stack and client-side is written in jQuery and AngularJS which is most commonly used stack and is appreciated by the developer community for its agility and scalability.

Full-Text Search

Full-text search is an important requirement for TechGig. For this, we are using Solr which supports real-time indexing of content, faceted search, dynamic clustering and scalability and fault tolerance.


TechGig supports 54+ languages in its code evaluation engine. To scale it in real-time and support these many languages and environments we are heavily using containers. Docker is the container technology which provides lightweight virtualization with almost no overhead.

Data Store

A lot of data within the TechGig ecosystem is not relational in nature like metadata of various types of evaluations, events and analytics data. To support these kinds of data we are using MongoDB as NoSQL technology. MongoDB is easy to use, highly available, highly scalable and high-performance document-oriented database which fits well within our requirements for evolving data and real-time analytics.


TechGig application follows loosely coupled architecture to manage complexity, modularity and stability of code. Messaging is an important aspect of loosely coupled architecture. TechGig uses Kafka as the message queue for inter-communication between different modules of application like code evaluation front end, code evaluation engine, content indexing, real-time analytics, recommendation engine etc.

Caching Tech Stack Requirements

Caching is an important requirement for any application to enhance performance and response time. TechGig uses Redis to support data caching and offloading database load. TechGig serves user profile data, session data, stats and a lot of long-lived data from Redis caching layer. Other static data like JavaScript, CSS and images are served from Akamai CDN.

Backend and Analytics

A lot of heavy lifting components like face detection, face recognition recommendation services, plagiarism detection, content classification, bulk mailing services etc. are written in java and python using OpenCV, NLTK and TensorFlow frameworks.

Analytics is very important for better user experience and informed decision making. TechGig uses ELK stack for data ingestion and exploration and visualization. All event logs and behavioural data is ingested and visualized using ELK stack.

Development and Deployment

Apart from the technology stack, the selection of Software development tools is also very critical. This includes IDE, build tools, source control, requirement and bug tracking, CI/CD pipeline and testing tools. TechGig team uses, VSCode as IDE as of this lightweight and very fast. For source control TechGig uses BitBucket and for the requirement and Bug tracking, Atlassian Jira is used. Bitbucket and Jira go along so well, and both are Atlassian tools. TechGig uses Jenkins for its CI/CD pipeline as it is very easy to manage, requires very little maintenance and has a rich plugin ecosystem. For automated testing we use Selenium as it is open source, supports multiple browser testing and parallel test executions.

Site Reliability

After all Site Reliability monitoring is also very important. If some code deployed in production misbehaves or breaks, the SRE team should get alert and immediate rectification should be done. TechGig team has its own set of inhouse homegrown tools for monitoring and alert.

Application Security

Bad actors consistently try to steal the user’s personal, financial data. To protect users and data, we have a comprehensive application security program in which from requirements gathering to production deployment in all steps security is embedded. We use Static Code analysis to identify security gaps within the code. To identify security gaps in production live application we use Dynamic security assessments. Apart from this to identify business logic vulnerabilities extensive manual security assessments are done periodically.

Apart from the above-mentioned pieces, there are a lot of other components which contribute to TechGig tech stack. The technologies are subject to change depending on new requirements and challenges. What makes our work interesting is the problem statement we are solving i.e. help the developer community to learn to compete and grow.

About the author

Ram Awasthi is the Head-Technology of TechGig & TimesJobs and VP- Technology, Times Internet.