Categories
Community

How OCR Helps in Text Extraction From Multiple Images at Once?

OCR (Optical Character Recognition) tools utilize Machine Learning algorithms to extract characters from a digital image or scanned file

This technology enables many individuals and industries to streamline their workflow by digitizing data for easy access and storage. 

Plus, the advanced OCR tools come with batch-processing capabilities. So, they can extract text from multiple images at once. This feature allows companies to create large datasets that they can later utilize to make well-informed decisions.

In this post, we will discuss how OCR helps in extracting text from multiple images at once. We will also learn the way to leverage a tool from the internet to complete day-to-day tasks.


Shape the Future of Tech! Join the Developer Nation Panel to share your insights, drive tech innovation, and win exciting prizes. Sign up, take surveys, and connect with a global community shaping tomorrow’s technology. Join Now


How OCR Works: The Basics

We will start-off by highlighting the basics of OCR and how it works to extract text from multiple images at once.

1. Image processing

Images are cleaned and prepared for the text recognition process. The OCR engine binarizes (converts the image to black and white), reduces the noise, corrects the skew, and then detects the edges of characters so they’re clearly captured. 

2. Text Detection

After preprocessing, the OCR engine detects areas of the image that likely contain text. These segregated areas are processed further by detecting a gradient in brightness between the text and the background color. For this step, algorithms such as convolutional neural networks (CNNs) can be used to detect text regions.

3. Character Segmentation

In this step, the OCR engine breaks the detected text regions into individual lines and characters. Connected component analysis is used by some systems and contours by others to find characters. 

However, the challenge here is to correctly distinguish between letters that touch each other or are spaced irregularly.

4. Pattern Recognition (Character Recognition)

This is the heart of OCR process and can happen in two primary ways:

Template Matching: Training the algorithms to compare each detected character to a database of known patterns. The engine does its best when the fonts and size does not change, but cannot handle different font or style variations.

Feature Extraction: The approach extracts distinct features of each character (given by lines, curves, and intersections), and applies algorithms such as k nearest neighbors (KNN) or neural networks to recognize the text.

5. Post-Processing

Once the characters are recognized, post-processing corrects errors and improves accuracy. For example, the system can use a dictionary to fix misrecognized words or apply NLP models to predict and fix common OCR mistakes such as reading “rn” as “m.”

Ways OCR Helps Getting Text from Multiple Images

When the OCR technology first started to commercialize, there were a lot of limitations. First, the software and tools were mostly licensed and paid. Secondly, you couldn’t process many images at once; it was one at a time and a very time-consuming process.

From that, we’ve come a long way. The OCR tools that we have today are much faster and more robust than in the past, one of which we will discuss in this article. We will also see how the advanced tech supports batch-processing capabilities while maintaining accuracy.

1. Batch-Processing

Newer OCR tools allow users to upload many image files at once for conversion. This is called batch-processing and it allows companies with big data sizes to quickly digitize their physical documentation.

A tool that we think is necessary to mention here is the Imagetotext.io. It helps users process 50 images at once with lightning-fast speeds and high accuracy. 

The OCR tool has a very minimalist user interface, which keeps the learning curve much gentler. To use the tool, we simply dragged and dropped the image files into the interface to receive the following output.

The text was immediately extracted for all the 3 files we uploaded to this OCR tool. If we want to do this for more images, then purchasing the premium package with some additional features might be the way to go.

Thus, explicitly showing how imagetotext.io has accurate batch-processing capabilities for handling a large sample size of documents.

2. Multi-Format Image Support

Not only batch-processing, but the advanced OCR tools (like the one we just mentioned) are also capable of supporting multiple file formats. These include:

  • PNG
  • JPG
  • JPEG
  • WEBP
  • BMP
  • TIFF
  • And more …

This support for a vast range of image formats makes OCR technology perfect for different use cases. A person working on freelance projects can directly fetch an image from the internet using its URL to convert it to editable text.

Similarly, an organization with a wide team structure working with complex imagery in TIFF format can get the text in editable form using the tool we discussed. All of these things elevate the functionality of individuals or work teams, immensely boosting their productivity.

3. Maintaining High Accuracy

As we saw in the pictorial demonstration, modern-day OCR tools are capable of maintaining their high accuracy during batch-processing of images. 

This feature makes the technology crucial for eliminating errors associated with manual data entry. Thus, making the information that reaches the databases accurate and dependable.

Besides that, a pristine text extraction process ensures that there is no loss of data, making the knowledge bases comprehensive.

However, it is never a bad idea to cross-check the extracted text so that you can avoid the rare slip-ups that these tools can sometimes make.

 4. Layout Preservation

ML algorithms have developed so quickly that the OCR tools can now ensure layout preservation of text almost every time. As an example, consider the image below with advanced mathematical text

Let us put the mentioned OCR tool to test with this image and see if it can retain the layout (mathematical symbols, etc.) in the extraction process.

Indeed, achieving such layout preservation levels for OCR tools is a statement that our technology is advancing at a rapid pace. Thus, no matter how many images you put in for the process, there will be no variation in the textual formatting of the extracted data.

 5. Integration with Other Tools

Modern OCR tools can integrate with a lot of other useful tools. These may include translators, transcribers, and so on.

Thus, image-to-text conversion isn’t just limited to digitizing information. But, the technology can also be used to make one-stop solutions for users where everything is done accurately and rapidly.

Companies can also leverage OCR to add accessibility features like TTS (Text-to-Speech) to their platforms. This can allow visually impaired individuals to navigate websites conveniently, thus adding inclusivity to the user experience (UX).

There are many more ways OCR helps to extract text and integrates with other applications or APIs. However, we’ve mentioned some of these in this post just to give you an idea of how this technology can help scale up your business.

Technical Details for Developers

For a developer, building or integrating OCR requires understanding some of the following aspects.

1. Image Preprocessing Techniques

  • Binarization: Converting gray images to binary makes OCR engines detect text easier, and algorithms such as Otsu’s method or adaptive thresholding are used for that.
  • Noise Reduction: Median filtering and morphological operations (like dilation and erosion) are techniques for cleaning the image, that is, removing irrelevant noise.
  • Skew Detection and Correction: A popular way to look for skew in scanned images and turn them back into horizontal orientation is to use Hough Transform.

2. Machine Learning and Deep Learning Techniques

Modern OCR systems often use deep learning models like CNNs for recognizing characters, words, and even handwritten text. LSTM (Long Short Term Memory) neural network has been integrated by tools such as Tesseract (an open source OCR engine) to deal better with complex text layouts, resulting in higher recognition accuracy.

3. Handling Different Languages and Scripts

OCR must be adaptable to different languages, fonts, and character sets. Mostly, we train the model on several datasets including mundane English words, English named entities (e.g., @realDonaldTrump), Chinese characters, Japanese characters, Arabic, and other right-to-left languages.

OCR systems can be fine tuned with specific datasets, to increase accuracy. But, this requires enough understanding and working experience with APIs and model training.

4. Accuracy Improvements

To enhance accuracy, OCR systems can be fine-tuned with specific datasets. Training custom models for industry specific fonts or handwriting style is vital for OCR use cases like reading of financial forms, invoices or legal documents.

A lot of developers include OCR by utilizing Google Cloud Vision APIs, AWS Textract or Microsoft Azure Cognitive Services.

5. Real-Time OCR

For mobile or camera-based applications, real-time OCR adds another layer of complexity, requiring efficient algorithms that work on lower-quality images and in varied lighting conditions. Developing applications under such conditions requires developers to optimize for processing times and to cope with lower resolution or motion blur.

Conclusion

OCR tools use machine learning algorithms to extract text from images, enabling individuals and industries to quickly digitize data for easy access. 

Advanced OCR tools support batch processing, allowing for the extraction of text from multiple images simultaneously. 

This technology maintains high accuracy, preserves layouts, and can integrate with other useful tools, making it a valuable asset for enhancing productivity and accessibility in various big-scale applications.

Categories
Community

The Convergence of Linear Algebra and Machine Learning

Machine learning has grown exponentially over the past decade, transforming industries and everyday life. At the heart of many machine learning algorithms lies a fundamental branch of mathematics: linear algebra. Understanding the intersection of linear algebra and machine learning is crucial for developers and data scientists aiming to harness the full potential of AI technologies. This blog post explores how linear algebra underpins key machine learning concepts and techniques, providing a robust framework for algorithm development and data manipulation.

The Foundations of Linear Algebra

Linear algebra is the branch of mathematics concerning vector spaces and linear mappings between them. It includes the study of vectors, matrices, and systems of linear equations. These elements form the backbone of many computational techniques used in machine learning.

Vectors are fundamental objects in linear algebra, representing quantities that have both magnitude and direction. In machine learning, data points are often represented as vectors, where each element of the vector corresponds to a feature of the data point. For instance, a data point in a dataset of house prices might be represented by a vector whose elements include the size of the house, the number of bedrooms, and the year it was built.

Matrices are arrays of numbers arranged in rows and columns, used to represent and manipulate data. In machine learning, matrices are essential for organizing datasets and performing operations such as transformations and projections. For example, a dataset of multiple data points can be represented as a matrix, where each row corresponds to a data point and each column corresponds to a feature. If you’re looking for personalized assistance in understanding these concepts better, consider exploring math tutoring in Henderson.

Enhancing Data Preprocessing with Linear Algebra

Data preprocessing is a critical step in the machine learning pipeline, ensuring that raw data is transformed into a suitable format for model training. Linear algebra plays a pivotal role in several preprocessing techniques, making the data preparation process more efficient and effective.

Normalization and Standardization

Normalization: This technique rescales the features of a dataset so that they fall within a specific range, typically [0, 1]. Normalization ensures that no single feature dominates the learning process due to its scale. The process involves applying linear transformations to the data matrix, adjusting each element based on the minimum and maximum values of the corresponding feature.

Standardization: Standardization transforms data to have a mean of zero and a standard deviation of one. This technique is particularly useful when features have different units and scales. Standardization is achieved using matrix operations to subtract the mean and divide by the standard deviation for each feature, resulting in a standardized data matrix.

Dimensionality Reduction

Principal Component Analysis (PCA): PCA is a popular technique for reducing the number of features in a dataset while preserving as much variance as possible. This method uses eigenvalues and eigenvectors, key concepts in linear algebra, to identify the principal components that capture the most significant variations in the data. By projecting the data onto these principal components, PCA reduces the dimensionality of the dataset, making it more manageable and less prone to overfitting.

Feature Extraction and Transformation

Singular Value Decomposition (SVD): SVD decomposes a data matrix into three other matrices, highlighting the underlying structure of the data. This technique is particularly useful for tasks like noise reduction and feature extraction. By applying SVD, one can transform the original features into a new set of features that are more informative and less redundant.

Fourier Transform: In signal processing and time-series analysis, the Fourier transform converts data from the time domain to the frequency domain. This transformation helps in identifying patterns and trends that are not apparent in the original data. Linear algebra provides the framework for performing and understanding these transformations, facilitating more effective data preprocessing.

By leveraging these linear algebra techniques, data preprocessing becomes more robust, ensuring that the data fed into machine learning models is clean, standardized, and optimally structured. This enhances the model’s performance and accuracy, leading to more reliable predictions and insights.

Linear Algebra in Model Training

Linear algebra is also fundamental in the training phase of machine learning models. Many learning algorithms rely on solving systems of linear equations or optimizing linear functions.

In linear regression, one of the simplest and most widely used algorithms, the goal is to find the best-fitting line through a set of data points. This involves solving a system of linear equations to minimize the sum of squared differences between the predicted and actual values. The solution can be efficiently found using matrix operations such as matrix inversion and multiplication.

Neural networks, which power deep learning, also heavily depend on linear algebra. The layers in a neural network are essentially a series of linear transformations followed by non-linear activation functions. During the training process, backpropagation is used to update the weights of the network. This involves computing gradients, which are derived using matrix calculus, a subset of linear algebra.

Evaluating Models with Linear Algebra Techniques

Effective model evaluation is crucial for ensuring that machine learning algorithms perform well on new, unseen data. Linear algebra provides the tools necessary for thorough and accurate evaluation.

Mean Squared Error (MSE)

Calculation: MSE is a common metric used to evaluate the accuracy of regression models. It quantifies the average squared disparity between predicted and actual values. By representing predictions and actual values as vectors, MSE can be calculated using vector operations to find the difference, squaring each element, and averaging the results.

Interpretation: A lower MSE indicates a model with better predictive accuracy. Linear algebra simplifies this process, making it easy to implement and interpret.

Confusion Matrix

Structure: For classification problems, a confusion matrix provides a detailed breakdown of a model’s performance. It includes true positives, false positives, true negatives, and false negatives, organized in a matrix format.

Usage: Linear algebra operations facilitate the construction and analysis of confusion matrices, helping to compute derived metrics like precision, recall, and F1 score. These metrics offer insights into different aspects of model performance, such as accuracy and robustness.

Eigenvalues and Eigenvectors

Principal Component Analysis (PCA): In evaluating models, PCA can be used to understand feature importance and variability. Eigenvalues indicate the amount of variance captured by each principal component, while eigenvectors define the directions of these components. This analysis helps in identifying the most significant features contributing to model predictions.

By incorporating these linear algebra-based techniques, model evaluation becomes more comprehensive and insightful, ensuring the development of robust and reliable machine learning systems.

Advanced Applications of Linear Algebra in Machine Learning

Beyond the basics, linear algebra enables more advanced machine learning applications. Singular Value Decomposition (SVD) is a powerful linear algebra technique used in recommendation systems and latent semantic analysis. SVD decomposes a matrix into three other matrices, revealing the underlying structure of the data.

Another advanced application is in the field of convolutional neural networks (CNNs), which are used for image recognition and processing. The convolution operations performed in CNNs are fundamentally matrix multiplications, where filters (small matrices) are applied to input data to extract features.

Conclusion

The intersection of linear algebra and machine learning is both profound and essential. Linear algebra provides the mathematical foundation for many machine learning algorithms and techniques, from data preprocessing and model training to evaluation and advanced applications. By mastering linear algebra, developers and data scientists can gain deeper insights into how machine learning models work and how to optimize them for better performance. As the field of machine learning continues to evolve, the role of linear algebra will remain pivotal, driving innovation and enabling the development of more sophisticated AI systems.

Categories
Tips

Web Design in the Age of Machine Learning: Automating Tasks and Personalizing Content

It is not a secret that combining web design with machine learning has opened up many new options. This article will discuss how machine learning is added to web design and what tools and trends can be utilized. 

Machine learning, which is part of artificial intelligence, has made a significant impact on web design. It helps do repetitive tasks automatically and adjusts content to fit each person, making websites better for everyone.

Understanding how machine learning works is critical to seeing how it changes web design. Machine learning makes websites work better for users by using patterns in data.

So, let’s move forward and understand the role of machine learning in modern web design in detail.

The Role of Machine Learning in Web Design

Machine learning
  • What is machine learning?

Machine learning happens when a computer uses data and results to make a program that can be used in regular programming. Traditional programming is when a computer uses data and a program to make an outcome.

Even though machine learning is a part of artificial intelligence, they are not the same. In machine learning, machines learn to improve at tasks without exact programming, while artificial intelligence aims to make machines think and decide things like humans.

  • How does machine learning improve web design?

Machine learning helps web developers make websites better for users by customizing them using visitors’ information and actions. For instance, machine learning models can suggest items or content that fit users’ past actions and likes. Many streaming services already use this. That cool song you found in your “recommended” list? Machine learning likely put it there.

  • Examples of Machine Learning Applications in Web Development

The first example concerns content generation and improvement. Machine learning helps improve content by suggesting ways to enhance SEO, checking how users engage, and making short summaries.

Web Design in the Age of ML

The second example is about customizing QR codes. With innovative computer methods, websites can create a QR code for a URL. It will change based on how users act, where they are, or what they like. This makes QR codes more personal and helps track users’ actions.

The last thing machine learning can be helpful for is predicting how users act. Websites use special computer programs called machine learning models to guess or estimate how users might behave. For example, they might suppose what users will buy or click on next.

Automating Tasks in Web Design Using Machine Learning

Machine learning makes designing websites easier by doing repetitive jobs, letting designers use their time better. It can suggest colors, fonts, and layouts based on users’ liking and current design trends. This helps designers start quickly and build from there.

Machine learning helps with coding and making websites, too. It looks at existing code, learns from it, and suggests or even creates parts of new code. This helps coders write better code and find mistakes faster, improving the final website.

  • Popular tools and practices

There are tools and tech that use machine learning for automating tasks. Some have AI helpers for design or code that work with popular coding programs. TensorFlow and PyTorch are frameworks for making custom machine-learning tools for web design jobs. Another use is chat APIs for faster responses and enhanced customer service. Several chats use machine learning for intelligent features like natural language processing (NLP) to understand and respond to user messages. There are many chat APIs, that can be helpful for a web designer. Among them are Twilio, Sendbird, and a great Sendbird alternative Sceyt.

Personalizing Content with Machine Learning

Making content personal is essential for web design. It helps websites connect better with users. Machine learning is a big help in doing this.

  • Importance of Personalization in Web Design:

Personalizing content matters a lot in web design. When websites show things that users like or are interested in, it keeps them engaged and interested in coming back.

  • How Machine Learning Enables Content Personalization

Machine learning helps in making content personal. It looks at what users do on a website, like what they click on or read, and then suggests similar things. This makes the website more tailored to each user’s preferences.

  • Implementing Personalization Algorithms in Web Design:

Special computer programs (algorithms) are used to make websites personalized. These programs analyze user behavior and suggest content that users might like. Web designers use these algorithms to make the website more exciting and appealing.

  • Tools for Personalization in Web Design

Various tools that use machine learning to personalize content and aid in web design are available. Tools like Canva provide easy-to-use interfaces for creating personalized graphics, allowing designers to tailor visual content based on user preferences and trends. Other machine learning-powered platforms, such as Adobe Sensei or Figma, offer features that analyze user data to suggest design elements, layouts, and styles that resonate with the target audience. Many Canva alternatives are there to help any web designer personalize their website’s visual content.

Conclusion

In the world of web design, machine learning is changing how designers make websites. It helps by doing repetitive tasks and making content personal for each person. This makes designing websites easier and makes users’ experiences better. Tools like Canva and Fotor use machine learning to suggest designs and create personalized content. As web design keeps growing, using machine learning tools will keep being important. They’ll help make sure websites are not just useful but also interesting and appealing to different people. Machine learning is shaping the future of how we experience the internet.

Categories
Analysis

Machine learning developers and their data

The data science (DS), machine learning (ML), and artificial intelligence (AI) field is adapting and expanding. From the ubiquity of data science in driving business insights, to AI’s facial recognition and autonomous vehicles, data is fast becoming the currency of this century.  This post will help you to learn more about this data and the profile of the developers who work with it. 

The findings shared in this post are based on our Developer Economics 20th edition survey, which ran from December 2020 to February 2021 and reached 19,000 developers. 

Before you dive into the data, our new global developer survey is live now. We have a set of questions for machine learning and AI developers. Check it out and take part for a chance to have your say about the most important development trends and win prizes.

It takes all types

The different types of ML/AI/DS data and their applications

We ask developers in ML, AI, and DS what types of data they work with. We distinguish between unstructured data — images, video, text, and audio — and structured tabular data. The latter group includes tabular data that they may simulate themselves. 

With 68% of ML/AI/DS developers using unstructured text data, it is the most common type of data these developers work with; however, developers frequently work with multiple types of data. Audio is the most frequently combined data type: 75-76% of those that use audio data also use images, video, or text. 

“Unstructured text is the most popular data type, even more popular than tabular data”

Given the most popular applications of audio data are text-to-speech generation (47%) and speech recognition (46%), the overlaps with video and text data are clear. Image data, like audio, overlaps heavily with video data: 78% of those using video data also use image data. The reverse dependence isn’t as strong: only 52% of those using image data are also video data users. The top two applications of both these data types are the same: image classification and facial recognition. These are two key application fields driving the next generation of intelligent devices: improving augmented reality in games and underpinning self-driving cars, in home robotics, home security surveillance, and medical imaging technology. 

The types of data ML/AI/DS developers work with

69% of  ML/AI/DS developers using tabular data also use unstructured text data

With 59% usage, tabular data is the second most popular type of data. 92% of the tabular data ML/DS/AI developers use is observed, while the other 8% is simulated. The two most common use-cases for this data is workforce planning — 39% of developers who use simulation do this —  and resource allocation, also at 39%.

Structured tabular data is least likely to be combined with other types of data. Although uncommon to combine this type of data with audio or video data, 69% do combine tabular data with unstructured text data. The top application of both tabular data and unstructured text is the analysis and prediction of customer behaviour. This is the sort of analysis often done on the data nuggets we leave behind when searching on retail websites — these are key inputs to algorithms for natural language and recommender systems.

Keeping it strictly professional?

The professional status of ML/AI/DS developers

The professional / hobbyist / student mix in the ML/AI/DS ecosystem

ML/AI/DS developers engage in their fields at different levels. Some are professionals, others students or hobbyists, and some are a combination of the above. The majority (53%) of all ML/DS/AI developers are professionals — although they might not be so exclusively. 

Of all the data types, audio data has the highest proportion of professional ML/DS/AI developers. 64% of ML/AI/DS developers who use this type of data classified themselves as a professional; and the majority (50%) of these professionals are applying audio data to text-to-speech generation. The high proportion of professionals in this field might be a byproduct of co-influencing variables: audio data is the data type most frequently combined with other types, and professionals are more likely to engage with many different types of data. 

Data types popular with students include image, tabular, and text data. Between 18-19% of developers who work with these types of data are students. There are many well-known datasets of these types of data freely available. With this data in hand, students also favour certain research areas. 

Image classification, for example, is popular with developers who are exclusively students: 72% of those students who use image data use it for this application, in contrast to just 68% of exclusive professionals that do. In applying unstructured text data, 38% of exclusive students are working in Natural Language Processing (NLP), while 32% of exclusive professionals are. As these students mature to professionals, they will enter industry with these specialised skills and we expect to see an increase in the practical applications of these fields, e.g. chatbots for NLP. 

“65% of students and 54% of professionals rely on one or two types of data”

Besides differences in application areas, students, hobbyists, and professionals engage with varying types of data. 65% of those who are exclusively students use one or two types of data, while 61% of exclusively hobbyists and only 54% of exclusively professionals use one or two types. Developers who are exclusively professionals are the most likely to be relying on many different types of data: 23% use four or five types of data. In contrast, 19% of exclusively hobbyists and 15% of exclusively students use four to five types. Level of experience, application, and availability of datasets all play a role in which types of data an ML/AI/DS developer uses. The size of these datasets is the topic of the next section.

Is all data ‘big’?

The size of ML/AI/DS developers’ structured and unstructured training data

The hype around big data has left many with the impression that all developers in ML/AI/DS work with extremely large datasets. We asked ML/AI/DS developers how large their structured and unstructured training datasets are. The size of structured tabular data is measured in rows, while the size of unstructured data — video, audio, text — is measured in disc size. Our research shows that very large datasets aren’t perhaps as ubiquitous as one might expect. 

“14% of ML/AI/DS developers use structured training datasets with less than 1,000 rows of data, while the same proportion of developers use data with more than 500,000 rows” 

The most common size band is 1K – 20K rows of data, with 25% of ML/AI/DS developers using structured training datasets of this size. This differs by application type. For example, 22% of those working in simulation typically work with 20K – 50K rows of data; while 21% of those working with optimisation tools work with 50K – 100K rows of data. 

Dataset size also varies by professional status. Only 11% of exclusively professional developers use structured training datasets with up to 20K rows, while 43% of exclusively hobbyists and 54% of exclusively students use these small datasets. This may have to do with access to these datasets — many companies generate large quantities of data as a byproduct or direct result of their business processes, while students and hobbyists have access to smaller, open-source datasets or those collected via their learning institutions. 

A further consideration is, who has access to the infrastructure capable of processing large datasets? For example, those who are exclusively students might not be able to afford the hardware to process large volumes of data. 

Non-tabular data is a useful measure for comparisons within categories: for example, 18% of image datasets are between 50MB-500MB, while only 8% are more than 1TB in size. The measure doesn’t, however, allow for cross-type comparisons, since different types of data take up different amounts of space. For example, 50MB of video data takes up a considerably shorter length of time than 50MB of audio data. 

The size of unstructured training data ML/AI/DS developers work with

The categorisation of the different data sizes was designed to take into account the steps in required processing power. For most ML/AI/DS developers, we expect that a 1-25GB dataset could be handled with powerful, but not specialised, hardware. Depending on the language and modelling method used, 25GB on disc relates to the approximate upper bound in memory size that this type of hardware could support. 

We see that 26% of ML/AI/DS developers using text data and 41% using video data will require specialised hardware to manage their training. The high level of specialized hardware manifests as a barrier-to-entry: data analysis on these large datasets is beyond an achievable scope without the backing of deep pockets supplying cloud-based technology support or infrastructure purchases.

Want to know more? This blog post explores where ML developers run their app or project’s code, and how it differs based on how they are involved in machine learning/AI, what they’re using it for, as well as which algorithms and frameworks they’re using.

Categories
Tips

Where do ML developers run their code?

In this blog post we’ll explore where ML developers run their app or project’s code, and how it differs based on how they are involved in machine learning/AI, what they’re using it for, as well as which algorithms and frameworks they’re using.

Machine learning (ML) powers an increasing number of applications and services which we use daily. For some organisations and data scientists, it is not just about generating business insights or training predictive models anymore. Indeed, the emphasis has shifted from pure model development to real-world production scenarios that are concerned with issues such as inference performance, scaling, load balancing, training time, reproducibility, and visibility. Those require computation power, which in the past has been a huge hindrance for machine learning developers.

A shift from running code on laptop & desktop computers to cloud computing solutions

The share of ML developers who write their app or project’s code locally on laptop or desktop computers, has dropped from 61% to 56% between the mid and end of 2019. Although the five percentage points drop is significant, the majority of developers continue to run their code locally. Unsurprisingly, amateurs are more likely to do so than professional ML developers (65% vs 51%).

By contrast, in the same period, we observe a slight increase in the share of developers who deploy their code on public clouds or mainframe computers. In this survey wave, we introduced multi cloud as a new possible answer to the question: “Where does your app/project’s code run?” in order to identify developers who are using multiple public clouds for a single project.

As it turns out, 19% of ML developers use multi cloud solutions (see this multi-cloud cheat sheet here) to deploy their code. It is likely that, by introducing this new option, we underestimate the real increase in public cloud usage for running code; some respondents may have selected multi cloud in place of public cloud. That said, it has become increasingly easy and inexpensive to spin up a number of instances and run ML models on rented cloud infrastructures. In fact, most of the leading cloud hosting solutions provide free Jupyter notebook environments that require no setup and run entirely in the cloud. Google Colab, for example, comes reinstalled with most of the machine learning libraries and acts as a perfect place where you can plug and play to build machine learning solutions where dependency and compute is not an issue.

While amateurs are less likely to leverage cloud computing infrastructures than professional developers, they are as likely as professionals to run their code on hardware other than CPU. As we’ll see in more depth later, over a third of machine learning enthusiasts who train deep learning models on large datasets use hardware architectures such as GPU and TPU to run their resource intensive code.

Developers working with big data & deep learning frameworks are more likely to deploy their code on hybrid and multi clouds

Developers who do ML/AI research are more likely to run code locally on their computers (60%) than other ML developers (54%); mostly because they tend to work with smaller datasets. On the other hand, developers in charge of deploying models built by members of their team or developers who build machine learning frameworks are more likely to run code on cloud hosting solutions.

Teachers of ML/AI or data science topics are also more likely than average to use cloud solutions, more specifically hybrid or multi clouds. It should be noted that a high share of developers teaching ML/AI are also involved in a different way in data science and ML/AI. For example, 41% consume 3rd party APIs and 37% train & deploy ML algorithms in their apps or projects. They are not necessarily using hybrid and multi cloud architectures as part of their teaching activity.

The type of ML frameworks or libraries which ML developers use is another indicator of running code on cloud computing architectures. Developers who are currently using big data frameworks such as Hadoop, and particularly Apache Spark, are more likely to use public and hybrid clouds. Spark developers also make heavier use of private clouds to deploy their code (40% vs 31% of other ML developers) and on-premise servers (36% vs 30%).

Deep learning developers are more likely to run their code on cloud instances or on-premise servers than developers using other machine learning frameworks/libraries such as the popular Scikit-learn python library. 

There is, however, a clear distinction between developers using Keras and TensorFlow – the popular and most accessible deep learning libraries for python – compared to those using Torch, DeepLearning4j or Caffe. The former are less likely to run their code on anything other than their laptop or desktop computers, while the latter are significantly more likely to make use of hybrid and multi clouds, on-premise servers and mainframes. These differences stem mostly from developers’ experience in machine learning development; for example, only 19% of TensorFlow users have over 3 years of experience as compared to 25% and 35% of Torch and DeepLearning4j developers respectively. Torch is definitely best suited to ML developers who care about efficiency, thanks to an easy and fast scripting language, LuaJIT, and an underlying C/CUDA implementation.

Hardware architectures are used more heavily by ML developers working with speech recognition, network security, robot locomotion and bioengineering. Those developers are also more likely to use advanced algorithms such as Generative Adversarial Networks and work on large datasets, hence the need for additional computer power. Similarly, developers who are currently using C++ machine learning libraries make heavier use of hardware architectures other than CPU (38% vs 31% of other developers) and mainframes,  presumably because they too care about performance.

Finally, there is a clear correlation between where ML developers’ code runs and which stage(s) of the machine learning/data science workflow they are involved in. ML developers involved in data ingestion are more likely to run their code on private clouds and on-premise servers, while those involved in model deployment make heavier use of public clouds to deploy their machine learning solutions. 31% of developers involved across all stages of the machine learning workflow – end to end – run code on self hosted solutions, as compared to 26% of developers who are not. They are also more likely to run their code on public and hybrid clouds. 

By contrast, developers involved in data visualisation or data exploration tend to run their code in local environments (62% and 60% respectively), even more so than ML developers involved in other stages of the data science workflow (54%).

Developer Economics 18th edition reached 17,000+ respondents from 159 countries around the world. As such, the Developer Economics series continues to be the most global independent research on mobile, desktop, industrial IoT, consumer electronics, 3rd party ecosystems, cloud, web, game, AR/VR and machine learning developers and data scientists combined ever conducted. You can read the full free report here.

If you are a Machine Learning programmer or Data Scientist, join our community and voice your opinion in our current survey to shape the next State of the Developer nation report.

Categories
Business

Data scientists need to make sense of the big picture, rather than the big data

The web echoes with cries for help with learning data science. “How do I get started?”. “Which are the must-know algorithms?”. “Can someone point me to best resources for deep learning?”. In response, a bustling ecosystem has sprung to life around learning resources of all shapes and sizes. Are the skills to unlock the deepest secrets of deep learning what emerging data scientists truly need though? Our research has consistently shown that only a minority of data scientists are in need of highly performing predictive models, while most would benefit from learning how to decide whether to build an algorithm or not and how to make sense of it, rather than how to actually build one.  

Categories
Business

Infographic: What are developers up to in the State of the Developer Nation 15th Edition?

Did you get a free copy of our latest State of the Developer Nation 15 edition? If you haven’t yet, you should! It highlights the most interesting findings from our Developer Economics survey which ran this summer in May-June this and reached over 20,500+ devs in 167 countries.

What’s new in the State of the Developer Nation 15 edition?


We asked developers, among other things, what kind of skills they’d like to learn or improve in 2019. We compared developer interest in twelve different skill sets, spanning from data science and machine learning to business/marketing skills to cloud-native development, DevOps, and hardware-level coding. The results were somewhat surprising. Data science and machine learning will be the most highly sought after skills in the next year – 45% of developers want to gain expertise in these fields. 33% of developers want to learn UI design, 25% cloud-native development. Other common tech skills, such as learning a new programming language, rank lower.

When it comes to programming language communities, JavaScript still reigns as the most popular language, with over 10M users globally. Python has reached 7M active developers and is climbing up the ranks.  62% of machine learning developers and data scientists now use Python.

Big data has been hyped for several years. In addition, a race has begun to design processors capable of crunching large sets of often unstructured data and to produce real-time predictions. The question is, to how many in the rapidly growing Data Science and Machine Learning (ML) community are large datasets and real-time predictions relevant? Scroll down to find all the highlights in the infographic!

Don’t forget to share the infographic & download the full report!

The Developer Economics 17th Edition is now LIVE. Take the survey and shape tomorrow’s trends.

State of the Developer Nation 15 edition, developer economics, developer survey, slashdata, developer research, infographic

Liked it? Take the survey and share with us your ideas for the future of development.

Categories
Languages

What is the best programming language for Machine Learning?

Q&A sites and data science forums are buzzing with the same questions over and over again: I’m new in data science, what language should I learn? What’s the best machine learning language?

machine-learning-programming-language

There’s an abundance of articles attempting to answer these questions, either based on personal experience or on job offer data. Τhere’s so much more activity in machine learning than job offers in the West can describe, however, and peer opinions are of course very valuable but often conflicting and as such may confuse the novices. We turned instead to our hard data from 2,000+ data scientists and machine learning developers who responded to our latest survey about which languages they use and what projects they’re working on – along with many other interesting things about their machine learning activities and training. Then, being data scientists ourselves, we couldn’t help but run a few models to see which are the most important factors that are correlated to language selection. We compared the top-5 languages and the results prove that there is no simple answer to the “which language?” question. It depends on what you’re trying to build, what your background is and why you got involved in machine learning in the first place.

Which machine learning language is the most popular overall?

First, let’s look at the overall popularity of machine learning languages. Python leads the pack, with 57% of data scientists and machine learning developers using it and 33% prioritising it for development. Little wonder, given all the evolution in the deep learning Python frameworks over the past 2 years, including the release of TensorFlow and a wide selection of other libraries. Python is often compared to R, but they are nowhere near comparable in terms of popularity: R comes fourth in overall usage (31%) and fifth in prioritisation (5%). R is in fact the language with the lowest prioritisation-to-usage ratio among the five, with only 17% of developers who use it prioritising it. This means that in most cases R is a complementary language, not a first choice. The same ratio for Python is at 58%, the highest by far among the five languages, a clear indication that the usage trends of Python are the exact opposite to those of R. Not only is Python the most widely used language, it is also the primary choice for the majority of its users. C/C++ is a distant second to Python, both in usage (44%) and prioritisation (19%). Java follows C/C++ very closely, while JavaScript comes fifth in usage, although with a slightly better prioritisation performance than R (7%). We asked our respondents about other languages used in machine learning, including the usual suspects of Julia, Scala, Ruby, Octave, MATLAB and SAS, but they all fall below the 5% mark of prioritisation and below 26% of usage. We therefore focused our attention on the top-5 languages.

Python is prioritised in applications where Java is not.

Our data reveals that the most decisive factor when selecting a language for machine learning is the type of project you’ll be working on – your application area. In our survey we asked developers about 17 different application areas while also providing our respondents with the opportunity to tell us that they’re still exploring options, not actively working on any area. Here we present the top and bottom three areas per language: the ones where developers prioritise each language the most and the least.

Machine learning scientists working on sentiment analysis prioritise Python (44%) and R (11%) more and JavaScript (2%) and Java (15%) less than developers working on other areas. In contrast, Java is prioritised more by those working on network security / cyber attacks and fraud detection, the two areas where Python is the least prioritised. Network security and fraud detection algorithms are built or consumed mostly in large organisations – and especially in financial institutions – where Java is a favourite of most internal development teams. In areas that are less enterprise-focused, such as natural language processing (NLP) and sentiment analysis, developers opt for Python which offers an easier and faster way to build highly performing algorithms, due to the extensive collection of specialised libraries that come with it.

Artificial Intelligence (AI) in games (29%) and robot locomotion (27%) are the two areas where C/C++ is favoured the most, given the level of control, high performance and efficiency required. Here a lower level programming language such as C/C++ that comes with highly sophisticated AI libraries is a natural choice, while R, designed for statistical analysis and visualisations, is deemed mostly irrelevant. AI in games (3%) and robot locomotion(1%)  are the two areas where R is prioritised the least, followed by speech recognition where the case is similar.

Other than in sentiment analysis, R is also relatively highly prioritised – as compared to other application areas – in bioengineering and bioinformatics (11%), an area where both Java and JavaScript are not favoured. Given the long-standing use of R in biomedical statistics, both inside and outside academia, it’s no surprise that it’s one of the areas where it’s used the most. Finally, our data shows that developers new to data science and machine learning who are still exploring options prioritise JavaScript more than others (11%) and Java less than others (13%). These are in many cases developers who are experimenting with machine learning through the use of a 3rd-party machine learning API in a web application.

machine-learning-programming-languages

Professional background is pivotal in selecting a machine learning language.

Second to the application area, the professional background is also pivotal in selecting a machine learning language: the developers prioritising  the top-five languages more than others come from five different backgrounds. Python is prioritised the most by those for whom data science is the first profession or field of study (38%). This indicates that Python has by now become an integral part of data science – it has evolved into the native language of data scientists. The same can not be said for R, which is mostly prioritised by data analysts and statisticians (14%), as the language was initially created for them, replacing S.

Front-end web developers extend their use of JavaScript to machine learning, 16% prioritising it for that purpose, while staying clear of the cumbersome C/C++ (8%). At the exact opposite stand embedded computing hardware / electronics engineers who go for C/C++ more than others, while avoiding JavaScript, Java and R more than others. Given their investment in mastering C/C++ in their engineering life, it would make no sense to settle for a language that would compromise their level of control over their application. Embedded computing hardware engineers are also the most likely to be working on near-the-hardware machine learning projects, such as IoT edge analytics projects, where hardware may force their language selection. Our data confirms that their involvement is significantly above average in industrial maintenance, image classification and robot locomotion projects among others.

For Java, it’s the front-end desktop application developers who prioritise it more than others (21%), which is also inline with its use mostly in enterprise-focused applications as noted earlier. Enterprise developers tend to use Java in all projects, including machine learning. The company directive in this case is also evident from the third factor that is strongly correlated to language prioritisation – the reason to get into machine learning. Java is prioritised the most (27%) by developers who got into machine learning because their boss or company asked them to. It is the least preferred (14%) by those who got into the field just because they were curious to see what all the fuss was about – Java is not a language that you normally learn just for fun! It is Python that the curious prioritise more than others (38%), another indication that Python is recognised as the main language that one needs to experiment with to find out what machine learning is all about.

It seems that some universities teaching data science courses still need to catch up with this notion though. Developers who say that they got into machine learning because data science is/was part of their university degree are the least likely to prioritise Python (26%) and the most likely to prioritise R (7%) as compared to others. There is evidently still a favourable bias towards R within statistics circles in academia – where it was born – but as data science and machine learning gravitate more towards computing, the trend is fading away. Those with university training in data science may favour it more than others, but in absolute terms it’s still only a small fraction of that group too that will go for R first.

C/C++ is prioritised more by those who want to enhance their existing apps/projects with machine learning (20%) and less by those who hope to build new highly competitive apps based on machine learning (14%). This pattern points again to C/C++ being mostly used in engineering projects and IoT or AR/VR apps, most likely already written in C/C++, to which ML-supported functionality is being added. When building a new app from scratch – especially one using NLP for chatbots – there’s no particular reason to use C/C++, while there are plenty of reasons to opt for languages that offer highly-specialised libraries, such as Python. These languages can more quickly and easily yield highly-performing algorithms that may offer a competitive advantage in new ML-centric apps.

Finally, contractors who got into machine learning to increase their chances of securing highly-profitable projects prioritise JavaScript more than others (8%). These are probably JavaScript developers building web applications to which they are adding a machine learning API. An example would be visualising the results of a machine learning algorithm on a web-based dashboard.

There is no such thing as a ‘best language for machine learning’.

Our data shows that popularity is not a good yardstick to use when selecting a programming language for machine learning and data science. There is no such thing as a ‘best language for machine learning’ and it all depends on what you want to build, where you’re coming from and why you got involved in machine learning. In most cases developers port the language they were already using into machine learning, especially if they are to use it in projects adjacent to their previous work – such as engineering projects for C/C++ developers or web visualisations for JavaScript developers.

If your first ever contact with programming is through machine learning, then your peers in our survey point to Python as the best option, given its wealth of libraries and ease of use. If, on the other hand, you’re dreaming of a job in an enterprise environment, be prepared to use Java. Whatever the case, these are exciting times for machine learning and the journey is guaranteed to be a mind-blowing one, irrespective of the language you opt for. Enjoy the ride!