Machine learning developers and their data

The data science (DS), machine learning (ML), and artificial intelligence (AI) field is adapting and expanding. From the ubiquity of data science in driving business insights, to AI’s facial recognition and autonomous vehicles, data is fast becoming the currency of this century.  This post will help you to learn more about this data and the profile of the developers who work with it. 

The findings shared in this post are based on our Developer Economics 20th edition survey, which ran from December 2020 to February 2021 and reached 19,000 developers. 

Before you dive into the data, our new global developer survey is live now. We have a set of questions for machine learning and AI developers. Check it out and take part for a chance to have your say about the most important development trends and win prizes.

It takes all types

The different types of ML/AI/DS data and their applications

We ask developers in ML, AI, and DS what types of data they work with. We distinguish between unstructured data — images, video, text, and audio — and structured tabular data. The latter group includes tabular data that they may simulate themselves. 

With 68% of ML/AI/DS developers using unstructured text data, it is the most common type of data these developers work with; however, developers frequently work with multiple types of data. Audio is the most frequently combined data type: 75-76% of those that use audio data also use images, video, or text. 

“Unstructured text is the most popular data type, even more popular than tabular data”

Given the most popular applications of audio data are text-to-speech generation (47%) and speech recognition (46%), the overlaps with video and text data are clear. Image data, like audio, overlaps heavily with video data: 78% of those using video data also use image data. The reverse dependence isn’t as strong: only 52% of those using image data are also video data users. The top two applications of both these data types are the same: image classification and facial recognition. These are two key application fields driving the next generation of intelligent devices: improving augmented reality in games and underpinning self-driving cars, in home robotics, home security surveillance, and medical imaging technology. 

The types of data ML/AI/DS developers work with

69% of  ML/AI/DS developers using tabular data also use unstructured text data

With 59% usage, tabular data is the second most popular type of data. 92% of the tabular data ML/DS/AI developers use is observed, while the other 8% is simulated. The two most common use-cases for this data is workforce planning — 39% of developers who use simulation do this —  and resource allocation, also at 39%.

Structured tabular data is least likely to be combined with other types of data. Although uncommon to combine this type of data with audio or video data, 69% do combine tabular data with unstructured text data. The top application of both tabular data and unstructured text is the analysis and prediction of customer behaviour. This is the sort of analysis often done on the data nuggets we leave behind when searching on retail websites — these are key inputs to algorithms for natural language and recommender systems.

Keeping it strictly professional?

The professional status of ML/AI/DS developers

The professional / hobbyist / student mix in the ML/AI/DS ecosystem

ML/AI/DS developers engage in their fields at different levels. Some are professionals, others students or hobbyists, and some are a combination of the above. The majority (53%) of all ML/DS/AI developers are professionals — although they might not be so exclusively. 

Of all the data types, audio data has the highest proportion of professional ML/DS/AI developers. 64% of ML/AI/DS developers who use this type of data classified themselves as a professional; and the majority (50%) of these professionals are applying audio data to text-to-speech generation. The high proportion of professionals in this field might be a byproduct of co-influencing variables: audio data is the data type most frequently combined with other types, and professionals are more likely to engage with many different types of data. 

Data types popular with students include image, tabular, and text data. Between 18-19% of developers who work with these types of data are students. There are many well-known datasets of these types of data freely available. With this data in hand, students also favour certain research areas. 

Image classification, for example, is popular with developers who are exclusively students: 72% of those students who use image data use it for this application, in contrast to just 68% of exclusive professionals that do. In applying unstructured text data, 38% of exclusive students are working in Natural Language Processing (NLP), while 32% of exclusive professionals are. As these students mature to professionals, they will enter industry with these specialised skills and we expect to see an increase in the practical applications of these fields, e.g. chatbots for NLP. 

“65% of students and 54% of professionals rely on one or two types of data”

Besides differences in application areas, students, hobbyists, and professionals engage with varying types of data. 65% of those who are exclusively students use one or two types of data, while 61% of exclusively hobbyists and only 54% of exclusively professionals use one or two types. Developers who are exclusively professionals are the most likely to be relying on many different types of data: 23% use four or five types of data. In contrast, 19% of exclusively hobbyists and 15% of exclusively students use four to five types. Level of experience, application, and availability of datasets all play a role in which types of data an ML/AI/DS developer uses. The size of these datasets is the topic of the next section.

Is all data ‘big’?

The size of ML/AI/DS developers’ structured and unstructured training data

The hype around big data has left many with the impression that all developers in ML/AI/DS work with extremely large datasets. We asked ML/AI/DS developers how large their structured and unstructured training datasets are. The size of structured tabular data is measured in rows, while the size of unstructured data — video, audio, text — is measured in disc size. Our research shows that very large datasets aren’t perhaps as ubiquitous as one might expect. 

“14% of ML/AI/DS developers use structured training datasets with less than 1,000 rows of data, while the same proportion of developers use data with more than 500,000 rows” 

The most common size band is 1K – 20K rows of data, with 25% of ML/AI/DS developers using structured training datasets of this size. This differs by application type. For example, 22% of those working in simulation typically work with 20K – 50K rows of data; while 21% of those working with optimisation tools work with 50K – 100K rows of data. 

Dataset size also varies by professional status. Only 11% of exclusively professional developers use structured training datasets with up to 20K rows, while 43% of exclusively hobbyists and 54% of exclusively students use these small datasets. This may have to do with access to these datasets — many companies generate large quantities of data as a byproduct or direct result of their business processes, while students and hobbyists have access to smaller, open-source datasets or those collected via their learning institutions. 

A further consideration is, who has access to the infrastructure capable of processing large datasets? For example, those who are exclusively students might not be able to afford the hardware to process large volumes of data. 

Non-tabular data is a useful measure for comparisons within categories: for example, 18% of image datasets are between 50MB-500MB, while only 8% are more than 1TB in size. The measure doesn’t, however, allow for cross-type comparisons, since different types of data take up different amounts of space. For example, 50MB of video data takes up a considerably shorter length of time than 50MB of audio data. 

The size of unstructured training data ML/AI/DS developers work with

The categorisation of the different data sizes was designed to take into account the steps in required processing power. For most ML/AI/DS developers, we expect that a 1-25GB dataset could be handled with powerful, but not specialised, hardware. Depending on the language and modelling method used, 25GB on disc relates to the approximate upper bound in memory size that this type of hardware could support. 

We see that 26% of ML/AI/DS developers using text data and 41% using video data will require specialised hardware to manage their training. The high level of specialized hardware manifests as a barrier-to-entry: data analysis on these large datasets is beyond an achievable scope without the backing of deep pockets supplying cloud-based technology support or infrastructure purchases.

Want to know more? This blog post explores where ML developers run their app or project’s code, and how it differs based on how they are involved in machine learning/AI, what they’re using it for, as well as which algorithms and frameworks they’re using.


What do developers value in open source?

Open-source software (OSS) is used by 92% of developers, so what exactly do they value in it? We find that developers value OSS’s ability to supersede any single contributor and live on almost eternally. We highlight some uncertainty around OSS’s future by showing trends from geographic regions and sectors. The findings shared in this post are based on the Developer Economics survey 19th edition which ran during June-August 2020 and reached more than 17,000 developers in 159 countries.

What exactly do developers value in open-source?

Open-source software (OSS) is ubiquitous in the global developer community. As our data shows, OSS is used by 92% of developers. A question that comes to mind is: what exactly do developers value in OSS? In the chart below, we show which statements developers value about OSS, broken down by professional and nonprofessional developers, and enterprise and non-enterprise developers. The overarching theme for what developers value from OSS is its ability to be eternal. “To collaborate with the community, building software that outlasts even its originator” encapsulates the two statements with the greatest agreement.

The overall cost and wanting to avoid vendor lock-in/lock-out are important aspects that professional and enterprise developers in particular value in OSS, while non-enterprise developers value forking product derivatives and debugging more than the other groups. Non-professional developers do not value the overall costs element, perhaps because they have not experienced the costs involved in closed source software, whereas many professional developers have. Another aspect that non-professional developers value significantly less is avoiding vendor lock-in. This also suggests that these developers have not experienced the limitations of closed source software yet.

Appreciation of the overall costs of OSS is also highly linked with years of developer experience: only 24% of developers with less than one year of experience agree that low cost is an asset of OSS. In contrast, the percentage of developers who agree that low cost is an asset of OSS rises to 34% of developers who have between three and five years, and 43% of developers with six or more years of experience. Typically, as developers gain experience, they begin to work in different sectors, often crossing over between sectors. At this point, the flexibility that OSS offers may become crucial. 

Finally, we also see a greater proportion of non-professional developers not using OSS compared to others. This is also reflected indirectly in each of the other statements; we see that non-professional developers agree with every statement less than professional developers. This suggests that, to be truly appreciative of the benefits of OSS, you may have had to engage with it seriously, in the way professional developers do.

Where OSS is written is changing

At present, the culture of OSS is particularly strong with Western European and Israeli developers, where not a single statement is valued below the average. On the contrary, developers in North America—who, up until now, have driven the OSS movement—value contributing and interacting with the community less than average. This could suggest a cooling off of North American OSS development and a maturing of this ecosystem. 

On average, East Asian developers seem to be disengaged from the OSS movement more than developers from other regions. Only 88% of developers in this region use OSS compared to 92% globally. In general, developers in this region also value less aspects of OSS. In particular, their extremely low appreciation of the continuous support for the technology compared to others, highlights that developers in this region are apprehensive about the longevity of OSS, which partially undermines its main benefit. This apprehension is also reflected by the relatively low agreement associated with contributing. 

According to our data, South Asian developers value contributing to OSS significantly more than others. In addition, South Asia is the region with the largest proportion of developers who value collaborating and interacting with the community. This combination positions the region to be among the drivers of the next wave of OSS development. In the Middle East and Africa region, some key advantages of OSS, such as avoiding vendor lock-in and the overall low cost have not yet resonated with developers — this is despite the fact that, at least for Africa, income per capita is low compared to global averages. What assists in explaining this is this region’s proportion of professional developers and the experience of its developers. 

The Middle East and Africa, as well as South America, have roughly the same proportion of professional developers, 60.7%, in contrast to North America or Western Europe and Israel, where more than 80% of developers are professional. Non-professionals value OSS less. Similarly, developers in the Middle East and Africa are also the least experienced, on average, and years of experience in particular is linked with appreciating the low cost of OSS.

Some sectors embrace OSS while others don’t

Emergent sectors such as augmented reality (AR) and virtual reality (VR) stand to benefit greatly from OSS as a means of defining a common standard and exchanging ideas. Yet, we find that developers working in these two fields do not value forking/creating product derivatives, nor even collaboration in the case of VR, as much as other developers do, on average, from other fields. This could be partially explained by the lower than average agreement with the need for continuous support for a technology. When developers do not value this characteristic, it is unlikely that they are working with the mindset which would ensure long term OSS growth and desirability. 

On the other hand, developers who are building apps and extensions for third party ecosystems, on average, value contributing and forking more than developers in other sectors. Similarly, the very successful node.js runtime has facilitated other extensions and developers working in backend services really value the continuous support of OSS projects. At present, despite the large percentage of developers who use open source software, it is only in certain circumstances that the majority of developers value OSS for any given reason. Perhaps this suggests that OSS has become an expectation rather than being perceived as a gift from society at large to society at large. Observing how developers value OSS in the future would be a good litmus test for the health of open source projects. For now though, there are encouraging blooms in South Asia for example, but also software sectors of scepticism, such as in AR/VR.

Are you involved in open source? Share your experiences with us in our Developer Economics 20th edition survey!

Be a guest writer on our blog
Have you got brilliant tips and resources that developers love to read? Then we want you on our blog! Find out more.