Categories
Analysis

Machine learning developers and their data

The data science (DS), machine learning (ML), and artificial intelligence (AI) field is adapting and expanding. From the ubiquity of data science in driving business insights, to AI’s facial recognition and autonomous vehicles, data is fast becoming the currency of this century.  This post will help you to learn more about this data and the profile of the developers who work with it. 

The findings shared in this post are based on our Developer Economics 20th edition survey, which ran from December 2020 to February 2021 and reached 19,000 developers. 

Before you dive into the data, our new global developer survey is live now. We have a set of questions for machine learning and AI developers. Check it out and take part for a chance to have your say about the most important development trends and win prizes.

It takes all types

The different types of ML/AI/DS data and their applications

We ask developers in ML, AI, and DS what types of data they work with. We distinguish between unstructured data — images, video, text, and audio — and structured tabular data. The latter group includes tabular data that they may simulate themselves. 

With 68% of ML/AI/DS developers using unstructured text data, it is the most common type of data these developers work with; however, developers frequently work with multiple types of data. Audio is the most frequently combined data type: 75-76% of those that use audio data also use images, video, or text. 

“Unstructured text is the most popular data type, even more popular than tabular data”

Given the most popular applications of audio data are text-to-speech generation (47%) and speech recognition (46%), the overlaps with video and text data are clear. Image data, like audio, overlaps heavily with video data: 78% of those using video data also use image data. The reverse dependence isn’t as strong: only 52% of those using image data are also video data users. The top two applications of both these data types are the same: image classification and facial recognition. These are two key application fields driving the next generation of intelligent devices: improving augmented reality in games and underpinning self-driving cars, in home robotics, home security surveillance, and medical imaging technology. 

The types of data ML/AI/DS developers work with

69% of  ML/AI/DS developers using tabular data also use unstructured text data

With 59% usage, tabular data is the second most popular type of data. 92% of the tabular data ML/DS/AI developers use is observed, while the other 8% is simulated. The two most common use-cases for this data is workforce planning — 39% of developers who use simulation do this —  and resource allocation, also at 39%.

Structured tabular data is least likely to be combined with other types of data. Although uncommon to combine this type of data with audio or video data, 69% do combine tabular data with unstructured text data. The top application of both tabular data and unstructured text is the analysis and prediction of customer behaviour. This is the sort of analysis often done on the data nuggets we leave behind when searching on retail websites — these are key inputs to algorithms for natural language and recommender systems.

Keeping it strictly professional?

The professional status of ML/AI/DS developers

The professional / hobbyist / student mix in the ML/AI/DS ecosystem

ML/AI/DS developers engage in their fields at different levels. Some are professionals, others students or hobbyists, and some are a combination of the above. The majority (53%) of all ML/DS/AI developers are professionals — although they might not be so exclusively. 

Of all the data types, audio data has the highest proportion of professional ML/DS/AI developers. 64% of ML/AI/DS developers who use this type of data classified themselves as a professional; and the majority (50%) of these professionals are applying audio data to text-to-speech generation. The high proportion of professionals in this field might be a byproduct of co-influencing variables: audio data is the data type most frequently combined with other types, and professionals are more likely to engage with many different types of data. 

Data types popular with students include image, tabular, and text data. Between 18-19% of developers who work with these types of data are students. There are many well-known datasets of these types of data freely available. With this data in hand, students also favour certain research areas. 

Image classification, for example, is popular with developers who are exclusively students: 72% of those students who use image data use it for this application, in contrast to just 68% of exclusive professionals that do. In applying unstructured text data, 38% of exclusive students are working in Natural Language Processing (NLP), while 32% of exclusive professionals are. As these students mature to professionals, they will enter industry with these specialised skills and we expect to see an increase in the practical applications of these fields, e.g. chatbots for NLP. 

“65% of students and 54% of professionals rely on one or two types of data”

Besides differences in application areas, students, hobbyists, and professionals engage with varying types of data. 65% of those who are exclusively students use one or two types of data, while 61% of exclusively hobbyists and only 54% of exclusively professionals use one or two types. Developers who are exclusively professionals are the most likely to be relying on many different types of data: 23% use four or five types of data. In contrast, 19% of exclusively hobbyists and 15% of exclusively students use four to five types. Level of experience, application, and availability of datasets all play a role in which types of data an ML/AI/DS developer uses. The size of these datasets is the topic of the next section.

Is all data ‘big’?

The size of ML/AI/DS developers’ structured and unstructured training data

The hype around big data has left many with the impression that all developers in ML/AI/DS work with extremely large datasets. We asked ML/AI/DS developers how large their structured and unstructured training datasets are. The size of structured tabular data is measured in rows, while the size of unstructured data — video, audio, text — is measured in disc size. Our research shows that very large datasets aren’t perhaps as ubiquitous as one might expect. 

“14% of ML/AI/DS developers use structured training datasets with less than 1,000 rows of data, while the same proportion of developers use data with more than 500,000 rows” 

The most common size band is 1K – 20K rows of data, with 25% of ML/AI/DS developers using structured training datasets of this size. This differs by application type. For example, 22% of those working in simulation typically work with 20K – 50K rows of data; while 21% of those working with optimisation tools work with 50K – 100K rows of data. 

Dataset size also varies by professional status. Only 11% of exclusively professional developers use structured training datasets with up to 20K rows, while 43% of exclusively hobbyists and 54% of exclusively students use these small datasets. This may have to do with access to these datasets — many companies generate large quantities of data as a byproduct or direct result of their business processes, while students and hobbyists have access to smaller, open-source datasets or those collected via their learning institutions. 

A further consideration is, who has access to the infrastructure capable of processing large datasets? For example, those who are exclusively students might not be able to afford the hardware to process large volumes of data. 

Non-tabular data is a useful measure for comparisons within categories: for example, 18% of image datasets are between 50MB-500MB, while only 8% are more than 1TB in size. The measure doesn’t, however, allow for cross-type comparisons, since different types of data take up different amounts of space. For example, 50MB of video data takes up a considerably shorter length of time than 50MB of audio data. 

The size of unstructured training data ML/AI/DS developers work with

The categorisation of the different data sizes was designed to take into account the steps in required processing power. For most ML/AI/DS developers, we expect that a 1-25GB dataset could be handled with powerful, but not specialised, hardware. Depending on the language and modelling method used, 25GB on disc relates to the approximate upper bound in memory size that this type of hardware could support. 

We see that 26% of ML/AI/DS developers using text data and 41% using video data will require specialised hardware to manage their training. The high level of specialized hardware manifests as a barrier-to-entry: data analysis on these large datasets is beyond an achievable scope without the backing of deep pockets supplying cloud-based technology support or infrastructure purchases.

Want to know more? This blog post explores where ML developers run their app or project’s code, and how it differs based on how they are involved in machine learning/AI, what they’re using it for, as well as which algorithms and frameworks they’re using.

Categories
Analysis

The lasting effects of COVID-19 on how developers work and learn

The COVID-19 pandemic has fundamentally changed the way people work and learn across industries, and developers are no exception to that. Although in the grand scheme of things, our data indicates that many developers have been weathering well the repercussions of an unprecedented crisis, there is much more to tell. 

The findings shared in this post are based on our Developer Economics 20th edition survey, which ran from December 2020 to February 2021 and reached 19,000 developers. Previously we reviewed how developers’ needs were changing due to COVID-19. Now we’ve taken a deep dive into our latest survey data to find which developer groups and regional communities were affected the most by the pandemic and in what ways. 

Before you dive into the data, our new global developer survey is live now. We have updated it with questions relevant for developers in 2021 – check it out and take part for a chance to leave your mark on the upcoming trends and win prizes.

How the COVID-19 pandemic affected the way developers work or study

In our survey, we asked developers to assess the impact of the COVID-19 pandemic on the way they work or learn. In perhaps the most salient category, 7% of developers said that they had lost their job in the aftermath of the pandemic and 9% had dropped out of their studies. Notwithstanding the severity of becoming unemployed in times of crisis, the IT sector is still seen as one of the least impacted sectors in terms of hiring during the global pandemic, with an almost unwavering demand for professionals in software and hardware segments.

How the COVID-19 pandemic affected the way developers work or study

Which developers were impacted the most?

Arguably, not every developer has been affected by their company’s decision to enforce remote working to the same extent. For example, does a working parent who juggles the daily demands of commutes and childcare embrace the opportunity of working remotely as a means of having a better work-life balance? On the other hand, if asked, how many pre-pandemic graduates would have, at the time of their study, agreed to go fully remote and potentially miss out on exploring the rich social life that a university offers to young people? Let’s take a closer look at some of our distinct developer groups to understand which factors have been having the greatest impact on ways of working and learning. 

When looking at how the pandemic affected developers of different experience levels, we find that the more experienced developers were also more affected in the way they work. For instance, 40% of developers with less than one year of work experience say they were unaffected in their ways of working, compared to 35% of developers with six or more years of experience. While the gap is not particularly large, junior developers appear to have switched to remote working to a lesser extent. This may be, in part, due to younger people and new hires wanting to go to the office, get to know their colleagues, and connect with their peers.

Developers working for large organisations were the most likely to go fully-remote during the pandemic

Next, we evaluated COVID-19’s impact with respect to company size. We find that developers working for larger companies were clearly more affected in their ways of working by the pandemic. While 42% of developers in small companies between two and 50 employees say they were not affected, the number plummets to less than 30% for developers in companies of over 50 employees. Large enterprises with more than 5,000 employees have been battling the repercussions of the pandemic at the frontlines; 51% of developers here went fully remote compared to just 29% of developers in companies with between two and 50 employees.

Developers in small companies were less affected by the pandemic.

Note that, except for small companies, switching to fully-remote working was the most likely outcome in our survey. There are good reasons for this: large business organisations are naturally more risk-averse and commonly need large contiguous office spaces that have to be fully closed for all of their employees to effectively contain the spread of the virus. On the other hand, many small companies had more remote-friendly organisational structures to begin with. In particular, start-ups have been known to promote a remote-first culture due to the apparent benefits of lower seed capital and broader options to recruit and pool talents together.

Twice as many developers in Western Europe compared to East Asia went fully remote.

Switching to remote working has been more common in Western regions.

Perhaps it comes as no surprise when looking at different regions to find a substantial gap between the East and the West. Our data shows that the Western regions, such as Western Europe and the Americas, have more readily facilitated remote working for their employees. For instance, 41% of Western European developers went fully remote, as opposed to only 20% in East Asia. This could be due to a combination of different factors. For example, pundits argue that many Asian countries score low for having the technological infrastructure deemed necessary to adapt to remote working conditions, such as having poor home-office equipment or internet connectivity that is sensitive to traffic surges. Yes, social factors may have partly played a role, as higher average household sizes and smaller apartments in emerging regions pose roadblocks to their own for employees to balance their work and home life.

The pandemic took a heavier toll on young learners.

Lastly, we looked at COVID-19’s impact on learners. 39% of 18- to 24-year olds stated that the way they study has not been affected during the pandemic. On the contrary, among those students aged 25 and over, 50% or more were not affected. Thus, an interesting trend emerges here, that especially younger learners had to adapt by becoming partly or fully remote. Our data offers one possible clue for this; younger learners are more likely enrolled in a formal degree program than older learners, who are more likely to be self-taught and are to be found burning the midnight oil with online courses and boot camps that have traditionally fostered remote ways of studying.