Data scientists need to make sense of the big picture

The web echoes with cries for help with learning data science. “How do I get started?”. “Which are the must-know algorithms?”. “Can someone point me to best resources for deep learning?”. In response, a bustling ecosystem has sprung to life around learning resources of all shapes and sizes. Are the skills to unlock the deepest secrets of deep learning what emerging data scientists truly need though? Our research has consistently shown that only a minority of data scientists are in need of highly performing predictive models, while most would benefit from learning how to decide whether to build an algorithm or not and how to make sense of it, rather than how to actually build one.

In our latest State of the Developer Nation report, we put the demand for data science skills into perspective, by comparing it to other skills that developers are after. Based on the responses of more than 17,000 developers globally, data science related skills are proven indeed to be the most sought-after skills. 45% of developers report that they want to learn data science and machine learning in the next year, and 22% say they want to acquire data engineering skills.

Data science and machine learning grew in popularity due to advances in cloud computing, deep learning, and all things connected generating tons of data. Could one, therefore, claim that all those interested in data science are after the gems hidden in big data, and the gains of predicting outcomes, often real-time? What part of the fast-growing data science community is, in fact, digging into large datasets to discern signal from noise, and how many are generating predictions to feed smart systems, apps, or products? To answer these questions, we asked data scientists in our fifteenth Developer Economics survey about the size (length and width) of their training sets, whether they use models to generate predictions, and if so of what type (batch / real-time) and volume. Their answers were not what you might expect.

To begin with, 29% said they don’t train models at all, and that’s after excluding those who said they only deploy models or build frameworks. As if that were not enough, 38% report working on datasets no bigger than 20,000 rows, and only 2% crunch more than 50M records. If you’re thinking that this may be some kind of glitch in our data, think again. We have asked the exact same question in past surveys, and the results point to a stable pattern rather than to inconsistent noise – true for both professionals and amateurs.

The predictive aspect of data science doesn’t look very bright either: the majority (58%) of data scientists don’t generate model-based predictions, and only 21% work on real-time predictions.

There are, of course, variations on the above theme, depending among other things on the type of project one is working on, but even in the areas that typically involve the analysis of large transactional datasets (such as the analysis of user behaviour, fraud detection, and recommendation systems) more than 40% of data scientists report using datasets of up to 20,000 records.

The underlying reasons extend deeper than a mere lack of interest in or need for larger datasets and real-time predictions, and point to the key challenge that data scientists face: availability of clean and annotated data, and access to non-proprietary datasets. Whatever the reasons, however, the fact remains that the majority of data scientists, whether novices or seasoned professionals, do not seem to be in need of highly performing predictive models, contrary to popular belief.

Add to above the fact that automated machine learning methods that shoulder all the dirty work of model selection are rapidly gaining ground, and you’re left pondering on the need of obtaining machine learning algorithm and architecture mastery. Efficient Neural Network Architecture Search (ENAS) functionality spreads beyond paid solutions such as Google’s AutoML and into open source – the natural habitat of machine learning – through libraries and frameworks such as Keras (AutoKeras), PyTorch, and Sklearn (auto-sklearn). You don’t have to be a Deep Learning Master, or in command of a substantial budget, to find the best architecture for your data – if you need one at all, that is.

Given all the above, what skills should aspiring data scientists acquire at the end of the day? In our view, data science mastery boils down to the ability to make sense of it all. Given the abundance of courses and the rise of auto ML, data science expertise won’t be (and already isn’t) the privilege of an elite. To use a popular expression, data science and machine learning are rapidly being democratised.

What would make you stand out from the crowd? First, the analytical mindset that will allow you to translate a business problem into a data project, or a market need into a data product. Then, the skill to timely recognise and face data challenges – and to design your data infrastructure accordingly. But most importantly, when all is optimised and analysed and modeled, the ability to make sense of the outcome, to answer the “so what?” and the “does it make sense?” questions and loop back to the business (or even to your own pet project) with an answer in a human, not ML, language. These more qualitative – or soft – skills on top of a solid technical background will be, in our view, the key differentiators in a market that is becoming increasingly crowded.

Want to tell us about your skills and experience in data science and machine learning? Join our Developer Economics community and take part in our surveys where you can win exclusive prizes and perks!

By Christina Voskoglou

Leave a Reply