Debunking the myth of the citizen data scientist


Photo by Hunters Race on Unsplash

Gartner analysts coined the term “citizen data scientist” to refer to “a person who creates and generates models that leverage predictive or prescriptive analytics, but whose primary function is outside the realm of statistics and analytics.” to analyse “. And in some circles, this role has been widely promoted as the solution to helping organizations accelerate their ML/AI journey.

But as the saying goes, caveat emptor. While there are some benefits to having citizen data scientists, they’re not a silver bullet — and they certainly don’t replace real data scientists.

Organizations in almost every industry are actively working to see how they can leverage AI and ML to accelerate their business and achieve better results. On average, only 54% of AI models move from pilot to production, according to a new survey from Gartner. There are several reasons behind this; for some companies, not having a qualified data science team is just one of them.

Debunking the myth of the citizen data scientist

Qualified professionals cost money – you need to hire the right people with the right skills. So the idea of ​​a citizen data scientist is appealing – this notion that you could ask someone to quickly build a model using the toolsets provided to them without having a solid understanding of the data, the background of this data or how to clean the data and choose the right features.

This scenario is attractive but highly unlikely. Someone without that experience, without that context-specific understanding, can’t just build a model. There won’t be a good result because basically they don’t know what they have; they just throw things into the system.

Additionally, we have seen a number of examples where even well-trained data scientists have (inadvertently) contributed to bias and drift issues with ML models. Think of Zillow’s failed iBuying algorithms or Facebook’s terrible photo mislabeling incident. In other words, if even experts can be wrong, how can we reasonably expect novices to succeed?

If you go this route, you’re going to end up with a classic “trash in, trash out” problem. Suppose your organization is a bank and you are a business analyst with access to a data warehouse. You get access to income level, demographics, locations, address, and more. some people. The idea that some tech vendors put forward is that you can just take that data and feed it into their tool and then it will choose the right algorithm for you and give you the right prediction.

But what often happens in this process is that maybe no one is involved to make sure the data is correct. You need to make sure that you do data cleansing or feature engineering. You must understand what you have. For example, in a loan application, you have different elements – a mailing address, a telephone number, etc. When doing feature engineering, every piece of information can be a feature.

Typically, what a data scientist needs to do is figure out which element has the weight in order to predict the results. And a citizen data scientist is unlikely to be able to do that without a lot of training and hard work.

If that person doesn’t understand the data and puts it into a tool without proper data cleansing, your input is garbage and the system will spit out garbage due to a lack of understanding of the data. The tool alone cannot improve the data.

Even with advanced AI/ML tools, you still need trained data scientists who can organize data and determine what is good and what is bad. You need people who know how to do feature engineering. Otherwise, you’re going to end up with patterns or algorithms that fail you.

A citizen data scientist is a great aspiration, but it is not a panacea, and it does not replace trained data scientists. In other words, how would you feel if you were on a plane and told that a “citizen pilot” (or even a student pilot or avid flight simulator user) would be flying the plane?

It is not so simple. And again, that doesn’t mean the citizen data scientist isn’t a viable concept. It’s just important to understand that these roles should be complements, not substitutes, for data scientists.

It’s tempting to think that you can use technology to disrupt or replace the need for certain skilled roles, especially since many industries are struggling with skills gaps. However, right now, the technology is not at a level where AI/ML projects can simply be handed over to citizen data scientists.

Despite all the talk of no-code or low-code software development, the industry has come to understand just how realistic this approach is and where it works. For serious software development, the no-code/low-code approach does not work when you need to develop mission-critical software. So it’s even wackier to only have citizen data scientists running your AI/ML.

A deep understanding of your data is critical to AI/ML success. This is what professional data scientists provide, along with the necessary contextual information that determines which data is good and useful and which is not. While it is certainly positive to bring new voices and ideas to the field, it does not eliminate the need for professional and skilled data scientists. You can’t cut costs by not having professional data scientists in place.

Victor Thu is president of Datatron. Throughout his career, Victor has specialized in product marketing, go-to-market, and product management in C-level and director positions for companies such as Petuum, VMware, and Citrix.

Sean N. Ayres