An honest revelation from a data scientist

Time passes. It seems like yesterday when I graduated from college as a novice statistician into the big leagues. Yet, six years and many mistakes later, I see myself evolving from junior analyst to the role of data science consultant. Much has changed in these six years in the world of data science learning, but some misconceptions still persist.

In this article, I will try to address these misconceptions and attempt to paint a realistic picture of the data science world.

Data science myths:

Most online data science courses and articles do an amazing job of giving a brief understanding of the technical aspects of data science. But they try to layer some well-known data science myths as reality deep within the learner. It’s time to burst those bubbles once and for all.

#1 Data science is about building models.

If you spend enough time in data science related content online, you will inevitably come across terms like machine learning, artificial intelligence, neural networks, and data modeling. Unfortunately, the internet tends to overhype these keywords. In reality, data science requires understanding the data well, identifying patterns, and creating signals that support the pattern. Real-world data is messy and unstructured. It takes a lot of work to bring the data up to the research standards of any notable model, let alone modeling. In my early days, I almost didn’t work on a single modeling element, but instead was invested in finding, validating and cleaning data, which is the mundane and unattractive part of data science. I then understood why these mundane things really matter and are the most crucial part of any data science solution.

#2 We have almighty algorithms that can do it all.

In reality, all algorithms have their set of advantages and disadvantages. You have to carefully balance these trade-offs to get the most out of them. A thorough understanding of their background, assumptions, and operation helps to assess the applicability of an algorithm in a particular situation. Additionally, careful hyper-parameter tuning in even the most basic algorithms can provide statistically better and stable results compared to the standard version of high-end algorithms. I learned the hard way that it’s better to stick with a particular algorithm and try to extract the best from it instead of bombarding the data with every algorithm ever known to a human being.

#3 Data science is a one man army.

Online courses offer “realistic live projects,” but lack a key skill for collaboration on any data science project. Typically, a data science team will consist of – i) a senior data scientist who provides general guidance and manages the progress of a project, ii) a couple of senior data scientists who work on elements data pattern recognition and solution design complexes, iii) a group of junior data scientists who are still in the learning curve, and iv) data engineers who are working on creating the right data format. You will need to communicate regularly with your team about what you are doing, how you are doing and the result. Your work will be evaluated and reviewed by seniors. You’ll have to work on a bunch of different tasks, whether it’s cleaning data or finding patterns.

#4 There is a unique type of approach.

Unfortunately, each solution is different. The approach you will work on depends on how the solution is designed. Here it has become very diverse due to the different skills and levels of understanding of data science approaches of experienced scientists.

Even so far, there is no SOP on how a data science project should be approached, and different intermediate processes need to be managed. Even the internet has no clue about this, and every other website offers a very different approach to the same problem. The data scientist’s lack of SOPs and the variability of interpretation skills make teamwork extremely difficult. The different skill levels make the team effort disjointed and the success of the project relies solely on the abilities of the most experienced and talented.

Problems due to non-standard operating procedures:

The lack of standardization is rapidly galloping up the scale of the main obstacles to data science. There are a few basic steps you need to follow for each solution, and you will inevitably run into issues with these steps because there is no standard procedure. Some data science professionals have invented ways to approach them from their own experiences, but not everyone has access to them. So, a lot of time is spent online looking for solutions on StackOverflow and similar platforms. Also, all the solutions presented online may not be relevant and it takes a lot of trial and error to find the exact solution.

An internal survey was conducted to measure the approximate time allocated by data science practitioners from various Wipro teams to the different stages of the data science workflow. The survey result was compiled and averaged to even out the skill difference between data science practitioners. The results are presented in the table below.

SL. No. Stage The description Time(in minutes)
1 To explore Exploring the best approach to the ML model 235
2 Adjust Test if the approach matches the problem 80
3 Enforce Implement the best approach to the problem 80
4 To understand In-depth understanding of data and building functionality using data management techniques 50
5 Model Create and validate an ML model 50
6 Production Productionize the ML model using MLOps 30
7 Research Idea of ​​new use cases, brainstorming, reading about new technologies 0

Table 1: Display of the average time allocated by a data scientist in 8.75 hours of work per day in the redefined stages of the ML process

It has been observed that while the times to “Understand”, “Model” and “Produce” remain more or less the same for each data scientist, the decisive moment comes with the time spent on the “Explore” stage. This step distinguishes a novice from a master data scientist and highlights the huge skill gap between different data scientists and its impact on the implementation of the whole project. Other than that, data scientists today have almost no “working minutes” left to spend learning about new technologies and brainstorming new ideas. So inevitably data scientists have to extend their working hours to compensate. In addition, they compromise the allocated time dedicated to “Understand” & “Model”, hampering the quality and stability of the model.

Conclusion

The standardization of ML processes is the need of the hour for any data scientist. The industry has belatedly started to understand the dangers of operating with a people-dependent approach to data science instead of a process-dependent approach. A sufficiently equipped ML standardization can reduce the burden on data scientists and allow them to better utilize their resources. The standardized procedure will also contribute to the democratization of the ML modeling framework and help to create ML models with higher references.

Sean N. Ayres