What IBM looks for in a data scientist

Join today’s top leaders online at the Data Summit on March 9. Register here.

Job seekers sometimes ask how IBM defines “data scientist.” This is an important question as more would-be scientists compete for attention in an increasingly lucrative job market.

The first step is to distinguish between what we consider true data scientists and other professionals working in adjacent roles (e.g. data engineers, business analysts, and AI application developers ). To make this distinction, let’s first define what we mean by data science.

At its core, data science applies the scientific method to solving business problems.

You can expand the definition further by understanding that we solve these business problems by using artificial intelligence to create predictions and prescriptions and to optimize processes.

The definition demonstrates that to realize the true potential of data science, we need data scientists with very specific experiences and skills.

1. Training as a scientist, with a master’s or doctoral degree
2. Expertise in machine learning and statistics, with a focus on decision optimization
3. Expertise in R, Python or Scala
4. Ability to transform and manage large datasets
5. Proven ability to apply the above skills to real business problems
6. Ability to evaluate model performance and adjust accordingly

Let’s examine these qualifications in the context of our definition of data science.

1. Training as a scientist, with an MSc or Ph.D.

It’s less about the degree itself and more about what you learn when you graduate. In short, you learn the scientific method, which begins with the ability to take a complex but abstract problem and break it down into a set of testable hypotheses. This continues with how you design experiments to test your hypotheses and how you analyze the results to see if the hypotheses are confirmed or contradicted. A determined individual can acquire these skills outside of academia or through the right combination of online training and practice – so there is some flexibility around actual graduation – but direct experience of applying the scientific method is essential.

Another benefit of a graduate degree is the rigorous peer review process and publication requirements imposed by degree programs. To be published, applicants must present their work in a way that allows others to review and reproduce it. You must also provide evidence that the results are valid and the methods are sound. This requires a thorough understanding of the difference between probabilistic and deterministic factors as well as the value and curse of correlation. It is possible to have an abstract idea of ​​these values, but there is no substitute for negative and positive reinforcement from mentors or the rejection or acceptance of reviews and criticisms.

2. Expertise in machine learning and statistics, with a focus on decision optimization

Applying the scientific method to business problems allows us to make better decisions by predicting what will happen next. These predictions are the product of artificial intelligence and more specifically machine learning. For a true data scientist, basic technical skills in machine learning and statistics are simply non-negotiable.

Additionally, decision optimization (aka operations research) is a rapidly growing aspect of data science. Indeed, the goal of data science is to help make better decisions by probabilistically estimating what is likely to happen in the future. The judicious application of decision optimization enables data scientists to prescribe or determine the next best action to achieve the best business results.

3. Expertise in R, Python or Scala

Being a data scientist doesn’t require you to be as good at programming as professional developers, but the ability to create and run code that supports the data science process is mandatory – and that includes the ability to use statistical and machine learning packages in one of the popular data science languages.

Python, R, and Scala are the fastest growing languages ​​for data science, with Julia another language coming into the space, though Julia isn’t fully mature yet. Like Python, R, and Scala, Julia’s core is open source. But it’s important to note that the reason for using these languages ​​isn’t that they’re free, but for the innovation and freedom to take them where you want to go.

4. Ability to transform and manage large datasets

The fourth skill is sometimes called big data. Here, the ability to use distributed data processing frameworks like Apache Spark is key. The true data scientist will know how to bring together datasets from multiple sources and multiple data types with the help of their data science team. The data itself can be a combination of structured, semi-structured, and unstructured data living across multiple clouds.

The data management process consists of finding and collecting the data, exploring the data, transforming the data, identifying the features (important data elements in the prediction), designing the features and making the data accessible to the model for training. A priority for any data scientist will be to streamline this process, which can easily consume up to 80% of their time.

5. Proven ability to apply the above skills to real business problems

Fifth on the list is a set of soft skills. It is the ability to communicate with non-data scientists to ensure that data science teams have the data resources they need and that they apply data science to the right business problems. Mastering this skill also means ensuring that the results of data science projects – for example, predictions about the likely evolution of the business – are fully understood and actionable by business people. This requires good storytelling skills and, in particular, the ability to match mathematical concepts to common sense.

6. Ability to evaluate model performance and adjust accordingly

For some, this sixth skill set is an aspect of the second skill set: machine learning expertise in general. We wanted to call it out separately because, too often, it’s what separates a good data scientist from a dangerous one. Data scientists who lack this skill can easily believe that they have created and deployed effective models when, in fact, their models are grossly over-fitting the available training data.

Be a real data scientist

If you want to be a true data scientist — as opposed to an aspiring data scientist or just a data scientist — we encourage you to master each of these six skills. A data scientist is fundamentally different from a business analyst or data analyst, who often serve as product owners within data science teams, with the important role of providing subject matter expertise to scientists. themselves.

That’s not to say business analysts, data analysts, and others can’t make the transition to becoming true data scientists, but understand that it takes time, commitment, mentorship, and you you apply again and again to real and difficult problems.

Seth Dobrin is vice president and chief data officer at IBM Analytics.

Jean-François Puget is an IBM engineer emeritus in machine learning and optimization.

VentureBeat’s mission is to be a digital public square for technical decision makers to learn about transformative enterprise technology and conduct transactions. Learn more

Sean N. Ayres