Debunking Data Myths and Misconceptions with Dun & Bradstreet Chief Data Scientist Anthony Scriffignano

With the endless amounts of data currently available and growing at a staggering rate, the opportunities for modern businesses to glean valuable information and the resulting possibilities are constantly evolving. As a result, the conversation around data and its uses must also evolve. Just a few years ago, the main topics of interest to data industry players focused on big data, data localization, unstructured data, and predictive analytics. More recently, the focus has shifted to topics such as data manipulation, personal data privacy and aspects of data bias. As data continues to gain in importance, so does the list of questions and concerns associated with its use.

At the Data for AI community’s April event, Dun & Bradstreet Chief Data Scientist and Senior Vice President Anthony Scriffignano spoke about myths and misconceptions about data and how mu organizations

er push for more responsible use. Scriffignano is an internationally recognized data scientist with over 35 years of experience in multiple industries, and as Chief Data Scientist at Dun & Bradstreet he is deeply involved in the worlds of finance, government and business. . At the recent Data for AI event, Mr. Scriffignano shared interesting insights into the use and application of data at companies like Dun & Bradstreet

Decisions about the use of data

Data is a unique resource in many ways. Unlike natural resources which have limited availability and lifespan, data has no such limits, increasing infinitely in quantity and availability. Data continues to grow and even compose as it is used. In fact, using or generating data creates metadata – data about data. As such, there are unique considerations regarding the use of data, dealing with increasingly difficult growth, availability, and handling needs.

Scriffignano explains that in any scenario, there are two types of data: data on hand that is readily available for use and discoverable data that could be acquired to further inform a decision. The first decision when using data should be whether there is enough data present to even make that decision. The answer to this riddle depends on the type of question the data is supposed to answer and whether the available data is representative of reality.

However, there is also a third type of data: existing but incomplete data. There are often many unknowns in data, but despite these unknowns, organizations are called upon to make decisions. For example, the advances in science and knowledge needed to make the first moon landing possible were incredible, but there was one factor scientists couldn’t estimate which was the “squishish” of moon regolith, or moon dust. Since the moon landing module landed on a surface with unknown properties, it had giant hemispherical feet to prevent it from tipping over. By considering incomplete data and their potential effects on the outcome of a situation, one can take into account a wider range of possibilities and make a correct decision even when the available data does not provide a complete picture.

Common myths and inconvenient truths

Just a few decades ago, devices like fitness monitors, GPS, or AI-powered recommendation engines would have looked like science fiction to many. However, thanks to the power of data and advanced analytical methods, they are all made possible today. Indeed, many of us use these devices on a daily basis without giving too much thought to how they work and the amount of data collected and used.

We take many of the modern luxuries that we have with data for granted. However, when working with data, Scriffignano explains, it is the user’s responsibility to stop and consider the context of the information and its true meaning. By thinking about how data evolves and how this changes the conclusions drawn from it, one can avoid falling victim to common myths and misconceptions surrounding data and its use.

One of the most common myths is that more data makes a better picture. As the human race generates and accumulates incredible amounts of data, identifying the data of interest becomes a search of an ever-increasing needle in a haystack, making important data harder to find and amplifying. the potential for amplification of errors, bias or noise. More often than not, blindly collecting more and more data will turn a “data lake” into a “data marsh”.

Another myth is that by using data, AI and machine learning will uncover answers or hidden truths. In reality, AI and machine learning algorithms cannot assess the veracity of the data provided to them. For example, consider a machine learning algorithm trained on images of seagulls landing in a parking lot. If given five examples of birds landing in consecutive parking spots, the algorithm could eventually conclude that the next seagull to arrive would land in the next available parking spot. Of course, common sense rejects this idea as ridiculous. Seagulls don’t intentionally land in a specific sequence of parking spots, they just randomly land that looks like a pattern. Scriffignano provides examples of algorithms that could cause an AI to make silly assumptions like this, noting that automating the process gives these types of errors the potential to increase in severity.

These examples of common misconceptions highlight the fact that as data becomes more pervasive and important, there needs to be more and more emphasis on thoughtful and intelligent use of data. While data can be a powerful tool and provide valuable information, improper use of data, whether done with malicious intent or accidentally out of ignorance, can have harmful consequences.

The Evolving Future of Data

Discussions around data, such as the ones that take place each month in the Data for AI online community, show the complexity of these issues. In previous waves of technology, businesses primarily needed people who were skilled in administering databases, loading data, and using programming languages ​​like Python and R to manipulate and move data in-house. While these are still essential skills, organizations now also need professionals who understand concepts like permitted use, intellectual property, AI ethics, and more.

As new purposes for data continue to develop, the dialogue is evolving to focus more on the responsible use of data. Inequalities, AI biases, conflicting data manipulation, data rights and additional threats need to be kept at the forefront of the conversation. “Do not ignore these things; don’t make embarrassing truths out of them, ”warns Scriffignano. Ask the tough questions, including: What right do I use this data for? Where did I get this data? How do I know it’s true? All of these questions are important to consider before using data to inform a decision.

Even with large amounts of data available and the limitless potential within it, it’s still important to remember to stop and think. Data is valuable, but to get context you have to look beyond the data itself. It’s not about how much data you have; it’s not about how much data you create; it’s about the meaning you make of it.

This article written in collaboration with David Pu, Johns Hopkins University

Sean N. Ayres