6 classics every data scientist should read in 2022 | by Arthur Mello

Photo by Ed Robertson on Unsplash

This year, be sure to read books that have stood the test of time

According to UNESCO, 2.2 million books were published in 2011. Even though we only focus on data science, it is impossible to track the number of new books that come out each year. Instead, I suggest you filter out the noise and focus on the things that really matter.

A good way to do this is to stick to the classics. Even though they were written years (or even decades) ago, they remain relevant for a reason. This generally does not apply to books about specific tools or technologies. As it is difficult to stay up to date in a field that changes so quickly, this advice only applies to books that talk about general principles, mathematics, etc.

Here’s a list of a few books that are considered classics and might help you brush up on the basics or expand your circle of skills into new areas.

Picture from Amazon

Published in 1977 and at nearly 700 pages, it’s definitely not an easy read, so it might take you a while to finish. This is often the case with classics, so fear not.

People often already have their own script on how to perform exploratory data analysis (EDA): they can look at the mean of variables, maximum and minimum values, distributions, missing values, etc. Often, however, people seem to do these things just for fun, without knowing exactly what they are looking for.

Even though the book is quite outdated (to the point where it talks about making graphs manually, without a computer), it can provide you with a structure for your data analyses, a better understanding of the methods you are already using, or even new methods. that you use. not yet familiar with.

Picture from Amazon

If you’re not familiar with causal inference yet, don’t worry, you’re not alone. Many methods used by data scientists do not rely on causation, but on correlation, and this is partly because the idea of ​​causation itself is slightly abstract and difficult to define. However, it can be very useful because in many business problems we are actually more interested in what is causing what, not just the correlation.

Judea Pearl has done a remarkable job of laying the foundation for much of what is being done today in the field of causal inference, and of making the subject accessible to non-mathematicians as well.

In this particular book (his first on the subject), he moves from introductory concepts such as confounding variables and counterfactuals, to more mathematical details such as Bayesian methods.

If you are completely new to the subject, I suggest you first read “The Book of Why”, by the same author. This will give you a smoother introduction, creating an intuition that will make it easier for you to understand the more complicated details later. By reading both, you will have added a completely new and powerful tool to your arsenal.

If you are still not convinced, try reading this article first:

This will give you a better idea of ​​the topic and help you decide if you’re interested or not, before you commit to buying the books.

Picture from Amazon

AI has been one of the hottest topics for a while (until it’s superseded by NFTs, the Metaverse, or the next fad), but if you want to be sure to go beyond superficial news, you should read something with substance. And believe me, this book has it.

In just 777 pages, Hofstadter talks about a lot of things, from math and logic to music and art, so it’s hard to sum it up in a few sentences.

One of the concepts discussed is emergence. Emergence concerns how wholes emerge from parts. How consciousness emerges from neurons, how cells form life and how, in the future, human intelligence may emerge from computer circuits.

Don’t be fooled by how “new age” this all sounds, it’s totally legit: Hofstadter has a PhD in physics and the book won a Pulitzer Prize: reading it will take time and effort, but it will also get you thinking about some of the weird places he’s never been before.

Picture from Amazon

It was the book that defined many of what we now think of as “data visualization rules”, and it was listed by Amazon.com as one of the “100 Best Non-Fiction Books of the 20th century”.

Tufte was essentially a minimalist and preached sparingly adding elements to a graphic. One of the most famous concepts presented by him is the “data/ink ratio”: how much information are you conveying, given the amount of elements in our graph? Everything you add to a chart should be relevant to your audience.

The book is filled with great examples of good (and bad) dataviz, but it can get a bit wordy at times. It’s still a light read. Definitely one also to keep and use as a reference when building important charts.

Picture from Amazon

By far the smallest book on this list, “How to Lie with Statistics” will give you a break from all that heavy reading. It’s light and fun to read, and yet it’s relevant. This will help you develop an intuition of the sources of bias in your analyses. Despite its dubious name, the book is about understanding and avoiding these biases, not using them to your advantage.

Some examples include wrongly correlating with causation, reading a graph where the y-axis does not start at 0, poor sampling, and using percentages to hide real numbers (when they are too small, for example).

Some of the examples are quite old, but most of the concepts apply to this day, so I highly recommend reading this book at some point in your life (even if you’re not a data scientist).

Picture from Amazon

If you have to choose one book from this list, this is it. It reviews all the most used methods in data science, for supervised and unsupervised learning, with a high level of detail. If you’re new to data science, I don’t suggest starting with this, as it might be too much content to digest. If you already know the basics, this book will help you fill some gaps in your understanding and go beyond.

Given its length and the amount of content presented, it can also be used as a reference only, although I still suggest that you read it in full: sometimes you don’t even know you need something until to have seen it.

Also, this one is language independent, so it doesn’t matter if you use Python or R, you can always follow it.

Sean N. Ayres