Top 6 Statistical Concepts Every Data Scientist Should Know | by Anamika Singh | May 2022

Vocabulary is a structural basis on which one can express oneself, and statistics are the structural basis on which data scientists can express data.

Data science is a deep field and statistics is one of its foundations. It would be extremely difficult for professionals in this field to understand or analyze data without a basic understanding of statistics.

According to Josh Wills (former head of data engineering at Slack), data scientists are always expected to be better at statistics than any other programmer and better at programming than any statistician.

As important as your body’s core is to keeping you moving anywhere and staying balanced, stats are also crucial for machine learning algorithms, translating and capturing patterns into actionable insights. Let’s discuss the importance of statistics before discussing its concepts.

It stands to reason that society cannot operate effectively based on instinct or trial and error, and the correct interpretation of digital data is essential in business. Decisions based on data will perform better than those made on the basis of intuition or intuition.

Learning how to use statistics in your data science projects will provide you with more than just a qualification.

As a data scientist, you will have equipped yourself with the knowledge and understanding to deal with the information you encounter in your daily life once you have mastered the language and some of the procedures used to make sense of your projects.

Understanding fundamental statistical concepts including regression, distributions, maximum likelihood, priors, conditional probabilities, posteriors, Bayesian theorem, and machine learning fundamentals is essential.

Simply put, a potential the data scientist must study and learn statistics and its concepts.

There are thousands of statistical concepts, but employers focus on a handful of them. Here are the main concepts you should be thorough with:

Descriptive statistics are used to describe the basic characteristics of data that give a clear overview of the summary of available data with simple graphical analysis, which can represent either the entire population or a sample of the population. They are also used to present quantitative descriptions in a manageable form. They are useful for rationally simplifying massive amounts of data.

To get a high-level description for any data set, descriptive statistics is the answer. Let’s take a look at some of the most common descriptive statistics that are derived from calculations, a few include:

  • Mean: It is the central value which is commonly called the arithmetic mean.
  • Fashion: It refers to the value that appears most often in a dataset.
  • Median: It is the median value of the ordered set which divides it exactly in half.

👉 Wicked

The mean (also called “expected value” or “average”) is the sum of the values ​​divided by the number of values. Take this example:

The average is calculated as follows:


List the values ​​in ascending (or descending) order. The median is the point that divides the data in half. If there are two intermediate numbers, the median is the average of these. For example:

The median is 4.5.

👉 Fashion

The mode is the most frequent value or values ​​in the data set. For example, the mode is 3.

👉 Gap

Variance measures the dispersion of a data set from the mean. To calculate the variance, subtract the mean from each value. Square each difference. Finally, calculate the average of these resulting numbers. For example:

👉 Standard Deviation

The standard deviation measures the overall deviation and is calculated by taking the square root of the variance. For example:

Oversampling is used when the currently available data is not enough. There are established techniques for mimicking a natural sample, such as the Synthetic Minority Oversampling (SMOTE) technique. Downsampling is used when part of the data is overrepresented. Downsampling techniques focus on finding overlapping and redundant data to use only part of the data.

Classification problems use these techniques. Sometimes our classification dataset is skewed to one side. For example, consider 1000 samples for class 1, but only 200 for class 2. To model the data as well as make the predictions, the Machine learning (ML) techniques are highly valued by data scientists. There are two pre-processing options, which are very useful in training these ML techniques.

  • Downsampling
  • Oversampling

Downsampling means that one can only select some data from the majority class, such as the same number of minority classes. Now we can have an equilibrium of the probability distribution of the classes. The data set is leveled by choosing fewer samples.

Oversampling means multiplying the minority class so that it has the same number as the majority class. Now it has been leveled the data set and the minority breakdown without additional data.

In the example mentioned above, it is possible to solve the problem in two ways. Using downsampling, one can only select 200 records for classes 1 and 2. Another method is to use oversampling by replicating 200 examples to 800 so that both classes have 100 examples each where the model performs better .

Probability is the basic condition for understanding possibilities. It is the measure of the probability of an event occurring in a random experiment.

For example: What are the odds that Team A will win the football match against Team B. To get this answer, you might need 100 people to give their respective votes – The number of samples. Based on these votes, one can have a chance to know which team can win the match.

But, in this example, consider a very important concept known as sampling – identifying the right group of people to vote for the results. Thus, probability is the chance of the event happening or not. Depending on the scenario, one can build different solutions around this.

👉 Even distribution

The uniform distribution has a single value that occurs in a particular range while everything outside the range is just 0. It can be assumed to be a representation of categorical variables 0 or 1. The categorical variable can have multiple values, but one can visualize the same thing as a piecewise function of multiple uniform distributions.

👉 Normal distribution

Normal distribution is also known as Gaussian distribution, which is defined by its mean and standard deviation. The mean moves the distribution in the space where the standard deviation controls the spread. Usually the average value of our dataset and the data spread with a Gaussian distribution.

👉 Distribution of Fish

The Poisson distribution is the same as Normal but with the addition of skewness. It has relatively uniform propagation in all directions, just like the normal at the time of the low value asymmetry. Data propagation will be different in different directions when the skew value is high.

Regression is a method mainly used to determine the relationship between one or more independent variables and a dependent variable. It is of two types:

👉 Linear Regression

Linear regression is used to fit the regression model that explains the relationship between a numerical predictor variable and one or more predictor variables.

👉 Logistic regression

Logistic regression is used to fit a regression model that describes the relationship between the binary response variable and one or more predictor variables.

Bias is the act of deliberately or unintentionally favoring one class or outcome over other groups or potential outcomes in the chosen data set. In statistical terms, this means that when a model is representative of a complete population, it must be minimized to achieve the desired result.

The three most common types of bias are:

👉 Selection bias

Selection bias is a phenomenon of selecting a group of data for statistical analysis, selecting such that the data is not randomized, resulting in the data not being representative of the entire population. population.

👉 Confirmation bias

Confirmation bias occurs when the person performing the statistical analysis has a predefined hypothesis.

👉 Time interval bias

Time interval bias is caused intentionally by specifying a certain time range to favor a particular outcome.

A p-value is a number produced from a statistical test that represents the probability of finding a specific set of measurements if the hypothesis is true.

P values ​​are used in theoretical tests to help determine whether the null hypothesis should be rejected. The lower the value of p, the more likely the null hypothesis is to be rejected.

For example, if the mice live the same time on either diet, the test statistic of your t-test will be very similar to the test statistic of the null hypothesis (that there is no of difference between the groups), and the resulting p-value will be very close to 1. It is unlikely to reach 1 because the groups will not be equal in real life.

Conversely, if there is an intermediate level of lifespan between the two groups, the test statistic deviates from the value expected by the null hypothesis and the value of p decreases. Even though it’s incredibly unlikely, the p-value will never be 0 because patterns in the data can always occur randomly.

Learning statistical concepts will help you become a professional data scientist. Statistics is the cornerstone of the field of data science. Statistics are also useful for solving complex problems in the real world so that data scientists can acquire resourceful insights and trends from the data by performing mathematical calculations on it. These concepts will surely help you excel in your data science career.

Sean N. Ayres