Key data distributions a data scientist should know

Statistics are the foundation of data science. It is absolutely necessary for anyone wishing to pursue a career in data science to master the concepts of statistics and understand how they can be applied in business. The different distributions of data and their properties is one such area of ​​statistics where a data scientist must have crystal clarity.

Let’s look at some of the most common distributions encountered by a data scientist during their career.

Normal distribution

In a normal distribution, the data is arranged so that most values ​​form a cluster in the middle and decrease symmetrically towards either extreme. It is also called Gaussian distribution. It appears as a bell curve when graphed. In a standard normal distribution, the mean is zero and the standard deviation takes the value of 1 with zero slope. The mean, median, and mode are all the same in a normal distribution.

In a normal distribution, the midpoint has the maximum frequency. In normal distributions, there is a constant proportion of the area under the curve that lies between the mean and any given distance from the mean when measured in terms of standard deviation units.

Normal distributions are represented as standard scores or Z-scores. These scores give an idea of ​​how far a true score is from the mean in terms of standard deviations.

Bernoulli distribution

In a Bernoulli distribution, there are two possible values ​​for the random variable (A random variable is a variable whose value depends on the outcome of an experiment). They are of two types – discrete and continuous.

A Bernoulli distribution is a discrete distribution. It has two possible outcomes and only one trial (called Bernoulli’s trial). A Bernoulli trial is one of the simplest experiments conducted in statistics. It comes with two possible outcomes of success and failure. Some examples of bernoulli trials include coin toss, dice roll, etc. The mutually exclusive event probability values ​​that make up all possible outcomes must sum to one.

The two possible outcomes in the Bernoulli distribution are indicated by n=0 and n=1. Here, n=1 indicating success has probability p and n=0 indicating failure has probability 1-p (0

Uniform distribution

The uniform distribution is one of the simplest statistical distributions to understand. It is a probability distribution in which all possible outcomes are equally likely to occur. Graphically, we can consider it as a horizontal straight line. Uniform distributions are of two types – discrete and continuous.

A discrete uniform distribution will have a finite number of outcomes, while a continuous uniform distribution will have an infinite number of measurable outcomes that are equally probable.

Poisson’s Law

A Poisson distribution is a probability distribution that shows how many times an event is likely to occur over a fixed period of time and space. It is named after the French mathematician Siméon Denis Poisson. It is a discrete distribution where the variables take only specific values. It is a limit process of the binomial distribution.

T-distribution

It is a type of normal distribution used primarily for small sample sizes, and the population standard deviation is unknown. It is also known as Student’s t-distribution – it is also bell-shaped and symmetric with zero mean. The shape undergoes a modification with the change of degrees of freedom. It has a greater dispersion than the standard normal distribution. As the degrees of freedom increase, the distribution approximates a standard normal distribution.

The distribution of students goes from –∞ to ∞ (infinity). Some important applications of the T-distribution include testing the population mean hypothesis, testing the difference between two means hypothesis, and testing the difference between two means hypothesis with samples dependent.

Lognormal distribution

A lognormal distribution is a probability distribution of a random variable whose logarithm is normally distributed. A random variable with a log-normal distribution takes only positive real values. A log-normally distributed random variable will only take into account positive real values.

Sean N. Ayres