Basic Distributions that every data scientist should know

SanDeep DuBey
7 min readAug 20, 2020

Hey Guys lets discuss all the basic distributions that a data scientist or machine learning engineer should know that help to analyze data better.I am giving my experience here what i studied from internet articles and some books by taking best fit examples.

Lets start with an example to motivate us why we are learning distributions and how it solves some good problems : —

Suppose you are a teacher at a school.You took a week test of your class and after checking all the papers you graded all the students.And you gave all the papers to the data entry person in the school to create a spreadsheet containing marks of all the students.But that person only stores marks and not the corresponding students and he also made a mistake he missed some of entries.We have no idea which student’s marks is actually missing.so lets find a way solve this problem.

One way to solve this problem is to visualize the data and checking the trend in the data :

As you can see a frequency distribution of the data which is making a smooth curve and we are observing here that there is an anomaly in the data.We have a low frequency in some range.So the best way is guessing missed mark by removing dent in the graph.Smoothing is used here to find the missed mark.

So this is how you would try to solve the real life problems by using distributions.There are 100’s of distributions discovered in the area of mathematics,physics and some more areas.But as a Machine learning practitioner we should know some of basic distributions which i discussed here.Before heading towards the distributions what prerequisite you should know about what are different type of Random Variables in stats.I will just discuss basics what these are and to study it in brief you can go to my another article Different type of Random variable.for now just learn one thing:

Discrete data can take a particular value and Continuous data can take value within some range.

Types of Distributions :

Normal Distribution

Normal Distribution which is also known as Gaussian Distribution is most frequent behavior in the universe.You might have seen this shape often :

Any distribution having characteristics below can be infer to normal distribution

  1. Symmetric about x=μ line which means data in the left of it is exactly same to the right of it.
  2. Mean,Median and Mode are coincide
  3. Value of probability downs exponentially
  4. Total area under curve is 1

Representation : X ~ N(µ,σ) µ,σ are parameters of normal distribution

PDF for Normal Distribution is :

X ~ N(0,1) is known as Standard Normal Distribution

One most important property of this theoretical model is 68–95–99.7 rule.You can see picture below to understand this :

68 % of data is distributed in 1 SD.95% of data is in 2SD.99.7% of data is in 3SD

This normal distribution is very very useful tool in ML as this is basis of data distribution because most of data is found distributed like this.So there any many methods to convert any distribution to normal distribution like Box-Cox Transform,Log method etc.

Lets see how variance can change shape of PDf :

As you can see here by increasing variance the height of graph decreases and for it i can infer that height is inversely proportion to the variance.Variance changes shape of CDF also:

As variance increases CDF goes flatter.

All Quantites

Uniform Distribution

As name says data which is equally likely.If data is discrete then it is called Discrete Uniform Distribution and if data is continuous then it is called Continuous Uniform Distribution.Lets see a example to understand this better:

Throwing a dice (1,2,3,4,5,6).All numbers are equally probable.So

P(any no.)=1/6 all with same probability .So data having same probability have Uniform Distribution

Discrete Uniform Distribution

For Continuous Distribution PDF is like

Continuous Uniform Distribution

Parameter : a,b(Max and Min value)

PDF :

Mean and Median : a+b/2

All Quantities

Bernoulli Distribution

Probability distribution of a random variable which takes the value 1 with probability p and the value 0 with probability q=1-p.For Example,At the beginning of any cricket match, how do you decide who is going to bat or ball? A toss! It all depends on whether you win or lose the toss, right? Let’s say if the toss results in a head, you win. Else, you lose. There’s no midway.

Here, the occurrence of a head denotes success, and the occurrence of a tail denotes failure.Probability of getting a head = 0.5 = Probability of getting a tail since there are only two possible outcomes.

PDF for Bernoulli Distribution is :

All Quantities

Binomial Distribution

Bernoulli and Binomial are two interlinked distributions .Lets take an example to understand this : Suppose i want to create new random variable

Y = Number of times i get head when i toss my fair coin n-times

Let n =10 and head denotes success here then Y can be

Y = {0,1,2,3,4,5,6,7,8,9,10}

In this example there are only two possible outcomes in which Head denotes Success and Tail denotes Failure.Therefore probability of getting head is 0.5 and probability of failure can be easily calculated by 1- p =0.5.The outcomes need not be equally likely means if p = 0.2 then q will be 0.8.For example biased coin have more probability of one side then other you can relate it with that.

Representation : Y ~ Bin(n,p)

p = Probability value in the bernoulli which is generating this binomial.

n = number of trials

In a simple language by adding n value to the bernoulli we are in the binomial distribution

PDF for Binomial Distribution is ;

Relation between Bernoulli and Binomial Distribution

Both distributions have same outcomes as success and failure .n is extra parameter in binomial distribution which means if there is single trail (n=1) then binomial distribution represents bernoulli distribution.Both distributions have independent trails.

Poisson Distribution

Before starting Poisson distribution let get a intuition behind it:

Suppose we are counting the number of occurrences of an event in a given unit of time,distance ,area or volume.For example

Number of car accidents in a day

Then the number of events is going to be a random variable that may or may not have the Poisson distribution depending on the specifics of the situation.But a Poisson random variable is a count of the number of occurrences of an event.Suppose events are occurring independently.In other words, knowing when one event happens gives absolutely no information about when another event will occur.And the probability that an event occurs in a given length of time does not change through time.In other words, the theoretical rate at which the events are occurring does not change through time.We might say that the events are occurring randomly and independently.If these conditions hold, then the random variable X,which represents the number of events in a fixed unit of time,has the Poisson distribution.

PMF for Poisson Distribution:

All Quantities

Exponential Distribution

The exponential relates quite significantly to the Poisson distribution and here I’ve mentioned that the exponential is actually the time between events in a Poisson process you can think of it like the inverse of the Poisson.Lets take some examples

Poisson Distribution : Number of cars passing a tollgate in one hour(Events per single unit of time)

Exponential : Number of hours between car arrivals(Time per single unit)

PDF :

f(x) = { λe-λx, x ≥ 0

Important Note :

All distributions which i have written here are very basic one and i discussed only important part.These all concepts are sufficient to analyze data if you are from computer science background.Main thing i want to discuss here is some confusing concepts which i faced when i was learning these.Let me go point by point hope it will help you;

  1. You may have noticed that i am using PDF somewhere and PMF somewhere .Actually both are same but only difference is what type of data we are analyzing.PMF(Probability Mass Function) represents discrete data and PDF(Probability Density Fuction) represents continuous data.
  2. These distribution which is discussed here are categorized into discrete distribution and Continuous distribution.
  3. Both Bernoulli and Binomial distribution can be represented by same PMF by taking value of n=1

4. To check if some distribution is normal or not we can use Q-Q plot which makes a graph between experimenting distribution with normal distribution .If all the points lie in a straight line then it will be normal otherwise any other.

5. There are many practical uses of these distributions .I like one which i studied with great interest log-normal distribution which help to find out what the height of Dam should be to mitigate the accident due to heavy rainfall.

THANKS

--

--