Exploratory Data Analysis
A dive in the core of Exploratory Data Analysis.
As i was thinking what could be the topic i should begin writing my first blog. EDA is an initial investigation we will do before applying any algorithm.It can be 7–8 Min. read so that any beginner can understand it.You could find better article on different resources but to know What’s and Why’s of EDA you can refer my article.So lets start:
What is EDA ?
EDA(Exploratory Data Analysis) is a task of analyzing our data using simple tool from statistics,plotting tools and linear algebra.In simple language,i could say EDA is like investigation before starting of any case.It maximizes insights of data.Some important points how EDA helps:
- To choose suitable algorithm for a problem
- To Define feature variables that can be used for machine learning
- To detect outliers and anomalies and many more
To share my understanding and techniques i will take a simple Classification problem on Iris data set.Iris data set is simplest data set available on the internet and you can download it from any resource(Google,Kaggle).
Exploratory data analysis is generally cross-classified in two ways.These two can be graphical or non-graphical:
a).Univariate Analysis(Analyzing Data using single variable at a time)
b).Multivariate Analysis
Lets understand these using a simple classification problem.First,
Objective of Problem
In iris data set,four features of Iris Flower(a type of flower) are given by using which we classify the type of species.There are three species of Iris flower-Setosa,Virginica,Versicolor.One class-label is also given in data set which tells type of species based on four variables.So main objective here is given a flower we have to classify whether this flower relates to Setosa,Verginica,Versicolor.
View of Iris Dataset :
You can see all variables here:
Scatter Plot
A scatter plot is a two-dimensional data visualization that uses dots to represent the values obtained for two different variables — one plotted along the x-axis and the other plotted along the y-axis.
To start with,I imported necessary libraries (for this example pandas, numpy,matplotlib and seaborn) and loaded the data set.
As you can see i have used two features Sepal_Length ,Sepal_Width and we can clearly classify setosa(Blue Color) by fitting a simple line with other two different flowers.We can use any two features on X-axis,Y-axis and select which will be the best one to visualize problem.Here the question is Do we have to use multiple scatter plots to find the difference because there are 4C2(6) scatter plots can be possible.
Now a question which can be strike in your mind is “Can we plot 3D graph here ?”.Answer is Yes but we don’t usually do that because it is very difficult to analyze three variables together.Iris Dataset has four features and we can not visualize 4D.
There are many ways to do it :
Pair Plot is best to analyze data if we have 4–6 variables.Above these dimensions we use different techniques like PCA(Principle Component Analysis),t-SNE etc.
Pairplot
Observations:
- We can see different possibilities here but in my opinion petal_length and petal_width are the most useful features to identify various flower types..
- While Setosa can be easily identified (Linearly Seperable),Virginica and Versicolor have some overlap(almost).
Lets see some example here on Univariate Analysis:
Bar Chart
A bar chart represents categorical data, with rectangular bars having lengths proportional to the values that they represent. For example, we can use the iris dataset to observe the average petal and sepal lengths/widths of all the different species.
Observations:
- Verginica has the highest petal length, petal width and sepal length, followed by Versicolor and Setosa.
- Sepal Width deviates from this trend where Setosa is highest followed by Virginica and Versicolor.
Box Plot
A box and whisker plo also called a box plot displays the five-number summary of a set of data. The five-number summary is the minimum, first quartile, median, third quartile, and maximum.If anyone don’t know these i will write for it.
This was small brief to what is Exploratory Data Analysis.Now Let’s see some statistical tools :
Mean,Variance And Standard Variance
Mean
Standard Definition:The mean (average) of a data set is found by adding all numbers in the data set and then dividing by the number of values in the set. The median is the middle value when a data set is ordered from least to greatest. The mode is the number that occurs most often in a data set.Mean Tells about central tendency.This is the standard definition given on internet now what my views are:
Formulae:
By Mean we can compare values.For example:
In it we can see size of petal_length is small compared to next two values.
Problem With Mean:
Adding outlier can deviate mean from its values.For Example lets add 50 as an outlier in the data:
By adding outlier the resultant average will be 2.4156……
Outlier is value out of scope which could be added by mistake or error.
Before going to next discussions lets see what the spread is:Spread tells about how your data is widely spread
Spread of setosa is small as compared to these two.So graph is finely telling about spread we can use numerical approach to find this.
Variance
Variance is average of square of distance from each point to mean value.
Standard Deviation
Spread is mathematically is Standard Deviation means what is the average deviation of points from the mean value.
In mean and standard deviation the problem was due to an outlier .After adding outlier both data get corrupted.Median,Percentiles and Quantiles are ideas of dealing with this problem.
I will discuss all these in other blog as it is going longer and provide link in this article……..
Lastly, to sum up all Exploratory Data Analysis is a philosophical and an artistical approach to guage every nuance from the data at early encounter.