Logistic Regression

SanDeep DuBey
5 min readMay 16, 2021

--

Finding a plane that separates my data

Before starting a journey to learn logistic regression lets understand geometric intuition behind the technique:

Geometric Intuition

In geometric interpretation perspective, Logistic regression tries to find out the line or hyperplane which best separates the data. It works with almost or perfectly linearly separable data. For a binary classification dataset, if a line or plane can almost or perfectly separate the two classes then such a dataset is called a linearly separable dataset.

Example of Linearly Separable data

We can derive linear regression using one of the three methods:

  1. Using Geometry
  2. Probabilistic Method
  3. Loss Function Method

I am covering only geometric method here since it can be hard to cover all these method in one blog so i will paste a link in the end to help you understand other two ways .

Derivation of Linear Regression Equation using Geometry

Lets see geometric intuition first:

Plane that separates linearly separable dataset

Lets do some mathematics by taking reference of diagram here:

We know that distance between a point to the plane can be written like this-

According to diagram mention above:

A term which is generally known as Signed Distance

Yi*W^T * Xi

Lets see four different cases with signed signed distance

Case 1 and Case 2 : Here classifier giving correct class label so that is why signed distance is coming positive in these cases

Case 3 and Case 4: Here classifier giving incorrect class label so that is why signed distance is coming negative in these cases

If you look at these cases carefully then you will observe that Yi * W^T*Xi > 0 means that we have correctly classified the points and Yi * W^T * Xi < 0 means that we have incorrectly classified the points.

So our final optimization equation is :

W with max value means plane with maximum value for this equation is the best plane that separates our points

Sigmoid Function

So picture is not ended here as we know our final optimization equation is given as above but stop there is problem in this equation lets discuss it here and understand geometrically what is the problem. I am mentioning two cases here and calculate signed distance in each case.

Case 1 :

According to this intuition :

Sum of Signed Distance(YiW^TXi)

1+1+1+1+1+1+1+1+1+1–100=-90

Case 2:

Sum of signed distance :

  • 1+2+3+4+5–1–2–3–4–5+1 = +1

As you might have guessed by now, our optimization function is not robust enough to handle any outliers. Intuitively if you look at the above figure you will realize that ㄫ1 is a better plane than ㄫ2 as ㄫ1 has correctly classified 14 data points and ㄫ2 only correctly classified a single data point but according to our optimization function ㄫ2 is better.

There are various methods to remove outliers from the dataset but there is no such method which can remove 100% outliers and as we have seen above, even a single outlier can heavily impact our search for the best plane.

One idea came here that lets squish the function which means use Sigmoid function rather than taking signed distance.So main idea is:

If signed distance is small — Use as it is

If signed distance is large — Make it a smaller value

The Sigmoid function squishes the larger values and all the values will lie between 0 and 1.

Now you can have a question here that why we are using this special function here like sigmoid function. Lets see benefits of using this:

There are various reasons to choose sigmoid function over others —

  • Nice probabilistic interpretation. Ex — If a point lies on the decision surface (d = 0) then by intuition it’s probability should be 1/2 as it can belong to any class and here also we can see that — Sigma(0) = 1/2.
  • If is easy to differentiate.

Now our new optimization function is:

Lets transform this function :

As we know monotonic functions are that functions as variable increases function is also increase.In simple way:

If X1 > X2 then g(X1)>g(X2) then g(X) is said to be monotonically increasing function.

Monotonic Function

If g(x) is monotonic function :

argmin f(x) = argmin g(f(x))

argmax f(x) = argmax g(f(x))

By using the property argmax(-f(x)) = argmin(f(x))

Final Optimization Equation

Thanks guys to read it ,hope it helped you to understand some mathematical intuition behind logistic regression……

--

--

No responses yet