# Proof that you can use absolute value error for logistic regression even though the y-values are only 0 and 1.

Alex Roberts, PhD

2 years ago | 2 min read

Usually, the motivation of logistic regression as a likelihood function is not really talked about. We are just told that we have to do this. However, the real reason behind it is error saturation: you can never get an error of more than 1.0 using standard techniques because you have a limited space (either 0 or 1 in the y-direction) of data. This is bad, and you can see this by a thought experiment: Suppose data is linearly separable, then the logistic curve will be infinitely steep at z=0. Now put another data point into the data that disagrees: its error is going to be 1.0 no matter where it is put. However, the further you go in the wrong direction, the worse it will be in reality. So, there is a mismatch.

But what if I told you that there is a traditional fit curve that is derived from logistic regression such that, after data manipulation, you fit the data with a line and the error is exactly the distance from the point to the line - in other words, you can calculate the probability based on an exponential distribution of distance from the squiggle? At first this seems impossible, after all the error is just the signed log of the difference between the predicted and observed y-values. which grows linearly after the z=0 point (the reason is that while likelihoods multiply, the log of likelihoods add just like regular errors):

The solution is to move all points with z>0 to the z=0 point but to maintain the correct z-value for the function itself. Then the distance not only takes the y-distance into account, but the z-distance also. We can see that this is going to work by finding the required remaining y-distance for the correct error assuming a Laplace-distribution:

Putting both parts together (since at z=0, the y-distance and the diagonal distance are the same, the function is continuous — in fact, it is differentiable), we get this curve:

Logistic regression then can be thought of as fitting two squiggles (one for y_true=1 and one for y_true=0) to the data such that the sum of absolute value distances to these squiggles are minimized with the condition that all data points that are on the ‘wrong side’ are moved to z=0, although their location on the function is still at the respective z-value. Here are the two squiggles together:

Next time you use Logistic Regression to calculate probabilities, also remember that they are based on the assumptions of the model: in this case, that both misclassifications and one minus correct classifications follow the same exponential tail distribution away from the classification boundary. You might say that this is a Bayesian assumption on the model (see here for a frequentist modification):

Upvote

Created by

Alex Roberts, PhD

Post

Upvote

Downvote

Comment

Bookmark

Share

Related Articles