We get bombarded daily by a set of exotic words like regression, classification, clustering, neural networks, deep learning, SVM’s, etc. But, you’re a curious person, interested in topics you don’t understand, and thrive on the joy you get by discerning and grasping them. If that’s you, you’re at the right place.

Linear Regression is a supervised machine learning algorithm for predicting the value of a continuous variable. In simple language, Regression is all about catching the trend in a given data set, storing what we’ve learned about that data into a model, and then using that model to make predictions for new inputs.

Hypothesis: Just as its general meaning states, it is our prediction of how the final curve after the process of regression will look like. In the case of linear regression with one variable it looks like:

Here:

‘x’ is the independent variable on which the hypothesis depends. For example, ‘a number of popsicles sold’ could be ‘x’ and ‘revenue’ could be the value our hypothesis is trying to predict.
‘theta 0’ is our bias variable.
‘theta 1’ is our weight variable.
theta 0 and theta 1 together constitute the weight matrix that defines our model.

Linear Regression is generally classified into two types:

Simple Linear Regression
Multiple Linear Regression

Simple Linear Regression

In Simple Linear Regression, we try to find the relationship between a single independent variable (input) and a corresponding dependent variable (output). This can be expressed in the form of a straight line.

The same equation of a line can be re-written as:

Y represents the output or dependent variable.
β0 and β1 are two unknown constants that represent the intercept and coefficient (slope) respectively.
ε (Epsilon) is the error term.

The following is a sample graph of a Simple Linear Regression Model:

Applications of Simple Linear Regression include:

Predicting crop yields based on the amount of rainfall: Yield is a dependent variable while the amount of rainfall is the independent variable.
Marks scored by students based on the number of hours studied (ideally): Here marks scored is dependent and the number of hours studied is independent.
Predicting the Salary of a person based on years of experience: Thus Experience becomes the independent variable while Salary becomes the dependent variable.

Multiple Linear Regression

In Multiple Linear Regression, we try to find the relationship between 2 or more independent variables (inputs) and the corresponding dependent variable (output). The independent variables can be continuous or categorical.

The equation that describes how the predicted values of y are related to p independent variables is called a Multiple Linear Regression equation

Below is the graph for Multiple Linear Regression Model, applied on the iris data set:

Multiple linear regression analysis can help us in the following ways:

It helps us predict trends and future values. The multiple linear regression analysis can be used to get point estimates.
It can be used to forecast the effects or impacts of changes. That is, multiple linear regression analysis can help to understand how much will the dependent variable change when we change the independent variables.
It can be used to identify the strength of the effect that the independent variables have on a dependent variable.

Real-time example

We have a dataset that contains information about the relationship between ‘a number of hours studied’ and ‘marks obtained’. Many students have been observed and their hours of study and grades are recorded. This will be our training data. The goal is to design a model that can predict marks if given the number of hours studied. Using the training data, a regression line is obtained which will give the minimum error. This linear equation is then used for any new data. That is, if we give a number of hours studied by a student as an input, our model should predict their mark with minimum error.

Y(pred) = b0 + b1*x

The values b0 and b1 must be chosen so that they minimize the error. If the sum of squared error is taken as a metric to evaluate the model, then the goal to obtain a line that best reduces the error.

If we don’t square the error, then the positive and negative points will cancel out each other.

For a model with one predictor,

Exploring ‘b1’

If b1 > 0, then x(predictor) and y(target) have a positive relationship. That is, an increase in x will increase y.
If b1 < 0, then x(predictor) and y(target) have a negative relationship. That is, an increase in x will decrease y.

Exploring ‘b0’

If the model does not include x=0, then the prediction will become meaningless with the only b0. For example, we have a dataset that relates height(x) and weight(y). Taking x=0(that is height as 0), will make the equation have only a b0 value which is completely meaningless as in real-time height and weight can never be zero. This resulted due to considering the model values beyond its scope.
If the model includes value 0, then ‘b0’ will be the average of all predicted values when x=0. But, setting zero for all the predictor variables is often impossible.
The value of b0 guarantees that residuals have a mean of zero. If there is no ‘b0’ term, then regression will be forced to pass over the origin. Both the regression coefficient and prediction will be biased.

Co-efficient from Normal equations

Apart from the above equation coefficient of the model can also be calculated from the normal equation.

Theta contains co-efficient of all predictors including the constant term ‘b0’. Normal equation performs computation by taking the inverse of the input matrix. The complexity of the computation will increase as the number of features increase. It gets very slow when the number of features grows large.