Linear regression simplified - Machine learning basics

5 min readFeb 15, 2021

Anyone starting in Machine learning will begin with Linear regression, the “Hello world” in ML. It is a pretty simple and one of the basic algorithms everyone should master before advancing to the more complex algorithms.

What is Regression?

According to Wikipedia, Regression is a set of statistical processes for estimating the relationships between a dependent (response) variable and one or more independent (predictor) variables.

Generally, in Machine learning, we refer to a problem as a Regression problem if the response variable we are trying to find (target variable) is a real or continuous value. Ex: Price ($50), Weight (73.2 Kg), Salary ($80,000).

Linear Regression

The most common type of Regression is Linear regression. In Linear regression, we aim to find a linear relationship between the response and predictor variable(s).

The other main types of Regression are Polynomial regression where we try to find the best fitting curve for the given data.

If in Linear regression, the no of predictor variables is one, then we call it Simple Linear regression and if they are more than one then we call it Multiple Linear regression. In Simple linear regression, we are trying to fit the best line that fits the given 2 variables.

Background

In the Linear regression context, you will hear everyone talking about fitting a linear model, what does the model actually mean? Here the model is nothing but the equation of line we are trying to find to fit the given data as close as possible.

The equation of any straight line can be represented as follows:

In the above equation, the slope and ‘m’ and y-intercept ‘c’ defines the line. The slope ‘m’ indicates how the ‘y’ value is changing with a change in x value and y-intercept ‘c’ indicates the distance from origin on the y-axis where the line is crossing the y-axis. The slope ‘m’ and y-intercept ‘c’ are referred to as coefficients.

If we know the coefficients (‘m’ and ‘c’) and were given a new value of ‘x’, we can easily find the corresponding ‘y’ value for that ‘x’ by substituting the values of ‘m’, ‘x’, and ‘c’ in the above equation. Thus our goal in Linear regression is to find these coefficients.

Line fitting

Now let us see how to actually fit the line for the given set of data.

There are two different approaches for line fitting i) Least-squares approach ii) Gradient descent method.

i) Least-squares approach:

This approach aims to fit a linear model by minimizing the sum of squares of residuals, where residuals are vertical distances between predicted values (ŷ) and actual values (y).

The best fit line always passes through the mean of all the points. The formula to estimate the coefficients by solving the above optimization problem is as follows:

ii) Gradient descent method:

In this approach, the loss function is Mean Squared Error (MSE) and our goal is to minimize the loss function.

The Gradient descent method is an iterative process, where we start with some random guess of the values of coefficients that minimize the loss function and then update them each time until we reach the global minimum.

The gradient of the function at a point indicates the direction of the steepest ascent. We move opposite to the gradient direction in each step to reach the global minimum. The gradient can be computed by taking the partial derivative of the loss function. The distance by which we move each time is called step-size and is defined by the learning rate (α).

In each step, we update the values of ‘m’ and ‘c’ as follows:

We repeat this process until the value of the loss function (J) becomes minimum (in other words 0) for the computed values of ‘m’ and ‘c’.

Goodness measures

After fitting the model we have to check how good the fit is. There are numerous metrics to evaluate the fitted model. Some of those are Mean squared error, Mean absolute error, R-squared, Adjusted R-squared, etc. Among these, R-squared is most extensively used.

The R-squared value indicates the percentage reduction in prediction error in the response variable (y) when we use regression. We can compute this by computing the Residual sum of squares (RSS) of our fitted regression model and the Total sum of squares (TSS)of the baseline model. The baseline model is one in which the mean of y (ȳ) is the predicted value of y for all y.

The R-square value closer to 1 indicates the model is a good fit for given data.

One interesting property to observe here is the R-square value computed from the above formula and the square of correlation coefficient (r^2) will be the same.

Python implementation

Now let us implement the Simple linear regression on a sample dataset in python using the LinearRegression function from sklearn library.

First, import all the required libraries and functions as follows.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
%matplotlib inline

Now let us generate some random data such that it has a linear relationship.

x = np.arange(20)
w = np.array([2, 0.11])
y = w[0] + w[1]*x
y = y + np.random.normal(0, 0.25, x.shape[0])
plt.scatter(x, y)
plt.show()

Now let us fit a linear model to the above data using the LinearRegression() function from the scikit-learn library by calling the fit() method on it. Then we can predict the ŷ values (y_pred) using the predict() method.

x = x.reshape(-1,1)
y = y.reshape(-1,1)
model = LinearRegression().fit(x, y)
y_pred = model.predict(x)

Now we can plot the final regression line as below.

plt.plot(x, y_pred, color = 'red')
plt.scatter(x,y)
plt.show()

Thanks for reading.

Happy learning. Spread the smile :)