Linear Regression Model

Objective:

Trying to predict the continuous variable Y which is a linear function of several continus variables x.

Model structure:

$Y_i=\beta_0 + \beta_1 x_i + \epsilon_i$

Assumption:

Y follows normal distribution, error $\epsilon_i$ is indepdent and has $\epsilon_i \sim N(0,\sigma^2)$ . Data X is fixed

Parameter estimate:

$\beta_0$ as intercept and $\beta_1$ as slope

Model selection:

feature selection

Model fit:

$R^2$
residual analysis
F-statistic

Multiple Linear Regression Model

Multiple linear regression is a linear model with more than 1 variable. These variables are called dependent variables and the predict variable is called independent variables.

Where the formula for Multiple linear regression model is:

for i = 1,...,n is the number of observations or data. $p$ is the number of dependent variables.

Hence the matrix notation for multiple linear regression is:

Which can also write in simple statement:

Using leat square, we need to minimise this function

To find the ‘best’ $\beta$ . One way is to find the relation between $y-\hat{y}$ and $X$ :

In this figure, the residuals $y-\hat{y}$ are orthogonal to the columns of $X$ :

Then we can define

Also we can use Least Squares as our loss(error) function which can minimize the Eucledian distance between the predicted $\hat{y}$ and actual $y$ :

$Loss function =L = \frac{1}{2}\sum_{i=1}^{n}(y_i-\beta^Tx_i)^2=\frac{1}{2}||y-\beta x||^2 = \frac{1}{2}(y-x\beta)^T(y-x\beta)$

Finding the minium of the loss function, we can use differentiate for $\beta$ :

$\frac{dL}{d\beta}=-X^Ty+X^TX\beta=0$

We still got the result:

$\hat{\beta}=(X^TX)^{-1}X^Ty$

This of course works only if the inverse exists. If the inverse does not exist, the normal equations can still be solved, but the solution may not be unique.

For fitted $\hat{y}$ , we can plug in the $\hat{\beta}$

The matrix $H$ (Hat-matrix) is a $n*n$ matrix, it maps the observed values $y$ onto the fitted value $\hat{y}$

And residuals can be written as

Here is a comparison between my own code build with numpy performance and the package in sckit-learn

X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
# y = 1 * x_0 + 2 * x_1 + 3
y = np.dot(X, np.array([1, 2])) + 3

## add a column with value 1 at the left side of x 
x_with_constant = np.insert(x,0,1, axis=1)
x_t = np.transpose(x_with_constant)

beta = np.linalg.inv(x_t.dot(x_with_constant)).dot(x_t).dot(y)
## beta is array([3., 1., 2.])
y_pred =  np.array([[1,3,5]]).dot(beta)

y_pred
## array([16.])

import numpy as np
from sklearn.linear_model import LinearRegression
x = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
# y = 1 * x_0 + 2 * x_1 + 3
y = np.dot(X, np.array([1, 2])) + 3
reg = LinearRegression().fit(X, y)
reg.score(X, y)
###  1.0
eg.coef_
## array([1., 2.])
reg.intercept_ 
### 3.0000...
reg.predict(np.array([[3, 5]]))
## array([16.])

Comparing to the function in sckit-learn, both of them can do the correct prediction. And other performance can be consider later…

Leverage and Influence

To assess leverage and influence of the observations, we compute the leverage values hii (the diagonal elements of the hat matrix H) and Cook’s D-values. An easy way to compute the hat matrix is to use the influence() function in R. This function returns a list with the following components:

• $hat is the diagonal of the hat matrix

• The rows of $coefficients contain the differences βˆ − βˆ(i) between the full parameter estimate and the estimate when observation i is omitted.

• $sigma contains the estimated values of σ from the model with observation i omitted.

The function cooks.distance() can be used to compute Cook’s D values $$D_1^2 , . . . , D_ n^2$$

To look for potential outliers, we find the observations with the largest residuals, leverages and D-values.

Both rows are identical, so the two ways to compute βˆ − βˆ(21) give the same result. Finally, Cook’s D-value D2 21 is defined to be $$\frac{D_{21}^2 = (\hat{β} − \hat{β}(21))^T X^T X(βˆ − βˆ(21))}{(p + 1)\hat{σ}^2}$$

Model Selection

To build up a model stepwise, we include columns one by one, until all F-values are below a pre-specified value $F_{IN}$ . While there are variables with $$Fj ≥ F_{IN}$$, we add a new variable. There are two choices:

a) We can either add the variable which leads to the largest $R^2$ -value; or

b) we can add the variable with the largest F-value.

Forward stepwise, backward stepwise.

Robustness

The kinds of questions typically asked in the robustness literature are:

Is the procedure sensitive to small departures from the model?
To first order, what is the sensitivity?
How wrong can the model be before the procedure produces garbage?

The first issue is that of qualitative robustness; the second is quantitative robustness; the third is the “breakdown point”.

Resistance and Breakdown Point

A statistic is resistant if arbitrary changes to a few data(outlier) will not change too much in result.

Suppose we are allowed to change the values of the observations in the sample. What is the smallest fraction would we need to change to make the estimator take an arbitrary value? The answer is the breakdown point of the estimator

M-Estimation

Linear least-squares estimates can behave badly when the error distribution is not normal, particularly when the errors are heavy-tailed. One remedy is to remove influential observations from the least-squares fit. Another approach, termed robust regression, is to use a fitting criterion that is not as vulnerable as least squares to unusual data.

This class of estimators can be regarded as a generalization of maximum-likelihood estimation, hence the term “M”-estimation.[2]

Reference

[1] http://mezeylab.cb.bscb.cornell.edu/labmembers/documents/supplement 5 - multiple regression.pdf

[2] http://users.stat.umn.edu/~sandy/courses/8053/handouts/robust.pdf