Multiple Linear Regression

Luke

Introduction to Multiple Linear Regression

\(y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_k x_{ik} + \epsilon_i\)
Assumptions:
- \(\epsilon_i \sim N(0, \sigma^2)\), \(Cov(\epsilon_i, \epsilon_j) = 0\)
- Strict: \(\epsilon_i \overset{i.i.d.}{\sim} N(0, \sigma^2)\)

Matrix form

In matrix notation we write:
- \[\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & x_{12} & \ldots & x_{1k} \\ 1 & x_{21} & x_{22} & \ldots & x_{2k} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \ldots & x_{nk} \end{bmatrix} \cdot \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_k \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{bmatrix}\]
Or in short (with dimension): \(\mathbf{y}_{n \times 1} = \mathbf{X}_{n \times (k+1)}\mathbf{\beta}_{(k+1) \times 1} + \mathbf{\epsilon}_{n \times 1}\)
alternatively, just: \(\mathbf{y} = \mathbf{X}\mathbf{\beta} + \mathbf{\epsilon}\)

Some properties

\(E(\mathbf{\epsilon}) = \mathbf{0}\)
\(Var(\mathbf{\epsilon}) = \sigma^2\mathbf{I}\)
Under standard and strict assumptions the random vector \(\mathbf{y}\) has expectation \(E(\mathbf{y}) = \mathbf{X}\mathbf{\beta}\) and variance \(Var(\mathbf{y}) = \sigma^2\mathbf{I}\)

Quick review of matrix algebra

Let \(\mathbf{A}\) be a \(k \times k\) matrix. The inverse of \(\mathbf{A}\), denoted by \(\mathbf{A}^{-1}\), is another \(k \times k\) matrix such that \(\mathbf{A}^{-1}\mathbf{A} = \mathbf{A}\mathbf{A}^{-1} = \mathbf{I}\)
Let \(\mathbf{A}\) be a \(n \times k\) matrix. The transpose of \(\mathbf{A}\), denoted by \(\mathbf{A}^T\), is a \(k \times n\) matrix, whose columns are the rows of \(\mathbf{A}\), vice versa.
If \(\mathbf{A}\) is a \(n \times m\) matrix and \(\mathbf{B}\) is a \(m \times p\) matrix, then \((\mathbf{A}\mathbf{B})^T\) = \(\mathbf{B}^T\mathbf{A}^T\)

Quick review of matrix algebra (cont.)

Let \(\mathbf{A}\) be a \(k \times k\) matrix
- \(\mathbf{A}\) is symmetric if \(\mathbf{A} = \mathbf{A}^T\)
- \(\mathbf{A}\) is idempotent if \(\mathbf{A}^2 = \mathbf{A}\)
- \(\mathbf{A}\) is orthonormal if \(\mathbf{A}^T\mathbf{A} = \mathbf{I}\)
Let \(\mathbf{y}\) be a \(k \times 1\) vector. The expression: \[\mathbf{y}^T\mathbf{A}\mathbf{y} = \sum_{i=1}^k\sum_{j=1}^k \mathbf{y}_i \cdot \mathbf{A[ij]} \cdot \mathbf{y}_j\] is called a quadratic form. In this case, \(\mathbf{A}\) is the matrix of the quadratic form.

Matrix derivatives

Let \(\mathbf{A}\) be a \(k \times k\) matrix, \(\mathbf{a}\) be a \(k \times 1\) vector, and \(\mathbf{y}\) be a \(k \times 1\) vector of variables.

if z = \(\mathbf{a}^T\mathbf{y}\), then \(\frac{\partial z}{\partial \mathbf{y}} = \mathbf{a}\)
if z = \(\mathbf{y}^T\mathbf{y}\), then \(\frac{\partial z}{\partial \mathbf{y}} = 2\mathbf{y}\)
if z = \(\mathbf{a}^T\mathbf{A}\mathbf{y}\), then \(\frac{\partial z}{\partial \mathbf{y}} = \mathbf{A}^T\mathbf{a}\)

Matrix derivatives (cont.)

Let \(\mathbf{A}\) be a \(k \times k\) matrix, \(\mathbf{a}\) be a \(k \times 1\) vector, and \(\mathbf{y}\) be a \(k \times 1\) vector of variables.

if z = \(\mathbf{y}^T\mathbf{A}\mathbf{y}\), then \(\frac{\partial z}{\partial \mathbf{y}} = \mathbf{Ay} + \mathbf{A}^T\mathbf{y}\)
if z = \(\mathbf{y}^T\mathbf{A}\mathbf{y}\) with \(\mathbf{A}\) symmetric, then \(\frac{\partial z}{\partial \mathbf{y}} = 2\mathbf{Ay}\)

Least Squares Estimation

\(Q = \sum_{j=1}^n [y_i - (\beta_0 + \beta_1 x_{i1} + \ldots + \beta_k x_{ik})]^2\)
in matrix form \(Q = (\mathbf{y} - \mathbf{X}\mathbf{\beta})^T(\mathbf{y} - \mathbf{X}\mathbf{\beta})\)
minimize SEE with the score equation \(\frac{\partial Q}{\partial \mathbf{\beta}} = 0\) this yields \[\hat{\mathbf{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]

Estimating \(\sigma^2\)

\(MSE = \frac{SSE}{n-k-1}\)
\(SSE = (\mathbf{y} - \mathbf{X}\mathbf{\hat{\beta}})^T(\mathbf{y} - \mathbf{X}\mathbf{\hat{\beta}})\)

Fitted values and residuals

\(\hat{\mathbf{y}} = \mathbf{X}\mathbf{\hat{\beta}}\)
\(\hat{\mathbf{e}} = \mathbf{y} - \mathbf{X}\mathbf{\hat{\beta}}\)

Hat matrix

With the hat matrix \(\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\) we have:

\(\hat{\mathbf{y}} = \mathbf{Hy}\)

Similarly, the residuals can be expressed as:

\(\mathbf{e} = (\mathbf{I} - \mathbf{H})\mathbf{y}\)

Essential linear algebra for inference

Let \(\mathbf{A}\) be a \(k \times k\) matrix, \(\mathbf{y}\) be a \(k \times 1\) random vector with mean \(\mu\) and variance-covariance matrix \(\Sigma\), then:

\(Var(\mathbf{A}\mathbf{y}) = \mathbf{A}\Sigma\mathbf{A}^T\)
\(E(\mathbf{A}\mathbf{y}) = \mathbf{A}E(\mathbf{y})\)
\(E(\hat{\mathbf{\beta}}) = \mathbf{\beta}\)
\(Var(\hat{\mathbf{\beta}}) = \sigma^2\cdot(\mathbf{X}^T\mathbf{X})^{-1}\)
\(\widehat{Var(\hat{\mathbf{\beta}})} = MSE\cdot(\mathbf{X}^T\mathbf{X})^{-1}\)

Confidence intervals of \(\beta_j\)

The distribution of \(\hat{\beta}_j\) are: \[\frac{\hat{\beta}_j - \beta_j}{\sqrt{\widehat{Var(\hat{\beta}_j})}} \sim t_{n-k-1}\]
Thus the \(100\cdot(1-\alpha)\%\) confidence interval for \(\beta_j\) is: \[\hat{\beta}_j \pm t_{n-k-1, 1-\alpha/2}\cdot\sqrt{\widehat{Var(\hat{\beta}_j)}}\]

General F-test

For comparing two nested models \(M_{full}\) and \(M_{reduced}\) we can use the F-test:

\[F = \frac{(SSE_{reduced} - SSE_{full})/(df_{E}(full) - df_{E}(reduced))}{SSE_{full}/df_{full}}\]

\[F = \frac{(SSR_{full} - SSR_{reduced})/(df_{R}(full) - df_{R}(reduced))}{SSE_{full}/df_{full}}\]

The two formulas are equivalent.