Introduction to Multiple Linear Regression
- \(y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_k x_{ik} + \epsilon_i\)
- Assumptions:
- \(\epsilon_i \sim N(0, \sigma^2)\), \(Cov(\epsilon_i, \epsilon_j) = 0\)
- Strict: \(\epsilon_i \overset{i.i.d.}{\sim} N(0, \sigma^2)\)
Matrix form
- In matrix notation we write:
- \[\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & x_{12} & \ldots & x_{1k} \\ 1 & x_{21} & x_{22} & \ldots & x_{2k} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \ldots & x_{nk} \end{bmatrix} \cdot \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_k \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{bmatrix}\]
- Or in short (with dimension): \(\mathbf{y}_{n \times 1} = \mathbf{X}_{n \times (k+1)}\mathbf{\beta}_{(k+1) \times 1} + \mathbf{\epsilon}_{n \times 1}\)
- alternatively, just: \(\mathbf{y} = \mathbf{X}\mathbf{\beta} + \mathbf{\epsilon}\)
Some properties
- \(E(\mathbf{\epsilon}) = \mathbf{0}\)
- \(Var(\mathbf{\epsilon}) = \sigma^2\mathbf{I}\)
- Under standard and strict assumptions the random vector \(\mathbf{y}\) has expectation \(E(\mathbf{y}) = \mathbf{X}\mathbf{\beta}\) and variance \(Var(\mathbf{y}) = \sigma^2\mathbf{I}\)
Quick review of matrix algebra
- Let \(\mathbf{A}\) be a \(k \times k\) matrix. The inverse of \(\mathbf{A}\), denoted by \(\mathbf{A}^{-1}\), is another \(k \times k\) matrix such that \(\mathbf{A}^{-1}\mathbf{A} = \mathbf{A}\mathbf{A}^{-1} = \mathbf{I}\)
- Let \(\mathbf{A}\) be a \(n \times k\) matrix. The transpose of \(\mathbf{A}\), denoted by \(\mathbf{A}^T\), is a \(k \times n\) matrix, whose columns are the rows of \(\mathbf{A}\), vice versa.
- If \(\mathbf{A}\) is a \(n \times m\) matrix and \(\mathbf{B}\) is a \(m \times p\) matrix, then \((\mathbf{A}\mathbf{B})^T\) = \(\mathbf{B}^T\mathbf{A}^T\)
Quick review of matrix algebra (cont.)
- Let \(\mathbf{A}\) be a \(k \times k\) matrix
- \(\mathbf{A}\) is symmetric if \(\mathbf{A} = \mathbf{A}^T\)
- \(\mathbf{A}\) is idempotent if \(\mathbf{A}^2 = \mathbf{A}\)
- \(\mathbf{A}\) is orthonormal if \(\mathbf{A}^T\mathbf{A} = \mathbf{I}\)
- \(\mathbf{A}\) is symmetric if \(\mathbf{A} = \mathbf{A}^T\)
- Let \(\mathbf{y}\) be a \(k \times 1\) vector. The expression: \[\mathbf{y}^T\mathbf{A}\mathbf{y} = \sum_{i=1}^k\sum_{j=1}^k \mathbf{y}_i \cdot \mathbf{A[ij]} \cdot \mathbf{y}_j\] is called a quadratic form. In this case, \(\mathbf{A}\) is the matrix of the quadratic form.
Matrix derivatives
Let \(\mathbf{A}\) be a \(k \times k\) matrix, \(\mathbf{a}\) be a \(k \times 1\) vector, and \(\mathbf{y}\) be a \(k \times 1\) vector of variables.
- if z = \(\mathbf{a}^T\mathbf{y}\), then \(\frac{\partial z}{\partial \mathbf{y}} = \mathbf{a}\)
- if z = \(\mathbf{y}^T\mathbf{y}\), then \(\frac{\partial z}{\partial \mathbf{y}} = 2\mathbf{y}\)
- if z = \(\mathbf{a}^T\mathbf{A}\mathbf{y}\), then \(\frac{\partial z}{\partial \mathbf{y}} = \mathbf{A}^T\mathbf{a}\)
Matrix derivatives (cont.)
Let \(\mathbf{A}\) be a \(k \times k\) matrix, \(\mathbf{a}\) be a \(k \times 1\) vector, and \(\mathbf{y}\) be a \(k \times 1\) vector of variables.
- if z = \(\mathbf{y}^T\mathbf{A}\mathbf{y}\), then \(\frac{\partial z}{\partial \mathbf{y}} = \mathbf{Ay} + \mathbf{A}^T\mathbf{y}\)
- if z = \(\mathbf{y}^T\mathbf{A}\mathbf{y}\) with \(\mathbf{A}\) symmetric, then \(\frac{\partial z}{\partial \mathbf{y}} = 2\mathbf{Ay}\)
Least Squares Estimation
- \(Q = \sum_{j=1}^n [y_i - (\beta_0 + \beta_1 x_{i1} + \ldots + \beta_k x_{ik})]^2\)
- in matrix form \(Q = (\mathbf{y} - \mathbf{X}\mathbf{\beta})^T(\mathbf{y} - \mathbf{X}\mathbf{\beta})\)
- minimize SEE with the score equation \(\frac{\partial Q}{\partial \mathbf{\beta}} = 0\) this yields \[\hat{\mathbf{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]
Estimating \(\sigma^2\)
- \(MSE = \frac{SSE}{n-k-1}\)
- \(SSE = (\mathbf{y} - \mathbf{X}\mathbf{\hat{\beta}})^T(\mathbf{y} - \mathbf{X}\mathbf{\hat{\beta}})\)
Fitted values and residuals
- \(\hat{\mathbf{y}} = \mathbf{X}\mathbf{\hat{\beta}}\)
- \(\hat{\mathbf{e}} = \mathbf{y} - \mathbf{X}\mathbf{\hat{\beta}}\)
Hat matrix
With the hat matrix \(\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\) we have:
- \(\hat{\mathbf{y}} = \mathbf{Hy}\)
Similarly, the residuals can be expressed as:
- \(\mathbf{e} = (\mathbf{I} - \mathbf{H})\mathbf{y}\)
Essential linear algebra for inference
Let \(\mathbf{A}\) be a \(k \times k\) matrix, \(\mathbf{y}\) be a \(k \times 1\) random vector with mean \(\mu\) and variance-covariance matrix \(\Sigma\), then:
- \(Var(\mathbf{A}\mathbf{y}) = \mathbf{A}\Sigma\mathbf{A}^T\)
- \(E(\mathbf{A}\mathbf{y}) = \mathbf{A}E(\mathbf{y})\)
- \(E(\hat{\mathbf{\beta}}) = \mathbf{\beta}\)
- \(Var(\hat{\mathbf{\beta}}) = \sigma^2\cdot(\mathbf{X}^T\mathbf{X})^{-1}\)
- \(\widehat{Var(\hat{\mathbf{\beta}})} = MSE\cdot(\mathbf{X}^T\mathbf{X})^{-1}\)
Confidence intervals of \(\beta_j\)
- The distribution of \(\hat{\beta}_j\) are: \[\frac{\hat{\beta}_j - \beta_j}{\sqrt{\widehat{Var(\hat{\beta}_j})}} \sim t_{n-k-1}\]
- Thus the \(100\cdot(1-\alpha)\%\) confidence interval for \(\beta_j\) is: \[\hat{\beta}_j \pm t_{n-k-1, 1-\alpha/2}\cdot\sqrt{\widehat{Var(\hat{\beta}_j)}}\]
General F-test
For comparing two nested models \(M_{full}\) and \(M_{reduced}\) we can use the F-test:
\[F = \frac{(SSE_{reduced} - SSE_{full})/(df_{E}(full) - df_{E}(reduced))}{SSE_{full}/df_{full}}\]
\[F = \frac{(SSR_{full} - SSR_{reduced})/(df_{R}(full) - df_{R}(reduced))}{SSE_{full}/df_{full}}\]
The two formulas are equivalent.