Sum of Squares
- \(SST = \sum_{i=1}^n (y_i - \bar{y})^2\)
- Total variation in the response variable
- \(SSE = \sum_{i=1}^n (y_i - \hat{y_i})^2 = \sum_{i=1}^n e_i^2\)
- Unexplained variation in the response variable
- \(SSReg = \sum_{i=1}^n (\hat{y_i} - \bar{y})^2\)
- Variation in the response variable explained by the model
- \(SST = SSReg + SSE\)
Correlation and r²
- \(Corr(X, Y) = \hat{\rho} = \frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}}\)
- \(r^2 = \frac{SSReg}{SST} = \frac{SST - SSE}{SST} = 1 - \frac{SSE}{SST}\)
ANOVA and Regression
In general:
- \(H_0: \beta_1 = \beta_2 = \ldots = \beta_k = 0\)
- \(H_1: \text{At least one } \beta_i \neq 0\)
For simple linear regression:
- \(H_0: \beta_1 = 0\)
- \(H_1: \beta_1 \neq 0\)
ANOVA table
| Source | Sum of Squares | Degrees of Freedom | Mean Square | F |
|---|---|---|---|---|
| Regression | \(SSReg = \sum^n_{i=1}(\hat{y}_i - \bar{y})^2\) | \(k\) | \(MSReg = \frac{SSReg}{k}\) | \(F = \frac{MSReg}{MSE}\) |
| Residual | \(SSE = \sum^n_{i=1}(y_i-\hat{y}_i)^2\) | \(n - k - 1\) | \(MSE = \frac{SSE}{n - k - 1} = s^2\) | |
| Total | \(SST = \sum^n_{i=1}(y_i - \bar{y})^2\) | \(n - 1\) |
Note: For simple linear regression, \(k = 1\)
F-test
- Why does ANOVA use the F-statistic?
- \(\frac{\sum_{i=1}^n (y_i - \bar{y})^2}{\sigma^2} \sim \chi^2_{n-1}\)
- In other words, for \(H_0\): \(\frac{SST}{\sigma^2} \sim \chi^2_{n-1}\)
- We also know that for \(H_0\): \(\frac{(n-2)MSE}{\sigma^2} \sim \chi^2_{n-2}\)
- Since SSE and SSReg are independent, the ratio of \(\frac{MSReg}{MSE}\) follows an F-distribution with \(k\) and \(n-k-1\) degrees of freedom
F-test for SLR
\(SSReg = \beta_1^2 S_{xx}\)
\(E[\hat{\beta_1^2}] = \frac{\sigma^2}{S_{xx}} + \beta_1^2\)
\(E(MSReg) = E(SSReg) = E(\beta_1^2 S_{xx}) = \sigma^2 + \beta_1^2 S_{xx}\)
\(E(MSE) = \sigma^2\)
When \(\beta_1 = 0\), \(E(MSReg) = E(MSE)\) -> same means, F = 1
When \(\beta_1 \neq 0\), \(E(MSReg) > E(MSE)\) -> F > 1
F-test for SLR (cont.)
Reject \(H_0\) if \(F > F_{1-\alpha; 1, n-2}\)
Do not reject \(H_0\) if \(F \leq F_{1-\alpha; 1, n-2}\)
Note that \(F = t^2\) for simple linear regression
Leverage points
Leverage points: a point with a distant \(x\) value
Bad leverage points: leverage point whose \(y\) value is an outlier
We want a rule to help identify \(x_i\) that are leverage points
- Should account for the distance of \(x_i\) from the bulk of \(x\)’s
- Consider the extend to which it influences the fitted regression line
Leverage points (cont.)
- We can show that \(\hat{y}_i = \sum_{j=1}^n h_{ij} y_j\)
- with \(h_{ij} = \frac{1}{n} + \frac{(x_i - \bar{x})(x_j - \bar{x})}{\sum_{j=1}^n (x_j - \bar{x})^2}\)
- We can also show that \(\sum_{i=1}^n h_{ij} = 1\)
- Thus, \(\hat{y}_i = h_{ii}y_i + \sum_{j \neq i} h_{ij} y_j\)
- \(h_{ii}\) is the leverage of the \(i\)th observation and shows how \(y_i\) influences \(\hat{y}_i\)
Leverage points (cont.)
- For SLR, \(Average(h_{ii}) = \frac{2}{n}\)
- We classify \(x_i\) as a leverage point if \[h_{ii} > 2 \times Average(h_{ii}) = \frac{4}{n}\]
Data transformations
- What to do when the assumptions of linear regression are violated?
- switch to a different model
- transform the data (x, y, or both)
- If the residuals are generally normal and have a constant variance, we typically transform \(x\)
- If the variance of the residuals is not constant but we have a linear relation between \(x\) and \(y\), we focus on stabilizing the variance of the residuals by transforming \(y\)
- When we have neither a linear relation nor constant variance, we may need to transform both \(x\) and \(y\), starting with \(y\)
Log transformation of y only
- when our model is \(log(y) = \beta_0 + \beta_1 x + \epsilon\) we need to take care with the interpretation of \(\beta_1\)
- \(\beta_1\) is the percentage change in the median of \(y\) for a one-unit change in \(exp(x)\)
Further notes on Residuals
- \(e_i\) are estimates of \(\epsilon_i\)
- \(\epsilon_i\) have constant variance \(\sigma^2\)
- BUT: \(e_i\) do not necessarily have constant variance
- That is why we should always perform our checks with the standardized residuals \(\frac{e_i}{s\sqrt{1-h_{ii}}}\)
Supplement: PDF of Normal Distribution
\[f(x) = \frac{1}{\sigma \sqrt{2\pi}} \exp\left(-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2\right)\]