Supplementary notes for Simple Linear Regression

Luke

Sum of Squares

  • \(SST = \sum_{i=1}^n (y_i - \bar{y})^2\)
    • Total variation in the response variable
  • \(SSE = \sum_{i=1}^n (y_i - \hat{y_i})^2 = \sum_{i=1}^n e_i^2\)
    • Unexplained variation in the response variable
  • \(SSReg = \sum_{i=1}^n (\hat{y_i} - \bar{y})^2\)
    • Variation in the response variable explained by the model
  • \(SST = SSReg + SSE\)

Correlation and r²

  • \(Corr(X, Y) = \hat{\rho} = \frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}}\)
  • \(r^2 = \frac{SSReg}{SST} = \frac{SST - SSE}{SST} = 1 - \frac{SSE}{SST}\)

ANOVA and Regression

In general:

  • \(H_0: \beta_1 = \beta_2 = \ldots = \beta_k = 0\)
  • \(H_1: \text{At least one } \beta_i \neq 0\)

For simple linear regression:

  • \(H_0: \beta_1 = 0\)
  • \(H_1: \beta_1 \neq 0\)

ANOVA table

Source Sum of Squares Degrees of Freedom Mean Square F
Regression \(SSReg = \sum^n_{i=1}(\hat{y}_i - \bar{y})^2\) \(k\) \(MSReg = \frac{SSReg}{k}\) \(F = \frac{MSReg}{MSE}\)
Residual \(SSE = \sum^n_{i=1}(y_i-\hat{y}_i)^2\) \(n - k - 1\) \(MSE = \frac{SSE}{n - k - 1} = s^2\)
Total \(SST = \sum^n_{i=1}(y_i - \bar{y})^2\) \(n - 1\)

Note: For simple linear regression, \(k = 1\)

F-test

  • Why does ANOVA use the F-statistic?
  • \(\frac{\sum_{i=1}^n (y_i - \bar{y})^2}{\sigma^2} \sim \chi^2_{n-1}\)
  • In other words, for \(H_0\): \(\frac{SST}{\sigma^2} \sim \chi^2_{n-1}\)
  • We also know that for \(H_0\): \(\frac{(n-2)MSE}{\sigma^2} \sim \chi^2_{n-2}\)
  • Since SSE and SSReg are independent, the ratio of \(\frac{MSReg}{MSE}\) follows an F-distribution with \(k\) and \(n-k-1\) degrees of freedom

F-test for SLR

  • \(SSReg = \beta_1^2 S_{xx}\)

  • \(E[\hat{\beta_1^2}] = \frac{\sigma^2}{S_{xx}} + \beta_1^2\)

  • \(E(MSReg) = E(SSReg) = E(\beta_1^2 S_{xx}) = \sigma^2 + \beta_1^2 S_{xx}\)

  • \(E(MSE) = \sigma^2\)

  • When \(\beta_1 = 0\), \(E(MSReg) = E(MSE)\) -> same means, F = 1

  • When \(\beta_1 \neq 0\), \(E(MSReg) > E(MSE)\) -> F > 1

F-test for SLR (cont.)

  • Reject \(H_0\) if \(F > F_{1-\alpha; 1, n-2}\)

  • Do not reject \(H_0\) if \(F \leq F_{1-\alpha; 1, n-2}\)

  • Note that \(F = t^2\) for simple linear regression

Leverage points

  • Leverage points: a point with a distant \(x\) value

  • Bad leverage points: leverage point whose \(y\) value is an outlier

  • We want a rule to help identify \(x_i\) that are leverage points

    • Should account for the distance of \(x_i\) from the bulk of \(x\)’s
    • Consider the extend to which it influences the fitted regression line

Leverage points (cont.)

  • We can show that \(\hat{y}_i = \sum_{j=1}^n h_{ij} y_j\)
  • with \(h_{ij} = \frac{1}{n} + \frac{(x_i - \bar{x})(x_j - \bar{x})}{\sum_{j=1}^n (x_j - \bar{x})^2}\)
  • We can also show that \(\sum_{i=1}^n h_{ij} = 1\)
  • Thus, \(\hat{y}_i = h_{ii}y_i + \sum_{j \neq i} h_{ij} y_j\)
  • \(h_{ii}\) is the leverage of the \(i\)th observation and shows how \(y_i\) influences \(\hat{y}_i\)

Leverage points (cont.)

  • For SLR, \(Average(h_{ii}) = \frac{2}{n}\)
  • We classify \(x_i\) as a leverage point if \[h_{ii} > 2 \times Average(h_{ii}) = \frac{4}{n}\]

Data transformations

  • What to do when the assumptions of linear regression are violated?
    • switch to a different model
    • transform the data (x, y, or both)
  • If the residuals are generally normal and have a constant variance, we typically transform \(x\)
  • If the variance of the residuals is not constant but we have a linear relation between \(x\) and \(y\), we focus on stabilizing the variance of the residuals by transforming \(y\)
  • When we have neither a linear relation nor constant variance, we may need to transform both \(x\) and \(y\), starting with \(y\)

Log transformation of y only

  • when our model is \(log(y) = \beta_0 + \beta_1 x + \epsilon\) we need to take care with the interpretation of \(\beta_1\)
  • \(\beta_1\) is the percentage change in the median of \(y\) for a one-unit change in \(exp(x)\)

Further notes on Residuals

  • \(e_i\) are estimates of \(\epsilon_i\)
  • \(\epsilon_i\) have constant variance \(\sigma^2\)
  • BUT: \(e_i\) do not necessarily have constant variance
  • That is why we should always perform our checks with the standardized residuals \(\frac{e_i}{s\sqrt{1-h_{ii}}}\)

Supplement: PDF of Normal Distribution

\[f(x) = \frac{1}{\sigma \sqrt{2\pi}} \exp\left(-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2\right)\]