Supplementary notes for Simple Linear Regression

Luke

Sum of Squares

\(SST = \sum_{i=1}^n (y_i - \bar{y})^2\)
- Total variation in the response variable
\(SSE = \sum_{i=1}^n (y_i - \hat{y_i})^2 = \sum_{i=1}^n e_i^2\)
- Unexplained variation in the response variable
\(SSReg = \sum_{i=1}^n (\hat{y_i} - \bar{y})^2\)
- Variation in the response variable explained by the model
\(SST = SSReg + SSE\)

In general:

For simple linear regression:

Source	Sum of Squares	Degrees of Freedom	Mean Square	F
Regression	\(SSReg = \sum^n_{i=1}(\hat{y}_i - \bar{y})^2\)	\(k\)	\(MSReg = \frac{SSReg}{k}\)	\(F = \frac{MSReg}{MSE}\)
Residual	\(SSE = \sum^n_{i=1}(y_i-\hat{y}_i)^2\)	\(n - k - 1\)	\(MSE = \frac{SSE}{n - k - 1} = s^2\)
Total	\(SST = \sum^n_{i=1}(y_i - \bar{y})^2\)	\(n - 1\)

Note: For simple linear regression, \(k = 1\)

Why does ANOVA use the F-statistic?
\(\frac{\sum_{i=1}^n (y_i - \bar{y})^2}{\sigma^2} \sim \chi^2_{n-1}\)
In other words, for \(H_0\): \(\frac{SST}{\sigma^2} \sim \chi^2_{n-1}\)
We also know that for \(H_0\): \(\frac{(n-2)MSE}{\sigma^2} \sim \chi^2_{n-2}\)
Since SSE and SSReg are independent, the ratio of \(\frac{MSReg}{MSE}\) follows an F-distribution with \(k\) and \(n-k-1\) degrees of freedom

\(SSReg = \beta_1^2 S_{xx}\)
\(E[\hat{\beta_1^2}] = \frac{\sigma^2}{S_{xx}} + \beta_1^2\)
\(E(MSReg) = E(SSReg) = E(\beta_1^2 S_{xx}) = \sigma^2 + \beta_1^2 S_{xx}\)
\(E(MSE) = \sigma^2\)
When \(\beta_1 = 0\), \(E(MSReg) = E(MSE)\) -> same means, F = 1
When \(\beta_1 \neq 0\), \(E(MSReg) > E(MSE)\) -> F > 1

Leverage points: a point with a distant \(x\) value
Bad leverage points: leverage point whose \(y\) value is an outlier
We want a rule to help identify \(x_i\) that are leverage points
- Should account for the distance of \(x_i\) from the bulk of \(x\)’s
- Consider the extend to which it influences the fitted regression line

We can show that \(\hat{y}_i = \sum_{j=1}^n h_{ij} y_j\)
with \(h_{ij} = \frac{1}{n} + \frac{(x_i - \bar{x})(x_j - \bar{x})}{\sum_{j=1}^n (x_j - \bar{x})^2}\)
We can also show that \(\sum_{i=1}^n h_{ij} = 1\)
Thus, \(\hat{y}_i = h_{ii}y_i + \sum_{j \neq i} h_{ij} y_j\)
\(h_{ii}\) is the leverage of the \(i\)th observation and shows how \(y_i\) influences \(\hat{y}_i\)

For SLR, \(Average(h_{ii}) = \frac{2}{n}\)
We classify \(x_i\) as a leverage point if \[h_{ii} > 2 \times Average(h_{ii}) = \frac{4}{n}\]

What to do when the assumptions of linear regression are violated?
- switch to a different model
- transform the data (x, y, or both)
If the residuals are generally normal and have a constant variance, we typically transform \(x\)
If the variance of the residuals is not constant but we have a linear relation between \(x\) and \(y\), we focus on stabilizing the variance of the residuals by transforming \(y\)
When we have neither a linear relation nor constant variance, we may need to transform both \(x\) and \(y\), starting with \(y\)

when our model is \(log(y) = \beta_0 + \beta_1 x + \epsilon\) we need to take care with the interpretation of \(\beta_1\)
\(\beta_1\) is the percentage change in the median of \(y\) for a one-unit change in \(exp(x)\)

\(e_i\) are estimates of \(\epsilon_i\)
\(\epsilon_i\) have constant variance \(\sigma^2\)
BUT: \(e_i\) do not necessarily have constant variance
That is why we should always perform our checks with the standardized residuals \(\frac{e_i}{s\sqrt{1-h_{ii}}}\)

\[f(x) = \frac{1}{\sigma \sqrt{2\pi}} \exp\left(-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2\right)\]