Math Stats: Simple Linear Regression

Luke

Simple Linear Regression definitions

Deterministic regression equation: \(E(y|x) = \beta_0 + \beta_1x\)
Stochastic regression equation: \(y = \beta_0 + \beta_1x + \epsilon\)
Simple regression line: \(E(y|x) = \beta_0 + \beta_1x\)
Estimated regression line: \(\hat{y} = \hat{\beta_0} + \hat{\beta_1}x\)
Estimated regression model: \(\hat{y_i} = \hat{\beta_0} + \hat{\beta_1}x_i + e_i\)

Expectation of errors: \(E(\epsilon) = 0\)
Variance of errors: \(Var(\epsilon) = \sigma^2\)
Independence of errors: \(Cov(\epsilon_i, \epsilon_j) = 0\)
[Stricter] Normality of errors: \(\epsilon \overset{i.i.d.}{\sim} N(0, \sigma^2)\)

Fitted values: \(\hat{y_i} = \hat{\beta_0} + \hat{\beta_1}x_i\)
Residuals: \(e_i = y_i - \hat{y_i}\)
SSE/RSS: \(\sum_{i=1}^n e_i^2 = \sum_{i=1}^n (y_i - \hat{y_i})^2\)
- can be rewritten as \(SSE = S_{yy} - \frac{S_{xy}^2}{Sxx}\)

\(\sum_{i=1}^n e_i = 0\)
\(\sum_{i=1}^n e_i^2\) is a minimum
\(\sum_{i=1}^n y_i = \sum_{i=1}^n \hat{y_i}\)
\(\sum_{i=1}^n x_ie_i = 0\)
\(\sum_{i=1}^n \hat{y_i}e_i = 0\)

\(s^2 = \hat{\sigma}^2 = \frac{SSE}{n-2} = \frac{\sum_{i=1}^n e_i^2}{n-2}\)
This is because to get \(\hat{y_i}\), we use 2 parameters \(\hat{\beta_0}\) and \(\hat{\beta_1}\)

\(S_{xx} = \sum_{i=1}^n x_i^2 - n\bar{x}^2\)
\(S_{xy} = \sum_{i=1}^n x_iy_i - n\bar{x}\bar{y}\)
\(S_{yy} = \sum_{i=1}^n y_i^2 - n\bar{y}^2\)
Therefore, knowing \(n\), \(\bar{x}\), \(\bar{y}\), \(\sum_{i=1}^n x_i^2\), \(\sum_{i=1}^n y_i^2\), \(\sum_{i=1}^n x_iy_i\) is enough to calculate \(\hat{\beta_0}\) and \(\hat{\beta_1}\)

Confidence intervals are used to describe the variance in the parameter estimates
The confidence interval is a range of values that is likely to contain the true value of the parameter
CI = statistic \(\pm\) (multiplier \(\times\) se(statistic))
The multiplier is determined by the confidence level and the distribution of the statistic
The standard error of the statistic is a measure of the variability of the statistic \(SE(estimator) = \sqrt{Var(estimator)}\)

\(\hat{\beta_0}\) = estimate when \(x = 0\)
\(\hat{\beta_1}\) = change in \(\hat{y}\) for a 1 unit change in \(x\)
We need to know \(E(\hat{\beta_k})\) and \(Var(\hat{\beta_k})\) to construct CIs where \(k = 0, 1\)
To derive the distribution, we rely on \(\epsilon \overset{i.i.d.}{\sim} N(0, \sigma^2)\)

\[\frac{\hat{\beta_1} - \beta_1}{\sqrt{Var(\hat{\beta_1})}} = \frac{\hat{\beta_1} - \beta_1}{\sqrt{\frac{\sigma^2}{S_{xx}}}} \sim N(0,1)\]
Therefore the 100(1-\(\alpha\))% CI for \(\beta_1\) is \[\hat{\beta_1} \pm z_{1-\alpha/2}\frac{\sigma}{\sqrt{\sum_i^N (x_i - \bar{x})^2}} \]

It can be shown that the when \(\sigma^2\) is unknown (and needs to be estimated with \(s^2\)), the distribution of \(\hat{\beta_1}\) is a t-distribution with \(n-2\) degrees of freedom
The 100(1-\(\alpha\))% CI for \(\beta_1\) thus is \[\hat{\beta_1} \pm t_{1-\alpha/2; n-2}\frac{s}{\sqrt{\sum_i^N (x_i - \bar{x})^2}} \]

\[\hat{\beta_0} \sim N(\beta_0, \sigma^2\left[\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}\right])\]
The 100(1-\(\alpha\))% CI for \(\beta_0\) is \[\hat{\beta_0} \pm z_{1-\alpha/2}\times \sigma\sqrt{\left[\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}\right]}\]

It can be shown that the when \(\sigma^2\) is unknown (and needs to be estimated with \(s^2\)), the distribution of \(\hat{\beta_0}\) is a t-distribution with \(n-2\) degrees of freedom
The 100(1-\(\alpha\))% CI for \(\beta_0\) thus is \[\hat{\beta_0} \pm t_{1-\alpha/2; n-2}\times s\sqrt{\left[\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}\right]}\]

State \(H_0\) and \(H_1\)
Choose a test statistic
- This describes the degree to which the sample deviates from the null hypothesis
Determine the distribution of the test statistic under \(H_0\)
Calculate the critical region or p-value

\(H_0: \beta_1 = 0\)
\(H_1: \beta_1 \neq 0\)
Test statistic: \(t = \frac{\hat{\beta_1} - \hat{\beta_{1}}_{(under H_0)}}{\sqrt{\widehat{Var(\hat{\beta_1})}}}\)
The distribution of the test statistic is \(t_{n-2}\)
Critical region: \(|t| > t_{1-\alpha/2; n-2}\) - Reject if \(t\) falls in the critical region
P-value: \(2P(t_{n-2} > |t|)\) - Reject if p-value < \(\alpha\)

When using SLR to predict, there are 2 sources of uncertainty
- Uncertainty in the estimates of the regression line
- Uncertainty in the estimate of the error

Predicting where the mean of “many” points will be at a given \(x_0\)
when \(\sigma^2\) is known:
\(\hat{\mu}_{y|x_0} \sim N\left(E(y|x_0), s^2[\frac{1}{n}+\frac{(x_0 - \bar{x})^2}{S_{xx}}]\right)\)
when \(\sigma^2\) is unknown:
\[\hat{\mu}_{y|x_0} \sim t_{n-2}\left(E(y|x_0), s^2[\frac{1}{n}+\frac{(x_0 - \bar{x})^2}{S_{xx}}]\right)\]

The range in which the mean of “many” points will lie at a given \(x_0\)
Given \(x_0\) and unknown variance, the 100(1-\(\alpha\))% CI for \(E(y|x_0)\) is
\[\hat{\mu}_{y|x_0} \pm t_{1-\alpha/2; n-2}\times s\sqrt{\left[\frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}\right]}\]

The range in which a single point will lie at a given \(x_0\)
Given \(x_0\) and unknown variance, the 100(1-\(\alpha\))% PI for \(\hat{y}_0\) is
\[\hat{y}_0 \pm t_{1-\alpha/2; n-2}\times s\sqrt{1 + \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}}\]
Note that the PI is wider than the CI because it accounts for the uncertainty in the error term
The PI gets wider with the distance from \(\bar{x}\)