Math Stats: Simple Linear Regression
Simple Linear Regression definitions
- Deterministic regression equation: \(E(y|x) = \beta_0 + \beta_1x\)
- Stochastic regression equation: \(y = \beta_0 + \beta_1x + \epsilon\)
- Simple regression line: \(E(y|x) = \beta_0 + \beta_1x\)
- Estimated regression line: \(\hat{y} = \hat{\beta_0} + \hat{\beta_1}x\)
- Estimated regression model: \(\hat{y_i} = \hat{\beta_0} + \hat{\beta_1}x_i + e_i\)
Assumptions of SLR
- Expectation of errors: \(E(\epsilon) = 0\)
- Variance of errors: \(Var(\epsilon) = \sigma^2\)
- Independence of errors: \(Cov(\epsilon_i, \epsilon_j) = 0\)
- [Stricter] Normality of errors: \(\epsilon \overset{i.i.d.}{\sim} N(0, \sigma^2)\)
SLR with least squares
- Fitted values: \(\hat{y_i} = \hat{\beta_0} + \hat{\beta_1}x_i\)
- Residuals: \(e_i = y_i - \hat{y_i}\)
- SSE/RSS: \(\sum_{i=1}^n e_i^2 = \sum_{i=1}^n (y_i - \hat{y_i})^2\)
- can be rewritten as \(SSE = S_{yy} - \frac{S_{xy}^2}{Sxx}\)
SLR equations
- \(\hat{\beta_1} = \frac{S_{xy}}{S_{xx}}\)
- \(\hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x}\)
Properties of SLR
- \(E(y_i) = \beta_0 + \beta_1x_i\)
- \(Var(y_i) = Var(\epsilon_i) = \sigma^2\)
- \(Cov(y_i, y_j) = 0\) for \(i \neq j\)
- \(y_i \sim N(\beta_0 + \beta_1x_i, \sigma^2)\)
- \(Cov(y_i, y_j) = 0\) for \(i \neq j\)
Properties of regression line
- \(\sum_{i=1}^n e_i = 0\)
- \(\sum_{i=1}^n e_i^2\) is a minimum
- \(\sum_{i=1}^n y_i = \sum_{i=1}^n \hat{y_i}\)
- \(\sum_{i=1}^n x_ie_i = 0\)
- \(\sum_{i=1}^n \hat{y_i}e_i = 0\)
How to estimate \(\sigma^2\)?
- \(s^2 = \hat{\sigma}^2 = \frac{SSE}{n-2} = \frac{\sum_{i=1}^n e_i^2}{n-2}\)
- This is because to get \(\hat{y_i}\), we use 2 parameters \(\hat{\beta_0}\) and \(\hat{\beta_1}\)
The six number summary
- \(S_{xx} = \sum_{i=1}^n x_i^2 - n\bar{x}^2\)
- \(S_{xy} = \sum_{i=1}^n x_iy_i - n\bar{x}\bar{y}\)
- \(S_{yy} = \sum_{i=1}^n y_i^2 - n\bar{y}^2\)
- Therefore, knowing \(n\), \(\bar{x}\), \(\bar{y}\), \(\sum_{i=1}^n x_i^2\), \(\sum_{i=1}^n y_i^2\), \(\sum_{i=1}^n x_iy_i\) is enough to calculate \(\hat{\beta_0}\) and \(\hat{\beta_1}\)
Confidence intervals
- Confidence intervals are used to describe the variance in the parameter estimates
- The confidence interval is a range of values that is likely to contain the true value of the parameter
- CI = statistic \(\pm\) (multiplier \(\times\) se(statistic))
- The multiplier is determined by the confidence level and the distribution of the statistic
- The standard error of the statistic is a measure of the variability of the statistic \(SE(estimator) = \sqrt{Var(estimator)}\)
Confidence intervals for \(\beta_0\) and \(\beta_1\)
- \(\hat{\beta_0}\) = estimate when \(x = 0\)
- \(\hat{\beta_1}\) = change in \(\hat{y}\) for a 1 unit change in \(x\)
- We need to know \(E(\hat{\beta_k})\) and \(Var(\hat{\beta_k})\) to construct CIs where \(k = 0, 1\)
- To derive the distribution, we rely on \(\epsilon \overset{i.i.d.}{\sim} N(0, \sigma^2)\)
Distribution and CI of \(\hat{\beta_1}\), given \(\sigma^2\)
- \[\frac{\hat{\beta_1} - \beta_1}{\sqrt{Var(\hat{\beta_1})}} = \frac{\hat{\beta_1} - \beta_1}{\sqrt{\frac{\sigma^2}{S_{xx}}}} \sim N(0,1)\]
- Therefore the 100(1-\(\alpha\))% CI for \(\beta_1\) is \[\hat{\beta_1} \pm z_{1-\alpha/2}\frac{\sigma}{\sqrt{\sum_i^N (x_i - \bar{x})^2}} \]
Distribution and CI of \(\hat{\beta_1}\) when \(\sigma^2\) is unknown
- It can be shown that the when \(\sigma^2\) is unknown (and needs to be estimated with \(s^2\)), the distribution of \(\hat{\beta_1}\) is a t-distribution with \(n-2\) degrees of freedom
- The 100(1-\(\alpha\))% CI for \(\beta_1\) thus is \[\hat{\beta_1} \pm t_{1-\alpha/2; n-2}\frac{s}{\sqrt{\sum_i^N (x_i - \bar{x})^2}} \]
Distribution and CI of \(\hat{\beta_0}\), given \(\sigma^2\)
- \[\hat{\beta_0} \sim N(\beta_0, \sigma^2\left[\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}\right])\]
- The 100(1-\(\alpha\))% CI for \(\beta_0\) is \[\hat{\beta_0} \pm z_{1-\alpha/2}\times \sigma\sqrt{\left[\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}\right]}\]
Distribution and CI of \(\hat{\beta_0}\) when \(\sigma^2\) is unknown
- It can be shown that the when \(\sigma^2\) is unknown (and needs to be estimated with \(s^2\)), the distribution of \(\hat{\beta_0}\) is a t-distribution with \(n-2\) degrees of freedom
- The 100(1-\(\alpha\))% CI for \(\beta_0\) thus is \[\hat{\beta_0} \pm t_{1-\alpha/2; n-2}\times s\sqrt{\left[\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}\right]}\]
Hypothesis testing
- State \(H_0\) and \(H_1\)
- Choose a test statistic
- This describes the degree to which the sample deviates from the null hypothesis
- Determine the distribution of the test statistic under \(H_0\)
- Calculate the critical region or p-value
Example: Two-sided test for \(\beta_1\)
- \(H_0: \beta_1 = 0\)
- \(H_1: \beta_1 \neq 0\)
- Test statistic: \(t = \frac{\hat{\beta_1} - \hat{\beta_{1}}_{(under H_0)}}{\sqrt{\widehat{Var(\hat{\beta_1})}}}\)
- The distribution of the test statistic is \(t_{n-2}\)
- Critical region: \(|t| > t_{1-\alpha/2; n-2}\) - Reject if \(t\) falls in the critical region
- P-value: \(2P(t_{n-2} > |t|)\) - Reject if p-value < \(\alpha\)
Prediction
- When using SLR to predict, there are 2 sources of uncertainty
- Uncertainty in the estimates of the regression line
- Uncertainty in the estimate of the error
Mean response prediction
- Predicting where the mean of “many” points will be at a given \(x_0\)
- when \(\sigma^2\) is known:
- \(\hat{\mu}_{y|x_0} \sim N\left(E(y|x_0), s^2[\frac{1}{n}+\frac{(x_0 - \bar{x})^2}{S_{xx}}]\right)\)
- when \(\sigma^2\) is unknown:
- \[\hat{\mu}_{y|x_0} \sim t_{n-2}\left(E(y|x_0), s^2[\frac{1}{n}+\frac{(x_0 - \bar{x})^2}{S_{xx}}]\right)\]
Confidence interval for mean response
- The range in which the mean of “many” points will lie at a given \(x_0\)
- Given \(x_0\) and unknown variance, the 100(1-\(\alpha\))% CI for \(E(y|x_0)\) is
- \[\hat{\mu}_{y|x_0} \pm t_{1-\alpha/2; n-2}\times s\sqrt{\left[\frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}\right]}\]
Prediction interval
- The range in which a single point will lie at a given \(x_0\)
- Given \(x_0\) and unknown variance, the 100(1-\(\alpha\))% PI for \(\hat{y}_0\) is
- \[\hat{y}_0 \pm t_{1-\alpha/2; n-2}\times s\sqrt{1 + \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}}\]
- Note that the PI is wider than the CI because it accounts for the uncertainty in the error term
- The PI gets wider with the distance from \(\bar{x}\)