A regression model relates a response variable y to a set of explanatory variables x. Assuming that we have access to n set of values (x_j, y_j), 1 \leq j \leq n), of these variable, the regression model is assumed to take the form y_j = f(x_j,\beta) + \varepsilon_j \quad ; \quad 1\leq j \leq n
where f is a structural model which depends on a p-vector of parameters \beta. We will assume that the residuals (\varepsilon_j) are independent random variables with mean 0 and variance \sigma^2:
\mathbb{E}(\varepsilon_j) = 0 \quad ; \quad \mathbb{E}(\varepsilon^2_j) = \sigma^2 \quad ; \quad \mathbb{E}(\varepsilon_j \varepsilon_k) = 0 \ (j \neq k)
Reproducing lm output
In this webpage, we will reproduce manually the output from simple 1-degree polynomial from the main course page:
data(cars)lm1 <-lm(dist ~1+ speed, data = cars)summary(lm1)
Call:
lm(formula = dist ~ 1 + speed, data = cars)
Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
speed 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
1 Ordinary least squares
1.1 Least squares estimator
A method for choosing automatically the “best parameters” \beta consists in minimizing the sum of squared errors of prediction, i.e. the residual sum of squares (RSS) :
Then, the standard error of each component of \hat\beta is defined as the square root of the diagonal elements of the variance-covariance matrix V=\mathbb{V}{\hat\beta}:
{\rm se}(\hat\beta_k) = \sqrt{V_{k k}}
Reproducing lm output
se_beta <-sqrt(diag(vcov_beta))se_beta
[1] 6.7584402 0.4155128
2 Statistical inference and diagnostics
Suppose that residuals (\varepsilon_j) are independent and normally distributed with mean 0 and variance \sigma^2:
\varepsilon_j \sim^{\mathrm{iid}} \mathcal{N}(0 \ , \ \sigma^2).
2.1 Statistical tests for the model parameters
In this case, \hat{\beta} is also normally distributed:
\hat{\beta} \sim \mathcal{N}(\beta \ , \ \sigma^2 (X^\prime X)^{-1})
and, for k=1, 2, \ldots , p, t_k = \frac{\hat{\beta}_k - \beta_k}{{\rm se}(\hat{\beta}_k)} follows a t-distribution with n-d degrees of freedom.
For each component \beta_k of \beta, we can then perform a t-test (known as the Wald test) to test
H_{k,0} : ``\beta_k = 0" \quad \text{versus} \quad H_{k,1}: ``\beta_k \neq 0"
Indeed, under the null hypothesis H_{k,0}, t_{{\rm stat}, k} = {\hat{\beta}_k}/{{\rm se}(\hat{\beta}_k)} follows a t-distribution with n-d degrees of freedom.
Using the fact that t_k follows a t-distribution with n-p degrees of freedom, we can build a confidence interval for \beta_k of level 1-\alpha:
{\rm CI}_{1-\alpha}(\beta_k) = [\hat{\beta}_k + qt_{\alpha/2, n-d}\ {\rm se}(\hat{\beta}_k) \ , \ \hat{\beta}_k + qt_{1-\alpha/2, n-d}\ {\rm se}(\hat{\beta}_k)] where qt_{\alpha/2, n-d} and qt_{1-\alpha/2, n-d} are the quantiles of order \alpha/2 and 1-\alpha/2 for a t-distribution with n-p df.
Indeed, we can easily check that \mathbb{P}{{\rm CI}_{1-\alpha}(\beta_k) \ni \beta_k} = 1-\alpha.
The function confint computes such confidence intervals for \beta (default level = 95%))
The multiple R-squared R^2 is the proportion of variation in the response variable that has been explained by the model. Using the fact that
\|y - \bar{y}\|^2 = \|X\hat{\beta} - \bar{y}\|^2 + \|y - X\hat{\beta} \|^2
By construction, adding more predictors to the model, i.e. increasing the degree of the polynome, is always going to increase the R-squared value. Adjusted R-squared penalizes this effect by normalizing each term by the associated degree of freedom.
The R-squared is a purely descriptive statistics. The adjusted R-squared should be preferably used to compare the explanatory power of models built from the same dataset.
The test statistic for testing H_0 against H_1 is
\begin{aligned}
F_{\rm stat} &= \frac{\|X\hat{\beta} - \bar{y}\|^2/(p-1)}{\|y - X\hat{\beta} \|^2/(n-p)}
\end{aligned}
Let us show that, under the null hypothesis, the test statistic F_{\rm stat} has a F distribution with (p-1,n-p) degrees of freedom:
By construction, \|y - X\hat{\beta} \|^2/\sigma^2 has a \chi^2 distribution with n-d df. On the other hand, under H_0, y_j=\varepsilon_j \sim^{\mathrm{iid}} \mathcal{N}(0,\sigma^2) and \|y - \bar{y}\|^2/\sigma^2 has a \chi^2 distribution with n-1 df.
Using the fact that
\|y - \bar{y}\|^2 = \|X\hat{\beta} - \bar{y}\|^2 + \|y - X\hat{\beta} \|^2
We deduce that, under H_0, \|X\hat{\beta} - \bar{y}\|^2/\sigma^2 has a \chi^2 distribution with (n-1) - (n-p) = p-1 df which leads to the conclusion since (\chi^2(\nu_1)/\nu1)/(\chi^2(\nu_2)/\nu2) = F(\nu_1,\nu_2).
The p-value of the F-test is therefore \text{p-value(F-test)} = \mathbb{P}(F_{p-1,n-p} > F_{\rm stat})=1- \mathbb{P}(F_{p-1,n-p} \leq F_{\rm stat}) ::: {.callout-note} #### Reproducing lm output
1-pf(F_stat, d -1, n - d)
[1] 1.489919e-12
:::
Remark: t-test and F-test are equivalent for linear models with only one predictor. In the case of polynomial regression of degree d = 1, both tests can be used equally for testing if \beta_1=0. Indeed,
\begin{aligned}
F_{\rm stat} &= \frac{\|\hat{\beta}_0 + \hat{\beta}_1 x - \bar{y}\|^2}{\|y - \hat{\beta}_0 - \hat{\beta}_1 x\|^2/(n-2)} \\[1.5ex]
&= \frac{\hat{\beta}_1^2 \|x - \bar{x}\|^2}{\hat{\sigma}^2} = \frac{\hat{\beta}_1^2}{se^2(\hat{\beta}_1)} = t_{\rm stat}^2
\end{aligned}
Furthermore, if t_{\rm stat} has a t distribution with n-2 df, then t_{\rm stat}^2 has a F distribution with (1,n-2) df. Both p-values are therefore equal.
2.5 Confidence and prediction intervals
For given values x^{\mathrm{new}} of the explanatory variable, we can use the fitted model for estimating the predicted response f^{\mathrm{new}}=f(x^{\mathrm{new}}). This estimation is defined as
\hat{f^{\mathrm{new}}} is a random variable since it is a function of the observed y. We can compute a confidence interval for f^{\mathrm{new}} with function predict(interval = "confidence"), since \hat{f^{\mathrm{new}}}= x^{\mathrm{new}} \hat{\beta},
\hat{f^{\mathrm{new}}} \sim \mathcal{N} (f^{\mathrm{new}} , {\rm Var}(\hat{f^{\mathrm{new}}} ) )
\widehat{{\rm Var}(\hat{f^{\mathrm{new}}})}, an estimate of {\rm Var}(\hat{f^{\mathrm{new}}} ) is obtained using \hat\sigma^2 instead of \sigma^2.
Then,
\begin{aligned}
\left({{\rm Var}(\hat{f^{\mathrm{new}}})} \right)^{-1/2}(\hat{f^{\mathrm{new}}} - f^{\mathrm{new}}) &\sim \mathcal{N}(0, {\rm Id}_{n^{\mathrm{new}}} ) \\
\left(\widehat{{\rm Var}(\hat{f^{\mathrm{new}}})} \right)^{-1/2}(\hat{f^{\mathrm{new}}} - f^{\mathrm{new}}) &\sim t_{n^{\mathrm{new}},n-p}
\end{aligned}
where t_{n^{\mathrm{new}},n-p} is the multivariate t distribution with n-p degrees of freedom (the components of this n^{\mathrm{new}}-vector are independent and follow a t distribution with n-p df).
Consider now a vector of new measured values y^{\mathrm{new}}. We can again use the predict(interval = "prediction") function for computing a prediction interval for y^{\mathrm{new}}. By definition of the model,
y^{\mathrm{new}} \sim \mathcal{N} (f^{\mathrm{new}} , \sigma^2 \, {\rm Id}_{n^{\mathrm{new}}} )
Then, if we want to compute a prediction interval for y^{\mathrm{new}}, we must take into account the variability of y^{\mathrm{new}} around f^{\mathrm{new}}, but also the uncertainty on f^{\mathrm{new}} since it is unknown:
Again, we assume in this part that the residual errors are independent and identically distributed (i.i.d.), with a normal distribution, mean 0 and variance \sigma^2.
3.1 ANOVA
Consider two nested linear models with, respectively, p_1 and p_2 coefficients. Let \hat{y}_1 and \hat{y}_2 be the respective predicted values under. Cochran Theorem states that
Under the null, the test statistics F_{\rm stat} follows a F distribution with (p_2-p_1 , n-p_2) degrees of freedom.
This ANOVA test can be performed by means of the anova function.
3.2 Likelihood ratio test
When two models are nested, we can compare them by performing a likelihood ratio test (LRT).
Let \log\ell_1 and \log\ell_2 be the log-likelihood functions of models \mathcal{M}_1 and \mathcal{M}_2. Then, for large n, the distribution of the test statistics
\begin{aligned}
LRT_{\rm stat} &= 2(\log\ell_2(\hat\theta_2) - \log\ell_1(\hat\theta_1) \\
&= n \log \left( \frac{\sum_{j=1}^n(y_j - f_1(x_j,\hat{\beta_1}))^2}{\sum_{j=1}^n(y_j - f_2(x_j,\hat{\beta_2}))^2} \right)
\end{aligned}
can be approximated by a \chi^2 distribution with p_2-p_1=d_2-d_1 df.
3.3 Deviance
The deviance for a given regression model and a given set of observations y, is a measure of goodness of fit defined, in R, as:
D = \sum_{j=1}^n(y_j - f(x_j,\hat{\beta}))^2
3.4 Information criteria
Functions AIC and BIC compute the Akaike information criterion and Bayesian information criterion. AIC and BIC are penalized versions of the log-likelihood defined by:
\begin{aligned}
AIC &= -2\log\ell(\hat{\theta}) + 2P \\
BIC &= -2\log\ell(\hat{\theta}) + \log(n)P
\end{aligned}
where P is the number of parameters of the model, i.e. the length of \theta.
On one hand, -2\log\ell(\hat{\theta}) decreases when P increases. On the other hand, the penalization term (2P or \log(n)P) increases with P. The objective of these criteria is to propose a model with an optimal compromise between the goodness of fit (measured by the log-likelihood) and the complexity of the model (measured by the number of parameters P).
Appendix - Maximum likelihood approach
If we assume that (\varepsilon_j, 1 \leq j \leq n) is a sequence of independent and normally distributed random variables with mean 0 and variance 1:
\varepsilon_j \sim^{\mathrm{iid}} \mathcal{N}(0, 1),
then the y_j are also independent and normally distributed:
y_j \sim \mathcal{N}(f(x_j, \beta),\sigma^2). The vector y=(y_1,y_2,\ldots,y_n) is therefore a Gaussian vector which probability density function (pdf) depends on a vector of parameters \theta=(\beta,\sigma^2):
For a given vector of observations y, the likelihood \ell is the function of the parameter \theta=(\beta,\sigma^2) defined as:
\ell(\theta) = \mathbb{P}(y ; \theta)
The log-likelihood is therefore
\log\ell(\theta) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{j=1}^n(y_j - f(x_j,\beta))^2
3.6 Maximum likelihood estimator
Assume that \theta takes its values in a subset \Theta of \mathbb{R}^p. Then, the Maximum Likelihood (ML) estimator of \theta is a function of y that maximizes the likelihood function:
Finally, the log-likelihood computed with \hat{\theta}=(\hat{\beta},\hat{\sigma}^2) reduces to
\log\ell(\hat{\theta}) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log\left(\frac{1}{n}\sum_{j=1}^n(y_j - f(x_j,\hat{\beta}))^2\right) -\frac{n}{2}
3.7 The Fisher Information matrix
3.8 Some general definitions
The partial derivative of the log-likelihood with respect to \theta is called the score. Under general regularity conditions, the expected value of the score is 0. Indeed, it is easy to show that
where \theta^\star is the ``true’’ unknown value of \theta such that the observations y where generated with model \mathbb{P}(\cdot;\theta^\star).
The variance of the score is called the Fisher information matrix (FIM):
I_n(\theta^\star) = \mathbb{E}{\left(\frac{\partial}{\partial \theta} \log\mathbb{P}(y;\theta^\star)\right)\left(\frac{\partial}{\partial \theta} \log\mathbb{P}(y;\theta^\star)\right)^\prime}.
Furthermore, it can be shown that if \log\ell is twice differentiable with respect to \theta,
The following central limit theorem (CLT) holds under certain regularity conditions:
I_n(\theta^\star)^{\frac{1}{2}}(\hat{\theta}-\theta^\star) \xrightarrow{n\to \infty} {\mathcal N}(0,{\rm Id}_n) .
This theorem shows that under relevant hypotheses, the estimator \hat{\theta} is consistent and converges to \theta^\star at rate \sqrt{n} since I_n=\mathcal{O}(n).
The normalizing term I_n(\theta^\star)^{-1} is unknown since it depends on the unknown parameter \theta^\star. We can use instead the observed Fisher information:
\begin{aligned}
I_{y}(\hat{\theta}) &= - \frac{\partial^2}{\partial \theta^2} \log\ell(\hat{\theta}) \\
&=-\sum_{i=1}^n \frac{\partial^2}{\partial \theta^2} \log \mathbb{P}(y_i ; \hat{\theta}).
\end{aligned}
We can then approximate the distribution of \hat{\theta} by a normal distribution with mean \theta^\star and variance-covariance matrix I_y(\hat{\theta})^{-1}:
\hat{\theta} \approx {\mathcal N}(\theta^\star , I_y(\hat{\theta})^{-1}) .
The square roots of the diagonal elements of I_y(\hat{\theta})^{-1} are called the standard errors (s.e.) of the elements of \hat{\theta}.
3.10 The FIM for a regression model
We have seen that, for a regression model,
\begin{aligned}
\log\ell(\theta) &= \log\ell(\beta,\sigma^2) \\
&= -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{j=1}^n(y_j - f(x_j,\beta))^2
\end{aligned}
By definition,
Because of the bloc structure of I_n(\theta^\star), the variance-covariance of \hat{\beta} can be estimated by I^{-1}_y(\hat{\beta}) where
\begin{aligned}
I_y(\hat{\beta}) &= - \frac{\partial^2}{\partial \beta \partial \beta^\prime} \log\ell(\hat{\beta},\hat{\sigma}^2) \\
&= \frac{1}{2\hat\sigma^2}\frac{\partial^2}{\partial \beta \partial \beta^\prime} \left(\sum_{j=1}^n(y_j - f(x_j,\hat\beta))^2 \right) \\
&= \frac{1}{\hat\sigma^2} \sum_{j=1}^n \left(
\left(\frac{\partial}{\partial \beta}f(x_j,\hat\beta)\right)\left(\frac{\partial}{\partial \beta}f(x_j,\hat\beta)\right)^\prime - \frac{\partial^2}{\partial \beta \partial \beta^\prime}f(x_j,\hat\beta)y_j
\right)
\end{aligned}
Remark: In the case of a linear model y=X\beta+e, we find that I_y(\hat{\beta}) = (X^\prime X)/\hat\sigma^2.
The variance of \hat{\sigma}^2 is estimated by I^{-1}_y(\hat{\sigma}^2) where
\begin{aligned}
I_y(\hat{\sigma}^2) &= -\frac{\partial^2}{\partial \sigma^{2^2} } \log\ell(\hat{\beta},\hat{\sigma}^2) \\
&= -\frac{n}{2\hat\sigma^4} + \frac{1}{\hat\sigma^6}\sum_{j=1}^n(y_j - f(x_j,\hat\beta))^2 \\
&= \frac{n}{2\hat\sigma^4}
\end{aligned}