Nonlinear Regression

Lecture Notes

Preliminary

Only functions from R-base and stats (preloaded) are required plus packages from the tidyverse for data representation and manipulation. You could also try the package broom that standardizes the output of built-in R functions for statistical modelling

library(tidyverse)
library(parallel)
theme_set(theme_bw())

1 Introduction

The faithful data (provided by the R base package datasets) consist of the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.

Let us see how these data look like.

Show the code
faithful_plot <- 
  faithful %>% 
  ggplot() + aes(x = eruptions, y = waiting) + 
  geom_point(size=2, colour="#993399") +
  ylab("Waiting time to next eruption (mn)") + xlab("duration of the eruption (mn)")
faithful_plot

We aim to fit a model to this data that describes the relationship between duration and waiting time.

If we try to fit a fit a polynomial model, we can check that a polynomial of degree 4 is considered as the “best polynomial model”.

poly4 <- lm(waiting ~poly(eruptions, 4), data = faithful)
faithful_plot <- 
  faithful_plot + 
  geom_smooth(method = "lm", formula = y ~ poly(x, 4), se = FALSE, colour="#339900")
faithful_plot

Even if this model is the “best” polynomial model, we may have serious doubts on the capabilities of the model to predict waiting times for durations outside of the observed range of durations.

Furthermore, parameters of the model, i.e. the polynomials’ coefficients, have no obvious physical interpretation. Using a polynomial model here, we are therefore not seeking to build a structural model ff that approximates a physical phenomenon, but merely seeking to rely the variability in the observations to the explanatory variables xx, x2x^2, … We therefore need to consider other types of models, i) that do not necessarily assume linear relationships between the response variable and the explanatory variables, ii) whose parameters have some physical interpretation.

A logistic function (or logistic curve) is a common “S” shape (sigmoid curve), with equation:

f1(x)=A1+exp(γ(xτ)) f_1(x) = \frac{A}{1+{\rm exp}(-\gamma(x-\tau))}

Here, AA is the limiting value (when xx \to \infty), γ\gamma measure the steepness of the curve and τ\tau is the xx-value of the sigmoid’s midpoint.

This model is a nonlinear model in the sense that the regression function f1f_1 is a nonlinear function of the parameters.We can fit this model to our data using the nls function.

nlm1 <- nls(waiting ~  A / ( 1 + exp(- gamma * (eruptions -tau))), faithful, start = c(A=70, gamma=2, tau=1))
summary(nlm1)

Formula: waiting ~ A/(1 + exp(-gamma * (eruptions - tau)))

Parameters:
      Estimate Std. Error t value Pr(>|t|)    
A      93.1097     4.5080  20.654  < 2e-16 ***
gamma   0.6394     0.1022   6.254 1.57e-09 ***
tau     1.4623     0.1092  13.391  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.763 on 269 degrees of freedom

Number of iterations to convergence: 9 
Achieved convergence tolerance: 6.211e-06

We will see in the next sections what these results are and how they are computed.

An extension of the logistic function assumes a minimum waiting time SS between eruptions:

f2(x)=S+AS1+exp(γ(xτ)) f_2(x) = S + \frac{A-S}{1+{\rm exp}(-\gamma(x-\tau))}

We can again use nls to fit this nonlinear model to the data:

nlm2 <- nls(waiting ~ (A-S) / ( 1 + exp(-gamma * (eruptions - tau)) ) + S, faithful, start = c(A=90, gamma=2, tau=2, S=50))
summary(nlm2)

Formula: waiting ~ (A - S)/(1 + exp(-gamma * (eruptions - tau))) + S

Parameters:
      Estimate Std. Error t value Pr(>|t|)    
A      82.4659     0.9973  82.689  < 2e-16 ***
gamma   2.2539     0.4355   5.175 4.47e-07 ***
tau     3.0553     0.1107  27.610  < 2e-16 ***
S      51.3221     1.8303  28.040  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.622 on 268 degrees of freedom

Number of iterations to convergence: 9 
Achieved convergence tolerance: 6.243e-06

We can now compute and plot the waiting times predicted with these two fitted models

Show the code
faithful_plot <- faithful_plot + 
  geom_smooth(
    method  = "nls",  se = FALSE, color = "#E69F00",
    formula = y ~ A / ( 1 + exp(- gamma * (x-tau))), 
    method.args = list(start = c(A=70, gamma=2, tau=1))) +
  geom_smooth(
    method  = "nls",  se = FALSE, color = "#56B4E9", 
    formula = y ~ (A-S) / ( 1 + exp(-gamma * (x-tau)) ) + S, 
    method.args = list(start=c(A=90, gamma=2, tau=2, S=50))) 
faithful_plot

We will see in the next sections

  • how to fit a nonlinear model to the data. We will use the model f1f_1 and show how to retrieve the results of nlm1.
  • how to evaluate the capability of the model to describe the observed data,
  • how to compare possible models,
  • how to compute confidence intervals and prediction intervals for predicted values.

2 Fitting a nonlinear model

2.1 Estimation of the parameters of the model

2.1.1 Least squares estimation

In the model

yj=f(xj,β)+εj;1jny_j = f(x_j,\beta) + \varepsilon_j \quad ; \quad 1\leq j \leq n The least squares (LS) estimator of β\beta minimizes the residual sum of squares (RSS)

β^=argminβj=1n(yjf(xj,β))2 \hat{\beta} = \arg\min_{\beta}\sum_{j=1}^n (y_j - f(x_j,\beta))^2

Here, there is no closed-form expression for β^\hat{\beta}. An optimization procedure is then used for computing β^\hat{\beta}.

We define the first model M1{\cal M}_1 as

yj=f1(xj,β)+εj;1jny_j = f_1(x_j,\beta) + \varepsilon_j \quad ; \quad 1\leq j \leq n

where f1f_1 is the logistic function defined above and where β=(A,γ,τ)\beta = (A,\gamma,\tau). In R,

f_1  <- function(beta, x) {
  A <- beta[1]; gamma <- beta[2]; tau <- beta[3]
  A / ( 1 + exp(- gamma * (x-tau)))
}

Let us check that function nls computes the nonlinear least-squares estimates of the parameters of the (nonlinear) model, that is, solve the above optimization problem in β\beta. We first create a function that computes the residual sum of squares for a given vector of parameters β\beta, which will be the objective (or cost) function from the optimization point of view:

rss_1 <- function(beta, x, y) sum( (y - f_1(beta, x) )^2 ) 

Then the LS estimate of β\beta can be computed using nlm (nonlinear minimization) which minimizes the residuals sum of squares using a Newton-type algorithm.

optim_nlm1 <- nlm(rss_1, c(A = 90, gamma = 2, tau = 2), faithful$eruptions, faithful$waiting)
beta_hat <- setNames(optim_nlm1$estimate, names(coef(nlm1)))
beta_hat
         A      gamma        tau 
93.1100965  0.6393835  1.4622674 

Assume now that the residual errors are random variables with mean 0 and variance σ2\sigma^2

E[εj]=0,E[εj2]=σ2,1jn\mathbb{E}[\varepsilon_j] = 0 \quad, \quad \mathbb{E}[\varepsilon_j^2] = \sigma^2 \quad, \quad 1 \leq j \leq n

Then, following the approach for linear models, if β\beta is a vector of length pp, there are npn-p degrees of freedom,the residual error variance is defined as

σ^2=1npj=1n(yjf(xj,β^))2 \hat{\sigma}^2 = \frac{1}{n-p} \sum_{j=1}^n \left( y_j - f(x_j,\hat{\beta})\right)^2

and the so-called residual standard error is

σ^=σ^2 \hat{\sigma} = \sqrt{\hat{\sigma}^2}

n  <- nrow(faithful)
p  <- length(beta_hat)
df <- (n - p)
sigma_hat <- sqrt(rss_1(beta_hat, faithful$eruptions, faithful$waiting)/df)
sigma_hat
[1] 5.762887

2.1.2 Maximum likelihood estimation

Let εj=σεj\varepsilon_j=\sigma \varepsilon_j where (εj)(\varepsilon_j) is a sequence of independent and normally distributed random variables with mean 0 and variance 11

εjiidN(0,1). \varepsilon_j \sim^{\mathrm{iid}} \mathcal{N}(0, 1).

We can then rewrite the model as follows:

yj=f(xj,β)+σεj;1jny_j = f(x_j,\beta) + \sigma \varepsilon_j \quad ; \quad 1\leq j \leq n

The maximum likelihood (ML) estimator of β\beta coincides with the least squares estimator

β^=argminβj=1n(yjf(xj,β))2\begin{aligned} \hat{\beta} &= \arg\min_{\beta}\sum_{j=1}^n \left(y_j - f(x_j,\beta)\right)^2 \end{aligned}

and the ML estimators of σ2\sigma^2 and σ\sigma are

σ^ml2=1nj=1n(yjf(xj,β^))2,σ^ml=σ^ml2 \hat{\sigma}^2_{\rm ml} = \frac{1}{n} \sum_{j=1}^n \left(y_j - f(x_j,\hat{\beta})\right)^2, \qquad \hat{\sigma}_{\rm ml} = \sqrt{\hat{\sigma}^2_{\rm ml}}

sigma_hat_ML <- sqrt(rss_1(beta_hat, faithful$eruptions, faithful$waiting) / n)
sigma_hat_ML
[1] 5.731018

2.2 Standard errors of the parameter estimates

Several methods exist for estimating the standard errors of the parameter estimates. In particular, the nls function uses a linear approximation of the model, but a likelihood approach or a parametric bootstrap may also provide estimates of these s.e.

2.2.1 Linearization approach

An nls object has methods for several generic functions, including vcov which computes the variance-covariance matrix of the estimated parameters β^\hat{\beta}.

vcov(nlm1)
               A        gamma          tau
A     20.3217751 -0.451414904  0.432032648
gamma -0.4514149  0.010452941 -0.008769221
tau    0.4320326 -0.008769221  0.011923449

The standard errors of the estimates are then the square roots of the diagonal elements of this matrix

sqrt(diag(vcov(nlm1)))
        A     gamma       tau 
4.5079680 0.1022396 0.1091945 

The nls function linearizes the model for computing this variance-covariance matrix. Indeed, for any β\beta ``close’’ to β^\hat{\beta},

f(xj,β)f(xj,β^)+f(xj,β^)(ββ^) f(x_j , \beta) \simeq f(x_j , \hat{\beta}) + \nabla f(x_j , \hat{\beta})(\beta - \hat{\beta})

where f(xj,β)\nabla f(x_j , {\beta}) is the gradient of f(xj,β)f(x_j , {\beta}), i.e. the row vector of the first derivatives of f(xj,β)f(x_j ,{\beta}) with respect to the dd components of β\beta. Setting zj=yjf(xj,β^)+f(xj,β^)β^z_j=y_j-f(x_j , \hat{\beta}) + \nabla f(x_j , \hat{\beta}) \hat{\beta} and gj=f(xj,β^))g_j=\nabla f(x_j , \hat{\beta})), the original model can be approximated by the linear model

zj=gj β+εj. z_j = g_j \ \beta + \varepsilon_j.

Writing this model in the matrix form z=Gβ+εz=G\, \beta + \varepsilon where gjg_j is the jjth row of matrix GG, we can check that the LS estimator of β\beta for this model is the LS estimator of the original model β^\hat{\beta}

Proposition 1 (Equivalence of the two LS estimate) The LS estimator of the linearized model is equivalent to the LS estimator of the original model:

β^=argminβj=1n(yjf(xj,β))2=argminβj=1n(zjgjβ)2 \begin{aligned} \hat{\beta} &= \arg\min_{\beta}\sum_{j=1}^n(y_j - f(x_j,\beta))^2 \\ & = \arg\min{\beta}\sum_{j=1}^n(z_j - g_j\beta)^2 \\ \end{aligned}

Let β~\tilde{\beta} be the LS estimator of the linearized model. Then,

β~=argminβzGβ2=(GG)1Gz=(GG)1G(yf(x,β^)+Gβ^)=β^+(GG)1f(x,β^)(yf(x,β^))\begin{aligned} \tilde{\beta} &= \arg\min{\beta} \, \left\|z - G\beta \right\|^2 \\ &= (G^\prime G)^{-1}G^\prime z \\ &= (G^\prime G)^{-1}G^\prime (y - f(x,\hat\beta) + G\hat\beta) \\ &= \hat\beta + (G^\prime G)^{-1}\nabla f(x,\hat\beta) (y - f(x,\hat\beta) ) \\ \end{aligned}

By definition, β^\hat\beta minimizes U(β)=yf(x,β)2U(\beta) = \| y -f(x,\beta) \|^2. Then,

U(β^)=2f(x,β^)(yf(x,β^))=0 \nabla U(\hat\beta) = -2\nabla f(x,\hat\beta) (y - f(x,\hat\beta) )= 0

Thus, β~=β^\tilde\beta=\hat\beta \Box.

Let us check this property numerically:

hx <- deriv(
  expr    = y ~ A / ( 1 + exp(- gamma * (x - tau))), 
  namevec = c("A", "gamma", "tau"), 
  function.arg = function(A, gamma, tau, x) { }
) 
fr <- hx(beta_hat[1], beta_hat[2], beta_hat[3], faithful$eruptions)
G  <- attr(fr, "gradient")
z  <- faithful$waiting - f_1(beta_hat, faithful$eruptions) + G %*% beta_hat
solve(crossprod(G)) %*% crossprod(G,z)
           [,1]
A     93.110121
gamma  0.639383
tau    1.462268

Since β^=(GG)1Gz\hat{\beta} =(G^\prime G)^{-1}G^\prime z, the variance-covariance of β^\hat\beta can be approximated by

Vlin(β^)=σ^2(GG)1\mathbb{V}_{\rm lin}(\hat\beta) = \hat\sigma^2 (G^\prime G)^{-1}

V_lin <- sigma_hat^2*solve(t(G)%*%G)
V_lin
               A        gamma          tau
A     20.3230488 -0.451427057  0.432076372
gamma -0.4514271  0.010452837 -0.008769862
tau    0.4320764 -0.008769862  0.011924743

and we can derive standard errors (selin(β^k),1kp)({\rm se}_{\rm lin}(\hat{\beta}_k), 1 \leq k \leq p) for the parameter estimates,

se_lin <- sqrt(diag(V_lin))
se_lin
        A     gamma       tau 
4.5081092 0.1022391 0.1092005 

2.2.2 Maximum likelihood approach

Let Iy(β^)I_y(\hat{\beta}) be the observed Fisher information matrix at β^\hat{\beta}:

Iy(β^)=2ββlog(β^,σ^2)=12σ^22ββ(j=1n(yjf(xj,β^))2). \begin{aligned} I_y({\hat\beta}) &= - \frac{\partial^2}{\partial \beta \partial \beta^\prime} \log \ell(\hat{\beta},\hat{\sigma}^2) \\ &= \frac{1}{2\hat\sigma^2}\frac{\partial^2}{\partial \beta \partial \beta^\prime} \left(\sum_{j=1}^n(y_j - f(x_j,\hat\beta))^2 \right). \end{aligned}

Then, the Central Limit Theorem states that the variance of β^\hat{\beta} can be approximated by the inverse of Iy(β^)I_y(\hat{\beta}):

Vml(β^)=Iy(β^)1.\mathbb{V}_{\rm ml}(\hat{\beta}) = I_y(\hat{\beta})^{-1}.

Function nlm can return the Hessian of the function rss_1 to minimize, i.e. the matrix of the second derivatives 2/ββj=1n(yjf(xj,β^))2\partial^2/\partial \beta \partial \beta^\prime \sum_{j=1}^n(y_j - f(x_j,\hat\beta))^2

optim_nlm1 <- nlm(rss_1, c(90, 2, 2), faithful$eruptions, faithful$waiting, hessian = "true")
H <- optim_nlm1$hessian
H
           [,1]      [,2]        [,3]
[1,]   324.8939   10848.5   -3794.477
[2,] 10848.4973  379318.3 -114782.340
[3,] -3794.4768 -114782.3   58845.488

We then derive the FIM and the variance Vml(β^)\mathbb{V}_{\rm ml}(\hat{\beta}):

V_ml <- solve(H/(2*sigma_hat_ML^2))
V_ml
           [,1]         [,2]         [,3]
[1,] 17.4344834 -0.386666080  0.369991066
[2,] -0.3866661  0.008998208 -0.007381367
[3,]  0.3699911 -0.007381367  0.010576191

and

se_ml <- sqrt(diag(V_ml))
se_ml
[1] 4.17546206 0.09485888 0.10284061

Beside, using the fact that V(χ2(k))=2k\mathbb{V}(\chi^2(k))=2k, we can show that V(σ^2)2σ4/n\mathbb{V}(\hat{\sigma}^2) \approx 2\sigma^4/n. Then the standard error of σ^\hat{\sigma} is approximately σ^/2n\hat{\sigma}/\sqrt{2n}.

sigma_hat/sqrt(2*n)
[1] 0.2470817

2.2.3 Parametric bootstrap

If we were able to repeat the same experiment under the same conditions, we would observe y(1)=(yj(1),1jn)y^{(1)} = (y^{(1)}_j, 1\leq j \leq n) and we would compute β(1)\beta^{(1)}, an estimate of β\beta. Then, if we could repeat this experiment LL times, we would get LL estimates β(1)\beta^{(1)}, β(2)\beta^{(2)}, , β(L)\beta^{(L)} of β\beta. This sequence of estimates (β(),1L)(\beta^{(\ell)}, 1 \leq \ell \leq L) would be a sample of random variables distributed as β^\hat{\beta} and could therefore be used for estimating this distribution.

When such replicates are not available, parametric bootstrap (or Monte-Carlo simulation) is a way to mimic the repetition of an experiment.

For =1,2,,L\ell=1,2,\ldots, L, we generate “observations” (yj(),1jn)(y^{(\ell)}_j, 1\leq j \leq n) with the model of interest, the original explanatory variables (xj,1jn)(x_j, 1\leq j\leq n) and using the estimated parameters θ^=(β^,σ^2)\hat\theta=(\hat\beta,\hat{\sigma}^2):

yj()=f(xj,β^)+σ^εj();1jn y^{(\ell)}_j = f(x_j,\hat{\beta}) + \hat{\sigma} \varepsilon^{(\ell)}_j \quad ; \quad 1\leq j \leq n

where εj()iidN(0,1)\varepsilon^{(\ell)}_j \sim^{\mathrm{iid}} \mathcal{N}(0,1). We also compute the LS /ML estimate of β\beta:

β^()=argminβj=1n(yj()f(xj,β))2\hat\beta^{(\ell)} = \arg\min_{\beta}\sum_{j=1}^n(y^{(\ell)}_j - f(x_j,\beta))^2

The variance-covariance matrix of β^\hat\beta is then estimated by the empirical variance-covariance matrix of (β^())(\hat\beta^{(\ell)}):

Vmc(β^)=1L1=1L(β^()βˉ)(β^()βˉ) \mathbb{V}_{\rm mc}(\hat{\beta}) = \frac{1}{L-1}\sum_{\ell=1}^L ( \hat\beta^{(\ell)} - \bar{\beta} )( \hat\beta^{(\ell)} - \bar{ \beta} )^\prime

where βˉ=1/L=1Lβ^()\bar{ \beta} = 1/L \sum_{\ell=1}^L \hat\beta^{(\ell)}.

L <- 1000
y_hat_ref <- predict(nlm1)
beta_hat <- coef(nlm1)
x <- faithful$eruptions
betas_boot <- parallel::mclapply(1:L, function(b) {
  y_b <- y_hat_ref + sigma_hat * rnorm(length(y_hat_ref))
  coef(suppressWarnings(nls(y_b ~  f_1(beta, x), start = list(beta = beta_hat))))
}) %>% unlist() %>% matrix(ncol = L) %>% t()
V_mc <- cov(betas_boot)
V_mc
           [,1]        [,2]        [,3]
[1,] 27.0193082 -0.49290098  0.73934710
[2,] -0.4929010  0.01036602 -0.01164034
[3,]  0.7393471 -0.01164034  0.02408383
se_mc <- sqrt(diag(V_mc))
se_mc
[1] 5.1980100 0.1018137 0.1551896

Remark. It would be equivalent to directly compute the empirical standard deviation of each component of the sequence (β^())(\hat\beta^{(\ell)}):

apply(betas_boot, 2, sd)
[1] 5.1980100 0.1018137 0.1551896

One of the main advantages of this method is that it doesn’t make any assumption on β^\hat\beta, contrary to the maximum likelihood estimator which asymptotic distribution is known to be normal, with a known asymptotic variance. Then, this asymptotic distribution is used with a finite set of observations for approximating the distribution of the ML estimator, but without knowing how good this approximation is. On its part, the linearization approach makes use of an approximation of the structural model, without knowing how good this approximation is.

In this example, the ML estimator seems to underestimate the standard error of the estimates. On the other hand, results obtained with the linearization approach are very similar to those obtained by Monte Carlo simulation.

2.3 Statistical tests for the model parameters

The summary of model nlm1 includes several informations about the model parameters:

summary(nlm1)$coefficient
        Estimate Std. Error   t value     Pr(>|t|)
A     93.1097350  4.5079680 20.654480 1.974732e-57
gamma  0.6393917  0.1022396  6.253854 1.567061e-09
tau    1.4622598  0.1091945 13.391326 1.110071e-31

Let β=(βk,1kp)\beta = (\beta_k, 1\leq k \leq p) be the pp-vector of parameters of the model. In the linearized model z=Gβ+ez=G\beta+e,

tk=(β^kβk)/se(β^k)t_k = (\hat{\beta}_k - \beta_k)/{\rm se}(\hat{\beta}_k) follows a tt-distribution with npn-p degrees of freedom. We can then perform a tt-test to test if βk=0\beta_k=0.

The test statistics is tstat,k=β^k/se(β^k)t_{{\rm stat}, k} = {\hat{\beta}_k}/{{\rm se}(\hat{\beta}_k)} and the pp-value for this test is

pk=2(1P(Tndtstat,k)p_k = 2(1 - \mathbb{P}(T_{n-d} \leq |t_{{\rm stat}, k}| )

t_stat  <- beta_hat/se_lin
p_value <- 2*(1 - pt(abs(t_stat), n-p))
cbind(beta_hat, se_lin, t_stat, p_value) %>% round(4)
      beta_hat se_lin  t_stat p_value
A      93.1097 4.5081 20.6538       0
gamma   0.6394 0.1022  6.2539       0
tau     1.4623 0.1092 13.3906       0

2.4 Confidence intervals for the model parameters

2.4.1 Linearization approach

Using the linearized model z=Gβ+εz = G\beta + \varepsilon, we can compute a confidence interval for each component of β\beta as we do with any linear model:

CIlin,1α(βk)=[β^k+qtα/2,np selin(β^k) , β^k+qt1α/2,np selin(β^k)]{\rm CI}_{{\rm lin}, 1-\alpha}(\beta_k) = [\hat{\beta}_k + qt_{\alpha/2, n-p}\ {\rm se}_{\rm lin}(\hat{\beta}_k) \ , \ \hat{\beta}_k + qt_{1-\alpha/2, n-p}\ {\rm se}_{\rm lin}(\hat{\beta}_k)]

where qtp,νqt_{p,\nu} is the quantile of order pp for a tt-distribution with ν\nu degree of freedom.

level <- 0.95
alpha <- 1 - level
CI_linearized <- 
  cbind(
  beta_hat + qt(alpha/2, n-p) * se_lin,
  beta_hat + qt(1-alpha/2,n-p) * se_lin) %>% as.data.frame() %>%  
  setNames(c(paste0((1-level)/2*100,"%"),paste0((1+level)/2*100,"%")))
CI_linearized
            2.5%       97.5%
A     84.2340705 101.9853995
gamma  0.4381011   0.8406823
tau    1.2472635   1.6772560

2.4.2 Maximum likelihood approach

We can adopt the same approach with the ML estimate. Here, the standard errors (seml(β^k),1kp)({\rm se}_{\rm ml}(\hat{\beta}_k), 1 \leq k \leq p) are derived from the asymptotic variance-covariance matrix of the parameter estimates Vml(β^)V_{\rm ml}(\hat{\beta}) .

CIml,1α(βk)=[β^k+qtα/2,np seml(β^k) , β^k+qt1α/2,np seml(β^k)]{\rm CI}_{{\rm ml},1-\alpha}(\beta_k) = [\hat{\beta}_k + qt_{\alpha/2, n-p}\ {\rm se}_{\rm ml}(\hat{\beta}_k) \ , \ \hat{\beta}_k + qt_{1-\alpha/2, n-p}\ {\rm se}_{\rm ml}(\hat{\beta}_k)]

CI_ML <- 
  cbind(
    beta_hat + qt(alpha/2,n-p) * se_ml,
    beta_hat + qt(1-alpha/2,n-p) * se_ml) %>% as.data.frame() %>%  
  setNames(c(paste0((1-level)/2*100,"%"),paste0((1+level)/2*100,"%")))
CI_ML
            2.5%       97.5%
A     84.8889935 101.3304765
gamma  0.4526314   0.8261519
tau    1.2597849   1.6647346

2.4.3 Parametric bootstrap

The sequence (β^(),1L)(\hat{\beta}^{(\ell)}, 1 \leq \ell \leq L) obtained by Monte Carlo simulation can be used for computing an empirical confidence interval:

CImc,1α(βk)=[β^k,α/2 , β^k,1α/2]{\rm CI}_{{\rm mc},1-\alpha}(\beta_k) = [\hat{\beta}_{k,\alpha/2} \ , \ \hat{\beta}_{k,1-\alpha/2} ] where, for any 0<p<10 < p < 1, β^k,p\hat{\beta}_{k,p} is the empirical quantile of order pp of (β^k(),1L)(\hat{\beta}^{(\ell)}_k, 1 \leq \ell \leq L):

CI_bootstrap <- 
  apply(as.matrix(betas_boot), 2, quantile, probs = c(alpha/2, 1-alpha/2)) %>%
  t() %>% as.data.frame() %>% 
  setNames(c(paste0((1-level)/2*100,"%"),paste0((1+level)/2*100,"%")))
CI_bootstrap
        2.5%       97.5%
1 86.7787669 106.9205636
2  0.4561864   0.8477302
3  1.3088312   1.9176370

Remark. These confidence intervals are slightly biased. We will use a linear model to explain where this bias comes from and show how to remove it.

Consider the linear model y=Xβ+σεy = X\beta + \sigma\varepsilon. A confidence interval of level 1α1-\alpha for βk\beta_k is

CI1α(βk)=[β^k+qtα/2,nd se(β^k) , β^k+qt1α/2,nd se(β^k)]{\rm CI}_{ 1-\alpha}(\beta_k) = [\hat{\beta}_k + qt_{\alpha/2, n-d}\ {\rm se}_(\hat{\beta}_k) \ , \ \hat{\beta}_k + qt_{1-\alpha/2, n-d}\ {\rm se}(\hat{\beta}_k)]

where se(β^k)=σ^2(XX)kk1{\rm se}(\hat{\beta}_k) = \hat\sigma^2 (X^\prime X)^{-1}_{kk}.

On the other hand, for =1,2,,L\ell=1,2,\ldots,L,

y()=Xβ^+σ^ε()y^{(\ell)} = X\hat{\beta} + \hat{\sigma} \varepsilon^{(\ell)}

and

β^()=β^+σ^(XX)1Xε()\hat{\beta}^{(\ell)} = \hat\beta + \hat{\sigma}(X^\prime X)^{-1}X^\prime\varepsilon^{(\ell)}

Thus, conditionnally to the observations yy, i.e. conditionnally to β^\hat\beta, β^k()N(β^k , se2(β^k))\hat{\beta}_k^{(\ell)} \sim \mathcal{N}(\hat\beta_k \ , \ {\rm se}^2(\hat{\beta}_k)). Then, the empirical quantile β^k,p\hat{\beta}_{k,p} is an estimator of the quantile of order pp of a normal distribution with mean β^k\hat\beta_k and variance se2(β^k){\rm se}^2(\hat{\beta}_k). In other words, the confidence interval CImc,1α(βk){\rm CI}_{{\rm mc},1-\alpha}(\beta_k) is an estimator of the interval [β^k+qNα/2 se(β^k) , β^k+qN1α/2 se(β^k)][\hat{\beta}_k + q\mathcal{N}_{\alpha/2}\ {\rm se}_(\hat{\beta}_k) \ , \ \hat{\beta}_k + q\mathcal{N}_{1-\alpha/2}\ {\rm se}(\hat{\beta}_k)], where qNpq\mathcal{N}_p is the quantile of order pp for a N(0,1)\mathcal{N}(0,1) distribution.

We see that these quantiles for a normal distribution should be tranformed into quantiles for a tt-ditribution with ndn-d df.

An unbiased confidence interval for βk\beta_k is therefore

CImc,1α(βk)=[β^k+qtα/2,ndqNα/2(β^k,α/2β^k) , β^k+qt1α/2,ndqN1α/2(β^k,1α/2β^k)]{\rm CI}^\star_{{\rm mc},1-\alpha}(\beta_k) = [\hat{\beta}_k + \frac{qt_{\alpha/2, n-d}}{q\mathcal{N}_{\alpha/2}}(\hat{\beta}_{k,\alpha/2} - \hat{\beta}_k)\ , \ \hat{\beta}_k + \frac{qt_{1-\alpha/2, n-d}}{q\mathcal{N}_{1-\alpha/2}}(\hat{\beta}_{k,1-\alpha/2} - \hat{\beta}_k)]

The same correction can be used for nonlinear models:

rq <- qt(1-alpha/2,df)/qnorm(1-alpha/2)
beta_hat + rq*(CI_bootstrap - beta_hat)
        2.5%       97.5%
1 86.7501543 106.9829812
2  0.4553584   0.8486718
3  1.3081377   1.9196951

2.4.4 Profile likelihood

Function confint uses the profile likelihood method for computing confidence intervals for parameters in a fitted model.

CI_profiled <- confint(nlm1, level = level)
CI_profiled
            2.5%       97.5%
A     87.1321726 105.5760368
gamma  0.4625255   0.8324095
tau    1.3109469   1.8569351

Profile likelihood confidence intervals are based on the log-likelihood function.

Imagine that we want to compute a confidence interval for β1\beta_1. The profile likelihood of β1\beta_1 is defined by

p(β1)=maxβ2,,βd(β1,β2,,βd)\ell_p(\beta_1) = \max_{\beta_2, \ldots, \beta_d}\ell(\beta_1, \beta_2, \ldots, \beta_d )

p(β1)\ell_p(\beta_1) does no longer depend on β2,,βd\beta_2, \ldots, \beta_d since it has been profiled out.

As an example, let us compute and display the profile log-likelihood of AA for model nlm1

f_1A <- function(gamma,tau,x,A){A/(1+exp(-gamma*(x-tau)))}
values_A <- seq(86, 110, by = 0.1)
start    <- list(gamma = beta_hat[2], tau = beta_hat[3])
logLik_A <- map(values_A, ~
  nls(waiting ~  .x / ( 1 + exp(- gamma * (eruptions - tau))), faithful, start = start)
) %>% map(logLik) %>% map_dbl(as.numeric)
Show the code
data.frame(A = values_A, logLik = logLik_A) %>% 
ggplot() + aes(x = A, y = logLik) + geom_line(color="blue", linewidth=1) + 
  geom_vline(xintercept = beta_hat[1], color="red", linetype = "longdash") + 
  scale_x_continuous(breaks = c(seq(85,110, by = 5), round(beta_hat[[1]],2)),"A")

Consider the test of H0:β1=β1H_0: \beta_1 = \beta_1^\star against H1:β1β1H_1: \beta_1 \neq \beta_1^\star. The likelihood ratio statistics is

LRstat=2(log((β^))log(p(β1)))LR_{\rm stat} = 2\left(\log(\ell(\hat\beta)) - \log(\ell_p(\beta_1^\star))\right)

where β^\hat\beta is the value of β\beta that maximises the likelihood (β)\ell(\beta) under H1H_1.

Under H0H_0, LRstatLR_{\rm stat} follows a χ2\chi^2 distribution with 1 df. Then, the test is significant (i.e. we reject H0H_0), if LRstat>qχ1,1α2LR_{\rm stat}> q\chi^2_{1,1-\alpha} where qχ1,1α2q\chi^2_{1,1-\alpha} is the quantile of order 1α1-\alpha for a χ2\chi^2 distribution with 1 df.

A “profile likelihood confidence interval” of level 1α1-\alpha for β1\beta_1 consists of those values β1\beta_1^\star for which the test is not significant.

qlevel <- qchisq(level,1)

lp_A <- function(A, qlevel) {
  nlm1_A <- nls(waiting ~  A / ( 1 + exp(- gamma * (eruptions - tau))), faithful, 
      start = list(gamma = beta_hat[2], tau = beta_hat[3]))
  res <- as.numeric(2*(logLik(nlm1) - logLik(nlm1_A) ) - qlevel) 
  res
}

c1_A <- uniroot(lp_A, c(80,  beta_hat[1]), qlevel)$root
c2_A <- uniroot(lp_A, c(beta_hat[1], 110), qlevel)$root
Show the code
dlogLik_A <- 2*(as.numeric(logLik(nlm1)) - logLik_A)
data.frame(A = values_A, dlogLik = dlogLik_A) %>% 
  ggplot() + geom_line(aes(values_A,dlogLik), size=1, color="blue") + 
  geom_hline(yintercept=qlevel, color="red", linetype = "longdash") +
  geom_segment(aes(x = c1_A, xend = c1_A, y=-Inf, yend=qlevel),  color="red", linetype = "longdash")  +
  geom_segment(aes(x = c2_A, xend = c2_A, y=-Inf, yend=qlevel),  color="red", linetype = "longdash")  +
  scale_y_continuous(breaks = c(0,2,3,6,round(qlevel,2)),1) +
  scale_x_continuous(breaks = c(seq(85,100,by=5),round(c1_A,1),round(c2_A,1), 110),"A") 
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Let us now compute the profile likelihood confidence intervals for γ\gamma and τ\tau

lp_gamma <- function(gamma) {
  nlm1_gamma <- nls(waiting ~  A/(1 + exp(-gamma*(eruptions-tau))), faithful,
                    start = list(A=beta_hat[1],tau=beta_hat[3]))
  as.numeric(2*(logLik(nlm1) - logLik(nlm1_gamma) ) - qlevel)
}
c1_gamma <- uniroot(lp_gamma, lower = 0.40, upper = beta_hat[2])$root
c2_gamma <- uniroot(lp_gamma, lower = beta_hat[2], upper=1)$root

lp_tau <- function(tau) {
  nlm1_tau <- nls(waiting ~ A/(1+exp(-gamma*(eruptions - tau))), faithful,
                  start=list(A=beta_hat[1],gamma=beta_hat[2]))
  as.numeric(2*(logLik(nlm1) - logLik(nlm1_tau) ) - qlevel)
}
c1_tau <- uniroot(lp_tau, lower=1.25, upper=beta_hat[3])$root
c2_tau <- uniroot(lp_tau, lower=beta_hat[3], upper=2.25)$root

CI_profiled_custom <- 
  rbind(A = c(c1_A , c2_A), gamma = c(c1_gamma, c2_gamma), tau = c(c1_tau, c2_tau)) %>% 
  as.data.frame() %>% 
  setNames(c(paste0((1-level)/2*100,"%"),paste0((1+level)/2*100,"%")))
CI_profiled_custom
            2.5%       97.5%
A     87.1523081 105.4511191
gamma  0.4636369   0.8310879
tau    1.3118040   1.8520538

Remark. The confint R function doesn’t use a χ2\chi^2 distribution with 1 df for the LRT statistics LRstatLR_{\rm stat} (which is theoretically the right asymptotic distribution).

On the contrary, the square root of LRstatLR_{\rm stat} is assumed to follow a half tt-distribution with npn-p df. Then, the null hypothesis H0H_0 is rejected when LRstat>qt1α/2,np2LR_{\rm stat}> qt_{1-\alpha/2,n-p}^2.

qlevel <- qt(1-alpha/2,df)^2
c1R_A <- uniroot(lp_A, c(80,  beta_hat[1]), qlevel)$root
c2R_A <- uniroot(lp_A, c(beta_hat[1], 110), qlevel)$root
c(c1R_A, c2R_A)
[1]  87.13229 105.53713

The two tests - and then the two confidence intervals - are equivalent for large nn since a tt-distribution with nn df converges to a N(0,1)\mathcal{N}(0,1) when nn goes to infinity. Then, for any 0<p<10 < p < 1,

(qtp,n)2nqχp,12 (qt_{p,n})^2 \xrightarrow{n\to \infty} q\chi^2_{p,1}

3 Diagnostic plots

Let us plot

  1. the observed waiting times versus predicted waiting times,
  2. the residuals versus eruption times,
  3. the residuals versus predicted waiting times,
  4. the distribution of the residuals
Show the code
## unfortunately, no plotting function is defined for 'nls' object
residual_nlm1 <- resid(nlm1)/sd(resid(nlm1))
par(mfrow=c(2,2))
plot(predict(nlm1), faithful$waiting)
abline(a=0, b=1, lty=1, col="magenta")
plot(x, residual_nlm1)
abline(a=0, b=0, lty=1, col="magenta")
plot(predict(nlm1), residual_nlm1)
abline(a=0, b=0, lty=1, col="magenta")
boxplot(residual_nlm1)
abline(a=0, b=0, lty=2, col="magenta")
abline(a=qnorm(0.25), b=0, lty=2, col="magenta")
abline(a=qnorm(0.75), b=0, lty=2, col="magenta")

On one hand, observations and predictions look well randomly distributed around the line y=xy=x. On the other hand, residual look well distributed around 0, with a constant variance. Furthermore, the distribution of the residuals appears to be symmetrical with quantiles close to those of a normal distribution

Then, based on these graphs, we don’t have any good reason for rejecting model nlm1… which doesn’t mean we should stay with this model as our final model!

4 Model comparison

We can produce the same the diagnostic plots with model nlm2 and arrive at the same conclusion concerning this model.

Show the code
residual_nlm2 <- resid(nlm2)/sd(resid(nlm2))
par(mfrow=c(2,2))
plot(predict(nlm2), faithful$waiting)
abline(a=0, b=1, lty=1, col="magenta")
plot(x, residual_nlm2)
abline(a=0, b=0, lty=1, col="magenta")
plot(predict(nlm2), residual_nlm2)
abline(a=0, b=0, lty=1, col="magenta")
boxplot(residual_nlm2)
abline(a=0, b=0, lty=2, col="magenta")
abline(a=qnorm(0.25), b=0, lty=2, col="magenta")
abline(a=qnorm(0.75), b=0, lty=2, col="magenta")

Since nlm1 and nlm2 are two possible model for fitting our data, we need some criteria for comparing them. The statistical tests and the information criteria used for comparing linear models can also be used for comparing nonlinear models.

First, we can perform a ANOVA for testing model nlm1 against model nlm2 since these two models are nested (nlm1 correponds to nlm2 when S=0S=0)

anova(nlm1, nlm2)
Analysis of Variance Table

Model 1: waiting ~ A/(1 + exp(-gamma * (eruptions - tau)))
Model 2: waiting ~ (A - S)/(1 + exp(-gamma * (eruptions - tau))) + S
  Res.Df Res.Sum Sq Df Sum Sq F value    Pr(>F)    
1    269     8933.7                                
2    268     8469.4  1  464.3  14.692 0.0001578 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let RSS1=yf1(x,β^1)2RSS_1 = \|y- f_1(x,\hat\beta_1)\|^2 and RSS2=yf2(x,β^2)2RSS_2 = \|y- f_2(x,\hat\beta_2)\|^2 be the residual sums of squares under, respectively, nlm1 and nlm2, and let d1d_1 and d2d_2 be lengths of vectors β1\beta_1 and β2\beta_2. Then,

Fstat=(RSS1RSS2)/(d2d1)(RSS2)/(nd2 F_{\rm stat} = \frac{(RSS_1 - RSS_2)/(d_2-d_1)}{(RSS_2)/(n-d_2}

RSS1 <- sum(resid(nlm1)^2)
RSS2 <- sum(resid(nlm2)^2)
p1   <- length(coef(nlm1))
p2   <- length(coef(nlm2))
F.stat <- ( (RSS1-RSS2)/(p2-p1) ) / ( RSS2/(n-p2) )
c(RSS2, RSS1 - RSS2, F.stat, 1-pf(F.stat, p2-p1, n-p2))
[1] 8.469424e+03 4.642987e+02 1.469192e+01 1.577976e-04

Remark: since the model is nonlinear, we cannot decompose the residual sum of squares RSS1RSS_1 as we did with linear models. Indeed, here,

yf1(x,β^1)2f1(x,β1)f2(x,β^2)2+yf2(x,β^2)2\|y- f_1(x,\hat\beta_1)\|^2 \neq \|f_1(x,\beta_1) - f_2(x,\hat\beta_2)\|^2 + \|y- f_2(x,\hat\beta_2)\|^2

c(RSS1-RSS2, sum((predict(nlm1)-predict(nlm2))^2))
[1] 464.2987 520.3559

Another way to test nlm1 against nlm2 consists in testing if S=0S=0 in model nlm2:

summary(nlm2)$coefficients
       Estimate Std. Error   t value      Pr(>|t|)
A     82.465891  0.9973073 82.688550 8.973936e-193
gamma  2.253936  0.4355265  5.175197  4.469535e-07
tau    3.055263  0.1106569 27.610221  2.420336e-80
S     51.322077  1.8302942 28.040343  1.110792e-81

Even if both tests clearly prefer model nlm2, we can remark that the tt-test and the FF-test are not equivalent since the models are nonlinear.

Information criteria such as AIC and BIC also prefer model nlm2:

as.matrix(AIC(nlm1, nlm2))
     df      AIC
nlm1  4 1729.668
nlm2  5 1717.152
as.matrix(BIC(nlm1, nlm2))
     df      BIC
nlm1  4 1744.092
nlm2  5 1735.181

5 Confidence intervals and prediction intervals

There is no ready-made functions to calculate confidence intervals for predicted values and prediction intervals for new data. We will see how to do it by implementing two different methods.

5.1 The delta-method

For a given value x0x_0 of the explanatory variable xx, we can use the model ff with the estimated parameter β^\hat{\beta} and predict the response as f(x0,β^)f(x_0,\hat{\beta}).

Since β^\hat{\beta} is a random vector with variance-covariance matrix V(β^)\mathbb{V}(\hat{\beta}), f(x0,β^)f(x_0,\hat{\beta}) is also a random variable that can be approximated by a linear function of β^\hat{\beta}

f(x0,β)f(x0,β^)+f(x0,β^)(ββ^) f(x_0 , \beta) \simeq f(x_0 , \hat{\beta}) + \nabla f(x_0 , \hat{\beta})(\beta - \hat{\beta})

Then, the so-called delta-method consists in using this approximation for approximating the variance of f(x0,β^)f(x_0,\hat{\beta}) by

V(f(x0,β^))f(x0,β^)V(β^)f(x0,β^), \mathbb{V}(f(x_0,\hat{\beta})) \simeq \nabla f(x_0 , \hat{\beta}) \mathbb{V}(\hat{\beta}) \nabla f(x_0 , \hat{\beta})^\prime,

and we can now use this approximation for computing a (1α)100%(1-\alpha)100\% confidence interval for each prediction f(x0,β)f(x_0,\beta):

CIlin,1α=[f(x0,β^)+qtα/2,np s.e.(f(x0,β^)) , f(x0,β^)+qt1α/2,np s.e.(f(x0,β^))]{\rm CI}_{{\rm lin}, 1-\alpha}= [f(x_0,\hat{\beta}) + qt_{\alpha/2, n-p}\ {\rm s.e.}(f(x_0,\hat{\beta})) \ , \ f(x_0,\hat{\beta}) + qt_{1-\alpha/2, n-p}\ {\rm s.e.}(f(x_0,\hat{\beta}))]

where s.e.(f(x0,β^)){\rm s.e.}(f(x_0,\hat{\beta})) is the standard error of f(x0,β^)f(x_0,\hat{\beta}) defined as

s.e.(f(x0,β^))=f(x0,β^)V(β^)f(x0,β^) {\rm s.e.}(f(x_0,\hat{\beta})) = \sqrt{\nabla f(x_0 , \hat{\beta}) \mathbb{V}(\hat{\beta}) \nabla f(x_0 , \hat{\beta})^\prime}

We can also compute a prediction interval for a future observation y0=f(x0,β)+ε0y_0 = f(x_0,\beta) + \varepsilon_0

The prediction for y0y_0 is

y^0=f(x0,β^).\hat{y}_0 = f(x_0 , \hat{\beta}).

Then, the standard error for this prediction should take into account the uncertainty on f(x0,β)f(x_0,\beta) and the variability of the residual error e0e_0:

s.e.(y^0)=f(x0,β^)V(β^)f(x0,β^)+σ2 {\rm s.e.}(\hat{y}_0) = \sqrt{\nabla f(x_0 , \hat{\beta}) \mathbb{V}(\hat{\beta}) \nabla f(x_0 , \hat{\beta})^\prime + \sigma^2}

Then,

CIlin,1α(y0)=[f(x0,β^)+qtα/2,np s.e.(y^0) , f(x0,β^)+qt1α/2,np s.e.(y^0)]{\rm CI}_{{\rm lin}, 1-\alpha}(y_0) = [f(x_0,\hat{\beta}) + qt_{\alpha/2, n-p}\ {\rm s.e.}(\hat{y}_0) \ , \ f(x_0,\hat{\beta}) + qt_{1-\alpha/2, n-p}\ {\rm s.e.}(\hat{y}_0)]

As an example, let us compute the variance of f2(x0,β^2)f_2(x_0, \hat{\beta}_2) for x0=1,1.1,1.2,,5.9,6x_0 = 1, 1.1, 1.2, \ldots, 5.9, 6,

f_prime <- deriv(y ~ (A-S)/(1+exp(-gamma*(x-tau))) + S, c("A", "gamma", "tau", "S"), function(A, gamma, tau, S, x){} ) 
x_new <- seq(1, 6, by=0.1)
beta_hat <- coef(nlm2)
f_new    <- f_prime(beta_hat[1], beta_hat[2], beta_hat[3], beta_hat[4], x_new)
grad_new <- attr(f_new, "gradient")
GS <- rowSums((grad_new %*% vcov(nlm2)) * grad_new)

We can then derive a 95%95\% confidence interval for each f2(x0,β)f_2(x_0,\beta)

alpha <- 0.05
delta_f <- sqrt(GS) * qt(1-alpha/2,  n - length(beta_hat))
df_delta <- data.frame(
  x = x_new, 
  f = f_new, 
  lwr_conf = f_new - delta_f, 
  upr_conf = f_new + delta_f
  )

and for each y0y_0

delta_y    <- sqrt(GS + sigma(nlm2)^2)*qt(1-alpha/2,df)
df_delta[c("lwr_pred","upr_pred")] <- cbind(f_new - delta_y,f_new + delta_y)

We can now plot these two intervals together with the data:

Show the code
ggplot(faithful)  + geom_point(aes(x = eruptions, y = waiting)) +
   geom_ribbon(data=df_delta, aes(x=x_new, ymin=lwr_pred, ymax=upr_pred), alpha=0.1, fill="blue") + 
  geom_ribbon(data=df_delta, aes(x=x_new, ymin=lwr_conf, ymax=upr_conf), alpha=0.2, fill="#339900") +   geom_line(data=df_delta, aes(x=x_new, y=f_new), colour="#339900", size=1)

5.2 Parametric bootstrap

A we have already seen, parametric bootstrap consists in simulating LL replicates of the experiment, by drawing random observations with the fitted model, i.e. using the estimated parameters $ $. Then, for each replicate,

  1. an estimate β^()\hat{\beta}^{(\ell)} of the vector of parameters β\beta is computed,
  2. a prediction f(x0,β^())f(x_0,\hat{\beta}^{(\ell)}) is computed for each value of x0x_0,
  3. a new observation y0()y_0^{(\ell)} is randomly drawn for each value of x0x_0.

We can then use the empirical quantiles of f(x0,β^(),1L)f(x_0,\hat{\beta}^{(\ell)}, 1 \leq \ell \leq L) and (y0(),1L)(y_0^{(\ell)}, 1 \leq \ell \leq L) to compute a confidence interval for f0=f(x0,β)f_0=f(x_0,\beta) and a prediction interval for y0y_0.

Let f0,pf_{0,p} and y0,py_{0,p} be the empirical quantiles of order pp of f(x0,β^(),1L)f(x_0,\hat{\beta}^{(\ell)}, 1 \leq \ell \leq L) and (y0(),1L)(y_0^{(\ell)}, 1 \leq \ell \leq L), respectively. Instead of using the empirical intervals

CImc,1α(f0)=[f0,α/2 , f0,1α/2]CImc,1α(y0)=[y0,α/2 , y0,1α/2]\begin{aligned} {\rm CI}_{{\rm mc},1-\alpha}(f_0) &= [f_{0,\alpha/2} \ , \ f_{0,1-\alpha/2} ] \\ {\rm CI}_{{\rm mc},1-\alpha}(y_0) &= [y_{0,\alpha/2} \ , \ y_{0,1-\alpha/2} ] \end{aligned}

we can define the confidence and prediction intervals using the correction previously introduced for obtaining unbiased intervals in the case of a linear model:

CImc,1α(f0)=[f(x0,β^)+qtα/2,ndqNα/2(f0,α/2f(x0,β^)) , f(x0,β^)+qt1α/2,ndqN1α/2(f0,1α/2f(x0,β^))]CImc,1α(y0)=[f(x0,β^)+qtα/2,ndqNα/2(y0,α/2f(x0,β^)) , f(x0,β^)+qt1α/2,ndqN1α/2(y0,1α/2f(x0,β^))]\begin{aligned} & {\rm CI}^\star_{{\rm mc},1-\alpha}(f_0) = \\ & [f(x_0,\hat{\beta}) + \frac{qt_{\alpha/2, n-d}}{q\mathcal{N}_{\alpha/2}}(f_{0,\alpha/2} - f(x_0,\hat{\beta}))\ , \ f(x_0,\hat{\beta}) + \frac{qt_{1-\alpha/2, n-d}}{q\mathcal{N}_{1-\alpha/2}}(f_{0,1-\alpha/2} - f(x_0,\hat{\beta}))] \\ & {\rm CI}^\star_{{\rm mc},1-\alpha}(y_0) = \\ & [f(x_0,\hat{\beta}) + \frac{qt_{\alpha/2, n-d}}{q\mathcal{N}_{\alpha/2}}(y_{0,\alpha/2} - f(x_0,\hat{\beta}))\ , \ f(x_0,\hat{\beta}) + \frac{qt_{1-\alpha/2, n-d}}{q\mathcal{N}_{1-\alpha/2}}(y_{0,1-\alpha/2} - f(x_0,\hat{\beta}))] \end{aligned}

Let us compute these confidence and prediction interval bu boostraping for the f2f_2:

f_hat <- function(beta, x){ beta[4] + (beta[1]-beta[4])/(1+exp(-beta[2]*(x-beta[3]))) }
beta_hat  <- coef(nlm2)
y_hat_ref <- f_hat(beta_hat, x)
df_mc <- data.frame( x =x_new, f = f_new)

res <- parallel::mclapply(1:1000, function(b) {
  y_b <- y_hat_ref + sigma(nlm2) * rnorm(n)
  nlm2_b <- nls(y_b ~ f_hat(beta, x), start = list(beta = beta_hat))
  f_b <- predict(nlm2_b, newdata = df_mc)
  list(f_hat = f_b, y_hat = f_b + rnorm(1, 0, sigma(nlm2)))
})

df_mc[c("lwr_conf","upr_conf")] <- 
  map(res, "f_hat") %>% reduce(rbind) %>% 
  apply(2, quantile, c(alpha/2,1-alpha/2)) %>% t()
df_mc[c("lwr_pred","upr_pred")] <-
  map(res, "y_hat") %>% reduce(rbind) %>% 
  apply(2, quantile, c(alpha/2,1-alpha/2)) %>% t()
## removing bias
df_mc[,(3:6)] <- f_new + rq*(df_mc[,(3:6)] - f_new)
Show the code
ggplot(faithful)  + geom_point(aes(x = eruptions, y = waiting)) +
   geom_ribbon(data=df_mc, aes(x=x_new, ymin=lwr_pred, ymax=upr_pred), alpha=0.1, fill="blue") + 
  geom_ribbon(data=df_mc, aes(x=x_new, ymin=lwr_conf, ymax=upr_conf), alpha=0.2, fill="#339900") +   
  geom_line(data=df_mc, aes(x=x_new, y=f_new), colour="#339900", size=1)