The Gauss-Newton method for nonlinear regression: is it so difficult?

Nonlinear regression and the Gauss-Newton method: is it so difficult?

As most biologists, I am fond of nonlinear regression. The ability of fitting biologically realistic equations to experimental data and use model parameters for interpretation is really appealing! The fitting procedure in itself is quite simple: we feed a dataset, an equation and reasonable guesses for model parameters to our fitting programme and that's it, we get our estimates back! But, what is actually happening in the meantime?

Some of us know that a sort of iterative algorithm is used; the most statistically proficient ones even know that the Gauss-Newton method is used, but very few biologists actually know how this algorithm works. And any attempt of getting to know more about it is immediately stopped by the mathematical obstacles: matrices, vectors, derivatives and other fancy entities get immediately on the way to better understanding. This is a pity, because the Gauss-Newton method is very elegant in principle and straightforward to implement!

Let's imagine that we have two experimental observations (I know this is unreasonable, but I have to keep it simple!):

y <- c(1.3, 2)
x <- c(0.5, 2.5)

and we would like to use those observations to parameterise the following equation:

\[ Y = f(X, \theta) = exp(\theta X) + \epsilon \]

The usual way to visualise this problem is to look at the Y as a nonlinear function of X:

# Figure 1. The common view of a nonlinear regression problem
plot(y ~ x, xlim = c(0, 3), ylim = c(1, 3))
curve(exp(0.2806 * x), add = T)

plot of chunk unnamed-chunk-2

We are actually looking for the exponential curve that goes as close as possible to the observations in the above plot (least squares estimation). Unfortunately, this simple view does not help much to understand how the estimation process works and a major change of perspective is required.

A more useful perspective

When it come to least squares estimation, it is very useful to look at the observations as a single point in the space. In this case we have two observations, so we can represent them on a plane (that's why I kept this problem so simple!). In general, we need a space with n dimensions, where n is the number of observations.

Look at the graph below, where Y is positioned by using its values as coordinates (1.3 and 2).

# Figure 2. The geometry of nonlinear regression
ey_x <- exp(seq(-0.9, 0.5, by = 0.1) * 0.5)
ey_y <- exp(seq(-0.9, 0.5, by = 0.1) * 2.5)
plot(ey_y ~ ey_x, type = "l", xlim = c(0.7, 1.9), ylim = c(0, 3.5), xlab = "", 
    ylab = "", col = "red")
points(c(2) ~ c(1.3), pch = 16)
text(1.35, 2, labels = "Y")
arrows(1.3, 2, exp(0.281 * 0.5), exp(0.281 * 2.5), lty = 2)
text(1.25, 1.9, labels = expression(epsilon))
text(1.08, 2, labels = expression(bar(Y)[LS]))
text(1.4, 3.5, labels = expression(paste(bar(Y), "= g(", theta, ")", sep = "")))

plot of chunk unnamed-chunk-3

If we had no 'lack of fit' (\( \epsilon = 0 \)), the expected response (\( \bf{\bar{Y}} \)) should be represented by using the coordinates:

\[ \bf{\bar Y} = g(\theta ) = \left( {\begin{array}{*{20}{c}} {exp(\theta \cdot 0.5)}\\ {exp(\theta \cdot 2.5)} \end{array}} \right) \]

For instance, if \( \theta = 0 \), \( \bf{\bar{Y}} \) would lie at \( exp(0 \times 0.5) = 1 \) and \( exp(0 \times 2.5) = 1 \). Otherwise, if \( \theta = 0.5 \), \( \bf{\bar{Y}} \) would lie at \( exp(0.5 \times 0.5) = 1.28 \) and \( exp(0.5 \times 2.5) = 3.49 \). Indeed, the expected response in the above figure is a function of \( \theta \), as represented by the red curve.

Do not confound this curve with the one we are fitting to the data (\( f(X, \theta) \)); they are obviously related, but \( g(\theta) \) gives us the coordinates to place \( \bf{\bar{Y}} \) on the response space and it is calculated only for the values of X that were actually used in the experiment (look carefully at the two pictures above)!

Within this red curve, we should be able to select the closest point to Y, that is \( \bf{\bar{Y}}_{LS} \). In other words, the distance \( \epsilon \) should be as small as possible. What \( \theta \) value do we need to achieve our aim?

As the red curve \( g(\theta) \) is not linear, the solution is almost never easy. That's where the Gauss-Newton method comes in handy!

The Gauss-Newton algorithm

The Gauss-Newton method is based on Taylor series. The basic idea is that, if the curve \( g(\theta) \) is close-to-linear, I can approximate it by using a tangent line. Therefore, I do not need to know \( g(\theta) \), I only need to be able to calculate \( g(\theta) \) at a given point \( \theta_1 \) and, above all, I need to be able to calculate its first derivative at \( \theta_1 \). If so, I can write:

\[ g(\theta) \simeq P(\theta) = g({\theta_1}) + g'({\theta_1})(\theta - {\theta_1}) \]

where \( P(\theta) \) is the tangent line and \( g'(\theta_1) \) is the first derivative of g, calculated at the point \( \theta_1 \). This tangent line may be a good approximation to our nonlinear function in the close vicinity of \( \theta_1 \).

The Gauss-Newton method uses the Taylor expansion recursively, as we will see now.

First move

1 - Let's try and guess a reasonable initial value for \( \theta \) (e.g. \( \theta_1 \) = 0). We can use this to calculate the corresponding expected response \( \bf{\bar{Y}_1} \), that lies on the red curve and has the following coordinates:

theta1 <- 0
bar_y1 <- exp(theta1 * x)
bar_y1
## [1] 1 1

2 - How far is \( \bf{\bar{Y}_1} \) from \( \bf{Y} \)? Let's calculate the residuals (difference in coordinates):

(R1 <- y - bar_y1)
## [1] 0.3 1.0

3 - Keeping in mind the graph above, we have to build a new reference system, where the origin is now placed on (\( \bar{Y}_1 \)). In this system, we can represent the residuals in \( \bf{R_1} \). Instead of using a symbol, we use a vector, i.e. an arrow that starts from the origin and goes to the point with coordinates (0.3, 1). The squared length of this arrow (vector) represents the residual deviance for this first iteration and it is equal to \( (0.3)^2 + (1)^2 = 1.09 \). See the graph below.

# Figure 3. The linear approximation to nonlinear least squares
Y <- c(1.3, 2)
ey_x <- exp(seq(-0.9, 0.5, by = 0.1) * 0.5) - 1
ey_y <- exp(seq(-0.9, 0.5, by = 0.1) * 2.5) - 1
plot(0, type = "n", xlim = c(-0.2, 0.8), ylim = c(-1, 2.5), axes = F, xlab = "", 
    ylab = "")
points(ey_y ~ ey_x, type = "l", col = "red")
axis(1, at = c(-0.2, 0.2, 0.4, 0.6, 0.8), labels = c(-0.2, 0.2, 0.4, 0.6, 0.8), 
    pos = c(0, 0))
axis(2, at = c(-1, -0.5, 0.5, 1, 1.5, 2, 2.5), labels = c(-1, -0.5, 0.5, 1, 
    1.5, 2, 2.5), pos = c(0, 0))
arrows(0, 0, 1.3 - 1, 2 - 1)
text(0.35, 1, labels = "R1")
arrows(1 - 1, 1 - 1, 1.5 - 1, 3.5 - 1)
text(1.55 - 1, 3.5 - 1, labels = "V1")
arrows(1.3 - 1, 2 - 1, 1.22 - 1, 2.12 - 1, lty = 2)
text(1.3 - 1, 1.2, labels = expression(epsilon))

plot of chunk unnamed-chunk-6

4 - It is now the time to use our Taylor series and plot the tangent line to the red curve at the origin. If every point in the red curve has coordinates equal to [\( f(\theta_1, 0.5) \), \( f(\theta_1, 2.5) \)], it follows that every point on the tangent line has coordinates equal to [\( f'(\theta_1, 0.5) \), \( f'(\theta_1, 2.5) \)]. Be careful, we have to calculate the derivative of \( f(\theta, X) = exp(\theta, X) \) with respect to \( \theta \), because X is know to assume the two values 0.5 e 2.5. We know that:

# How to calculate a derivative with R
D(expression(exp(theta * x)), "theta")
## exp(theta * x) * x

thus we can calculate \( V_1 \):

V1 <- exp(theta1 * x) * x
V1
## [1] 0.5 2.5

\( V_1 \) can be represented on our plane (see figure above) as an arrow (a vector); it represents the tangent line and every other point on this tangent line can be obtained by applying a factor \( \beta \) to both the coordinates of \( V_1 \) (i.e. we can shrink or enlarge V1).

5 - The least squares solution we are looking for, would be on the red curve, but we can approximate it by using the black tangent line. This problem is perfectly linear: we should shrink (or enlarge, but not in this case) \( \bf{V_1} \) until it becomes perpendicular to \( \epsilon \), i.e. until \( V_1^T \epsilon = 0 \). In other words we can use the same approach as we use for linear models, where:

\[ \beta = (V_1^{T}V_1)^{-1} V_1^{T}R_1 = 0.40769 \]

With R:

beta <- solve(t(V1) %*% V1) %*% (t(V1) %*% R1)
beta
##        [,1]
## [1,] 0.4077

If we use \( \beta \) to update our earlier guess \( \theta_1 \), we should come closer to \( \bf{Y} \). Let's make another move.

Second move

1 - We select \( \theta_2 = \theta_1 + 0.40769 = 0.40769 \) and calculate the expected response \( \bf{\bar{Y}_2} \):

theta2 <- 0.40769
bar_y2 <- exp(theta2 * x)
bar_y2
## [1] 1.226 2.771

2 - We calculate the residuals and the residual deviance:

(R2 <- y - bar_y2)
## [1]  0.07389 -0.77105
sum(R2^2)
## [1] 0.6

We see that we have actually come closer to Y, as the residual deviance as dropped from 1.09 to 0.6.

3 - Reuse the Taylor expansion and linearise the problem:

V2 <- exp(theta2 * x) * x
V2
## [1] 0.6131 6.9276

4 - We calculate a new linear least squares solution:

beta <- solve(t(V2) %*% V2) %*% (t(V2) %*% R2)
beta
##         [,1]
## [1,] -0.1095

5 - We update \( \theta \) and go on again

theta3 <- theta2 + beta
theta3
##        [,1]
## [1,] 0.2982

Final move

If we go ahead with the above process, we discover that after six iterations the change in \( \theta \) becomes negligible, as well as the change in residual sum of squares. We have likely reached convergence and stop the process.

If we would like to check the above procedure with R, we can use the nls() function, that uses the Gauss-Newton method. By using the trace=T argument (that displays the results for each iteration), we seequite a close agreement with our hand-calculations:

mod <- nls(y ~ exp(theta * x), start = list(theta = 0), trace = T)
## 1.09 :  0
## 0.6 :  0.4077
## 0.03093 :  0.2982
## 0.0226 :  0.2809
## 0.0226 :  0.2807
## 0.0226 :  0.2807
coef(mod)
##  theta 
## 0.2807

A take-home message

I think that it is a pity that biologists loose a basic understanding of this method and use it blindly. This is a very elegant solution to a mathematically daunting task! There is obviously much more than this, but we'll let it to the mathematicians. But, what is a biologist's take home message? As a basic fact, the above method is asymptotically correct only when the linear approximation holds, i.e. when parameters are close-to-linear. So, we need to be careful to the parameterisation that we use: several models are biologically realistic, but they are based on highly nonlinear parameters!

Further readings

BATES DM, WATTS DG (1988). Nonlinear Regression Analysis and Its Applications. John Wiley and Sons, Inc., New York.

RATKOWSKY DA (1990). Handbook of nonlinear regression models. Marcel Dekker Inc.

No comments:

Post a Comment