TheoryConsider a set of m data points, is minimized, where the residuals (errors) ri are given by for The minimum value of S occurs when the gradient is zero. Since the model contains n parameters there are n gradient equations: In a non-linear system, the derivatives Here, k is an iteration number and the vector of increments, The Jacobian, J, is a function of constants, the independent variable and the parameters, so it changes from one iteration to the next. Thus, in terms of the linearized model, Substituting these expressions into the gradient equations, they become which, on rearrangement, become n simultaneous linear equations, the normal equations The normal equations are written in matrix notation as When the observations are not equally reliable, a weighted sum of squares may be minimized, Each element of the diagonal weight matrix, W should, ideally, be equal to the reciprocal of the variance of the measurement.1 The normal equations are then These equations form the basis for the Gauss-Newton algorithm for a non-linear least squares problem. Geometrical interpretationIn linear least squares the objective function, S, is a quadratic function of the parameters. When there is only one parameter The graph of S with respect to that parameter will be a parabola. With two or more parameters the contours of S with respect to any pair of parameters will be concentric ellipses (assuming that the normal equations matrix The more the parameter values differ from their optimal values, the more the contours deviate from elliptical shape. A consequence of this is that initial parameter estimates should be as close as practicable to their (unknown!) optimal values. It also explains how divergence can come about as the Gauss-Newton algorithm is convergent only when the objective function is approximately quadratic in the parameters. Initial parameter estimatesProblems of ill-conditioning and divergence can be ameliorated by finding initial parameter estimates that are near to the optimal values. A good way to do this is by computer simulation. Both the observed and calculated data are displayed on a screen. The parameters of the model are adjusted by hand until the agreement between observed and calculated data is reasonably good. Although this will be a subjective judgment, it is sufficient to find a good starting point for the non-linear refinement. ComputationGauss-Newton methodThe normal equations may be solved for where k is an iteration number. While this method may be adequate for simple models, it will fail if divergence occurs. Therefore protection against divergence is pretty much essential. Shift-cuttingIf divergence occurs, a simple expedient is to reduce the length of the shift vector, For example the length of the shift vector may be successively halved until the new value of the objective function is less than its value at the last iteration. The fraction, f could be optimized by a line search.2 As each trial value of f requires the objective function to be re-calculated it is not worth optimizing its value too stringently. When using shift-cutting, the direction of the shift vector remains unchanged. This limits the applicability of the method to situations where the direction of the shift vector is not very different from what it would be if the objective function were approximately quadratic in the parameters, Marquardt parameterIf divergence occurs and the direction of the shift vector is so far from its "ideal" direction that shift-cutting is not very effective, that is, the fraction, f required to avoid divergence is very small, the direction must be changed. This can achieved by using the Marquardt parameter.3 In this method the normal equations are modified where λ is the Marquardt parameter and I is an identity matrix. Increasing the value of λ has the effect of changing both the direction and the length of the shift vector. The shift vector is rotated towards the direction of steepest descent
Various strategies have been proposed for the determination of the Marquardt parameter. As with shift-cutting, it is wasteful to optimize this parameter too stringently. Rather, once a value has been found that brings about a reduction in the value of the objective function, that value of the parameter is carried to the next iteration, reduced if possible, or increased if need be. When reducing the value of the Marquardt parameter, there is a cut-off value below which it is safe to set it to zero, that is, to continue with the unmodified Gauss-Newton method. The cut-off value may be set equal to the smallest singular value of the Jacobian. 4 A bound for this value is given by QR decompositionThe minimum in the sum of squares can be found by a method that does not involve forming the normal equations. The residuals with the linearized model can be written as The Jacobian is subjected to an orthogonal decomposition; the QR decomposition will serve to illustrate the process. where Q is an orthogonal The residual vector is left-multiplied by This has no effect on the sum of squares since These equations are easily solved as R is upper triangular. Singular value decompositionA variant of the method of orthogonal decomposition involves singular value decomposition, in which R is diagonalized by further orthogonal transformations. where The relative simplicity of this expression is very useful in theoretical analysis of non-linear least squares. The application of singular value decomposition is discussed in detail in Lawson and Hanson. 4. Convergence criteriaThe common sense criterion for convergence is that the sum of squares does not decrease from one iteration to the next. However this criterion is often difficult to implement in practice, for various reasons. A useful convergence criterion is The value 0.0001 is somewhat arbitrary and may need to changed. In particular it may need to be increased when experimental errors are large. An alternative criterion is Again, the numerical value is somewhat arbitrary; 0.001 is equivalent to specifying that each parameter should be refined to 0.1% precision. This is reasonable when it is less than the largest relative standard deviation on the parameters. Parameter errors, confidence limits, residuals etc.For details concerning these topics see linear least squares#Parameter errors, correlation and confidence limits Multiple minimaMultiple minima can occur in a variety of circumstances some of which are:
where α is the height, γ is the position and β is the half-width at half height, there are two solutions for the half-width,
Not all multiple minima have equal values of the objective function. False minima, also known as local minima, occur when the objective function value is greater than its value at the so-called global minimum. For example, the model has a local minimum at To be certain that the minimum found is the global minimum, the refinement should be started with widely differing initial values of the parameters. When the same minimum is found regardless of starting point, it is likely to be the global minimum. When multiple minima exist there is an important consequence: the objective function will have a maximum value somewhere between two minima. The normal equations matrix is not positive definite at a maximum in the objective function, as the gradient is zero and no unique direction of descent exists. Refinement from a point (a set of parameter values) close to a maximum will be ill-conditioned and should be avoided as a starting point. For example, when fitting a Lorentzian the normal equations matrix is not positive definite when the half-width of the band is zero.7 Other methodsTransformation to a linear modelA non-linear model can sometimes be transformed into a linear one. For example, when the model is a simple exponential function, it can be transformed into a linear model by taking logarithms. The sum of squares becomes This procedure should be avoided if at all possible because it can give misleading results. This comes from the fact that whatever the experimental errors on y might be, the errors on log y are different. Therefore, when the transformed sum of squares is minimized different results will be obtained both for the parameter values and their calculated standard deviations. In practice, models like the exponential model can be fitted by least squares directly to the observed quantities in a spreadsheet, using an optimizer such as SOLVER in EXCEL. Another example is furnished by Michaelis-Menten kinetics, used to determine two parameters Vmax and Km. of 1/v against S is very sensitive to data error and it is strongly biased toward fitting the data in a particular range of the independent variable, S. Gradient methodsThere are many examples in the scientific literature where different methods have been used for non-linear data-fitting problems. Before outlining some of them it is important to recognize that the Gauss-Newton-Marquardt method is by far the most effective method.
There are models for which it is either very difficult or even impossible to derive analytical expressions for the elements of the Jacobian. Then, the numerical approximation is obtained by calculation of
The matrix H is known as the Hessian matrix. Although this model has better convergence properties near to the minimum, it is much worse when the parameters are far from their optimal values. Calculation of the Hessian adds to the complexity of the algorithm. This method is not in general use.
Direct search methodsDirect search methods depend on evaluations of the objective function at a variety of parameter values and do not use derivatives at all. They offer alternatives to the use of numerical derivatives in the Gauss-Newton method. However, only the conjugate gradient method has the quadratic convergence property that the Gauss-Newton method has. .
More detailed descriptions of these, and other, methods are available, in Numerical Recipes, together with computer code in various languages. See alsoReferences
| | |||||||||||||||||