The Nelder-Mead minimization algorithm

It is easy to find the minimum of {f(x,y) = x^2+16y^2} if you are human. For a computer this takes more work:

Search for the minimum of x^2+16y^2
Search for the minimum of x^2+16y^2

The animation shows a simplified form of the Nelder-Mead algorithm: a simplex-based minimization algorithm that does not use any derivatives of {f}. Such algorithms are easy to come up with for functions of one variable, e.g., the bisection method. But how to minimize a function of two variables?

A natural way to look for minimum is to slide along the graph in the direction opposite to {\nabla f}; this is the method of steepest descent. But for computational purposes we need a discrete process, not a continuous one. Instead of thinking of a point sliding down, think of a small tetrahedron tumbling down the graph of {f}; this is a discrete process of flips and flops. The process amounts to the triangle of contact being replaced by another triangle with an adjacent side. The triangle is flipped in the direction away from the highest vertex.

This is already a reasonable minimization algorithm: begin with a triangle {T}; find the values of {f} at the vertices of {T}; reflect the triangle away from the highest value; if the reflected point {R} has a smaller value, move there; otherwise stop.

But there’s a problem: the size of triangle never changes in this process. If {T} is large, we won’t know where the minimum is even if {T} eventually covers it. If {T} is small, it will be moving in tiny steps.

Perhaps, instead of stopping when reflection does not work anymore, we should reduce the size of {T}. It is natural to contract it toward the “best” vertex (the one with the smallest value of {f}), replacing two other vertices with the midpoints of corresponding sides. Then repeat. The stopping condition can be the values of {f} at all vertices becoming very close to one another.

This looks clever, but the results are unspectacular. The algorithm is prone to converge to a non-stationary point where just by an accident the triangle attains a nearly horizontal position. The problem is that the triangle, while changing its size, does not change its shape to fit the geometry of the graph of {f}.

The Nelder-Mead algorithm adapts the shape of the triangle by including the possibility of stretching while flipping. Thus, the triangle can grow smaller and larger, moving faster when the path is clear, or becoming very thin to fit into a narrow passage. Here is a simplified description:

  • Begin with some triangle {T}.
  • Evaluate the function {f} at each vertex. Call the vertices {W,G,B} where {W} is the worst one (the largest value of {f}) and {B} is the best.
  • Reflect {W} about the midpoint of the good side {GB}. Let {R} be the reflected point.
  • If {f(R)<f(B)}, then we consider moving even further in the same direction, extending the line {WR} beyond {R} by half the length of {WR}. Choose between {R} and {E} based on where {f} is smaller, and make the chosen point a new vertex of our triangle, replacing {W}.
  • Else, do not reflect and instead shrink the triangle toward {B}.
  • Repeat, stopping when we either exceed the number of iterations or all values of {f} at the vertices of triangle become nearly equal.

(The full version of the Nelder-Mead algorithm also includes the comparison of {R} with {G}, and also involves trying a point inside the triangle.)

Rosenbrock's function
Rosenbrock’s function

This is Rosenbrock’s function {f(x,y)=100(x^2-y)^2 + (x-1)^2}, one of standard torture tests for minimization algorithms. Its graph has a narrow valley along the parabola {y=x^2}. At the bottom of the valley, the incline toward the minimum {(1,1)} is relatively small, compared to steep walls surrounding the valley. The steepest descent trajectory quickly reaches the valley but dramatically slows down there, moving in tiny zig-zagging steps.

The algorithm described above gets within {0.001} of the minimum in 65 steps.

Minimizing Rosenbrock's function
Minimizing Rosenbrock’s function

In conclusion, Scilab code with this algorithm.

x = -0.4:0.1:1.6; y = -2:0.1:1.4          // viewing window  
[X,Y] = meshgrid(x,y); contour(x,y,f(X,Y)',30)  // contour plot
plot([1],[1],'r+')                             // minimum point
tol = 10^(-6)
n = 0
T = [0, -1.5 ; 1.4, -1.5; 1.5, 0.5]        //  initial triangle
for i=1:3 
    values(i) = f(T(i,1), T(i,2))
while (%T) 
    xpoly(T(:,1),T(:,2),'lines',1)         // draw the triangle  
    [values, index] = gsort(values)          // sort the values 
    T = T(index,:)       
    if values(1)-values(3) < tol            // close enough?
        mfprintf(6, "Minimum at (%.3f, %.3f)",T(3,1),T(3,2))
    R = T(2,:) + T(3,:) - T(1,:)             // reflected  
    fR = f(R(1),R(2))
    if fR < values(3)                         
        E = 1.5*T(2,:) + 1.5*T(3,:) - 2*T(1,:)  // extended  
        fE = f(E(1),E(2))
        if fE < fR 
            T(1,:)=E; values(1)=fE     // pick extended 
            T(1,:)=R; values(1)=fR     // pick reflected 
        for i=1:2
            T(i,:) = (T(i,:)+T(3,:))/2      // shrink
            values(i) = f(T(i,1), T(i,2))      
    n = n+1
    if n >= 200
        disp('Failed to converge'); break    //  too bad 

3 calculus 3 examples

The function {f(x,y)=\dfrac{xy}{x^2+y^2}} might be the world’s most popular example demonstrating that the existence of partial derivatives does not imply differentiability.


But in my opinion, it is somewhat extreme and potentially confusing, with discontinuity added to the mix. I prefer

\displaystyle  f(x,y)=\frac{xy}{\sqrt{x^2+y^2}}

pictured below.


This one is continuous. In fact, it is Lipschitz continuous because the first-order partials {f_x} and {f_y} are bounded. The restriction of {f} to the line {y=x} is {f(x,y)=x^2/\sqrt{2x^2} = |x|/\sqrt{2}}, which is a familiar single-variable example of a nondifferentiable function.

To unify the analysis of such examples, let {f(x,y)=xy\,g(x^2+y^2)}. Then

\displaystyle    f_x = y g+ 2x^2yg'

With {g(t)=t^{-1/2}}, where {t=x^2+y^2}, we get

\displaystyle    f_x = O(t^{1/2}) t^{-1/2} + O(t^{3/2})t^{-3/2} = O(1),\quad t\rightarrow 0

By symmetry, {f_y} is bounded as well.

My favorite example from this family is more subtle, with a deceptively smooth graph:

Looks like xy
Looks like xy

The formula is

\displaystyle    f(x,y)=xy\sqrt{-\log(x^2+y^2)}

Since {f} decays almost quadratically near the origin, it is differentiable at {(0,0)}. Indeed, the first order derivatives {f_x} and {f_y} are continuous, as one may observe using {g(t)=\sqrt{-\log t}} above.

And the second-order partials {f_{xx}} and {f_{yy}} are also continuous, if just barely. Indeed,

\displaystyle    f_{xx} = 6xy g'+ 4x^3yg''

Since the growth of {g} is sub-logarithmic, it follows that {g'(t)=o(t^{-1})} and {g''(t)=o(t^{-2})}. Hence,

\displaystyle    f_{xx} = O(t) o(t^{-1}) + O(t^{2}) o(t^{-2}) = o(1),\quad t\rightarrow 0

So, {f_{xx}(x,y)\rightarrow 0 = f_{xx}(0,0)} as {(x,y)\rightarrow (0,0)}. Even though the graph of {f_{xx}} looks quite similar to the first example in this post, this one is continuous. Can’t trust these plots.

Despite its appearance, f_{xx} is continuous
Despite its appearance, f_{xx} is continuous

By symmetry, {f_{yy}} is continuous as well.

But the mixed partial {f_{xy}} does not exist at {(0,0)}, and tends to {+\infty} as {(x,y)\rightarrow (0,0)}. The first claim is obvious once we notice that {f_x(0,y)= y\, g(y^2)} and {g} blows up at {0}. The second one follows from

\displaystyle    f_{xy} = g + 2(x^2+y^2) g' + 4x^2y^2 g''

where {g\rightarrow\infty} while the other two terms tend to zero, as in the estimate for {f_{xx}}. Here is the graph of {f_{xy}}.

Up, up and away
Up, up and away

This example is significant for the theory of partial differential equations, because it shows that a solution of the Poisson equation {f_{xx}+f_{yy} = h } with continuous {h} may fail to be in {C^2} (twice differentiable, with continuous derivatives). The expected gain of two derivatives does not materialize here.

The situation is rectified by upgrading the continuity condition to Hölder continuity. Then {f} indeed gains two derivatives: if {h\in C^\alpha} for some {\alpha\in (0,1)}, then {f\in C^{2,\alpha}}. In particular, the Hölder continuity of {f_{xx} } and {f_{yy} } implies the Hölder continuity of {f_{xy} }.

Recognizing a quadric surface from its traces

A simple “calculus” (but really analytic geometry) exercise is to find the traces of a given surface in coordinate planes. Just take the surface equation F(x,y,z)=0, set, for example, y=0 in it, maybe simplify and voilà—you have the trace in the xz-plane. Calculus textbook writers rely on quadric surfaces, in which case the intersections are (possibly degenerate) quadrics.

Can the process be reversed? That is, can we determine a quadric surface from its coordinate traces?

Let’s begin in one dimension lower: can we determine a quadric curve from its intersections with coordinate axes?

Of course not!

There is an obvious algebraic reason too: adding xy to the equation has no effect on the traces. The correct way to go one dimension lower is to also go one degree lower: that is, determine a line from its intercepts. This, of course, is a famous problem in precalculus. The solution is known: a line is uniquely determined if and only if it does not pass through the origin.

Going back to the original problem, we notice that there is no hope of identifying the surface if the traces are empty, as they can easily be for an ellipsoid. However, even three nonempty traces do not always determine the surface. Indeed, imagine that the ellipses shown above lie in the plane z=1, and use them to build a paraboloid (or a cone) with a vertex at (0,0,0). All three traces are nonempty, and they do not depend on the shape of the ellipse.

Apparently, we must assume more: all three traces are nondegenerate quadrics, namely circles/ellipses, parabolas or hyperbolas. Can we recover the surface now? The answer is yes but the proof is not entirely trivial. It naturally splits into two cases, depending on whether the surface passes through the origin. (We can tell if it does by looking at any of the traces.)

Case 1: the surface does not pass through the origin. Then we look for its equation in the form p(x,y,z)=1 where p is a polynomial with zero constant term. Correspondingly, we write the equations of traces as f(x,y)=1, g(x,z)=1 and h(y,z)=1. Now, p-f vanishes on the xy-trace and at the origin, hence on the entire xy-plane. This means that p-f involves only the terms divisible by z: namely, xz, z^2, and yz. Of these, the first two are found by inspecting g and the last one by inspecting h. The process by which we obtain p from f,g,h can be described as merging the polynomials: we take the union of all monomials they contain. Something similar happens in calculus when students are asked to recover a function from its partial derivatives.

Case 2: the surface passes through the origin. Now we look for its equation in the form p(x,y,z)=0 where p still has zero constant term. The obvious issue is that 0 is not as useful for normalization as 1. Having written the trace equations as f(x,y)=0, g(x,z)=0, and h(y,z)=0, we still don’t know if we got their scaling right. The idea is then to insist that at the origin f_x=g_x, f_y=h_y, and g_z=h_z, which must be the case if f,g,h are scaled to fit together into p.

Case 2a: Two of the equations with derivatives are 0=0. This means that one of three functions, say f, has zero gradient at the origin. However, we ruled out this possibility by assuming that all traces are nondegenerate quadrics.

Case 2b: At most one of the equations is 0=0. The other two give us enough information to determine the ratios of coefficients with which f,g,h appear in p. Actually, we can just say that f appears with coefficient 1, and scale g,h accordingly. Now we merge f, g and h into p as in Case 1. QED

This is not as satisfactory as the identification of a line by its intercepts, because the condition (nondegenerate traces) is not necessary. For example, the right circular cylinder x^2+y^2=1 is determined by its traces, even though two of them are degenerate quadrics. Two questions:

  • Is there a clean necessary and sufficient condition for unique determination of a quadric surface by its traces?
  • What happens for hypersurfaces of degree n-1 in \mathbb R^n?

How much multivariable calculus can be done along curves?

Working with functions of two (or more) real variables is significantly harder than with functions of one variable. It is tempting to reduce the complexity by considering the restrictions of a multivariate function to lines passing through a point of interest. But standard counterexamples of Calculus III, such as \displaystyle f(x,y)=\frac{xy^2}{x^2+y^4}, f(0,0)=0, show that lines are not enough: this function f is not continuous at (0,0), even though its restriction to every line is continuous. It takes a parabola, such as x=y^2, to detect the discontinuity.

Things look brighter if we do allow parabolas and other curves into consideration.

Continuity: f is continuous at a\in\mathbb R^n if and only if f\circ \gamma is continuous at 0 for every map \gamma\colon \mathbb R\to \mathbb R^n such that \gamma(0)=a and \gamma is continuous at 0.

Proof: If f is not continuous, we can find a sequence a_n\to a such that f(a_n)\not\to f(a), and run \gamma through these points, for example in a piecewise linear way.

Having been successful at the level of continuity, we can hope for a similar differentiability result:

Differentiability, take 1: f is differentiable at a\in\mathbb R^n if and only if f\circ \gamma is differentiable at 0 for every map \gamma\colon \mathbb R\to \mathbb R^n such that \gamma(0)=a and \gamma'(0) exists.

Alas, this is false. Take a continuous function g\colon S^{n-1}\to \mathbb R which preserves antipodes (i.e., g(-x)=-g(x)) and extend it to \mathbb R^n via f(tx)=tg(x). Consider \gamma as above, with a\in \mathbb R^n being the origin. If \gamma'(0)=0 when (f\circ \gamma)'(0)=0 because f is Lipschitz. If \gamma'(0)\ne 0, we can rescale the parameter so that \gamma'(0) is a unit vector. It is easy to see that \displaystyle \frac{f(\gamma(t))}{t}= \frac{f(\gamma(t))}{|\gamma(t)|\mathrm{sign}\,t} \frac{|\gamma(t)|}{|t|}\to g(\gamma'(0)), hence f\circ \gamma is differentiable at 0. However, f is not differentiable at a unless g happens to be the restriction of a linear map.

I can’t think of a way to detect the nonlinearity of directional derivative by probing f with curves. Apparently, it has to be imposed artificially.

Differentiability, take 2: f is differentiable at a\in\mathbb R^n if and only if there exists a linear map T such that (f\circ \gamma)'(0)=T\gamma'(0) for every map \gamma\colon \mathbb R\to \mathbb R^n such that \gamma(0)=a and \gamma'(0) exists.

Note that the only viable candidate for T is given by partial derivatives, and those are computed along lines. Thus, we are able determine the first-order differentiability of f using only the tools of single-variable calculus.

Proof goes along the same lines as for continuity, with extra care taken in forming \gamma.

  1. We may assume that T=0 by subtracting Tx from our function. Also assume a=0.
  2. Suppose f is not differentiable at 0. Pick a sequence v_k\to 0 such that |f(v_k)|\ge \epsilon |v_k| for all k.
  3. Passing to a subsequence, make sure that v_k/|v_k| tends to a unit vector v, and also that |v_{k+1}|\le 2^{-k}|v_k|.
  4. Connect the points v_k by line segments. Parametrize this piecewise-linear curve by arc length.
  5. The distance from v_{k+1} and v_k is bounded by |v_{k+1}|+|v_k|\le (1+2^{-k})|v_k|, the triangle inequality. Hence, the total length between 0 and v_k does not exceed \sum_{m\ge k}(1+2^{-m})|v_m| \le (1+c_k)|v_k|, where c_k\to 0 as k\to \infty.
  6. By 3, 4, and 5 the constructed curve \gamma has a one-sided derivative when it reaches 0. Shift the parameter so that \gamma(0)=0. Extend \gamma linearly to get two-sided derivative at 0.
  7. By assumption, |f(\gamma (t))|/|t|\to 0 as t\to 0. This contradicts 2 and 5.

Can one go further and detect the second order differentiability by probing f with paths? But the second derivative is not a pointwise asymptotic condition: it requires the first derivative to exist in a neighborhood. The pointwise second derivative might be possible to detect, but I’m not sure… and it’s getting late.