Measuring nonlinearity and reducing it

How to measure the nonlinearity of a function {f\colon I\to \mathbb R} where {I\subset \mathbb R} is an interval? A natural way is to consider the smallest possible deviation from a line {y=kx+b}, that is {\inf_{k, b}\sup_{x\in I}|f(x)-kx-b|}. It turns out to be convenient to divide this by {|I|}, the length of the interval {I}. So, let {\displaystyle NL(f;I) = \frac{1}{|I|} \inf_{k, b}\sup_{x\in I}|f(x)-kx-b|}. (This is similar to β-numbers of Peter Jones, except the deviation from a line is measured only in the vertical direction.)

NL-def
NL(f; I) is the maximal vertical distance between red and black, divided by the length of I

Relation with derivatives

The definition of derivative immediately implies that if {f'(a)} exists, then {NL(f;I)\to 0} as {I} shrinks to {a} (that is, gets smaller while containing {a}). A typical construction of a nowhere differentiable continuous function is based on making {NL(f;I)} bounded from below; it is enough to do this for dyadic intervals, and that can be done by adding wiggly terms like {2^{-n}\mathrm{dist}\,(x, 2^{-n}\mathbb Z)}: see the blancmange curve.

The converse is false: if {NL(f; I)\to 0} as {I} shrinks to {a}, the function {f} may still fail to be differentiable at {a}. The reason is that the affine approximation may have different slopes at different scales. An example is {f(x)=x \sin \sqrt{-\log |x|}} in a neighborhood of {0}. Consider a small interval {[-\delta, \delta]}. The line {y = kx} with {k=\sin\sqrt{-\log \delta}} is a good approximation to {f} because {f(x)/x\approx k} on most of the interval except for a very small part near {0}, and on that part {f} is very close to {0} anyway.

Why the root of logarithm? Because {\sin \log |x|} has a fixed amount of change on a fixed proportion of  {[-\delta, \delta]}, independently of {\delta}. We need a function slower than the logarithm, so that as {\delta} decreases, there is a smaller amount of change on a larger part of the interval {[-\delta, \delta]}.

Nonlinearity of Lipschitz functions

Suppose {f} is a Lipschitz function, that is, there exists a constant {L} such that {|f(x)-f(y)|\le L|x-y|} for all {x, y\in I}. It’s easy to see that {NL(f;I)\le L/2}, by taking the mid-range approximation {y=\frac12 (\max_I f + \min_I f)}. But the sharp bound is {NL(f;I)\le L/4} whose proof is not as trivial. The sharpness is shown by {f(x)=|x|} with {I=[-1,1]}.

sharpness
With the maximum difference 1/2 and the length of interval 2, we get NL(f; [-1, 1]) = 1/4
Proof. Let {k} be the slope of the linear function that agrees with {f} at the endpoints of {I}. Subtracting this linear function from {f} gives us a Lipschitz function {g} such that {-L-k\le g'\le L-k} and {\int_I g'= 0}. Let {A = \int_I (g')^+ = \int_I (g')^-}. Chebyshev’s inequality gives lower bounds for the measures of the sets {g'>0} and {g'<0}: namely, {|g'>0|\ge A/(L-k)} and {|g'<0|\le A/(L+k)}. By adding these, we find that {|I| \ge 2LA/(L^2-k^2)\ge 2A/L}. Since {\max _I g - \min_I g \le A}, the mid-range approximation to {g} has error at most {A/2 \le |I|L/4}. Hence {NL(f; I) = NL(g; I) \le L/4}.

Reducing nonlinearity

Turns out, the graph of every Lipschitz function has relatively large almost-flat pieces.  That is, there are subintervals of nontrivial size where the measure of nonlinearity is much smaller than the Lipschitz constant. This result is a special (one-dimensional) case of Theorem 2.3 in Affine approximation of Lipschitz functions and nonlinear quotients by Bates, Johnson, Lindenstrauss, Preiss, and Schechtman.

Theorem AA (for “affine approximation”): For every {\epsilon>0} there exists {\delta>0} with the following property. If {f\colon I\to \mathbb R} is an {L}-Lipschitz function, then there exists an interval {J\subset I} with {|J|\ge \delta |I|} and {NL(f; J)\le \epsilon L}.

Theorem AA should not be confused with Rademacher’s theorem which says that a Lipschitz function is differentiable almost everywhere. The point here is a lower bound on the size of the interval {J}. Differentiability does not provide that. In fact, if we knew that {f} is smooth, or even a polynomial, the proof of Theorem AA would not become any easier.

Proof of Theorem AA

We may assume {I=[-1, 1]} and {L=1}. For {t\in (0, 2]} let {L(t) = \sup \{|f(x)-f(y)|/|x-y| \colon x, y\in I, \ |x-y|\ge t\}}. That is, {L(t)} is the restricted Lipschitz constant, one that applies for distances at least {t}. It is a decreasing function of {t}, and {L(0+)=1}.

Note that {|f(-1)-f(1)|\le 2L(1)} and that every value of {f} is within {2L(1)} of either {f(-1)} or {f(1)}. Hence, the oscillation of {f} on {I} is at most {6L(1)}. If {L(1) \le \epsilon/3}, then the constant mid-range approximation on {I} gives the desired conclusion, with {J=I}. From now on {L(1) > \epsilon/3}.

The sequence {L_k = L(4^{-k})} is increasing toward {L(0+)=1}, which implies {L_{k+1}\le (1+\epsilon) L_k} for some {k}. Pick an interval {[a, b]\subset I} that realizes {L_k}, that is {b-a\ge 4^{-k}} and {|f(b)-f(a)| = 4^{-k}L_k}. Without loss of generality {f(b)>f(a)} (otherwise consider {-f}). Let {J = [(3a+b)/4, (a+3b)/4]} be the middle half of {[a. b]}. Since each point of {J} is within distance {\ge 4^{-k-1}} of both {a} and {b}, it follows that {\displaystyle f(b) + L_{k+1}(x-b) \le f(x) \le f(a) + L_{k+1}(x-a) } for all {x \in J}.

proof
The green lines have slope L_{k+1} which is close to the slope L_k of the secant line through (a, f(a)) and (b, f(b)). The graph of f is pinched between these green lines, except near a or b.

So far we have pinched {f} between two affine functions of equal slope. Let us consider their difference:
{\displaystyle (f(a) + L_{k+1}(x-a)) - (f(b) + L_{k+1}(x-b)) = (L_{k+1}-L_k) (b-a)}. Recall that {L_{k+1}\le (1+\epsilon) L_k}, which gives a bound of {\epsilon L_k(b-a) \le 2\epsilon L |J|} for the difference. Approximating {f} by the average of the two affine functions we conclude that {NL(f;J)\le \epsilon L} as required.

It remains to consider the size of {J}, about which we only know {|J|\ge 4^{-k}/2} so far. Naturally, we want to take the smallest {k} such that {L_{k+1}\le (1+\epsilon) L_k} holds. Let {m} be this value; then {L_m > (1+\epsilon)^{m} L_0}. Here {L_m\le 1} and {L_0 = L(1)> \epsilon/3 }. The conclusion is that {(1+\epsilon)^m < 3/\epsilon}, hence {m< \log(3/\epsilon)/\log(1+\epsilon)}. This finally yields {\displaystyle \delta = 4^{-\log(3/\epsilon)/\log(1+\epsilon)}/2} as an acceptable choice, completing the proof of Theorem AA.

A large amount of work has been done on quantifying {\delta} in various contexts; for example Heat flow and quantitative differentiation by Hytönen and Naor.

Between a cubic spline and the line of regression

Given some data points {(x_i,y_i)} one can devise various functions that extract information from them. Such as

  • Regression line (considered in When the digits of {\pi} go to 11): it detects a linear trend {y=ax+b}, minimizing the sum of squares of residuals {y_i-(ax_i+b)}.
  • Natural cubic spline (considered in Connecting dots naturally), which passes through every data point while minimizing the amount of wiggling, measured by the square of its second derivative. Like this:
Natural cubic spline
Natural cubic spline

A smoothing spline is something in between the above extremes: it insists on neither being a line (i.e., having zero second derivative) nor on passing through given points (i.e., having zero residuals). Instead, it minimizes a weighted sum of both things: the integral of second derivative squared, and the sum of residuals squared. Like this plot, where the red points are given but the spline chooses to interpolate the green ones:

Smoothing spline
Smoothing spline

I’ll demonstrate a version of a smoothing spline that might not be exactly canonical, but is very easy to implement in Matlab or Scilab (which I prefer to use). As in Connecting dots naturally, assume the knots {a=x_0<x_1<\dots<x_n=b} equally spaced at distance {h=(b-a)/n}. The natural cubic spline is determined by the values of its second derivative at {x_1,\dots,x_{n-1}}, denoted {z_1,\dots,z_n}. As explained earlier, these can be found from the linear system

\displaystyle    \frac{h}{6}\begin{pmatrix} 4 & 1 & 0 & 0 & \ldots & 0 & 0 \\ 1 & 4 & 1 & 0 & \ldots & 0 & 0 \\ 0 & 1 & 4 & 1 & \ldots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \ldots & \ldots \\ 0 & 0 & 0 & 0 & \ldots & 1 & 4 \end{pmatrix} \begin{pmatrix} z_1 \\ z_2 \\ \vdots \\ z_{n-1}\end{pmatrix}   = - \frac{1}{h} \begin{pmatrix} \Delta^2 y_1 \\ \Delta^2 y_2 \\ \vdots \\ \Delta^2 y_{n-1}\end{pmatrix}

where the column on the right contains the amounts by which the derivative of the piecewise linear interpolant through given points jumps at every knot. The notation {\Delta^2y } means the second difference of {(y_i)}, for example {\Delta^2y_1 =y_2-2y_1+y_0}.

A smoothing spline is also a natural spline, but for a different set of points {(x_i,\tilde y_i)}. One has to find {\tilde y_i} that minimize a weighted sum of {\sum_i (\tilde y_i-y_i)^2} and of {\int_a^b (f''(x))^2\,dx}. The latter integral is easily computed in terms of {z_i}: it is equal to {\frac{h}{3}\sum_{i=1}^{n}(z_i^2+z_{i-1}^2+z_iz_{i-1})}. Since this quadratic form is comparable to {\sum_{i=1}^{n}z_i^2}, I’ll work with the latter instead.

The idea is to set up an underdetermined system for {z_i} and {\tilde y_i-y_i}, and let Scilab find a least squares solution of that. Let’s introduce a weight parameter {\lambda>0} that will determine the relative importance of curvature vs residuals. It is convenient to let {\tilde y_i=y_i+\lambda h^2 \delta_i}, so that both {\delta_i} and {z_i} (second derivative) scale the same way. The terms {\delta_i} contributes to the linear system for {z_i}, since the right hand side now has {\tilde y_i} instead of {y_i}. This contribution is {- \lambda h \Delta^2 \delta}. Moving it to the left hand-side (since {\delta_i} are unknowns) we obtain the following system.

\displaystyle    \begin{pmatrix} A & | & B \end{pmatrix} \begin{pmatrix} z \\ \delta \end{pmatrix}   = - \frac{1}{h} \Delta^2 y

where {A} is the same tridiagonal matrix as above, and {B} is the rectangular Laplacian-type matrix

\displaystyle    B = \frac{\lambda}{h} \begin{pmatrix} -1 & 2 & -1 & 0 & \ldots & 0 & 0 \\ 0 & -1 & 2 & -1 & \ldots & 0 & 0 \\ 0 & 0 & -1 & 2 & \ldots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \ldots & \ldots \\ 0 & 0 & 0 & 0 & \ldots & 2 & -1 \end{pmatrix}

All in all, the system has {2n } unknowns {z_1,\dots,z_{n-1},\delta_0,\dots,\delta_n}, and {(n-1)} equations, reflecting the continuity of first derivative at each interior knot. The lsq command of Scilab finds the solution with the least sum of squares of the unknowns, which is what we are after.

Time for some examples. {\lambda=0} and {\lambda=0.05} can be seen above. Here are more:

lambda = 0.1
lambda = 0.1
lambda = 1
lambda = 1

As {\lambda} increases, the spline approaches the regression line:

lambda = 10
lambda = 10
lambda=100
lambda=100

Finally, the Scilab code. It is almost the same as for natural spline; the difference is in five lines from B=... to newy=... The code after that is merely evaluation and plotting of the spline.

a = 0; b = 10
y = [3 1 4 1 5 9 2 6 5 3 5]
lambda = 0.1

n = length(y)-1
h = (b-a)/n
jumps = diff(y,2)/h
A = (h/6)*(diag(4*ones(1,n-1))+diag(ones(1,n-2),1)+diag(ones(1,n-2),-1))

B = lambda/h*(diag(-2*ones(1,n+1))+diag(ones(1,n),1)+diag(ones(1,n),-1))
C = [A, B(2:$-1,:)]
sol = lsq(C,-jumps')'
z = [0,sol(1:n-1),0]
newy = y + lambda*h^2*sol(n:$)

allx = []
spl = []
for i=1:n  
   xL = a+h*(i-1)
   xR = a+h*i
   x = linspace(xL,xR,100)
   linear = newy(i)*(xR-x)/h + newy(i+1)*(x-xL)/h
   correction = ((z(i+1)+2*z(i))*(xR-x)+(2*z(i+1)+z(i))*(x-xL)).*(xR-x).*(x-xL)/(6*h)
   allx = [allx, x]   
   spl = [spl, linear+correction] 
end

plot(allx, spl) 
plot(a:h:b, newy, 'g*')
plot(a:h:b, y, 'r*')

Trigonometric approximation and the Clenshaw-Curtis quadrature

Trying to approximate a generic continuous function on {[-1,1]} with the Fourier trigonometric series of the form {\sum_n (A_n\cos \pi nx+B_n\sin \pi nx)} is in general not very fruitful. Here’s such an approximation to {f(x)=\exp(x)}, with the sum over {n\le 4}:

Poor approximation to exp(x)
Poor approximation to exp(x)

It’s better to make a linear change of variable: consider {f(2x-1)} on the interval {[0,1]}, and use the formula for the cosine series. This results in {\exp(2x-1)}, which is approximated by the partial sum {\sum_{n=0}^4 A_n\cos \pi nx} of its cosine Fourier series as follows.

Better approximation after a change of variable
Better approximation after a change of variable

But one can do much better with a different, nonlinear change of variable. Consider {f(\cos x)} on the interval {[0,\pi]}, and again use the formula for the cosine series. This results in {\exp(\cos x)}, which is approximated by the partial sum {\sum_{n=0}^4 A_n\cos nx} of its cosine Fourier series as follows.

Excellent approximation after nonlinear change of variable
Excellent approximation after nonlinear change of variable

Yes, I can’t see any difference either: the error is less than {10^{-3}}.

The composition with cosine improves approximation because {f(\cos t)} is naturally a periodic function, with no jumps or corners in its graph. Fourier series, which are periodic by nature, follow such functions more easily.


A practical implication of this approximation of {f(\cos t)} is the Clenshaw-Curtis integration method. It can be expressed in one line:

\displaystyle    \int_{-1}^1 f(x)\,dx = \int_0^\pi f(\cos t)\sin t\,dt \approx \int_0^\pi \sum_{n=0}^N a_n \cos nt \sin t\,dt   = \sum_{n=0}^N \frac{(1+(-1)^n) a_n}{1-n^2}

The integral {\int_{-1}^1 f(x)\,dx} is approximated by summing {2a_{2k}/(1-4k^2)}, where {a_{2k}} are even-numbered cosine coefficients of {f(\cos t)}. In the example with {f(x)=\exp(x)} using just three coefficients yields

\displaystyle    \frac{2a_0}{1}+\frac{2a_2}{-3}+\frac{2a_4}{-15} \approx 2.350405

while the exact integral is {\approx 2.350402}.


At first this doesn’t look practical at all: after all, the Fourier coefficients are themselves found by integration. But since {f(\cos t)} is so close to a trigonometric polynomial, one can sample it at equally spaced points and apply the Fast Fourier transform to the result, quickly obtaining many coefficients at once. This is what the Clenshaw-Curtis quadrature does (at least in principle, the practical implementation may streamline these steps.)

B-splines and probability

If one picks two real numbers {X_1,X_2} from the interval {[0,1]} (independent, uniformly distributed), their sum {S_2=X_1+X_2} has the triangular distribution.

Also known as the hat function
Also known as the hat function

The sum {S_3} of three such numbers has a differentiable probability density function:

Piecewise quadratic, C1 smooth
Piecewise quadratic, C1 smooth

And the density of {S_4=X_1+X_2+X_3+X_4} is smoother still: the p.d.f. has two
continuous derivatives.

This begins to look normal
This begins to look normal

As the number of summands increases, these distributions converge to normal if they are translated and scaled properly. But I am not going to do that. Let’s keep the number of summands to four at most.

The p.d.f. of {S_n} is a piecewise polynomial of degree {n-1}. Indeed, for {S_1=X_1} the density is piecewise constant, and the formula

\displaystyle  S_n(x) = \int_{x-1}^x S_{n-1}(t)\,dt

provides the inductive step.

For each {n}, the translated copies of function {S_n} form a partition of unity:

\displaystyle  \sum_{k\in\mathbb Z}S_n(x-k)\equiv 1

The integral recurrence relation gives an easy proof of this:

\displaystyle  \sum_{k\in\mathbb Z}\int_{x-k-1}^{x-k} S_{n-1}(t)\,dt = \int_{\mathbb R} S_{n-1}(t)\,dt = 1

And here is the picture for the quadratic case:

Piecewise quadratic partition of unity
Piecewise quadratic partition of unity

A partition of unity can be used to approximate functions by piecewise polynomials: just multiply each partition element by the value of the function at the center of the corresponding interval, and add the results.

Doing this with {S_2} amounts to piecewise linear interpolation: the original function {f(x)=e^{-x/2}} is in blue, the weighted sum of hat functions in red.

Using the function exp(-x/2)
PL interpolation of exp(-x/2)

With {S_4} we get a smooth curve.

Approximating exp(-x/2) with a cubic B-spline
Approximating exp(-x/2) with a cubic B-spline

Unlike interpolating splines, this curve does not attempt to pass through the given points exactly. However, it has several advantages over interpolating splines:

  • Is easier to calculate; no linear system to solve;
  • Yields positive function for positive data;
  • Yields monotone function for monotone data

Linear approximation and differentiability

If a function {f\colon \mathbb R\rightarrow \mathbb R} is differentiable at {a\in \mathbb R}, then it admits good linear approximation at small scales. Precisely: for every {\epsilon>0} there is {\delta>0} and a linear function {\ell(x)} such that {|f(x)-\ell(x)|<\epsilon \,\delta} for all {|x|<\delta}. Having {\delta} multiplied by {\epsilon} means that the deviation from linearity is small compared to the (already small) scale {\delta} on which the function is considered.

For example, this is a linear approximation to {f(x)=e^x} near {0} at scale {\delta=0.1}.

Linear approximation to exponential function
Linear approximation to exponential function

As is done on this graph, we can always take {\ell} to be the secant line to the graph of {f} based on the endpoints of the interval of consideration. This is because if {L} is another line for which {|f(x)-L(x)|<\epsilon \,\delta} holds, then {|\ell-L|\le \epsilon \,\delta} at the endpoints, and therefore on all of the interval (the function {x\mapsto |\ell(x)-L(x)|} is convex).


Here is a non-differentiable function that obviously fails the linear approximation property at {0}.

Self-similar graph
Self-similar graph

(By the way, this post is mostly about me trying out SageMathCloud.) A nice thing about {f(x)=x\sin \log |x|} is self-similarity: {f(rx)=rf(x)} with the similarity factor {r=e^{2\pi}}. This implies that no matter how far we zoom in on the graph at {x=0}, the graph will not get any closer to linear.

I like {x\sin \log |x|} more than its famous, but not self-similar, cousin {x\sin(1/x)}, pictured below.

Standard example from intro to real analysis
Standard example from intro to real analysis

Interestingly, linear approximation property does not imply differentiability. The function {f(x)=x\sin \sqrt{-\log|x|}} has this property at {0}, but it lacks derivative there since {f(x)/x} does not have a limit as {x\rightarrow 0}. Here is how it looks.

Now with the square root!
Now with the square root!

Let’s look at the scale {\delta=0.1}

scale 0.01
scale 0.1

and compare to the scale {\delta=0.001}

scale 0.001
scale 0.001

Well, that was disappointing. Let’s use math instead. Fix {\epsilon>0} and consider the function {\phi(\delta)=\sqrt{-\log \delta}-\sqrt{-\log (\epsilon \delta)}}. Rewriting it as

\displaystyle    \frac{\log \epsilon}{\sqrt{-\log \delta}+\sqrt{-\log (\epsilon \delta)}}

shows {\phi(\delta)\rightarrow 0} as {\delta\rightarrow 0}. Choose {\delta} so that {|\phi(\delta)|<\epsilon} and define {\ell(x)=x\sqrt{-\log \delta}}. Then for {\epsilon \,\delta\le |x|< \delta} we have {|f(x)-\ell(x)|\le \epsilon |x|<\epsilon\,\delta}, and for {|x|<\epsilon \delta} the trivial bound {|f(x)-\ell(x)|\le |f(x)|+|\ell(x)|} suffices.

Thus, {f} can be well approximated by linear functions near {0}; it’s just that the linear function has to depend on the scale on which approximation is made: its slope {\sqrt{-\log \delta}} does not have a limit as {\delta\to0}.

The linear approximation property does not become apparent until extremely small scales. Here is {\delta = 10^{-30}}.

Scale 1e-30
Scale 1e-30

Squarish polynomials

For some reason I wanted to construct polynomials approximating this piecewise constant function {f}:

How to approximate this with polynomials?
So square

Of course approximation cannot be uniform, since the function is not continuous. But it can be achieved in the sense of convergence of graphs in the Hausdorff metric: their limit should be the “graph” shown above, with the vertical line included. In concrete terms, this means for every {\epsilon>0} there is {N} such that for {n\ge N} the polynomial {p_n} satisfies

\displaystyle  |p_n-f|\le \epsilon\quad \text{ on }\ [0,2]\setminus [1-\epsilon,1+\epsilon]

and also

\displaystyle  -\epsilon\le  p_n  \le 1+\epsilon\quad \text{ on }\ [1-\epsilon,1+\epsilon]

How to get such {p_n} explicitly? I started with the functions {f_m(x) = \exp(-x^m)} when {m} is large. The idea is that as {m\rightarrow\infty}, the limit of {\exp(-x^m)} is what is wanted: {1} when {x<1}, {0} when {x>1}. Also, for each {m} there is a Taylor polynomial {T_{m,n}} that approximates {f_m} uniformly on {[0,2]}. Since the Taylor series is alternating, it is not hard to find suitable {n}. Let’s shoot for {\epsilon=0.01} in the Taylor remainder and see where this leads:

  • Degree {7} polynomial for {\exp(-x)}
  • Degree {26} polynomial for {\exp(-x^2)}
  • Degree {69} polynomial for {\exp(-x^3)}
  • Degree {180} polynomial for {\exp(-x^4)}
  • Degree {440} polynomial for {\exp(-x^5)}

The results are unimpressive, though:

Taylor polynomials of exp(-x^m) are not so square
Taylor polynomials of exp(-x^m) are not so square

To get within {0.01} of the desired square-ness, we need {\exp(-1.01^m)<0.01}. This means {m\ge 463}. Then, to have the Taylor remainder bounded by {0.01} at {x=2}, we need {2^{463n}/n! < 0.01}. Instead of messing with Stirling’s formula, just observe that {2^{463n}/n!} does not even begin to decrease until {n} exceeds {2^{463}}, which is more than {10^{139}}. That’s a … high degree polynomial. I would not try to ask a computer algebra system to plot it.

Bernstein polynomials turn out to work better. On the interval {[0,2]} they are given by

\displaystyle    p_n(x) = 2^{-n} \sum_{k=0}^n f(2k/n) \binom{n}{k} x^k (2-x)^{n-k}

To avoid dealing with {f(1)}, it is better to use odd degrees. For comparison, I used the same or smaller degrees as above: {7, 25, 69, 179, 439}.

Squarish Bernstein polynomials
Squarish Bernstein polynomials

Looks good. But I don’t know of a way to estimate the degree of Bernstein polynomial required to obtain Hausdorff distance less than a given {\epsilon} (say, {0.01}) from the square function.