Measuring nonlinearity and reducing it

How to measure the nonlinearity of a function {f\colon I\to \mathbb R} where {I\subset \mathbb R} is an interval? A natural way is to consider the smallest possible deviation from a line {y=kx+b}, that is {\inf_{k, b}\sup_{x\in I}|f(x)-kx-b|}. It turns out to be convenient to divide this by {|I|}, the length of the interval {I}. So, let {\displaystyle NL(f;I) = \frac{1}{|I|} \inf_{k, b}\sup_{x\in I}|f(x)-kx-b|}. (This is similar to β-numbers of Peter Jones, except the deviation from a line is measured only in the vertical direction.)

NL-def
NL(f; I) is the maximal vertical distance between red and black, divided by the length of I

Relation with derivatives

The definition of derivative immediately implies that if {f'(a)} exists, then {NL(f;I)\to 0} as {I} shrinks to {a} (that is, gets smaller while containing {a}). A typical construction of a nowhere differentiable continuous function is based on making {NL(f;I)} bounded from below; it is enough to do this for dyadic intervals, and that can be done by adding wiggly terms like {2^{-n}\mathrm{dist}\,(x, 2^{-n}\mathbb Z)}: see the blancmange curve.

The converse is false: if {NL(f; I)\to 0} as {I} shrinks to {a}, the function {f} may still fail to be differentiable at {a}. The reason is that the affine approximation may have different slopes at different scales. An example is {f(x)=x \sin \sqrt{-\log |x|}} in a neighborhood of {0}. Consider a small interval {[-\delta, \delta]}. The line {y = kx} with {k=\sin\sqrt{-\log \delta}} is a good approximation to {f} because {f(x)/x\approx k} on most of the interval except for a very small part near {0}, and on that part {f} is very close to {0} anyway.

Why the root of logarithm? Because {\sin \log |x|} has a fixed amount of change on a fixed proportion of  {[-\delta, \delta]}, independently of {\delta}. We need a function slower than the logarithm, so that as {\delta} decreases, there is a smaller amount of change on a larger part of the interval {[-\delta, \delta]}.

Nonlinearity of Lipschitz functions

Suppose {f} is a Lipschitz function, that is, there exists a constant {L} such that {|f(x)-f(y)|\le L|x-y|} for all {x, y\in I}. It’s easy to see that {NL(f;I)\le L/2}, by taking the mid-range approximation {y=\frac12 (\max_I f + \min_I f)}. But the sharp bound is {NL(f;I)\le L/4} whose proof is not as trivial. The sharpness is shown by {f(x)=|x|} with {I=[-1,1]}.

sharpness
With the maximum difference 1/2 and the length of interval 2, we get NL(f; [-1, 1]) = 1/4
Proof. Let {k} be the slope of the linear function that agrees with {f} at the endpoints of {I}. Subtracting this linear function from {f} gives us a Lipschitz function {g} such that {-L-k\le g'\le L-k} and {\int_I g'= 0}. Let {A = \int_I (g')^+ = \int_I (g')^-}. Chebyshev’s inequality gives lower bounds for the measures of the sets {g'>0} and {g'<0}: namely, {|g'>0|\ge A/(L-k)} and {|g'<0|\le A/(L+k)}. By adding these, we find that {|I| \ge 2LA/(L^2-k^2)\ge 2A/L}. Since {\max _I g - \min_I g \le A}, the mid-range approximation to {g} has error at most {A/2 \le |I|L/4}. Hence {NL(f; I) = NL(g; I) \le L/4}.

Reducing nonlinearity

Turns out, the graph of every Lipschitz function has relatively large almost-flat pieces.  That is, there are subintervals of nontrivial size where the measure of nonlinearity is much smaller than the Lipschitz constant. This result is a special (one-dimensional) case of Theorem 2.3 in Affine approximation of Lipschitz functions and nonlinear quotients by Bates, Johnson, Lindenstrauss, Preiss, and Schechtman.

Theorem AA (for “affine approximation”): For every {\epsilon>0} there exists {\delta>0} with the following property. If {f\colon I\to \mathbb R} is an {L}-Lipschitz function, then there exists an interval {J\subset I} with {|J|\ge \delta |I|} and {NL(f; J)\le \epsilon L}.

Theorem AA should not be confused with Rademacher’s theorem which says that a Lipschitz function is differentiable almost everywhere. The point here is a lower bound on the size of the interval {J}. Differentiability does not provide that. In fact, if we knew that {f} is smooth, or even a polynomial, the proof of Theorem AA would not become any easier.

Proof of Theorem AA

We may assume {I=[-1, 1]} and {L=1}. For {t\in (0, 2]} let {L(t) = \sup \{|f(x)-f(y)|/|x-y| \colon x, y\in I, \ |x-y|\ge t\}}. That is, {L(t)} is the restricted Lipschitz constant, one that applies for distances at least {t}. It is a decreasing function of {t}, and {L(0+)=1}.

Note that {|f(-1)-f(1)|\le 2L(1)} and that every value of {f} is within {2L(1)} of either {f(-1)} or {f(1)}. Hence, the oscillation of {f} on {I} is at most {6L(1)}. If {L(1) \le \epsilon/3}, then the constant mid-range approximation on {I} gives the desired conclusion, with {J=I}. From now on {L(1) > \epsilon/3}.

The sequence {L_k = L(4^{-k})} is increasing toward {L(0+)=1}, which implies {L_{k+1}\le (1+\epsilon) L_k} for some {k}. Pick an interval {[a, b]\subset I} that realizes {L_k}, that is {b-a\ge 4^{-k}} and {|f(b)-f(a)| = 4^{-k}L_k}. Without loss of generality {f(b)>f(a)} (otherwise consider {-f}). Let {J = [(3a+b)/4, (a+3b)/4]} be the middle half of {[a. b]}. Since each point of {J} is within distance {\ge 4^{-k-1}} of both {a} and {b}, it follows that {\displaystyle f(b) + L_{k+1}(x-b) \le f(x) \le f(a) + L_{k+1}(x-a) } for all {x \in J}.

proof
The green lines have slope L_{k+1} which is close to the slope L_k of the secant line through (a, f(a)) and (b, f(b)). The graph of f is pinched between these green lines, except near a or b.

So far we have pinched {f} between two affine functions of equal slope. Let us consider their difference:
{\displaystyle (f(a) + L_{k+1}(x-a)) - (f(b) + L_{k+1}(x-b)) = (L_{k+1}-L_k) (b-a)}. Recall that {L_{k+1}\le (1+\epsilon) L_k}, which gives a bound of {\epsilon L_k(b-a) \le 2\epsilon L |J|} for the difference. Approximating {f} by the average of the two affine functions we conclude that {NL(f;J)\le \epsilon L} as required.

It remains to consider the size of {J}, about which we only know {|J|\ge 4^{-k}/2} so far. Naturally, we want to take the smallest {k} such that {L_{k+1}\le (1+\epsilon) L_k} holds. Let {m} be this value; then {L_m > (1+\epsilon)^{m} L_0}. Here {L_m\le 1} and {L_0 = L(1)> \epsilon/3 }. The conclusion is that {(1+\epsilon)^m < 3/\epsilon}, hence {m< \log(3/\epsilon)/\log(1+\epsilon)}. This finally yields {\displaystyle \delta = 4^{-\log(3/\epsilon)/\log(1+\epsilon)}/2} as an acceptable choice, completing the proof of Theorem AA.

A large amount of work has been done on quantifying {\delta} in various contexts; for example Heat flow and quantitative differentiation by Hytönen and Naor.

The Alabern-Mateu-Verdera characterization of Sobolev spaces

I’ve been trying to understand how the Alabern-Mateu-Verdera characterization of Sobolev spaces works. Consider a function f\colon \mathbb R\to\mathbb R. Let f_t be the average of f on scale t>0; that is, \displaystyle f_t(x)=\frac{1}{2t}\int_{x-t}^{x+t}f(y)\,dy. The difference |f_t-f| measures the deviation of f from a line. AMV define the square function S_f as a weighted average of the squares of such deviations:

\displaystyle S_f^2(x)=\int_0^{\infty} \left|\frac{f_t(x)-f(x)}{t}\right|^2 \frac{dt}{t}

Since I’m mostly interested in the local matters, I’ll use the truncated square function S_f^2(x)=\int_0^{1}\dots which avoids the large-scale considerations. If f is very nice (say, twice continuously differentiable), then |f_t-f| is of order t^2, and the integral converges with room to spare. For example, here is the Gaussian (f is in blue, S_f in red):

Gaussian

This looks suspicious. Clearly, S_f measures the size of the second derivative, not of the first. Yet, one of the Alabern-Mateu-Verdera theorems is for the Sobolev space of first order W^{1,p}: namely, f\in W^{1,p} \iff S_f\in L^p (for finite p). So, the degree of integrability of |f'| matches that of S_f, even though the functions look very different.

For functions that are not very nice S_f may be infinite for some values of x. For example, if the graph of f has a corner at some point, then |f_t-f| is of order t there, and the integral defining S_f diverges as \displaystyle \frac{dt}{t}. For example, take the triangle f(x)=(1-|x|)^+:

Triangle

The triangle is a Lipschitz, i.e., s\in W^{1,\infty}, but its S_f is not bounded. So, the AMV characterization f\in W^{1,p} \iff S_f\in L^p does not extend to p=\infty. However, the blow-up rate of S_f in this example is merely logarithmic (|\log{}|^{1/2} to be precise), which implies S_f\in L^p for all p<\infty, in accordance with the AMV theorem. Again, we notice that S_f and |f'| look rather unlike each other… except that S_f now resembles the absolute value of the Hilbert transform of f'.

Here is the semicircle f(x)=\sqrt{1-x^2}:

At the endpoints of the arc |f'| blows up as (1-|x|)^{-1/2}, and therefore f\in W^{1,p} only when p<2. And indeed, near the endpoints the nonlinearity on scale t is about \sqrt{t}, which turns the integrand in the definition of S_f into \displaystyle \frac{dt}{t^2}. Hence, S_f^2(x)\sim \frac{1}{|x-1|} as x\to 1. We have S_f\in L^p iff p<2, as needed.

The last example, f(x)=(1-\sqrt{|x|})^+, has both a cusp and two corners, demonstrating the different rates at which S_f blows up.

Cusp and corners

My Scilab code is probably not the most efficient one for this purpose. I calculated f_t-f using a multidiagonal matrix with -1 on the main diagonal and 1/(2k) on the nearest 2k diagonals.

step=0.01; scale=1; x=[-3:step:3]; n=length(x); s=zeros(n)
f=max(0,1-sqrt(abs(x)))
for k=1:(scale/step) do
avg=-diag(ones(n,1))
for j=1:k do
avg=avg+diag(ones(n-j,1)/(2*k),j)+diag(ones(n-j,1)/(2*k),-j)
end
s=s+(avg*f').^2/(step^2*k^3)
end
s=s.^(1/2)
clf(); plot(x,f); plot(x,s,'red')

Nearest point projection, part II: beware of two-dimensional gifts

To avoid the complications from the preceding post, let’s assume that X is a uniformly convex Banach space: such a space is automatically reflexive, and therefore any closed subspace A has a well-defined nearest point projection P\colon X\to A.

Recalling that in a Hilbert space P is a linear operator (and a self-adjoint one at that), we might ask if our P is linear. Indeed, the invariance of distance under translations shows that P(x+a)=P(x)+a for all x\in X, a\in A. Consequently, all fibers P^{-1}(a) are translates of one another. The map P is also homogeneous: P(\lambda x)=\lambda P(x), which follows from the homogeneity of the norm. In particular, P^{-1}(0) is a two-sided cone: it’s closed under multiplication by scalars.

In the special case \dim (X/A)=1 we conclude that P^{-1}(0) is a line, and the direct sum decomposition X= A+P^{-1}(0) identifies P as a (possibly skewed) linear projection.

Well, one of things that the geometry of Banach spaces teaches us is that 2-dimensional examples are often too simple to show what is really going on, while a 3-dimensional example may suffice. For instance, the \ell_1 and \ell_\infty norms define isometric spaces in 2 dimensions, but not in 3 or more.

So, let’s take X to be 3-dimensional with the \ell_p norm \|x\|^p=|x_1|^p+|x_2|^p+|x_3|^p, where 1<p<\infty. Let A=\lbrace x_1=x_2=x_3\rbrace, so that the codimension of A is 2. What is the set P^{-1}(0)? We know it is a ruled surface: with each point it contains the line through that point and the origin. More precisely, P(x)=0 exactly when the minimum of d(t):=|x_1-t|^p+|x_2-t|^p+|x_3-t|^p is attained at t=0. (The minimum point is unique, since the function is strictly convex.) Differentiation reveals that

\displaystyle P^{-1}(0)=\lbrace x\colon |x_1|^{p-2}x_1+|x_2|^{p-2}x_2+|x_3|^{p-2}x_3 = 0\rbrace

which is a plane only when p=2. Here is this surface for p=4, when the equation simplifies to x_1^3+x_2^3+x_3^3=0:

A fiber of the nearest point projection

The entire 3-dimensional space is foliated by the translates of this surface in the direction of the vector (1,1,1).

The nearest point projection is likely to be the first nonlinear map one encounters in functional analysis. It is not even Lipschitz in general, although in decent spaces such as \ell_p for 1<p<\infty it is Hölder continuous (I think the optimal exponent is \frac{p\wedge 2}{p\vee 2}).

After a little thought, the nonlinearity of NPP is not so surprising: minimization of distance amounts to solving an equation involving the gradient of the norm, and this gradient is nonlinear unless the norm is a quadratic functions, i.e., unless we are in a Hilbert space.