Measuring nonlinearity and reducing it

How to measure the nonlinearity of a function ${f\colon I\to \mathbb R}$ where ${I\subset \mathbb R}$ is an interval? A natural way is to consider the smallest possible deviation from a line ${y=kx+b}$, that is ${\inf_{k, b}\sup_{x\in I}|f(x)-kx-b|}$. It turns out to be convenient to divide this by ${|I|}$, the length of the interval ${I}$. So, let ${\displaystyle NL(f;I) = \frac{1}{|I|} \inf_{k, b}\sup_{x\in I}|f(x)-kx-b|}$. (This is similar to β-numbers of Peter Jones, except the deviation from a line is measured only in the vertical direction.)

Relation with derivatives

The definition of derivative immediately implies that if ${f'(a)}$ exists, then ${NL(f;I)\to 0}$ as ${I}$ shrinks to ${a}$ (that is, gets smaller while containing ${a}$). A typical construction of a nowhere differentiable continuous function is based on making ${NL(f;I)}$ bounded from below; it is enough to do this for dyadic intervals, and that can be done by adding wiggly terms like ${2^{-n}\mathrm{dist}\,(x, 2^{-n}\mathbb Z)}$: see the blancmange curve.

The converse is false: if ${NL(f; I)\to 0}$ as ${I}$ shrinks to ${a}$, the function ${f}$ may still fail to be differentiable at ${a}$. The reason is that the affine approximation may have different slopes at different scales. An example is ${f(x)=x \sin \sqrt{-\log |x|}}$ in a neighborhood of ${0}$. Consider a small interval ${[-\delta, \delta]}$. The line ${y = kx}$ with ${k=\sin\sqrt{-\log \delta}}$ is a good approximation to ${f}$ because ${f(x)/x\approx k}$ on most of the interval except for a very small part near ${0}$, and on that part ${f}$ is very close to ${0}$ anyway.

Why the root of logarithm? Because ${\sin \log |x|}$ has a fixed amount of change on a fixed proportion of  ${[-\delta, \delta]}$, independently of ${\delta}$. We need a function slower than the logarithm, so that as ${\delta}$ decreases, there is a smaller amount of change on a larger part of the interval ${[-\delta, \delta]}$.

Nonlinearity of Lipschitz functions

Suppose ${f}$ is a Lipschitz function, that is, there exists a constant ${L}$ such that ${|f(x)-f(y)|\le L|x-y|}$ for all ${x, y\in I}$. It’s easy to see that ${NL(f;I)\le L/2}$, by taking the mid-range approximation ${y=\frac12 (\max_I f + \min_I f)}$. But the sharp bound is ${NL(f;I)\le L/4}$ whose proof is not as trivial. The sharpness is shown by ${f(x)=|x|}$ with ${I=[-1,1]}$.

Proof. Let ${k}$ be the slope of the linear function that agrees with ${f}$ at the endpoints of ${I}$. Subtracting this linear function from ${f}$ gives us a Lipschitz function ${g}$ such that ${-L-k\le g'\le L-k}$ and ${\int_I g'= 0}$. Let ${A = \int_I (g')^+ = \int_I (g')^-}$. Chebyshev’s inequality gives lower bounds for the measures of the sets ${g'>0}$ and ${g'<0}$: namely, ${|g'>0|\ge A/(L-k)}$ and ${|g'<0|\le A/(L+k)}$. By adding these, we find that ${|I| \ge 2LA/(L^2-k^2)\ge 2A/L}$. Since ${\max _I g - \min_I g \le A}$, the mid-range approximation to ${g}$ has error at most ${A/2 \le |I|L/4}$. Hence ${NL(f; I) = NL(g; I) \le L/4}$.

Reducing nonlinearity

Turns out, the graph of every Lipschitz function has relatively large almost-flat pieces.  That is, there are subintervals of nontrivial size where the measure of nonlinearity is much smaller than the Lipschitz constant. This result is a special (one-dimensional) case of Theorem 2.3 in Affine approximation of Lipschitz functions and nonlinear quotients by Bates, Johnson, Lindenstrauss, Preiss, and Schechtman.

Theorem AA (for “affine approximation”): For every ${\epsilon>0}$ there exists ${\delta>0}$ with the following property. If ${f\colon I\to \mathbb R}$ is an ${L}$-Lipschitz function, then there exists an interval ${J\subset I}$ with ${|J|\ge \delta |I|}$ and ${NL(f; J)\le \epsilon L}$.

Theorem AA should not be confused with Rademacher’s theorem which says that a Lipschitz function is differentiable almost everywhere. The point here is a lower bound on the size of the interval ${J}$. Differentiability does not provide that. In fact, if we knew that ${f}$ is smooth, or even a polynomial, the proof of Theorem AA would not become any easier.

Proof of Theorem AA

We may assume ${I=[-1, 1]}$ and ${L=1}$. For ${t\in (0, 2]}$ let ${L(t) = \sup \{|f(x)-f(y)|/|x-y| \colon x, y\in I, \ |x-y|\ge t\}}$. That is, ${L(t)}$ is the restricted Lipschitz constant, one that applies for distances at least ${t}$. It is a decreasing function of ${t}$, and ${L(0+)=1}$.

Note that ${|f(-1)-f(1)|\le 2L(1)}$ and that every value of ${f}$ is within ${2L(1)}$ of either ${f(-1)}$ or ${f(1)}$. Hence, the oscillation of ${f}$ on ${I}$ is at most ${6L(1)}$. If ${L(1) \le \epsilon/3}$, then the constant mid-range approximation on ${I}$ gives the desired conclusion, with ${J=I}$. From now on ${L(1) > \epsilon/3}$.

The sequence ${L_k = L(4^{-k})}$ is increasing toward ${L(0+)=1}$, which implies ${L_{k+1}\le (1+\epsilon) L_k}$ for some ${k}$. Pick an interval ${[a, b]\subset I}$ that realizes ${L_k}$, that is ${b-a\ge 4^{-k}}$ and ${|f(b)-f(a)| = 4^{-k}L_k}$. Without loss of generality ${f(b)>f(a)}$ (otherwise consider ${-f}$). Let ${J = [(3a+b)/4, (a+3b)/4]}$ be the middle half of ${[a. b]}$. Since each point of ${J}$ is within distance ${\ge 4^{-k-1}}$ of both ${a}$ and ${b}$, it follows that ${\displaystyle f(b) + L_{k+1}(x-b) \le f(x) \le f(a) + L_{k+1}(x-a) }$ for all ${x \in J}$.

So far we have pinched ${f}$ between two affine functions of equal slope. Let us consider their difference:
${\displaystyle (f(a) + L_{k+1}(x-a)) - (f(b) + L_{k+1}(x-b)) = (L_{k+1}-L_k) (b-a)}$. Recall that ${L_{k+1}\le (1+\epsilon) L_k}$, which gives a bound of ${\epsilon L_k(b-a) \le 2\epsilon L |J|}$ for the difference. Approximating ${f}$ by the average of the two affine functions we conclude that ${NL(f;J)\le \epsilon L}$ as required.

It remains to consider the size of ${J}$, about which we only know ${|J|\ge 4^{-k}/2}$ so far. Naturally, we want to take the smallest ${k}$ such that ${L_{k+1}\le (1+\epsilon) L_k}$ holds. Let ${m}$ be this value; then ${L_m > (1+\epsilon)^{m} L_0}$. Here ${L_m\le 1}$ and ${L_0 = L(1)> \epsilon/3 }$. The conclusion is that ${(1+\epsilon)^m < 3/\epsilon}$, hence ${m< \log(3/\epsilon)/\log(1+\epsilon)}$. This finally yields ${\displaystyle \delta = 4^{-\log(3/\epsilon)/\log(1+\epsilon)}/2}$ as an acceptable choice, completing the proof of Theorem AA.

A large amount of work has been done on quantifying ${\delta}$ in various contexts; for example Heat flow and quantitative differentiation by Hytönen and Naor.

The Alabern-Mateu-Verdera characterization of Sobolev spaces

I’ve been trying to understand how the Alabern-Mateu-Verdera characterization of Sobolev spaces works. Consider a function $f\colon \mathbb R\to\mathbb R$. Let $f_t$ be the average of $f$ on scale $t>0$; that is, $\displaystyle f_t(x)=\frac{1}{2t}\int_{x-t}^{x+t}f(y)\,dy$. The difference $|f_t-f|$ measures the deviation of $f$ from a line. AMV define the square function $S_f$ as a weighted average of the squares of such deviations:

$\displaystyle S_f^2(x)=\int_0^{\infty} \left|\frac{f_t(x)-f(x)}{t}\right|^2 \frac{dt}{t}$

Since I’m mostly interested in the local matters, I’ll use the truncated square function $S_f^2(x)=\int_0^{1}\dots$ which avoids the large-scale considerations. If $f$ is very nice (say, twice continuously differentiable), then $|f_t-f|$ is of order $t^2$, and the integral converges with room to spare. For example, here is the Gaussian ($f$ is in blue, $S_f$ in red):

This looks suspicious. Clearly, $S_f$ measures the size of the second derivative, not of the first. Yet, one of the Alabern-Mateu-Verdera theorems is for the Sobolev space of first order $W^{1,p}$: namely, $f\in W^{1,p} \iff S_f\in L^p$ (for finite $p$). So, the degree of integrability of $|f'|$ matches that of $S_f$, even though the functions look very different.

For functions that are not very nice $S_f$ may be infinite for some values of $x$. For example, if the graph of $f$ has a corner at some point, then $|f_t-f|$ is of order $t$ there, and the integral defining $S_f$ diverges as $\displaystyle \frac{dt}{t}$. For example, take the triangle $f(x)=(1-|x|)^+$:

The triangle is a Lipschitz, i.e., $s\in W^{1,\infty}$, but its $S_f$ is not bounded. So, the AMV characterization $f\in W^{1,p} \iff S_f\in L^p$ does not extend to $p=\infty$. However, the blow-up rate of $S_f$ in this example is merely logarithmic ($|\log{}|^{1/2}$ to be precise), which implies $S_f\in L^p$ for all $p<\infty$, in accordance with the AMV theorem. Again, we notice that $S_f$ and $|f'|$ look rather unlike each other… except that $S_f$ now resembles the absolute value of the Hilbert transform of $f'$.

Here is the semicircle $f(x)=\sqrt{1-x^2}$:

At the endpoints of the arc $|f'|$ blows up as $(1-|x|)^{-1/2}$, and therefore $f\in W^{1,p}$ only when $p<2$. And indeed, near the endpoints the nonlinearity on scale $t$ is about $\sqrt{t}$, which turns the integrand in the definition of $S_f$ into $\displaystyle \frac{dt}{t^2}$. Hence, $S_f^2(x)\sim \frac{1}{|x-1|}$ as $x\to 1$. We have $S_f\in L^p$ iff $p<2$, as needed.

The last example, $f(x)=(1-\sqrt{|x|})^+$, has both a cusp and two corners, demonstrating the different rates at which $S_f$ blows up.

My Scilab code is probably not the most efficient one for this purpose. I calculated $f_t-f$ using a multidiagonal matrix with $-1$ on the main diagonal and $1/(2k)$ on the nearest $2k$ diagonals.
 step=0.01; scale=1; x=[-3:step:3]; n=length(x); s=zeros(n) f=max(0,1-sqrt(abs(x))) for k=1:(scale/step) do avg=-diag(ones(n,1)) for j=1:k do avg=avg+diag(ones(n-j,1)/(2*k),j)+diag(ones(n-j,1)/(2*k),-j) end s=s+(avg*f').^2/(step^2*k^3) end s=s.^(1/2) clf(); plot(x,f); plot(x,s,'red') 

Nearest point projection, part II: beware of two-dimensional gifts

To avoid the complications from the preceding post, let’s assume that $X$ is a uniformly convex Banach space: such a space is automatically reflexive, and therefore any closed subspace $A$ has a well-defined nearest point projection $P\colon X\to A$.

Recalling that in a Hilbert space $P$ is a linear operator (and a self-adjoint one at that), we might ask if our $P$ is linear. Indeed, the invariance of distance under translations shows that $P(x+a)=P(x)+a$ for all $x\in X, a\in A$. Consequently, all fibers $P^{-1}(a)$ are translates of one another. The map $P$ is also homogeneous: $P(\lambda x)=\lambda P(x)$, which follows from the homogeneity of the norm. In particular, $P^{-1}(0)$ is a two-sided cone: it’s closed under multiplication by scalars.

In the special case $\dim (X/A)=1$ we conclude that $P^{-1}(0)$ is a line, and the direct sum decomposition $X= A+P^{-1}(0)$ identifies $P$ as a (possibly skewed) linear projection.

Well, one of things that the geometry of Banach spaces teaches us is that 2-dimensional examples are often too simple to show what is really going on, while a 3-dimensional example may suffice. For instance, the $\ell_1$ and $\ell_\infty$ norms define isometric spaces in 2 dimensions, but not in 3 or more.

So, let’s take $X$ to be 3-dimensional with the $\ell_p$ norm $\|x\|^p=|x_1|^p+|x_2|^p+|x_3|^p$, where $1. Let $A=\lbrace x_1=x_2=x_3\rbrace$, so that the codimension of $A$ is 2. What is the set $P^{-1}(0)$? We know it is a ruled surface: with each point it contains the line through that point and the origin. More precisely, $P(x)=0$ exactly when the minimum of $d(t):=|x_1-t|^p+|x_2-t|^p+|x_3-t|^p$ is attained at $t=0$. (The minimum point is unique, since the function is strictly convex.) Differentiation reveals that

$\displaystyle P^{-1}(0)=\lbrace x\colon |x_1|^{p-2}x_1+|x_2|^{p-2}x_2+|x_3|^{p-2}x_3 = 0\rbrace$

which is a plane only when $p=2$. Here is this surface for $p=4$, when the equation simplifies to $x_1^3+x_2^3+x_3^3=0$:

The entire 3-dimensional space is foliated by the translates of this surface in the direction of the vector $(1,1,1)$.

The nearest point projection is likely to be the first nonlinear map one encounters in functional analysis. It is not even Lipschitz in general, although in decent spaces such as $\ell_p$ for $1 it is Hölder continuous (I think the optimal exponent is $\frac{p\wedge 2}{p\vee 2}$).

After a little thought, the nonlinearity of NPP is not so surprising: minimization of distance amounts to solving an equation involving the gradient of the norm, and this gradient is nonlinear unless the norm is a quadratic functions, i.e., unless we are in a Hilbert space.