## Functions of bounded or vanishing nonlinearity

A natural way to measure the nonlinearity of a function ${f\colon I\to \mathbb R}$, where ${I\subset \mathbb R}$ is an interval, is the quantity ${\displaystyle NL(f;I) = \frac{1}{|I|} \inf_{k, r}\sup_{x\in I}|f(x)-kx-r|}$ which expresses the deviation of ${f}$ from a line, divided by the size of interval ${I}$. This quantity was considered in Measuring nonlinearity and reducing it.

Let us write ${NL(f) = \sup_I NL(f; I)}$ where the supremum is taken over all intervals ${I}$ in the domain of definition of ${f}$. What functions have finite ${NL(f)}$? Every Lipschitz function does, as was noted previously: ${NL(f) \le \frac14 \mathrm{Lip}\,(f)}$. But the converse is not true: for example, ${NL(f)}$ is finite for the non-Lipschitz function ${f(x)=x\log|x|}$, where ${f(0)=0}$.

The function looks nice, but ${f(x)/x}$ is clearly unbounded. What makes ${NL(f)}$ finite? Note the scale-invariant feature of NL: for any ${t>0}$ the scaled function ${f_t(x) = t^{-1}f(tx)}$ satisfies ${NL(f_t)=NL(f)}$, and more precisely ${NL(f; tI) = NL(f_t; I)}$. On the other hand, our function has a curious scaling property ${f_t(x) = f(x) + x\log t}$ where the linear term ${x\log t}$ does not affect NL at all. This means that it suffices to bound ${NL(f; I)}$ for intervals ${I}$ of unit length. The plot of ${f}$ shows that not much deviation from the secant line happens on such intervals, so I will not bother with estimates.

The class of functions ${f}$ with ${NL(f)<\infty}$ is precisely the Zygmund class ${\Lambda^*}$ defined by the property ${|f(x-h)-2f(x)+f(x+h)| \le Mh}$ with ${M}$ independent of ${x, h}$. Indeed, since the second-order difference ${f(x-h)-2f(x)+f(x+h)}$ is unchanged by adding an affine function to ${f}$, we can replace ${f}$ by ${f(x)-kx-r}$ with suitable ${k, r}$ and use the triangle inequality to obtain

${\displaystyle |f(x-h)-2f(x)+f(x+h)| \le 4 \sup_I |f(x)-kx-r| = 8h\; NL(f; I)}$

where ${I=[x-h, x+h]}$. Conversely, suppose that ${f\in \Lambda^*}$. Given an interval ${I=[a, b]}$, subtract an affine function from ${f}$ to ensure ${f(a)=f(b)=0}$. We may assume ${|f|}$ attains its maximum on ${I}$ at a point ${\xi \le (a + b)/2}$. Applying the definition of ${\Lambda^*}$ with ${x = \xi}$ and ${h = \xi - a}$, we get ${|f(2\xi - a) - 2f(\xi )| \le M h}$, hence ${|f(\xi )| \le Mh}$. This shows ${NL(f; I)\le M/2}$. The upshot is that ${NL(f)}$ is equivalent to the Zygmund seminorm of ${f}$ (i.e., the smallest possible M in the definition of ${\Lambda^*}$).

A function in ${\Lambda^*}$ may be nowhere differentiable: it is not difficult to construct ${f}$ so that ${NL(f;I)}$ is bounded between two positive constants. The situation is different for the small Zygmund class ${\lambda^*}$ whose definition requires that ${NL(f; I)\to 0}$ as ${|I|\to 0}$. A function ${f \in \lambda^*}$ is differentiable at any point of local extremum, since the condition ${NL(f; I)\to 0}$ forces its graph to be tangent to the horizontal line through the point of extremum. Given any two points ${a, b}$ we can subtract the secant line from ${f}$ and thus create a point of local extremum between ${a }$ and ${b}$. It follows that ${f}$ is differentiable on a dense set of points.

The definitions of ${\Lambda^* }$ and ${\lambda^*}$ apply equally well to complex-valued functions, or vector-valued functions. But there is a notable difference in the differentiability properties: a complex-valued function of class ${\lambda^*}$ may be nowhere differentiable [Ullrich, 1993]. Put another way, two real-valued functions in ${\lambda^*}$ need not have a common point of differentiability. This sort of thing does not often happen in analysis, where the existence of points of “good” behavior is usually based on the prevalence of such points in some sense, and therefore a finite collection of functions is expected to have common points of good behavior.

The key lemma in Ullrich’s paper provides a real-valued VMO function that has infinite limit at every point of a given ${F_\sigma}$ set ${E}$ of measure zero. Although this is a result of real analysis, the proof is complex-analytic in nature and involves a conformal mapping. It would be interesting to see a “real” proof of this lemma. Since the antiderivative of a VMO function belongs to ${\lambda^* }$, the lemma yields a   function ${v \in \lambda^*}$ that is not differentiable at any point of ${E}$. Consider the lacunary series ${u(t) = \sum_{n=1}^\infty a_n 2^{-n} \cos (2^n t)}$. One theorem of Zygmund shows that ${u \in \lambda^*}$ when ${a_n\to 0}$, while another shows that ${u}$ is almost nowhere differentiable when ${\sum a_n^2 = \infty}$. It remains to apply the lemma to get a function ${v\in \lambda^*}$ that is not differentiable at any point where ${u}$ is differentiable.

## Lightness, hyperspace, and lower oscillation bounds

When does a map ${f\colon X\to Y}$ admit a lower “anti-continuity” bound like ${d_Y(f(a),f(b))\ge \lambda(d_X(a,b))}$ for some function ${\lambda\colon (0,\infty)\to (0, \infty)}$ and for all ${a\ne b}$? The answer is easy: ${f}$ must be injective and its inverse must be uniformly continuous. End of story.

But recalling what happened with diameters of connected sets last time, let’s focus on the inequality ${\textrm{diam}\, f(E)\ge \lambda (\textrm{diam}\, E)}$ for connected subsets ${E\subset X}$. If such ${\lambda}$ exists, the map f has the LOB property, for “lower oscillation bound” (oscillation being the diameter of image). The LOB property does not require ${f}$ to be injective. On the real line, ${f(x)=|x|}$ satisfies it with ${\lambda(\delta)=\delta/2}$: since it simply folds the line, the worst that can happen to the diameter of an interval is to be halved. Similarly, ${f(x)=x^2}$ admits a lower oscillation bound ${\lambda(\delta) = (\delta/2)^2}$. This one decays faster than linear at 0, indicating some amount of squeezing going on. One may check that every polynomial has the LOB property as well.

On the other hand, the exponential function ${f(x)=e^x}$ does not have the LOB property, since ${\textrm{diam}\, f([x,x+1])}$ tends to ${0}$ as ${x\to-\infty}$. No surprise there; we know from the relation of continuity and uniform continuity that things like that happen on a non-compact domain.

Also, a function that is constant on some nontrivial connected set will obviously fail LOB. In topology, a mapping is called light if the preimage of every point is totally disconnected, which is exactly the same as not being constant on any nontrivial connected set. So, lightness is necessary for LOB, but not sufficient as ${e^x}$ shows.

Theorem 1: Every continuous light map ${f\colon X\to Y}$ with compact domain ${X}$ admits a lower oscillation bound.

Proof. Suppose not. Then there exists ${\epsilon>0}$ and a sequence of connected subsets ${E_n\subset X}$ such that ${\textrm{diam}\, E_n\ge \epsilon}$ and ${\textrm{diam}\, f(E_n)\to 0}$. We can assume ${E_n}$ compact, otherwise replace it with its closure ${\overline{E_n}}$ which we can because ${f(\overline{E_n})\subset \overline{f(E_n)}}$.

The space of nonempty compact subsets of ${X}$ is called the hyperspace of ${X}$; when equipped with the Hausdorff metric, it becomes a compact metric space itself. Pass to a convergent subsequence, still denoted ${\{E_n\}}$. Its limit ${E}$ has diameter at least ${\epsilon}$, because diameter is a continuous function on the hyperspace. Finally, using the uniform continuity of ${f}$ we get ${\textrm{diam}\, f(E) = \lim \textrm{diam}\, f(E_n) = 0}$, contradicting the lightness of ${f}$. ${\quad \Box}$

Here is another example to demonstrate the importance of compactness (not just boundedness) and continuity: on the domain ${X = \{(x,y)\colon 0 < x < 1, 0 < y < 1\}}$ define ${f(x,y)=(x,xy)}$. This is a homeomorphism, the inverse being ${(u,v)\mapsto (u, v/u)}$. Yet it fails LOB because the image of line segment ${\{x\}\times (0,1)}$ has diameter ${x}$, which can be arbitrarily close to 0. So, the lack of compactness hurts. Extending ${f}$ to the closed square in a discontinuous way, say by letting it be the identity map on the boundary, we see that continuity is also needed, although it’s slightly non-intuitive that one needs continuity (essentially an upper oscillation bound) to estimate oscillation from below.

All that said, on a bounded interval of real line we need neither compactness nor continuity.

Theorem 2: If ${I\subset \mathbb R}$ is a bounded interval, then every light map ${f\colon I\to Y}$ admits a lower oscillation bound.

Proof. Following the proof of Theorem 1, consider a sequence of intervals ${(a_n, b_n)}$ such that ${b_n-a_n\ge \epsilon}$ and ${\textrm{diam}\, f((a_n,b_n))\to 0}$. There is no loss of generality in considering open intervals, since it can only make the diameter of the image smaller. Also WLOG, suppose ${a_n\to a}$ and ${b_n\to b}$; this uses the boundedness of ${I}$. Consider a nontrivial closed interval ${[c,d]\subset (a,b)}$. For all sufficiently large ${n}$ we have ${[c,d]\subset (a_n,b_n)}$, which implies ${\textrm{diam}\, f([c,d])\le \textrm{diam}\, f((a_n,b_n))\to 0}$. Thus ${f}$ is constant on ${[c,d]}$, a contradiction. ${\quad \Box}$

The property that distinguishes real line here is that nontrivial connected sets have nonempty interior. The same works on the circle and various tree-like spaces, but fails for spaces that don’t look one-dimensional.

## Words that contain UIO, and best-fitting lines

In Calculus I we spend a fair amount of time talking about how nicely the tangent line fits a smooth curve.

But truth be told, it fits only near the point of tangency. How can we find the best approximating line for a function ${f}$ on a given interval?

A natural measure of quality of approximation is the maximum deviation of the curve from the line, ${E(f;\alpha,\beta) = \max_{[a, b]} |f(x) - \alpha x-\beta|}$ where ${\alpha,\beta}$ are the coefficients in the line equation, to be determined. We need ${\alpha,\beta}$ that minimize ${E(f;\alpha,\beta)}$.

The Chebyshev equioscillation theorem is quite useful here. For one thing, its name contains the letter combination uio, which Scrabble players may appreciate. (Can you think of other words with this combination?) Also, its statement does not involve concepts outside of Calculus I. Specialized to the case of linear fit, it says that ${\alpha,\beta}$ are optimal if and only if there exist three numbers ${ x_1 in ${[a, b]}$ such that the deviations ${\delta_i = f(x_i) - \alpha x_i-\beta}$

• are equal to ${E(f;\alpha,\beta)}$ in absolute value: ${|\delta_i| = E(f;\alpha,\beta)}$ for ${i=1,2,3}$
• have alternating signs: ${\delta_1 = -\delta_2 = \delta_3}$

Let’s consider what this means. First, ${f'(x_i) =\alpha}$ unless ${x_i}$ is an endpoint of ${[a,b]}$. Since ${x_2}$ cannot be an endpoint, we have ${f'(x_2)=\alpha}$.

Furthermore, ${f(x) - \alpha x }$ takes the same value at ${x_1}$ and ${x_3}$. This gives an equation for ${x_2}$

$\displaystyle f(x_1)-f'(x_2)x_1 = f(x_3)-f'(x_2) x_3 \qquad \qquad (1)$

which can be rewritten in the form resembling the Mean Value Theorem:

$\displaystyle f'(x_2) = \frac{f(x_1)-f(x_3)}{x_1-x_3} \qquad \qquad (2)$

If ${f'}$ is strictly monotone, there can be only one ${x_i}$ with ${f'(x_i)=\alpha}$. Hence ${x_1=a}$ and ${x_3=b}$ in this case, and we find ${x_2}$ by solving (2). This gives ${\alpha = f'(x_2)}$, and then ${\beta}$ is not hard to find.

Here is how I did this in Sage:

var('x a b')
f = sin(x)  # or another function
df = f.diff(x)
a = # left endpoint
b = # right endpoint

That was the setup. Now the actual computation:

var('x1 x2 x3')
x1 = a
x3 = b
x2 = find_root(f(x=x1)-df(x=x2)*x1 == f(x=x3)-df(x=x2)*x3, a, b)
alpha = df(x=x2)
beta = 1/2*(f(x=x1)-alpha*x1 + f(x=x2)-alpha*x2)
show(plot(f,a,b)+plot(alpha*x+beta,a,b,color='red'))

However, the algorithm fails to properly fit a line to the sine function on ${[0,3\pi/2]}$:

The problem is, ${f'(x)=\cos x}$ is no longer monotone, making it possible for two of ${x_i}$ to be interior points. Recalling the identities for cosine, we see that these points must be symmetric about ${x=\pi}$. One of ${x_i}$ must still be an endpoint, so either ${x_1=a}$ (and ${x_3=2\pi-x_2}$) or ${x_3=b}$ (and ${x_1=2\pi-x_2}$). The first option works:

This same line is also the best fit on the full period ${[0,2\pi]}$. It passes through ${(\pi,0)}$ and has the slope of ${-0.2172336...}$ which is not a number I can recognize.

On the interval ${[0,4\pi]}$, all three of the above approaches fail:

Luckily we don’t need a computer in this case. Whenever ${|f|}$ has at least three points of maximum with alternating signs of ${f}$, the Chebyshev equioscillation theorem implies that the best linear fit is the zero function.

## Scaling and oscillation

A function ${f\colon \mathbb R\rightarrow\mathbb R}$ can be much larger than its derivative. Take the constant function ${f(x)=10^{10}}$, for example. Or ${f(x)=10^{10}+\sin x}$ to make it nonconstant. But if one subtracts the average (mean) from ${f}$, the residual is nicely estimated by the derivative:

$\displaystyle \frac{1}{b-a}\int_a^b |f(x)-\overline{f}|\,dx \le \frac12 \int_a^b |f'(x)|\,dx \ \ \ \ \ (1)$

Here ${\overline{f}}$ is the mean of ${f}$ on ${[a,b]}$, namely ${\overline{f}=\frac{1}{b-a}\int_a^b f(t)\,dt}$. Indeed, what’s the worst that could happen? Something like this:

Here ${H}$ is at most the integral of ${|f'|}$, and the shaded area is at most ${\frac12 H(b-a)}$. This is what the inequality (1) says.

An appealing feature of (1) is that it is scale-invariant. For example, if we change the variable ${u=2x}$, both sides remain the same. The derivative will be greater by the factor of ${2}$, but will be integrated over the shorter interval. And on the left we have averages upon averages, which do not change under scaling.

What happens in higher dimensions? Let’s stick to two dimensions and consider a smooth function ${f\colon\mathbb R^2\rightarrow\mathbb R}$. Instead of an interval we now have a square, denoted ${Q}$. It makes sense to denote squares by ${Q}$, because it’s natural to call a square a cube, and “Q” is the first letter of “cube”. Oh wait, it isn’t. Moving on…

The quantity ${b-a}$ was the length of interval of integration. Now we will use the area of ${Q}$, denoted ${|Q|}$. And ${\overline{f}=\frac{1}{|Q|}\iint_Q f}$ is now the mean value of ${f}$ on ${Q}$. At first glance one might conjecture the following version of (1):

$\displaystyle \frac{1}{|Q|}\iint_Q |f(x,y)-\overline{f}|\,dx\,dy \le C \int_Q |\nabla f(x,y)|\,dx\,dy \ \ \ \ \ (2)$

But this can’t be true because of inconsistent scaling. The left side of (2) is scale-invariant as before. The right side is not. If we shrink the cube by factor of ${2}$, the gradient ${|\nabla f|}$ will go up by ${2}$, but the area goes down by ${4}$. This suggests that the correct inequality should be

$\displaystyle \frac{1}{|Q|}\iint_Q |f(x,y)-\overline{f}|\,dx\,dy \le C \left(\int_Q |\nabla f(x,y)|^2\,dx\,dy\right)^{1/2} \ \ \ \ \ (3)$

We need the square root so that the right side of (3) scales correctly with ${f}$: to first power.

And here is the proof. Let ${f(*,y)}$ denote ${f}$ averaged over ${x}$. Applying (1) to every horizontal segment in ${Q}$, we obtain

$\displaystyle \frac{1}{h}\iint_Q |f(x,y)-f(*,y)|\,dx\,dy \le \frac12 \int_Q |f_x(x,y)|\,dx\,dy \ \ \ \ \ (4)$

where ${h}$ is the sidelength of ${Q}$. Now work with ${f(*,y)}$, using (1) along vertical segments:

$\displaystyle \frac{1}{h}\iint_Q |f(*,y)-f(*,*)|\,dx\,dy \le \frac12 \int_Q |f_y(*,y)|\,dx\,dy \ \ \ \ \ (5)$

Of course, ${f(*,*)}$ is the same as ${\overline{f}}$. The derivative on the right can be estimated: the derivative of average does not exceed the average of the absolute value of derivative. To keep estimates clean, simply estimate both partial derivatives by ${|\nabla f|}$. From (4) and (5) taken together it follows that

$\displaystyle \frac{1}{h}\iint_Q |f(x,y)-\overline{f}|\,dx\,dy \le \int_Q |\nabla f(x,y)|\,dx\,dy \ \ \ \ \ (6)$

This is an interesting result (a form of the Poincar\'{e} inequality), but in the present form it’s not scale-invariant. Remember that we expect the square of the gradient on the right. Cauchy-Schwarz to the rescue:

$\displaystyle \int_Q 1\cdot |\nabla f| \le \left( \int_Q 1 \right)^{1/2} \left( \int_Q |\nabla f|^2 \right)^{1/2}$

The first factor on the right is simply ${h}$. Move it to the left and we are done:

$\displaystyle \frac{1}{|Q|}\iint_Q |f(x,y)-\overline{f}|\,dx\,dy \le \left(\int_Q |\nabla f(x,y)|^2\,dx\,dy\right)^{1/2} \ \ \ \ \ (7)$

In higher dimensions we would of course have ${n}$ instead of ${2}$. Which is one of many reasons why analysis in two dimensions is special: ${L^n}$ is a Hilbert space only when ${n=2}$.

The left side of (7) is the mean oscillation of ${f}$ on the square ${Q}$. The integrability of ${|\nabla f|^n}$ in ${n}$ dimensions ensures that ${f}$ is a function of bounded mean oscillation, known as BMO. Actually, it is even in the smaller space VMO because the right side of (7) tends to zero as the square shrinks. But it need not be continuous or even bounded: for ${f(x)=\log\log |x| }$ the integral of ${|\nabla f|^n}$ converges in a neighborhood of the origin (just barely, thanks to ${\log^n |x|}$ in the denominator). This is unlike the one-dimensional situation where the integrability of ${|f'|}$ guarantees that the function is bounded.