Branching

Multiple kinds of branching here. First, the motorsport content has been moved to formula7.blog. Two blogs? Well, it became clear that my Stack Exchange activity, already on hiatus since 2018, is not going to resume (context: January 14, January 15, January 17). But typing words in boxes is still a hobby of mine.

There may be yet more branching in the knowledge market space, with Codidact and TopAnswers attempting to rise from the ashes of Stack Exchange. (I do not expect either project to have much success.)

Also, examples of branching in complex analysis are often limited to the situations where any two branches differ either by an additive constant like {\log z} or by a multiplicative constant like {z^p}. But different branches can even have different branch sets. Consider the dilogarithm, which has a very nice power series in the unit disk:

{\displaystyle f(z) = \sum_{n=1}^\infty \frac{z^n}{n^2} = z + \frac{z^2}{4} + \frac{z^3}{9} + \frac{z^4}{16} + \cdots}

The series even converges on the unit circle {|z|=1}, providing a continuous extension there. But this circle is also the boundary of the disk of convergence, so some singularity has to appear. And it does, at {z=1}. Going around this singularity and coming back to the unit disk, we suddenly see a function with a branch point at {z=0}, where there was no branching previously.

What gives? Consider the derivative:

{\displaystyle f'(z) = \sum_{n=1}^\infty \frac{z^{n-1}}{n} = -\frac{\log (1-z)}{z}}

As long as the principal branch of logarithm is considered, there is no singularity at {z} since {\log(1-0) = 0} cancels the denominator. But once we move around {z=1}, the logarithm acquires a multiple of {2\pi i }, and so {f'} gets an additional term {cz^{-1}}, and integrating that results in logarithmic branching at {z=0}.

Of course, this does not even begin the story of the dilogarithm, so I refer to Zagier’s expanded survey which has a few branch points itself.

Thus the dilogarithm is one of the simplest non-elementary functions one can imagine. It is also one of the strangest. It occurs not quite often enough, and in not quite an important enough way, to be included in the Valhalla of the great transcendental functions—the gamma function, Bessel and Legendre- functions, hypergeometric series, or Riemann’s zeta function. And yet it occurs too often, and in far too varied contexts, to be dismissed as a mere curiosity. First defined by Euler, it has been studied by some of the great mathematicians of the past—Abel, Lobachevsky, Kummer, and Ramanujan, to name just a few—and there is a whole book devoted to it. Almost all of its appearances in mathematics, and almost all the formulas relating to it, have something of the fantastical in them, as if this function alone among all others possessed a sense of humor.

Fundamental limits as differences of reciprocals

The three fundamental limits (as they were taught to me) are

{\displaystyle \lim_{x\to 0} \frac{e^x-1}{x} = \lim_{x\to 0} \frac{\log(1+x)}{x} = \lim_{x\to 0} \frac{\sin x}{x} = 1}

(exponential, logarithmic, trigonometric). Usually they come with pictures like

standard-pic
exp(x) – 1 and x

which show that the graph of the numerator in each limit, say {y=f(x)}, is indeed close to the line {y=x}.

But there are so many different degrees of “close”: for example, {0.0103\approx 0.01} but the reciprocals of these numbers differ by almost {3}. As a stress-test of the approximation {f(x)\approx x}, let us consider the behavior of {\displaystyle \frac{1}{f(x)} - \frac{1}{x}} for each of the fundamental limits.

Exponential limit

Expressed as a difference: {\displaystyle \frac{1}{e^x - 1} - \frac{1}{x}}

expm1
1/(exp(x)-1) – 1/x

Always negative, which is obvious once we recall that the graph of {e^x-1} lies above its tangent line. The graph has central symmetry about {(0, -1/2)} which is not as obvious but is an easy algebra exercise. Since the asymptote on the right is {y=0}, the other asymptote is {y=-1}. This looks a lot like the logistic function {1/(1+\exp(-x))} … but it is not, because the logistic function approaches its asymptotes exponentially fast, while our function does so at the rate {1/x}.
For comparison, here is the logistic curve {\displaystyle y = \frac{1}{1+\exp(-x/3)} - 1} (in green) scaled to match the behavior at {0}.

expm1-compare
Too slow to be logistic

This could be potentially useful if one needs a logistic-type function that behaves like a rational function at infinity. Simply using a rational function for this purpose would not do: it cannot have two distinct horizontal asymptotes.

Logarithmic limit

{\displaystyle \frac{1}{\log(x+1)} - \frac{1}{x}}

log1p
1/log(1+x) – 1/x

This one is always positive, and has a vertical asymptote at {x=-1} in addition to the horizontal asymptote at {0}. At a glance it may look like shifted/scaled hyperbola {y=1/x}. Indeed, {y = (1+x/3)/(2+x)} is a decent approximation to it, shown in green below.

log1p-compare
Not really a hyperbola

Trigonometric limit

{\displaystyle \frac{1}{\sin x} - \frac{1}{x}}

sin
1/sin(x) – 1/x

Unlike in the previous cases, the difference of reciprocals vanished at {0} due to the approximation {\sin x \approx x} being of higher order: the error term is cubic rather than quadratic. The graph looks like the tangent function but it cannot be exactly that since the nearest vertical asymptotes are {x=\pm \pi} rather than {x=\pm \pi/2}. (Not to mention other reasons such as non-periodicity.) A stretched and rescaled tangent, namely {\displaystyle \frac{1}{3}\tan \frac{x}{2}}, sort of fits:

sin-compare
Like a tangent?

Bonus limit

{\tan x \approx x}, with the reciprocal difference being {\displaystyle \frac{1}{\tan x} - \frac{1}{x}}.

tan
1/tan(x) – 1/x

The limit {\displaystyle \lim_{x\to 0} \frac{\tan x}{x} = 1} adds nothing new compared to the previous one, since {\cos 0=1}. But the difference of reciprocals is another story. For one thing, the principal error term for it is twice as large as for the sine limit: {-x/3} versis {x/6}. Accordingly, the graph looks more like {\displaystyle -\frac{2}{3}\tan \frac{x}{2}}.

tan-compare
Not really like a tangent

Taylor series

All of the above differences have derivatives of all orders at the origin, which is not easy to prove with the standard calculus tools. Complex analysis makes the situation clear: the reciprocal of a holomorphic function with a zero at {z=0} can be expanded into a Laurent series, and subtracting {1/z} eliminates the principal part of that series, leaving a convergent power series, i.e., another holomorphic function. Let us take a look at these series:

{\displaystyle \frac{1}{e^x - 1} - \frac{1}{x} = - \frac{1}{2} + \frac{x}{12} - \frac{x^{3}}{720} + \frac{x^{5}}{30240} - \frac{x^{7}}{1209600} + \frac{x^{9}}{47900160} -\cdots }

Nice, the coefficients have alternating signs and all of them have numerator {1}… oh no, next term is {\displaystyle - \frac{691 x^{11}}{1307674368000}}. The signs do continue to alternate. Apart from the constant term, only odd powers of {x} appear, according to the central symmetry noted above. The coefficient of {x^{n-1}} in this series is {B_n/n!} where {B_n} is the {n}th Bernoulli number. These are the “modern” Bernoulli numbers, with {B_1 = -1/2} rather than {1/2}.

{\displaystyle \frac{1}{\log(x+1)} - \frac{1}{x} = \frac{1}{2} - \frac{x}{12} + \frac{x^{2}}{24} - \frac{19 x^{3}}{720} + \frac{3 x^{4}}{160} - \frac{863 x^{5}}{60480} + \cdots }

Also alternating but not omitting even powers, and not decaying nearly as fast as the coefficients of the previous series. (Of course: this function has a singularity at {-1} while the nearest singularities of the exponential thing are at {\pm 2\pi i}.)  These are Gregory coefficients, which according to Wikipedia “often appear in works of modern authors who do not recognize them”. I would not recognize them either without OEIS.

{\displaystyle \frac{1}{\sin x} - \frac{1}{x} = \frac{x}{6} + \frac{7 x^{3}}{360} + \frac{31 x^{5}}{15120} + \frac{127 x^{7}}{604800} + \frac{73 x^{9}}{3421440} + \cdots }

These are all positive for some reason.

{\displaystyle \frac{1}{\tan x} - \frac{1}{x} = - \frac{x}{3} - \frac{x^{3}}{45} - \frac{2 x^{5}}{945} - \frac{x^{7}}{4725} - \frac{2 x^{9}}{93555} - \cdots }

These are all negative for some reason.

Discrete maximum principle for polynomials

The polynomially convex hull of a compact set {K\subset \mathbb C} is defined as the set of all points {z\in \mathbb C} such that the inequality {|p(z)|\le \sup_K |p|} (a form of the maximum principle) holds for every polynomial {p}. For example, the polynomially convex hull of a simple closed curve is the union of that curve with its interior region. In general, this process fills up the holes in the set {K}, resulting in the complement of the unbounded connected component of {\mathbb C\setminus K}.

We can recover the usual convex hull from this construction by restricting {p} to the polynomials of first degree. Indeed, when {p} is linear, the set {|p|\le M} is a closed disk, and we end up with the intersection of all closed disks that contain {K}. This is precisely the convex hull of {K}.

What if we restrict {p} to the polynomials of degree at most {n}? Let’s call the resulting set the degree-{n} convex hull of {K}, denoted {K_n}. By definition, it is contained in the convex hull and contains the polynomially convex hull. To exactly compute {K_n} for general {K} appears to be difficult even when {n=2}.

Consider finite sets. When {K} has at most {n} points, we have {K_n=K} because there is a polynomial of degree {n} whose zero set is precisely {K}. So, the first nontrivial case is of {K} having {n+1} points. Let us write {K=\{z_0, \dots, z_n\}}.

Depending on the location of the points, {K_n} may be strictly larger than {K}. For example, if {K} consists of the vertices of a regular {(n+1)}-gon, then {K_n} also contains its center. Here is why. By a linear transformation, we can make sure that {K=\{\zeta^k\colon k=0, \dots, n\}} where {\zeta = \exp(2\pi i/(n+1))}. For {j=1, \dots, n} we have {\sum_{k=0}^n \zeta^{kj} = (\zeta^{(n+1)j}-1)/(\zeta^j - 1) = 0}. Hence, for any polynomial {p} of degree at most {n}, the sum {\sum_{k=0}^n p(\zeta^k)} is equal to {(n+1)p(0)}. This implies {|p(0)|\le \max_{0\le k\le n}|p(\zeta^k)|}, a kind of a discrete maximum principle.

A more systematic approach is to use the Lagrange basis polynomials, that is {L_j(z) = \prod_{k\ne j} (z-z_k)/(z_j-z_k)}, which satisfy {L_j(z_k) = \delta_{jk}}. Since {p = \sum_j p(z_j)L_j} for any polynomial of degree at most {n}, it follows that {z\in K_n} if and only if {\left|\sum_j c_j L_j(z)\right|\le \max |c_j| } holds for every choice of scalars {c_0, \dots, c_n}. The latter is equivalent to the inequality {\sum_j |L_j(z)|\le 1}.

This leads us to consider the function {S=\sum_j |L_j|}, the sum of the absolute values of the Lagrange basis polynomials. (Remark: S is called a Lebesgue function for this interpolation problem.) Since {\sum_j L_j\equiv 1}, it follows that {S\ge 1} everywhere. By construction, {S=1} on {K}. At a point {z\notin K}, the equality {S(z)=1} holds if and only if {\arg L_j(z)} is the same for all {j}.

In the trivial case {K=\{z_0, z_1\}}, the function {S(z) = (|z-z_0|+|z-z_1|)/|z_0-z_1|} is equal to {1} precisely on the linear segment with endpoints {z_0, z_1}. Of course, this only repeats what we already knew: the degree-1 convex hull is the ordinary convex hull.

If {K=\{x_0, \dots, x_n\}} with {x_0<\cdots<x_n} real and {n\ge 2}, then {K_n=K}. Indeed, if {x\in K_n\setminus K}, then {x} lies in the convex hull of {K}, and therefore {x_{j-1}<x<x_j} for some {j}. The basis polynomial {L_j} is positive at {x}, since it is equal to {1} at {x_j} and does not vanish outside of {K}. Since a polynomial changes its sign at every simple zero, it follows that {L_{j+1}(x) < 0}. Well, there is no {L_{j+1}} if {j=n}, but in that case, the same reasoning applies to {L_{j-2}(x)<0}. In any case, the conclusion is that {\arg L_k(x)} cannot be the same for all {k}.

At this point one might guess that the vertex set of a regular polygon is the only kind of finite sets that admit a nontrivial discrete maximum principle for polynomials. But this is not so: the vertices of a rhombus work as well. Indeed, if {K=\{a, -a, ib, -ib\}} with {a, b>0}, then {L_j(0)>0} for all {j}, hence {S(0)=1}.

The vertices of a non-square rectangle do not work: if {K} is the set of these vertices, the associated function {S} is strictly greater than 1 on the complement of {K}.

Are there any other finite sets that support a discrete maximum principle for polynomials?

Institutions ranked by the number of AMS Fellows, 2020 edition

Only those with the count of 4 or greater are included. Not counting the deceased. Considering CUNY as a single institution. Source of data.

Top 10 changes compared to 2019 ranking: MIT takes sole possession of the 3rd place with Berkeley dropping into 4th, UIUC rises from 8th to 6th, Princeton drops from 6th to 8th, Stanford rises from 11th to 9th, Wisconsin-Madison drops from 8th to 10th, Illinois-Chicago rises from 13th to 10th. Due to ties, the “top 10” are actually top 12.

Honorable mention: Texas A&M rises from 19th to 16th. Having once set “top 20” as the goal of their “Vision 2020” campaign, they achieved it at least by this measure.

  1. Rutgers The State University of New Jersey New Brunswick : 44
  2. University of California, Los Angeles : 39
  3. Massachusetts Institute of Technology : 35
  4. University of California, Berkeley : 32
  5. University of Michigan : 31
  6. Cornell University : 26
  7. University of Illinois, Urbana-Champaign : 26
  8. Princeton University : 25
  9. Stanford University : 24
  10. Brown University : 23
  11. University of Illinois at Chicago : 23
  12. University of Wisconsin, Madison : 23
  13. New York University, Courant Institute : 22
  14. University of California, San Diego : 22
  15. University of Texas at Austin : 22
  16. Texas A&M University : 21
  17. University of Chicago : 21
  18. The City University of New York : 20
  19. University of Washington : 20
  20. Stony Brook University : 19
  21. University of Minnesota-Twin Cities : 19
  22. Purdue University : 17
  23. University of California, Santa Barbara : 17
  24. University of Pennsylvania : 17
  25. Duke University : 16
  26. Indiana University, Bloomington : 16
  27. Ohio State University, Columbus : 16
  28. University of Maryland : 16
  29. Georgia Institute of Technology : 15
  30. Northwestern University : 15
  31. Pennsylvania State University : 15
  32. University of California, Irvine : 14
  33. University of Utah : 14
  34. Johns Hopkins University, Baltimore : 13
  35. University of British Columbia : 12
  36. Boston University : 11
  37. Harvard University : 11
  38. University of California, Davis : 11
  39. University of Notre Dame : 11
  40. University of Toronto : 11
  41. Eidgenössische Technische Hochschule Zürich (ETH Zürich) : 10
  42. University of North Carolina at Chapel Hill : 10
  43. University of Virginia : 10
  44. Vanderbilt University : 10
  45. Brandeis University : 9
  46. Columbia University : 9
  47. Institute for Advanced Study : 9
  48. University of Oregon : 9
  49. Michigan State University : 8
  50. Rice University : 8
  51. Tel Aviv University : 8
  52. University of Georgia : 8
  53. University of Nebraska-Lincoln : 8
  54. University of Southern California : 8
  55. California Institute of Technology : 7
  56. Ecole Polytechnique Fédérale de Lausanne (EPFL) : 7
  57. Microsoft Research : 7
  58. North Carolina State University : 7
  59. University of Oxford : 7
  60. University of Rochester : 7
  61. Williams College : 7
  62. Carnegie Mellon University : 6
  63. Imperial College : 6
  64. Louisiana State University, Baton Rouge : 6
  65. The Hebrew University of Jerusalem : 6
  66. Université Pierre et Marie Curie (Paris VI) : 6
  67. University of Arizona : 6
  68. Harvey Mudd College : 5
  69. Northeastern University : 5
  70. Temple University : 5
  71. Université Paris-Diderot : 5
  72. University of California, Riverside : 5
  73. University of Colorado, Boulder : 5
  74. Virginia Polytechnic Institute and State University : 5
  75. Boston College : 4
  76. Florida State University : 4
  77. NYU Polytechnic School of Engineering : 4
  78. Norwegian University of Science and Technology : 4
  79. Rutgers The State University of New Jersey Newark : 4
  80. Université Paris-Sud (Paris XI) : 4
  81. University of California, Santa Cruz : 4
  82. University of Cambridge : 4
  83. University of Connecticut, Storrs : 4
  84. University of Missouri-Columbia : 4
  85. University of Tennessee, Knoxville : 4
  86. University of Warwick : 4
  87. Washington University : 4
  88. Weizmann Institute of Science : 4
  89. Yale University : 4
  90.  

Zeros of Taylor polynomials of (1+z)^p

This is post is related to Extremal Taylor polynomials where it was important to observe that the Taylor polynomials of the function {(1+z)^{-1/2}} do not have zeros in the unit disk. Let’s see how far this generalizes.

The function {f(z)=(1+z)^{-1}} has the rare property that all zeros of its Taylor polynomial have unit modulus. This is clear from

{\displaystyle T_n(z) = \sum_{k=0}^n (-z)^k = (1-(-z)^{n+1})/(1+z)}.

p-33
The Taylor zeros of (1+z)^(-1)

In this and subsequent illustrations, the zeros of the first 50 Taylor polynomials are shown as blue dots, with the unit circle in red for reference.

When the exponent is less than -1, the zeros move inside the unit disk and begin forming nice patterns in there.

p-43
(1+z)^(-4/3)
p-63
(1+z)^(-2)
p-93
(1+z)^(-3)
p-123
(1+z)^(-4)

When the exponent is strictly between -1 and 1, the zeros are all outside of the unit disk. Some of them get quite large, forcing a change of scale in the image.

p-23
(1+z)^(-2/3)
p-13
(1+z)^(-1/3)
p13
(1+z)^(1/3)
p23
(1+z)^(2/3)

Why does this happen when the exponent approaches 1? The function {1+z} is its own Taylor polynomial, and has the only zero at -1.  So, when {p\approx 1}, the Taylor polynomials are small perturbations of {1+z}. These perturbations of coefficients have to create additional zeros, but being small, they require a large value of {z} to help them.

For a specific example, the quadratic Taylor polynomial of {(1+z)^p} is {1 + pz + p(p-1)z^2/2}, with roots {(1\pm \sqrt{(2-p)/p})/(1-p) }. When {p\approx 1}, one of these roots is near {-1} (as it has to be) and the other is large.

Finally, when {p>1} and is not an integer, we get zeros on both sides of the unit circle. The majority of them are still outside. A prominent example of an interior zero is {-1/p} produced by the first-degree polynomial {1 + pz}.

p43
(1+z)^(4/3)
p73
(1+z)^(7/3)
p103
(1+z)^(10/3)

Another related post: Real zeros of sine Taylor polynomials.

Measuring the regularity of a graph by its Laplacian eigenvalues

Let {G} be a graph with vertices {1, 2, \dots, n}. The degree of vertex {i} is denoted {d_i}. Let {L} be the Laplacian matrix of {G}, so that {L_{ii}=d_i}, {L_{ij}} is {-1} when the vertices {i, j} are adjacent, and is {0} otherwise. The eigenvalues of {L} are written as {\lambda_1\le \dots \le \lambda_n}.

The graph is regular if all vertices have the same degree: {d_1=\cdots = d_n}. How can this property be seen from its Laplacian eigenvalues {\lambda_1, \dots, \lambda_n}?

Since the sum of eigenvalues is equal to the trace, we have {\sum \lambda_i = \sum d_i}. Moreover, {\sum \lambda_i^2} is the trace of {L^2}, which is equal to the sum of the squares of all entries of {L}. This sum is {\sum d_i^2 + \sum d_i} because the {i}th row of {L} contains one entry equal to {d_i} and {d_i} entries equal to {-1}. In conclusion, {\sum d_i^2 = \sum \lambda_i^2 - \sum\lambda_i}.

The Cauchy-Schwarz inequality says that {n\sum d_i^2 \ge \left(\sum d_i \right)^2} with equality if and only if all numbers {d_i} are equal, i.e., the graph is regular. In terms of eigenvalues, this means that the difference
{\displaystyle D =n\sum d_i^2 - \left(\sum d_i \right)^2 = n\sum (\lambda_i^2 - \lambda_i) - \left( \sum\lambda_i \right)^2 }
is always nonnegative, and is equal to zero precisely when the graph is regular. This is how one can see the regularity of a graph from its Laplacian spectrum.

As an aside, {D } is an even integer. Indeed, the sum {\sum d_i} is even because it double-counts the edges. Hence the number of vertices of odd degree is even, which implies that {\sum d_i^k } is even for every positive integer  {k }.

Up to a constant factor, {D} is simply the degree variance: the variance of the sequence {d_1, \dots, d_n}. What graph maximizes it for a given {n}? We want to have some very large degrees and some very small ones.

Let {G_{m, n}} be the union of the complete graph {K_m} on {m} vertices and {(n-m)} isolated vertices. The sum of degrees is {m(m-1)} and the sum of squares of degrees is {m(m-1)^2}. Hence,

{D = nm(m-1)^2 - (m(m-1))^2 = m(m-1)^2(n-m)}

For {n=3, 4, 5, 6} the maximum is attained by {m=n-1}, that is there is one isolated vertex. For {n=7, 8, 9, 10} the maximum is {m=n-2}. In general it is attained by {m^*=\lfloor (3n+2)/4 \rfloor}.

The graph {G_{m, n}} is disconnected. But any graph has the same degree variance as its complement. And the complement {G^c(m, n)} is always connected: it consists of a “center”, a complete graph on {n-m} vertices, and “periphery”, a set of {m} vertices that are connected to each central vertex. Put another way, {G^c(m, n)} is obtained from the complete bipartite graph {K_{m, n-m}} by connecting all vertices of the {n-m} group together.

Tom A. B. Snijders (1981) proved that {G(m^*, n)} and {G^c(m^*, n)} are the only graphs maximizing the degree variance; in particular, {G^c(m^*, n)} is the unique maximizer among the connected graphs. It is pictured below for {n=4, \dots, 9}.

The displacement set of nonlinear maps in vector spaces

Given a vector space {V} and a map {f\colon V\to V} (linear or not), consider the displacement set of {f}, denoted {D(f) = \{f(x)-x\colon x\in V\}}. For linear maps this is simply the range of the operator {f-I} and therefore is a subspace.

The essentially nonlinear operations of taking the inverse or composition of maps become almost linear when the displacement set is considered. Specifically, if {f} has an inverse, then {D(f^{-1}) = -D(f)}, which is immediate from the definition. Also, {D(f\circ g)\subset D(f)+D(g)}.

When {V} is a topological vector space, the maps for which {D(f)} has compact closure are of particular interest: these are compact perturbations of the identity, for which degree theory can be developed. The consideration of {D(f)} makes it very clear that if {f} is an invertible compact perturbation of the identity, then {f^{-1}} is in this class as well.

It is also of interest to consider the maps for which {D(f)} is either bounded, or is bounded away from {0}. Neither case can occur for linear operators, so this is essentially nonlinear analysis. In the nonlinear case, the boundedness assumption for linear operators is usually replaced by the Lipschitz condition. Let us say that {f} is {(L, \ell)}-bi-Lipschitz if {\ell\|x-y\|\le \|f(x)-f(y)\|\le L\|x-y\|} for all {x, y} in the domain of {f}.

Brouwer’s fixed point theorem fails in infinite-dimensional Hilbert spaces, but it not yet clear how hard it can fail. The strongest possible counterexample would be a bi-Lipschitz automorphism of the unit ball with displacement bounded away from 0. The existence of such a map is unknown. If it does not exist, that would imply that the unit ball and the unit sphere in the Hilbert space are not bi-Lipschitz equivalent, because the unit sphere does have such an automorphism: {x\mapsto -x}.

Concerning the maps with bounded displacement, here is a theorem from Patrick Biermann’s thesis (Theorem 3.3.2): if {f} is an {(L, \ell)}-bi-Lipschitz map in a Hilbert space, {L/\ell < \pi/\sqrt{8}}, and {f} has bounded displacement, then {f} is onto. The importance of bounded displacement is illustrated by the forward shift map {S(x_1, x_2, \dots) = (0, x_1, x_2, \dots)} for which {L=\ell=1} but surjectivity nonetheless fails.

It would be nice to get rid of the assumption {L/\ell < \pi/\sqrt{8}} in the preceding paragraph. I guess any bi-Lipschitz map with bounded displacement should be surjective, at least in Hilbert spaces, but possibly in general Banach spaces as well.

Orthogonality in normed spaces

For a vector {x} in a normed space {X}, define the orthogonal complement {x^\perp} to be the set of all vectors {y} such that {\|x+ty\|\ge \|x\|} for all scalars {t}. In an inner product space (real or complex), this agrees with the normal definition of orthogonality because {\|x+ty\|^2 - \|x\|^2 = 2\,\mathrm{Re}\,\langle x, ty\rangle + o(t)} as {t\to 0}, and the right hand side can be nonnegative only if {\langle x, y\rangle=0}.

Let’s see what properties of orthogonal complement survive in a general normed space. For one thing, {x^\perp=X} if and only if {x=0}. Another trivial property is that {0\in x^\perp} for all {x}. More importantly, {x^\perp} is a closed set that contains some nonzero vectors.

  •  Closed because the complement is open: if {\|x+ty\| < \|x\|} for some {t}, the same will be true for vectors close to {y}.
  • Contains a nonzero vector because the Hahn-Banach theorem provides a norming functional for {x}, i.e., a unit-norm linear functional {f\in X^*} such that {f(x)=\|x\|}. Any {y\in \ker f} is orthogonal to {x}, because {\|x+ty\|\ge f(x+ty) = f(x) = \|x\|}.

In general, {x^\perp} is not a linear subspace; it need not even have empty interior. For example, consider the orthogonal complement of the first basis vector in the plane with {\ell_1} (taxicab) metric: it is \{(x, y)\colon |y|\ge |x|\}.

download
The orthogonal complement of a horizontal vector in the taxicab plane

This example also shows that orthogonality is not symmetric in general normed spaces: {(1,1)\in (1,0)^\perp} but {(1,0)\notin (1,1)^\perp}. This is why I avoid using notation {y \perp x} here.

In fact, {x^\perp} is the union of kernels of all norming functionals of {x}, so it is only a linear subspace when the norming functional is unique. Containment in one direction was already proved. Conversely, suppose {y\in x^\perp} and define a linear functional {f} on the span of {x,y} so that {f(ax+by) = a\|x\|}. By construction, {f} has norm 1. Its Hahn-Banach extension is a norming functional for {x} that vanishes on {y}.

Consider {X=L^p[0,1]} as an example. A function {f} satisfies {1\in f^\perp} precisely when its {p}th moment is minimal among all translates {f+c}. This means, by definition, that its “{L^p}-estimator” is zero. In the special cases {p=1,2,\infty} the {L^p} estimator is known as the median, mean, and midrange, respectively. Increasing {p} gives more influence to outliers, so {1\le p\le 2} is the more useful range for it.

Unpopular positive opinion challenge

Challenge accepted. I got three, though none are below 40%.

The Yellow Birds (2018), 45% on RT

The market isn’t so hot for Iraq war movies. And it’s nearly impossible to adapt such an introspective novel into film. I still respect the effort and its outcome, even if all references to the normal distribution got left out of it.

I spent a lot of time trying to identify the exact point at which I noticed a change in Murph, somehow thinking that if I could figure out where he had begun to slide down the curve of the bell that I could do something about it. But these are subtle shifts, and trying to distinguish them is like trying to measure the degrees of gray when evening comes. It’s impossible to identify the cause of anything, and I began to see the war as a big joke, for how cruel it was, for how desperately I wanted to measure the particulars of Murph’s new, strange behavior and trace it back to one moment, to one cause, to one thing I would not be guilty of. And I realized very suddenly one afternoon while throwing rocks into a bucket in a daze that the joke was in fact on me. Because how can you measure deviation if you don’t know the mean? There was no center in the world. The curves of all our bells were cracked.

(From The Yellow Birds by Kevin Powers)

Hearts in Atlantis (2001), 49% on RT

Two actors with (essentially) the same first name and over 50 years of age difference (Anthony Hopkins 1937-, Anton Yelchin 1989-2016) make this Stephen King adaptation well worth watching.

He made another circuit of his room, working the tingles out of his legs, feeling like a prisoner pacing his cell. The door had no lock on it—no more than his mom’s did—but he felt like a jailbird just the same. He was afraid to go out. She hadn’t called him for supper, and although he was hungry—a little, anyway—he was afraid to go out. He was afraid of how he might find her… or of not finding her at all. Suppose she had decided she’d finally had enough of Bobby-O, stupid lying little Bobby-O, his father’s son? Even if she was here, and seemingly back to normal… was there even such a thing as normal? People had terrible things behind their faces sometimes. He knew that now.

(From Low Men in Yellow Coats by Stephen King)

Maze Runner: The Death Cure (2018), 43% on RT

Sure, it’s not as good as the first film in the series (which does not qualify for the challenge, scoring 65% on RT), but a major improvement on the mindless zombie chases of the second part. I like to think of it as a parable illustrating ethical issues in public health… allowing for the customary movie-science vs actual-science differences.

Measuring nonlinearity and reducing it

How to measure the nonlinearity of a function {f\colon I\to \mathbb R} where {I\subset \mathbb R} is an interval? A natural way is to consider the smallest possible deviation from a line {y=kx+b}, that is {\inf_{k, b}\sup_{x\in I}|f(x)-kx-b|}. It turns out to be convenient to divide this by {|I|}, the length of the interval {I}. So, let {\displaystyle NL(f;I) = \frac{1}{|I|} \inf_{k, b}\sup_{x\in I}|f(x)-kx-b|}. (This is similar to β-numbers of Peter Jones, except the deviation from a line is measured only in the vertical direction.)

NL-def
NL(f; I) is the maximal vertical distance between red and black, divided by the length of I

Relation with derivatives

The definition of derivative immediately implies that if {f'(a)} exists, then {NL(f;I)\to 0} as {I} shrinks to {a} (that is, gets smaller while containing {a}). A typical construction of a nowhere differentiable continuous function is based on making {NL(f;I)} bounded from below; it is enough to do this for dyadic intervals, and that can be done by adding wiggly terms like {2^{-n}\mathrm{dist}\,(x, 2^{-n}\mathbb Z)}: see the blancmange curve.

The converse is false: if {NL(f; I)\to 0} as {I} shrinks to {a}, the function {f} may still fail to be differentiable at {a}. The reason is that the affine approximation may have different slopes at different scales. An example is {f(x)=x \sin \sqrt{-\log |x|}} in a neighborhood of {0}. Consider a small interval {[-\delta, \delta]}. The line {y = kx} with {k=\sin\sqrt{-\log \delta}} is a good approximation to {f} because {f(x)/x\approx k} on most of the interval except for a very small part near {0}, and on that part {f} is very close to {0} anyway.

Why the root of logarithm? Because {\sin \log |x|} has a fixed amount of change on a fixed proportion of  {[-\delta, \delta]}, independently of {\delta}. We need a function slower than the logarithm, so that as {\delta} decreases, there is a smaller amount of change on a larger part of the interval {[-\delta, \delta]}.

Nonlinearity of Lipschitz functions

Suppose {f} is a Lipschitz function, that is, there exists a constant {L} such that {|f(x)-f(y)|\le L|x-y|} for all {x, y\in I}. It’s easy to see that {NL(f;I)\le L/2}, by taking the mid-range approximation {y=\frac12 (\max_I f + \min_I f)}. But the sharp bound is {NL(f;I)\le L/4} whose proof is not as trivial. The sharpness is shown by {f(x)=|x|} with {I=[-1,1]}.

sharpness
With the maximum difference 1/2 and the length of interval 2, we get NL(f; [-1, 1]) = 1/4
Proof. Let {k} be the slope of the linear function that agrees with {f} at the endpoints of {I}. Subtracting this linear function from {f} gives us a Lipschitz function {g} such that {-L-k\le g'\le L-k} and {\int_I g'= 0}. Let {A = \int_I (g')^+ = \int_I (g')^-}. Chebyshev’s inequality gives lower bounds for the measures of the sets {g'>0} and {g'<0}: namely, {|g'>0|\ge A/(L-k)} and {|g'<0|\le A/(L+k)}. By adding these, we find that {|I| \ge 2LA/(L^2-k^2)\ge 2A/L}. Since {\max _I g - \min_I g \le A}, the mid-range approximation to {g} has error at most {A/2 \le |I|L/4}. Hence {NL(f; I) = NL(g; I) \le L/4}.

Reducing nonlinearity

Turns out, the graph of every Lipschitz function has relatively large almost-flat pieces.  That is, there are subintervals of nontrivial size where the measure of nonlinearity is much smaller than the Lipschitz constant. This result is a special (one-dimensional) case of Theorem 2.3 in Affine approximation of Lipschitz functions and nonlinear quotients by Bates, Johnson, Lindenstrauss, Preiss, and Schechtman.

Theorem AA (for “affine approximation”): For every {\epsilon>0} there exists {\delta>0} with the following property. If {f\colon I\to \mathbb R} is an {L}-Lipschitz function, then there exists an interval {J\subset I} with {|J|\ge \delta |I|} and {NL(f; J)\le \epsilon L}.

Theorem AA should not be confused with Rademacher’s theorem which says that a Lipschitz function is differentiable almost everywhere. The point here is a lower bound on the size of the interval {J}. Differentiability does not provide that. In fact, if we knew that {f} is smooth, or even a polynomial, the proof of Theorem AA would not become any easier.

Proof of Theorem AA

We may assume {I=[-1, 1]} and {L=1}. For {t\in (0, 2]} let {L(t) = \sup \{|f(x)-f(y)|/|x-y| \colon x, y\in I, \ |x-y|\ge t\}}. That is, {L(t)} is the restricted Lipschitz constant, one that applies for distances at least {t}. It is a decreasing function of {t}, and {L(0+)=1}.

Note that {|f(-1)-f(1)|\le 2L(1)} and that every value of {f} is within {2L(1)} of either {f(-1)} or {f(1)}. Hence, the oscillation of {f} on {I} is at most {6L(1)}. If {L(1) \le \epsilon/3}, then the constant mid-range approximation on {I} gives the desired conclusion, with {J=I}. From now on {L(1) > \epsilon/3}.

The sequence {L_k = L(4^{-k})} is increasing toward {L(0+)=1}, which implies {L_{k+1}\le (1+\epsilon) L_k} for some {k}. Pick an interval {[a, b]\subset I} that realizes {L_k}, that is {b-a\ge 4^{-k}} and {|f(b)-f(a)| = 4^{-k}L_k}. Without loss of generality {f(b)>f(a)} (otherwise consider {-f}). Let {J = [(3a+b)/4, (a+3b)/4]} be the middle half of {[a. b]}. Since each point of {J} is within distance {\ge 4^{-k-1}} of both {a} and {b}, it follows that {\displaystyle f(b) + L_{k+1}(x-b) \le f(x) \le f(a) + L_{k+1}(x-a) } for all {x \in J}.

proof
The green lines have slope L_{k+1} which is close to the slope L_k of the secant line through (a, f(a)) and (b, f(b)). The graph of f is pinched between these green lines, except near a or b.

So far we have pinched {f} between two affine functions of equal slope. Let us consider their difference:
{\displaystyle (f(a) + L_{k+1}(x-a)) - (f(b) + L_{k+1}(x-b)) = (L_{k+1}-L_k) (b-a)}. Recall that {L_{k+1}\le (1+\epsilon) L_k}, which gives a bound of {\epsilon L_k(b-a) \le 2\epsilon L |J|} for the difference. Approximating {f} by the average of the two affine functions we conclude that {NL(f;J)\le \epsilon L} as required.

It remains to consider the size of {J}, about which we only know {|J|\ge 4^{-k}/2} so far. Naturally, we want to take the smallest {k} such that {L_{k+1}\le (1+\epsilon) L_k} holds. Let {m} be this value; then {L_m > (1+\epsilon)^{m} L_0}. Here {L_m\le 1} and {L_0 = L(1)> \epsilon/3 }. The conclusion is that {(1+\epsilon)^m < 3/\epsilon}, hence {m< \log(3/\epsilon)/\log(1+\epsilon)}. This finally yields {\displaystyle \delta = 4^{-\log(3/\epsilon)/\log(1+\epsilon)}/2} as an acceptable choice, completing the proof of Theorem AA.

A large amount of work has been done on quantifying {\delta} in various contexts; for example Heat flow and quantitative differentiation by Hytönen and Naor.