Biased and unbiased mollification

When we want to smoothen (mollify) a given function {f:{\mathbb R}\rightarrow{\mathbb R}}, the standard recipe suggests: take the {C^{\infty}}-smooth bump function

\displaystyle    \varphi(t) = \begin{cases} c\, \exp\{1/(1-t^2)\}\quad & |t|<1; \\   0 \quad & |t|\ge 1 \end{cases}

where {c} is chosen so that {\int_{{\mathbb R}} \varphi=1} (for the record, {c\approx 2.2523}).

Standard bump
Standard bump

Make the bump narrow and tall: {\varphi_{\delta}(t)=\delta^{-1}\varphi(t/\delta)}. Then define {f_\delta = f*\varphi_\delta}, that is

\displaystyle    f_\delta(x) = \int_{\mathbb R} f(x-t) \varphi_\delta(t)\,dt = \int_{\mathbb R} f(t) \varphi_\delta(x-t)\,dt

The second form of the integral makes it clear that {f_\delta} is infinitely differentiable. And it is easy to prove that for any continuous {f} the approximation {f_\delta} converges to { f} uniformly on compact subsets of {{\mathbb R}}.

The choice of the particular mollifier given above is quite natural: we want a {C^\infty} function with compact support (to avoid any issues with fast-growing functions {f}), so it cannot be analytic. And functions like {\exp(-1/t)} are standard examples of infinitely smooth non-analytic functions. Being nonnegative is obviously a plus. What else to ask for?

Well, one may ask for a good rate of convergence. If {f} is an ugly function like {f(x)=\sqrt{|x|}}, then we probably should not expect fast convergence. But is could be something like {f(x)=|x|^7}; a function that is already six times differentiable. Will the rate of convergence be commensurate with the head start {f\in C^6} that we are given?

No, it will not. The limiting factor is not the lack of seventh derivative at {x=0}; it is the presence of (nonzero) second derivative at {x\ne 0}. To study this effect in isolation, consider the function {f(x)=x^2}, which has nothing beyond the second derivative. Here it is together with {f_{0.1}}: the red and blue graphs are nearly indistinguishable.

Parabola
Good approximation

But upon closer inspection, {f_{0.1}} misses the target by almost {2\cdot 10^{-3}}. And not only around the origin: the difference {f_{0.1}-f} is constant.

Overshoot
But it overshoots the target

With {\delta=0.01} the approximation is better.

Better approximation
Better approximation

But upon closer inspection, {f_{0.01}} misses the target by almost {2\cdot 10^{-5}}.

Still overshoots
Still overshoots

And so it continues, with the error of order {\delta^2}. And here is where it comes from:

\displaystyle f_\delta(0) = \int_{\mathbb R} t^2\varphi_\delta(t)\,dt = \delta^{-1} \int_{\mathbb R} t^2\varphi(t/\delta)\,dt  = \delta^{2} \int_{\mathbb R} s^2\varphi(s)\,ds

The root of the problem is the nonzero second moment {\int_{\mathbb R} s^2\varphi(s)\,ds \approx 0.158}. But of course, this moment cannot be zero if {\varphi} does not change sign. All familiar mollifiers, from Gaussian and Poisson kernels to compactly supported bumps such as {\varphi}, have this limitation. Since they do not reproduce quadratic polynomials exactly, they cannot approximate anything with nonzero second derivative better than to the order {\delta^2}.

Let’s find a mollifier without such limitations; that is, with zero moments of all orders. One way to do it is to use the Fourier transform. Since {\int_{\mathbb R} t^n \varphi(t)\,dt } is a multiple of {\widehat{\varphi}^{(n)}(0)}, it suffices to find a nice function {\psi} such that {\psi(0)=1} and {\psi^{(n)}(0)=0} for {n\ge 1}; the mollifier will be the inverse Fourier transform of {\psi}.

As an example, I took something similar to the original {\varphi}, but with a flat top:

\displaystyle  \psi(\xi) = \begin{cases} 1 \quad & |\xi|\le 0.1; \\    \exp\{1-1/(1-(|\xi|-0.01)^2)\} \quad & 0.1<|\xi|<1.1\\  0\quad & |\xi|\ge 1.1  \end{cases}

Fourier transform of unbiased mollifier
Fourier transform of unbiased mollifier

The inverse Fourier transform of {\psi} is a mollifier that reproduces all polynomials exactly: {p*\varphi = p} for any polynomial. Here it is:

Unbiased mollifier
Unbiased mollifier

Since I did not make {\psi} very carefully (its second derivative is discontinuous at {\pm 0.01}), the mollifier {\varphi} has a moderately heavy tail. With a more careful construction it would decay faster than any power of {t}. However, it cannot be compactly supported. Indeed, if {\varphi} were compactly supported, then {\widehat{\varphi}} would be real-analytic; that is, represented by its power series centered at {\xi=0}. But that power series is

\displaystyle 1+0+0+0+0+0+0+0+0+0+\dots

The idea of using negative weights in the process of averaging {f} looks counterintuitive, but it’s a fact of life. Like the appearance of negative coefficients in the 9-point Newton-Cotes quadrature formula… but that’s another story.

Credit: I got the idea of this post from the following remark by fedja on MathOverflow:

The usual spherical cap mollifiers reproduce constants and linear functions faithfully but have a bias on quadratic polynomials. That’s why you cannot go beyond {C^2} and {\delta^2} with them.

Entropic uncertainty

Having considered the SMBC version of the Fourier transform, it is time to take a look at the traditional one:

\displaystyle \widehat{f}(\xi)=\int_{{\mathbb R}}f(x)e^{-2\pi i \xi x}\,dx

(I am not going to worry about the convergence of any integrals in this post.) It is obvious that for any {\xi\in{\mathbb R}}

\displaystyle |\widehat{f}(\xi)|\le \int_{{\mathbb R}}|f(x)|\,dx

which can be tersely stated as {\|\widehat{f}\|_\infty\le \|f\|_1} using the {L^p} norm notation. A less obvious, but more important, relation is {\|\widehat{f}\|_2= \|f\|_2}. Interpolating between {p=1} and {p=2} we obtain the Hausdorff-Young inequality {\|\widehat{f}\|_q\le \|f\|_p} for {1\le p\le 2}. Here and in what follows {q=p/(p-1)}.

Summarizing the above, the function {r(p)=\|\widehat{f}\|_q/\|f\|_p} does not exceed {1} on the interval {[1,2]} and attains the value {1} at {p=2}. This brings back the memories of Calculus I and the fun we had finding the absolute maximum of a function on a closed interval. More specifically, it brings the realization that {r'(2)\ge 0}. (I do not worry about differentiability either.)

What does the inequality {r'(2)\ge 0} tell us about {f}? When writing it out, it is better to work with {(\log r)' = r'/r}, avoiding another memory of Calculus I: the Quotient Rule.

\displaystyle    \log r = \frac{1}{q} \int_{{\mathbb R}} |\widehat{f}(\xi )|^q \,d\xi   -\frac{1}{p} \int_{{\mathbb R}} |f(x)|^{p} \,dx

To differentiate this, we have to recall {(a^p)'=a^p \log a}, but nothing more unpleasant happens:

\displaystyle    (\log r)'(2) =  - \frac12 \int_{{\mathbb R}} |\widehat{f}(\xi )|^2 \,\log |\widehat{f}(\xi)|\,d\xi - \frac12 \int_{{\mathbb R}} |f(x)|^2 \,\log |f(x)|\,dx

Here the integral with {\widehat{f}} gets the minus sign from the chain rule: {(p/(p-1))'=-1} at {p=2}. In terms of the Shannon entropy {H(\phi)=-\int |\phi|\log |\phi| }, the inequality {(\log r)'(2)\ge 0} becomes simply

\displaystyle    H(|f|^2)+H(|\widehat{f}|^2)\ge 0    \ \ \ \ \ (1)

Inequality (1) was proved by I. Hirschman in 1957, and I followed his proof above. The left side of (1) is known as the entropic uncertainty (or Hirschman uncertainty) of {f}. As Hirschman himself conjectured, (1) is not sharp: it can be improved to

\displaystyle    H(|f|^2)+H(|\widehat{f}|^2)\ge 1-\log 2   \ \ \ \ \ (2)

The reason is that the Hausdorff-Young inequality {r(p)\le 1} is itself not sharp for {1<p<2}. It took about twenty years until W. Beckner proved the sharp form of the Hausdorff-Young inequality in his Ph.D. thesis (1975):

\displaystyle    r(p) \le \sqrt{p^{1/p}/q^{1/q}}   \ \ \ \ \ (3)

Here is the plot of the upper bound in (3):

Sharp constant
Sharp constant

Since the graph of {r} stays below this curve and touches it at {(2,1)}, the derivative {r'(2)} is no less than the slope of the curve at {p=2}, which is {(1-\log 2)/4}. Recalling that {H(|f|^2)+H(|\widehat{f}|^2)=4r'(2)}, we arrive at (2).

The best known form of the uncertainty principle is due to H. Weyl:

\displaystyle    \|\,x f\,\|_2 \cdot \|\,\xi \widehat{f}\,\|_2 \ge \frac{1}{4\pi}\|f\|_2^2   \ \ \ \ \ (4)

Although (4) can be derived from (2), this route is rather inefficient: Beckner’s theorem is hard, while a direct proof of (4) takes only a few lines: integration by parts {\int |f(x)|^2\,dx = -\int x \frac{d}{dx}|f(x)|^2\,dx }, chain rule and the Cauchy-Schwarz inequality.

But we can take another direction and use (1) (not the hard, sharp form (2)) to obtain the following inequality, also due to Hirschman: for every {\alpha>0} there is {C_\alpha>0} such that

\displaystyle    \|\,|x|^\alpha f\,\|_2 \cdot \|\,|\xi|^\alpha \widehat{f}\,\|_2 \ge C_\alpha \|f\|_2^2   \ \ \ \ \ (5)

It is convenient to normalize {f} so that {\|f\|_2=1}. This makes {\rho =|f|^2} a probability distribution (and {|\widehat f |^2 } as well). Our goal is to show that for any probability distribution {\rho}

\displaystyle    H(\rho) \le \alpha^{-1} \log \int |x|^\alpha \rho(x)\,dx + B_\alpha    \ \ \ \ \ (6)

where {B_\alpha} depends only on {\alpha}. Clearly, (1) and (6) imply (5).

A peculiar feature of (6) is that {x} appears in the integral on the right, but not on the left. This naturally makes one wonder how (6) behaves under scaling {\rho_\lambda(x)=\lambda \rho (\lambda x)}. Well, wonder no more—

\displaystyle    H(\rho_\lambda)= H(\rho) - \log \lambda \int \rho = H(\rho)- \log\lambda

and

\displaystyle     \int |x|^\alpha \rho_\lambda (x)\,dx =\lambda^{-\alpha}    \int |x|^\alpha \rho (x)\,dx

Thus, both sides of (6) change by {-\log \lambda}. The inequality passed the scaling test, and now we turn scaling to our advantage by making {\int |x|^\alpha \rho(x)\,dx =1}. This reduces (6) to {H(\rho)\le B_\alpha}.

Now comes a clever trick (due to Beckner): introduce another probability measure {d\gamma = A_\alpha \exp(-|x|^\alpha/\alpha)\,dx} where {A_\alpha} is a normalizing factor. Let {\phi(x) = A_\alpha^{-1}\exp( |x|^\alpha/\alpha)\,\rho (x)}, so that {\int \phi\,d\gamma=1}. By Jensen’s inequality,

\displaystyle    \int \phi \log \phi \,d\gamma \ge \int \phi \,d\gamma \cdot \log \int \phi \,d\gamma =0

On the other hand,

\displaystyle    \int \phi \log \phi \,d\gamma = \int \rho \log \phi \,dx    = -\log A_\alpha+\alpha^{-1} -H(\rho)

and we have desired bound {H(\rho)\le B_\alpha}.

Biographical note

Although Wikipedia refers to Isidore Isaac Hirschman, Jr. as a living person, he died in 1990. From the MAA Digital Library:

Halmos photographed analyst Isidore Hirschman (1922-1990) in June of 1960. Hirschman earned his Ph.D. in 1947 from Harvard with the dissertation “Some Representation and Inversion Problems for the Laplace Transform,” written under David Widder. After writing ten papers together, Hirschman and Widder published the book The Convolution Transform in 1955 (Princeton University Press; now available from Dover Publications). Hirschman spent most of his career (1949-1978) at Washington University in St. Louis, Missouri, where he published mainly in harmonic analysis and operator theory.

Further reading:

  1. Beckner, William. “Inequalities in Fourier analysis.” Ann. of Math.(2) 102.1 (1975): 159-182.
  2. Cowling, Michael G., and John F. Price. “Bandwidth versus time concentration: the Heisenberg-Pauli-Weyl inequality.” SIAM Journal on Mathematical Analysis 15.1 (1984): 151-165.
  3. Folland, Gerald B., and Alladi Sitaram. “The uncertainty principle: a mathematical survey.” Journal of Fourier Analysis and Applications 3.3 (1997): 207-238.
  4. Hirschman Jr, I. I. “A note on entropy.” American Journal of Mathematics 79.1 (1957): 152-156.

fourier transform according to Saturday Morning Breakfast Cereal

Although not (yet?) tenured, I decided to implement this transform anyway: see fourier transform. To determine the fouriest transform uniquely, in case of a tie the greatest base wins (this is reasonable because greater values of the base correspond to more compact representations).

Caution: JavaScript integers go only this far
Caution: JavaScript integers go only this far

If no base can beat the decimal, the fourier transform does not exist. I limited consideration to bases up to 36, because they yield convenient representations using the alphabet 012..xyz.

Some examples:

number fourier transform(s) fouriest
42 does not exist does not exist
2013 43b22, 4bi21, 56047 43b22
13244 4104345 4104345
57885161 (too many to list) 54244025056

The last line is merely the exponent of the latest and greatest Mersenne prime. If you want to transform the number itself, go four it…

The transform may have a practical application. For example, my office phone number 4431487 is a pretty undistinguished number. But it has a unique fourier (hence fouriest) transform 524445347, which is easier to remember. This will come in handy when dialpads become equipped with a number base switch.


I added the line “Input interpreted as …” to give some indication of how JavaScript handles integers out of its 2^{53} range. For example, entering 9999999999999999 you will find the the code actually transforms 10000000000000000. This one is handled correctly: the fouriest transform is 2422441203251352401446. But in general, round-off errors may distort the results when the input is large. Do not rely on this code in the situations when the fourierness of a very large number is a matter of life and death.

Open problem: can a number of the form 444…4 have a fourier transform? It is conceivable, though unlikely, that its representation in a base smaller than 10 could have more 4s.

The OEIS sequence A000002

The sequences in The On-Line Encyclopedia of Integer Sequences® are themselves arranged into a sequence, each being indexed by Axxxxxx. Hence, the sequence
2, 3, 2, 1, 3, 4, 1, 7, 7, 5, 45, 2, 181, 43, 17, ...
can never be included in the encyclopedia: its definition is a_n=1+\text{nth term of the nth sequence in the OEIS}.

By the way, which sequences have the honor of being on the virtual first page of the OEIS? A000004 consists of zeros. Its description offers the following explicit formula:

a(n)=A*sin(n*Pi) for any A. [From Jaume Oliver Lafont, Apr 02 2010]

The sequence A000001 counts the groups of order n.

The sequence A000002 consists of 1s and 2s only:
1, 2, 2, 1, 1, 2, 1, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 2, 1, 2, 2, 1, 1, ...
The defining property of this sequence is self-similarity: consider the runs of equal numbers
1, 22, 11, 2, 1, 22, 1, 22, 11, 2, 11, 22, 1, 2, 11, 2, 1, 22, 11, ...
and replace each run by its length:
1, 2, 2, 1, 1, 2, 1, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 2, 2, ...
You get the original sequence again.

This is the Kolakoski sequence, which first appeared in print in 1965. Contrary to what one might expect from such a simple definition, there is no simple pattern; it is not even known whether 1s and 2s appear with the same asymptotic frequency 1/2. One way to look for patterns in a function/sequence is to take its Fourier transform, and this is what I did below:

Fourier transform: click to magnify

Specifically, this is the absolute value of \displaystyle F(t)=\sum_{n=1}^{500} (2a_n-3) e^{2\pi i n t}. To better see where the peaks fall, it’s convenient to look at |F(1/t)|:

Reciprocal axis: click to magnify

Something is going on at t=1/3, i.e. at period 3. Recall that the sequence contains no triples 111 or 222: there are no runs of more than 2 equal numbers. So perhaps it’s not surprising that a pattern emerges at period 3. And indeed, even though the percentage of 1s among the first 500 terms of the sequence is 49.8%, breaking it down by n mod 3 yields the following:

  • Among a_{3k}, the percentage of 1s is 17.4%
  • Among a_{3k+1}, the percentage of 1s is 85.2%
  • Among a_{3k+2}, the percentage of 1s is 46.8%

Noticing another peak at t=1/9, I also broke down the percentages by n mod 9:

  • n=9k: there are 5.4% 1s
  • n=9k+1: there are 74.8% 1s
  • n=9k+2: there are 41.4% 1s
  • n=9k+3: there are 21.6% 1s
  • n=9k+4: there are 97.2% 1s
  • n=9k+5: there are 70.2% 1s
  • n=9k+6: there are 25.2% 1s
  • n=9k+7: there are 84.6% 1s
  • n=9k+8: there are 28.8% 1s

Nothing close to 50%…

In conclusion, my code for the Kolakoski sequence. This time I used Maple (instead of Scilab) for no particular reason.

N:=500; ks:=Array(1..N+2); i:=0; j:=1; new:=1;
while i<=N do i:=i+1; ks[i]:=new;
if (ks[j]=2) then i:=i+1; ks[i]:=new; end if;
j:=j+1; new:=3-new;
end do: