Continued fractions vs Power series

The Taylor series of the exponential function,
{\displaystyle e^x = \sum_{n=0}^\infty \frac{x^n}{n!} = 1 + x + \frac{x^2}{2} + \frac{x^3}{6} + \frac{x^4}{24} + \frac{x^5}{120} + \cdots }
provides reasonable approximations to the function near zero: for example, with {\displaystyle T_3(x) = 1 + x + \frac{x^2}{2} + \frac{x^3}{6} } we get

T3
Taylor polynomial T_3 in red

The quality of approximation not being spectacular, one can try to improve it by using rational functions instead of polynomials. In view of the identity
{\displaystyle e^x = \frac{e^{x/2}}{e^{-x/2}} \approx \frac{T_3(x/2)}{T_3(-x/2)}}
one can get {e^x\approx p(x)/p(-x)} with {\displaystyle p(x) = 1 + \frac{x}{2} + \frac{x^2}{8} + \frac{x^3}{48} }. The improvement is substantial for negative {x}:

exp-half
Rational approximation based on exp(x/2)

Having rational approximation of the form {e^x\approx p(x)/p(-x)} makes perfect sense, because such approximants obey the same functional equation {f(x)f(-x)=1} as the exponential function itself. We cannot hope to satisfy other functional equations like {f(x+1)=e f(x)} by functions simpler than {\exp}.

However, the polynomial {T_n(x/2)} is not optimal for approximation {e^x \approx p(x)/p(-x)} except for {n=0, 1}. For degree {3}, the optimal choice is {\displaystyle p(x) = 1 + \frac{x}{2} + \frac{x^2}{10} + \frac{x^3}{120} }. In the same plot window as above, the graph of {p(x)/p(-x)} is indistinguishable from {e^x}.

pade
Red=rational, Blue=exponential

This is a Padé approximant to the exponential function. One way to obtain such approximants is to replace the Taylor series with continued fraction, using long division:

{\displaystyle e^x = 1 + \dfrac{x}{1 + \dfrac{-x/2}{1 + \dfrac{x/6}{1 + \dfrac{-x/6}{1 + \dfrac{x/10}{1 + \dfrac{-x/10}{1 + \cdots }}}}}} }

Terminating the expansion after an even number of {x} apprearances gives a Padé approximant of the above form.

This can be compared to replacing the decimal expansion of number {e}:

{\displaystyle e = 2 + \frac{7}{10} + \frac{1}{10^2} + \frac{8}{10^3} + \frac{2}{10^4}+\cdots }

with the continued fraction expansion

{\displaystyle e = 2 + \dfrac{1}{1 + \dfrac{1}{2 + \dfrac{1}{1 + \dfrac{1}{1 + \dfrac{1}{4 + \dfrac{1}{1 + \cdots }}}}}} }

which, besides greater accuracy, has a regular pattern: 121 141 161 181 …

The numerators {p} of the (diagonal) Padé approximants to the exponential function happen to have a closed form:

{\displaystyle p(x) = \sum_{k=0}^n \frac{\binom{n}{k}}{\binom{2n}{k}} \frac{x^k}{k!} }

which shows that for every fixed {k}, the coefficient of {x^k} converges to {2^{-k}/k!} as {n\to\infty }. The latter is precisely the Taylor coefficient of {e^{x/2}}.

In practice, a recurrence relation is probably the easiest way to get these numerators: begin with {p_0(x)=1} and {p_1(x) = 1+x/2}, and after that use {\displaystyle p_{n+1}(x) = p_n(x) + \frac{x^2}{16n^2 - 4} p_{n-1}(x)}. This relation can be derived from the recurrence relations for the convergents {A_n/B_n} of a generalized continued fraction {\displaystyle \mathop{\raisebox{-5pt}{\huge K}}_{n=1}^\infty \frac{a_n}{b_n}}. Namely, {A_n = b_nA_{n-1}+a_nA_{n-2}} and {B_n = b_nB_{n-1}+a_nB_{n-2}}. Only the first relation is actually needed here.

Using the recurrence relation, we get

{\displaystyle p_2(x) = p_1(x) + \frac{x^2}{12}p_0(x) = 1 + \frac{x}{2} + \frac{x^{2}}{12} }

{\displaystyle p_3(x) = p_2(x) + \frac{x^2}{60}p_1(x) = 1 + \frac{x}{2} + \frac{x^{2}}{10} + \frac{x^{3}}{120} }

{\displaystyle p_4(x) = p_3(x) + \frac{x^2}{140}p_2(x) = 1 + \frac{x}{2} + \frac{3 x^{2}}{28} + \frac{x^{3}}{84} + \frac{x^{4}}{1680} }

(so, not all coefficients have numerator 1…)

{\displaystyle p_5(x) = p_4(x) + \frac{x^2}{252}p_3(x) = 1 + \frac{x}{2} + \frac{x^{2}}{9} + \frac{x^{3}}{72} + \frac{x^{4}}{1008} + \frac{x^{5}}{30240} }

The quality of approximation to {e^x} is best seen on logarithmic scale: i.e., how close is {\log(p(x)/p(-x))} to {x}? Here is this comparison using {p_5}.

pade5
Rational approximation based on 5th degree polynomial

For comparison, the Taylor polynomial of fifth degree, also on logarithmic scale: {\log(T_5(x))} where {\displaystyle T_5(x) = 1 + x + \frac{x^2}{2} + \frac{x^3}{6} + \frac{x^4}{24} + \frac{x^5}{120}}.

T5
log(T_5) in red

Transcendental-free Riemann-Lebesgue lemma

Calculus books tend to introduce transcendental functions (trigonometric, exponential, logarithm) early. Analysis textbooks such as Principles of Mathematical Analysis by Rudin tend to introduce them later, because of how long it takes to develop enough of the theory of power series.

The Riemann-Lebesgue lemma involves either trigonometric or exponential functions. But the following version works with the “late transcendentals” approach.

Transcendental-free Riemann-Lebesgue Lemma

TFRLL. Suppose that {f\colon [a, b]\to \mathbb R} and {g\colon \mathbb R\to \mathbb R} are continuously differentiable functions, and {g} is bounded. Then {\int_a^b f(x)g'(nx)\,dx \to 0} as {n\to\infty}.

The familiar form of the lemma is recovered by letting {g(x) = \sin x} or {g(x) = \cos x}.

Proof. By the chain rule, {g'(nx)} is the derivative of {g(nx)/n}. Integrate by parts:

{\displaystyle  \int_a^b f(x)g'(nx)\,dx = \frac{f(b)g(nb)}{n} - \frac{f(a)g(na)}{n} - \int_a^b f'(x)\frac{g(nx)}{n}\,dx }

By assumption, there exists a constant {M} such that {|g|\le M} everywhere. Hence {\displaystyle \left| \frac{f(b)g(nb)}{n}\right| \le \frac{|f(b)| M}{n} }, {\displaystyle \left|\frac{f(a)g(na)}{n}\right| \le \frac{|f(a)| M}{n}}, and {\displaystyle \left|\int_a^b f'(x)\frac{g(nx)}{n}\,dx\right| \le \frac{M}{n} \int_a^b |f'(x)|\,dx}. By the triangle inequality,

{\displaystyle \left|\int_a^b f(x)g'(nx)\,dx \right| \le \frac{M}{n}\left(|f(b)|+|f(a)| + \int_a^b |f'(x)|\,dx \right) \to 0 }

completing the proof.

As a non-standard example, TFRLL applies to, say, {g(x) = \sin (x^2) } for which {g'(x) = 2x\cos (x^2)}. The conclusion is that {\displaystyle \int_a^b f(x) nx \cos (n^2 x^2) \,dx \to 0 }, that is, {\displaystyle  \int_a^b xf(x) \cos (n^2 x^2) \,dx = o(1/n)} which seems somewhat interesting. When {0\notin [a, b]}, the factor of {x} can be removed by applying the result to {f(x)/x}, leading to {\displaystyle \int_a^b f(x) \cos (n^2 x^2) \,dx = o(1/n)}.

What if we tried less smoothness?

Later in Rudin’s book we encounter the Weierstrass theorem: every continuous function on {[a, b]} is a uniform limit of polynomials. Normally, this would be used to make the Riemann-Lebesgue lemma work for any continuous function {f}. But the general form given above, with an unspecified {g}, presents a difficulty.

Indeed, suppose {f} is continuous on {[a, b]}. Given {\epsilon > 0 }, choose a polynomial {p} such that {|p-f|\le \epsilon} on {[a, b]}. Since {p} has continuous derivative, it follows that {\int_a^b p(x)g'(nx)\,dx \to 0}. It remains to show that {\int_a^b p(x)g'(nx)\,dx} is close to {\int_a^b f(x)g'(nx)\,dx}. By the triangle inequality,

{\displaystyle \left| \int_a^b (p(x) - f(x))g'(nx)\,dx \right| \le \epsilon \int_a^b |g'(nx)|\,dx }

which is bounded by … um. Unlike for {\sin } and {\cos}, we do not have a uniform bound for {|g'|} or for its integral. Indeed, with {g(x) = \sin x^2} the integrals {\displaystyle  \int_0^1 |g'(nx)| \,dx = \int_0^1 2nx |\cos (n^2x^2)| \,dx  } grow linearly with {n}. And this behavior would be even worse with {g(x) = \sin e^x}, for example.

At present I do not see a way to prove TFRLL for continuous {f}, let alone for integrable {f}. But I do not have a counterexample either.

Partitions of unity and monotonicity-preserving approximation

There are many ways to approximate a given continuous function {f\colon [a, b]\to \mathbb R} (I will consider the interval {[a, b]=[0, 1]} for convenience.) For example, one can use piecewise linear interpolation through the points {(k/n, f(k/n))}, where {k=0, 1, \dots, n}. The resulting piecewise linear function {g} has some nice properties: for example, it is increasing if {f} is increasing. But it is not smooth.

A convenient way to represent piecewise linear interpolation is the sum {g(x) = \sum_{k=0}^n f(k/n) \varphi_k(x)} where the functions {\varphi_k} are the triangles shown below: {\varphi_k(x) = \max(0, 1 - |nx-k|)}.

Triangular basis functions

The functions {{\varphi_k}} form a partition of unity, meaning that {\sum_k \varphi_k \equiv 1} and all {\varphi_k} are nonnegative. This property leads to the estimate

{\displaystyle |f(x) - g(x)| = \left| \sum_{k=0}^n (f(x) - f(k/n)) \varphi_k(x)\right| \le \sum_{k=0}^n |f(x) - f(k/n)| \varphi_k(x) }

The latter sum is small because when {x} is close to {k/n}, the first factor {|f(x) - f(k/n)|} is small by virtue of continuity, while the second factor {\varphi_k(x)} is bounded by {1}. When {x} is far from {k/n}, the second factor {\varphi_k(x)} is zero, so the first one is irrelevant. The upshot is that {f-g} is uniformly small.

But if we want a smooth approximation {g}, we need a smooth partition of unity {{\varphi_k}}. But not just any set of smooth nonnegative functions that add up to {0} is equally good. One desirable property is preserving monotonicity: if {f} is increasing, then {g} should be increasing, just as this works for piecewise linear interpolation. What does this condition require of our partition of unity?

An increasing function can be expressed as a limit of sums of the form {\sum_{j} c_j [x \ge t_j]} where {c_j>0} and {[\cdots ]} is the Iverson bracket: 1 if true, 0 if false. By linearity, it suffices to have increasing {g} for the case {f(x) = [x \ge t]}. In this case {g} is simply {s_m := \sum_{k=m}^n \varphi_k} for some {m}, {0\le m\le n}. So we want all {s_m} to be increasing functions. Which is the case for the triangular partition of unity, when each {s_m} looks like this:

Also known as a piecewise linear activation function

One smooth choice is Bernstein basis polynomials: {\displaystyle \varphi_k(x) = \binom{n}{k} x^k (1-x)^{n-k}}. These are nonnegative on {[0, 1]}, and the binomial formula shows {\displaystyle \sum_{k=0}^n \varphi_k(x) = (x + 1-x)^n \equiv 1}. Are the sums {\displaystyle s_m(x) = \sum_{k=m}^n \binom{n}{k} x^k (1-x)^{n-k}} increasing with {x}? Let’s find out. By the product rule,

{\displaystyle s_m'(x) = \sum_{k=m}^n \binom{n}{k} k x^{k-1} (1-x)^{n-k} - \sum_{k=m}^n \binom{n}{k} (n-k) x^{k} (1-x)^{n-k - 1}}

In the second sum the term with {k=n} vanishes, and the terms with {k<n} can be rewritten as {\displaystyle \frac{n!}{k! (n-k)!} (n-k) x^{k} (1-x)^{n-k - 1}}, which is {\frac{n!}{(k+1)! (n-k-1)!} (k+1) x^{k} (1-x)^{n-k - 1}}, which is {\binom{n}{k+1} (k+1) x^{k} (1-x)^{n-k - 1} }. After the index shift {k+1\mapsto k} this becomes identical to the terms of the first sum and cancels them out (except for the first one). Thus,

{\displaystyle s_m'(x) = \binom{n}{m} m x^{m-1} (1-x)^{n-m} \ge 0 }

To summarize: the Bernstein polynomials {\displaystyle B_n(x) = \sum_{k=0}^n f(k/n) \binom{n}{k} x^k (1-x)^{n-k}} are monotone whenever {f} is. On the other hand, the proof that {B_n\to f} uniformly is somewhat complicated by the fact that the polynomial basis functions {\varphi_k} are not localized the way that the triangle basis functions are: the factors {\varphi_k(x)} do not vanish when {x} is far from {k/n}. I refer to Wikipedia for a proof of convergence (which, by the way, is quite slow).

Bernstein polynomial basis

Is there some middle ground between non-smooth triangles and non-localized polynomials? Yes, of course: piecewise polynomials, splines. More specifically, B-splines which can be defined as follows: B-splines of degree {1} are the triangle basis functions shown above; a B-spline of degree {d+1} is the moving averages of a {B}-spline of degree {d} with a window of length {h = 1/n}. The moving average of {F} can be written as {\frac{1}{h} \int_{x-h/2}^{x+h/2} f}. We get a partition of unity because the sum of moving averages is the moving average of a sum, and averaging a constant function does not change it.

The splines of even degrees are awkward to work with… they are obtained from the triangles by taking those integrals with {h/2} an odd number of times, which makes their knots fall in the midpoints of the uniform grid instead of the grid points themselves. But I will use {d=2} anyway, because this degree is enough for {C^1}-smooth approximation.

Recall that a triangular basis function {\varphi_k} has slope {\pm n} and is supported on an interval {[(k-1)h, (k+1)h]} where {h=1/n}. Accordingly, its moving average {\psi_k} will be supported on {[(k-3/2)h, (k+3/2)h]}. Since {\psi_k'(x) = n(\phi_k(x+h/2) - \phi_k(x-h/2))}, the second derivative {\psi_k''} is {n^2} when {-3/2< nx-k< -1/2}, is {-2n^2} when {|nx-k| < 1/2}, and is {n^2} again when {1/2 < nx-k < 3/2}. This is enough to figure out the formula for {\psi_k}:

{\displaystyle \psi_k(n) = \begin{cases} (nx-k+3/2)^2 / 2, & -3/2\le nx -k\le -1/2 \\ 3/4 -(nx-k)^2, & -1/2\le nx-k \le 1/2 \\ (nx-k-3/2)^2 / 2, & 1/2\le nx -k \le 3/2 \\ \end{cases} }

These look like:

Is this right?

Nice! But wait a moment, the sum near the endpoints is not constant: it is less than 1 because we do not get a contributions of two splines to the left and right of the interval. To correct for this boundary effect, replace {\psi_0} with {\psi_0 + \psi_{-1}} and {\psi_n} with {\psi_n + \psi_{n+1}}, using “ghost” elements of the basis that lie outside of the actual grid. Now the quadratic B-spline basis is correct:

A smooth partition of unity by quadratic splines

Does this partition of unity preserve monotinicity? Yes, it does: {\displaystyle \sum_{k\ge m}\psi_k'(x) = n\sum_{k\ge m} (\phi_k(x+h/2) - \phi_k(x-h/2)) = n(s(x+h/2) - s(x-h/2))} which is nonnegative because the sum {s := \sum_{k\ge m} \phi_k} is an increasing piecewise linear function, as noted previously. Same logic works for B-splines of higher degree.

In conclusion, here is a quadratic B-spline approximation (orange) to a tricky increasing function (blue).

Smooth approximation

One may wonder why the orange curve deviates from the line at the end – did we miss some boundary effect there? Yes, in a way… the spline actually approximates the continuous extension of our original function by constant values on the left and right. Imagine the blue graph continuing to the right as a horizontal line: this creates a corner at {x=1} and the spline is smoothing that corner. To avoid this effect, one may want to extend {f} in a better way and then work with the extended function, not folding the ghosts {\psi_{-1}, \psi_{n+1}} into {\psi_0, \psi_n}.

But even so, B-spline achieves a better approximation than the Bernstein polynomial with the same number of basis functions (eight):

Bernstein polynomial

The reason is the non-local nature of the polynomial basis {\varphi_k}, which was noted above. Bernstein polynomials do match the function perfectly at the endpoints, but this is small consolation.

Measuring nonlinearity and reducing it

How to measure the nonlinearity of a function {f\colon I\to \mathbb R} where {I\subset \mathbb R} is an interval? A natural way is to consider the smallest possible deviation from a line {y=kx+b}, that is {\inf_{k, b}\sup_{x\in I}|f(x)-kx-b|}. It turns out to be convenient to divide this by {|I|}, the length of the interval {I}. So, let {\displaystyle NL(f;I) = \frac{1}{|I|} \inf_{k, b}\sup_{x\in I}|f(x)-kx-b|}. (This is similar to β-numbers of Peter Jones, except the deviation from a line is measured only in the vertical direction.)

NL-def
NL(f; I) is the maximal vertical distance between red and black, divided by the length of I

Relation with derivatives

The definition of derivative immediately implies that if {f'(a)} exists, then {NL(f;I)\to 0} as {I} shrinks to {a} (that is, gets smaller while containing {a}). A typical construction of a nowhere differentiable continuous function is based on making {NL(f;I)} bounded from below; it is enough to do this for dyadic intervals, and that can be done by adding wiggly terms like {2^{-n}\mathrm{dist}\,(x, 2^{-n}\mathbb Z)}: see the blancmange curve.

The converse is false: if {NL(f; I)\to 0} as {I} shrinks to {a}, the function {f} may still fail to be differentiable at {a}. The reason is that the affine approximation may have different slopes at different scales. An example is {f(x)=x \sin \sqrt{-\log |x|}} in a neighborhood of {0}. Consider a small interval {[-\delta, \delta]}. The line {y = kx} with {k=\sin\sqrt{-\log \delta}} is a good approximation to {f} because {f(x)/x\approx k} on most of the interval except for a very small part near {0}, and on that part {f} is very close to {0} anyway.

Why the root of logarithm? Because {\sin \log |x|} has a fixed amount of change on a fixed proportion of  {[-\delta, \delta]}, independently of {\delta}. We need a function slower than the logarithm, so that as {\delta} decreases, there is a smaller amount of change on a larger part of the interval {[-\delta, \delta]}.

Nonlinearity of Lipschitz functions

Suppose {f} is a Lipschitz function, that is, there exists a constant {L} such that {|f(x)-f(y)|\le L|x-y|} for all {x, y\in I}. It’s easy to see that {NL(f;I)\le L/2}, by taking the mid-range approximation {y=\frac12 (\max_I f + \min_I f)}. But the sharp bound is {NL(f;I)\le L/4} whose proof is not as trivial. The sharpness is shown by {f(x)=|x|} with {I=[-1,1]}.

sharpness
With the maximum difference 1/2 and the length of interval 2, we get NL(f; [-1, 1]) = 1/4
Proof. Let {k} be the slope of the linear function that agrees with {f} at the endpoints of {I}. Subtracting this linear function from {f} gives us a Lipschitz function {g} such that {-L-k\le g'\le L-k} and {\int_I g'= 0}. Let {A = \int_I (g')^+ = \int_I (g')^-}. Chebyshev’s inequality gives lower bounds for the measures of the sets {g'>0} and {g'<0}: namely, {|g'>0|\ge A/(L-k)} and {|g'<0|\le A/(L+k)}. By adding these, we find that {|I| \ge 2LA/(L^2-k^2)\ge 2A/L}. Since {\max _I g - \min_I g \le A}, the mid-range approximation to {g} has error at most {A/2 \le |I|L/4}. Hence {NL(f; I) = NL(g; I) \le L/4}.

Reducing nonlinearity

Turns out, the graph of every Lipschitz function has relatively large almost-flat pieces.  That is, there are subintervals of nontrivial size where the measure of nonlinearity is much smaller than the Lipschitz constant. This result is a special (one-dimensional) case of Theorem 2.3 in Affine approximation of Lipschitz functions and nonlinear quotients by Bates, Johnson, Lindenstrauss, Preiss, and Schechtman.

Theorem AA (for “affine approximation”): For every {\epsilon>0} there exists {\delta>0} with the following property. If {f\colon I\to \mathbb R} is an {L}-Lipschitz function, then there exists an interval {J\subset I} with {|J|\ge \delta |I|} and {NL(f; J)\le \epsilon L}.

Theorem AA should not be confused with Rademacher’s theorem which says that a Lipschitz function is differentiable almost everywhere. The point here is a lower bound on the size of the interval {J}. Differentiability does not provide that. In fact, if we knew that {f} is smooth, or even a polynomial, the proof of Theorem AA would not become any easier.

Proof of Theorem AA

We may assume {I=[-1, 1]} and {L=1}. For {t\in (0, 2]} let {L(t) = \sup \{|f(x)-f(y)|/|x-y| \colon x, y\in I, \ |x-y|\ge t\}}. That is, {L(t)} is the restricted Lipschitz constant, one that applies for distances at least {t}. It is a decreasing function of {t}, and {L(0+)=1}.

Note that {|f(-1)-f(1)|\le 2L(1)} and that every value of {f} is within {2L(1)} of either {f(-1)} or {f(1)}. Hence, the oscillation of {f} on {I} is at most {6L(1)}. If {L(1) \le \epsilon/3}, then the constant mid-range approximation on {I} gives the desired conclusion, with {J=I}. From now on {L(1) > \epsilon/3}.

The sequence {L_k = L(4^{-k})} is increasing toward {L(0+)=1}, which implies {L_{k+1}\le (1+\epsilon) L_k} for some {k}. Pick an interval {[a, b]\subset I} that realizes {L_k}, that is {b-a\ge 4^{-k}} and {|f(b)-f(a)| = 4^{-k}L_k}. Without loss of generality {f(b)>f(a)} (otherwise consider {-f}). Let {J = [(3a+b)/4, (a+3b)/4]} be the middle half of {[a. b]}. Since each point of {J} is within distance {\ge 4^{-k-1}} of both {a} and {b}, it follows that {\displaystyle f(b) + L_{k+1}(x-b) \le f(x) \le f(a) + L_{k+1}(x-a) } for all {x \in J}.

proof
The green lines have slope L_{k+1} which is close to the slope L_k of the secant line through (a, f(a)) and (b, f(b)). The graph of f is pinched between these green lines, except near a or b.

So far we have pinched {f} between two affine functions of equal slope. Let us consider their difference:
{\displaystyle (f(a) + L_{k+1}(x-a)) - (f(b) + L_{k+1}(x-b)) = (L_{k+1}-L_k) (b-a)}. Recall that {L_{k+1}\le (1+\epsilon) L_k}, which gives a bound of {\epsilon L_k(b-a) \le 2\epsilon L |J|} for the difference. Approximating {f} by the average of the two affine functions we conclude that {NL(f;J)\le \epsilon L} as required.

It remains to consider the size of {J}, about which we only know {|J|\ge 4^{-k}/2} so far. Naturally, we want to take the smallest {k} such that {L_{k+1}\le (1+\epsilon) L_k} holds. Let {m} be this value; then {L_m > (1+\epsilon)^{m} L_0}. Here {L_m\le 1} and {L_0 = L(1)> \epsilon/3 }. The conclusion is that {(1+\epsilon)^m < 3/\epsilon}, hence {m< \log(3/\epsilon)/\log(1+\epsilon)}. This finally yields {\displaystyle \delta = 4^{-\log(3/\epsilon)/\log(1+\epsilon)}/2} as an acceptable choice, completing the proof of Theorem AA.

A large amount of work has been done on quantifying {\delta} in various contexts; for example Heat flow and quantitative differentiation by Hytönen and Naor.

Between a cubic spline and the line of regression

Given some data points {(x_i,y_i)} one can devise various functions that extract information from them. Such as

  • Regression line (considered in When the digits of {\pi} go to 11): it detects a linear trend {y=ax+b}, minimizing the sum of squares of residuals {y_i-(ax_i+b)}.
  • Natural cubic spline (considered in Connecting dots naturally), which passes through every data point while minimizing the amount of wiggling, measured by the square of its second derivative. Like this:

Natural cubic spline
Natural cubic spline

A smoothing spline is something in between the above extremes: it insists on neither being a line (i.e., having zero second derivative) nor on passing through given points (i.e., having zero residuals). Instead, it minimizes a weighted sum of both things: the integral of second derivative squared, and the sum of residuals squared. Like this plot, where the red points are given but the spline chooses to interpolate the green ones:

Smoothing spline
Smoothing spline

I’ll demonstrate a version of a smoothing spline that might not be exactly canonical, but is very easy to implement in Matlab or Scilab (which I prefer to use). As in Connecting dots naturally, assume the knots {a=x_0<x_1<\dots<x_n=b} equally spaced at distance {h=(b-a)/n}. The natural cubic spline is determined by the values of its second derivative at {x_1,\dots,x_{n-1}}, denoted {z_1,\dots,z_n}. As explained earlier, these can be found from the linear system

\displaystyle    \frac{h}{6}\begin{pmatrix} 4 & 1 & 0 & 0 & \ldots & 0 & 0 \\ 1 & 4 & 1 & 0 & \ldots & 0 & 0 \\ 0 & 1 & 4 & 1 & \ldots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \ldots & \ldots \\ 0 & 0 & 0 & 0 & \ldots & 1 & 4 \end{pmatrix} \begin{pmatrix} z_1 \\ z_2 \\ \vdots \\ z_{n-1}\end{pmatrix}   = - \frac{1}{h} \begin{pmatrix} \Delta^2 y_1 \\ \Delta^2 y_2 \\ \vdots \\ \Delta^2 y_{n-1}\end{pmatrix}

where the column on the right contains the amounts by which the derivative of the piecewise linear interpolant through given points jumps at every knot. The notation {\Delta^2y } means the second difference of {(y_i)}, for example {\Delta^2y_1 =y_2-2y_1+y_0}.

A smoothing spline is also a natural spline, but for a different set of points {(x_i,\tilde y_i)}. One has to find {\tilde y_i} that minimize a weighted sum of {\sum_i (\tilde y_i-y_i)^2} and of {\int_a^b (f''(x))^2\,dx}. The latter integral is easily computed in terms of {z_i}: it is equal to {\frac{h}{3}\sum_{i=1}^{n}(z_i^2+z_{i-1}^2+z_iz_{i-1})}. Since this quadratic form is comparable to {\sum_{i=1}^{n}z_i^2}, I’ll work with the latter instead.

The idea is to set up an underdetermined system for {z_i} and {\tilde y_i-y_i}, and let Scilab find a least squares solution of that. Let’s introduce a weight parameter {\lambda>0} that will determine the relative importance of curvature vs residuals. It is convenient to let {\tilde y_i=y_i+\lambda h^2 \delta_i}, so that both {\delta_i} and {z_i} (second derivative) scale the same way. The terms {\delta_i} contributes to the linear system for {z_i}, since the right hand side now has {\tilde y_i} instead of {y_i}. This contribution is {- \lambda h \Delta^2 \delta}. Moving it to the left hand-side (since {\delta_i} are unknowns) we obtain the following system.

\displaystyle    \begin{pmatrix} A & | & B \end{pmatrix} \begin{pmatrix} z \\ \delta \end{pmatrix}   = - \frac{1}{h} \Delta^2 y

where {A} is the same tridiagonal matrix as above, and {B} is the rectangular Laplacian-type matrix

\displaystyle    B = \frac{\lambda}{h} \begin{pmatrix} -1 & 2 & -1 & 0 & \ldots & 0 & 0 \\ 0 & -1 & 2 & -1 & \ldots & 0 & 0 \\ 0 & 0 & -1 & 2 & \ldots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \ldots & \ldots \\ 0 & 0 & 0 & 0 & \ldots & 2 & -1 \end{pmatrix}

All in all, the system has {2n } unknowns {z_1,\dots,z_{n-1},\delta_0,\dots,\delta_n}, and {(n-1)} equations, reflecting the continuity of first derivative at each interior knot. The lsq command of Scilab finds the solution with the least sum of squares of the unknowns, which is what we are after.

Time for some examples. {\lambda=0} and {\lambda=0.05} can be seen above. Here are more:

lambda = 0.1
lambda = 0.1

lambda = 1
lambda = 1

As {\lambda} increases, the spline approaches the regression line:

lambda = 10
lambda = 10

lambda=100
lambda=100

Finally, the Scilab code. It is almost the same as for natural spline; the difference is in five lines from B=... to newy=... The code after that is merely evaluation and plotting of the spline.

a = 0; b = 10
y = [3 1 4 1 5 9 2 6 5 3 5]
lambda = 0.1

n = length(y)-1
h = (b-a)/n
jumps = diff(y,2)/h
A = (h/6)*(diag(4*ones(1,n-1))+diag(ones(1,n-2),1)+diag(ones(1,n-2),-1))

B = lambda/h*(diag(-2*ones(1,n+1))+diag(ones(1,n),1)+diag(ones(1,n),-1))
C = [A, B(2:$-1,:)]
sol = lsq(C,-jumps')'
z = [0,sol(1:n-1),0]
newy = y + lambda*h^2*sol(n:$)

allx = []
spl = []
for i=1:n  
   xL = a+h*(i-1)
   xR = a+h*i
   x = linspace(xL,xR,100)
   linear = newy(i)*(xR-x)/h + newy(i+1)*(x-xL)/h
   correction = ((z(i+1)+2*z(i))*(xR-x)+(2*z(i+1)+z(i))*(x-xL)).*(xR-x).*(x-xL)/(6*h)
   allx = [allx, x]   
   spl = [spl, linear+correction] 
end

plot(allx, spl) 
plot(a:h:b, newy, 'g*')
plot(a:h:b, y, 'r*')

Trigonometric approximation and the Clenshaw-Curtis quadrature

Trying to approximate a generic continuous function on {[-1,1]} with the Fourier trigonometric series of the form {\sum_n (A_n\cos \pi nx+B_n\sin \pi nx)} is in general not very fruitful. Here’s such an approximation to {f(x)=\exp(x)}, with the sum over {n\le 4}:

Poor approximation to exp(x)
Poor approximation to exp(x)


It’s better to make a linear change of variable: consider {f(2x-1)} on the interval {[0,1]}, and use the formula for the cosine series. This results in {\exp(2x-1)}, which is approximated by the partial sum {\sum_{n=0}^4 A_n\cos \pi nx} of its cosine Fourier series as follows.

Better approximation after a change of variable
Better approximation after a change of variable


But one can do much better with a different, nonlinear change of variable. Consider {f(\cos x)} on the interval {[0,\pi]}, and again use the formula for the cosine series. This results in {\exp(\cos x)}, which is approximated by the partial sum {\sum_{n=0}^4 A_n\cos nx} of its cosine Fourier series as follows.

Excellent approximation after nonlinear change of variable
Excellent approximation after nonlinear change of variable

Yes, I can’t see any difference either: the error is less than {10^{-3}}.

The composition with cosine improves approximation because {f(\cos t)} is naturally a periodic function, with no jumps or corners in its graph. Fourier series, which are periodic by nature, follow such functions more easily.


A practical implication of this approximation of {f(\cos t)} is the Clenshaw-Curtis integration method. It can be expressed in one line:

\displaystyle    \int_{-1}^1 f(x)\,dx = \int_0^\pi f(\cos t)\sin t\,dt \approx \int_0^\pi \sum_{n=0}^N a_n \cos nt \sin t\,dt   = \sum_{n=0}^N \frac{(1+(-1)^n) a_n}{1-n^2}

The integral {\int_{-1}^1 f(x)\,dx} is approximated by summing {2a_{2k}/(1-4k^2)}, where {a_{2k}} are even-numbered cosine coefficients of {f(\cos t)}. In the example with {f(x)=\exp(x)} using just three coefficients yields

\displaystyle    \frac{2a_0}{1}+\frac{2a_2}{-3}+\frac{2a_4}{-15} \approx 2.350405

while the exact integral is {\approx 2.350402}.


At first this doesn’t look practical at all: after all, the Fourier coefficients are themselves found by integration. But since {f(\cos t)} is so close to a trigonometric polynomial, one can sample it at equally spaced points and apply the Fast Fourier transform to the result, quickly obtaining many coefficients at once. This is what the Clenshaw-Curtis quadrature does (at least in principle, the practical implementation may streamline these steps.)

B-splines and probability

If one picks two real numbers {X_1,X_2} from the interval {[0,1]} (independent, uniformly distributed), their sum {S_2=X_1+X_2} has the triangular distribution.

Also known as the hat function
Also known as the hat function

The sum {S_3} of three such numbers has a differentiable probability density function:

Piecewise quadratic, C1 smooth
Piecewise quadratic, C1 smooth

And the density of {S_4=X_1+X_2+X_3+X_4} is smoother still: the p.d.f. has two
continuous derivatives.

This begins to look normal
This begins to look normal

As the number of summands increases, these distributions converge to normal if they are translated and scaled properly. But I am not going to do that. Let’s keep the number of summands to four at most.

The p.d.f. of {S_n} is a piecewise polynomial of degree {n-1}. Indeed, for {S_1=X_1} the density is piecewise constant, and the formula

\displaystyle  S_n(x) = \int_{x-1}^x S_{n-1}(t)\,dt

provides the inductive step.

For each {n}, the translated copies of function {S_n} form a partition of unity:

\displaystyle  \sum_{k\in\mathbb Z}S_n(x-k)\equiv 1

The integral recurrence relation gives an easy proof of this:

\displaystyle  \sum_{k\in\mathbb Z}\int_{x-k-1}^{x-k} S_{n-1}(t)\,dt = \int_{\mathbb R} S_{n-1}(t)\,dt = 1

And here is the picture for the quadratic case:

Piecewise quadratic partition of unity
Piecewise quadratic partition of unity

A partition of unity can be used to approximate functions by piecewise polynomials: just multiply each partition element by the value of the function at the center of the corresponding interval, and add the results.

Doing this with {S_2} amounts to piecewise linear interpolation: the original function {f(x)=e^{-x/2}} is in blue, the weighted sum of hat functions in red.

Using the function exp(-x/2)
PL interpolation of exp(-x/2)

With {S_4} we get a smooth curve.

Approximating exp(-x/2) with a cubic B-spline
Approximating exp(-x/2) with a cubic B-spline

Unlike interpolating splines, this curve does not attempt to pass through the given points exactly. However, it has several advantages over interpolating splines:

  • Is easier to calculate; no linear system to solve;
  • Yields positive function for positive data;
  • Yields monotone function for monotone data

Linear approximation and differentiability

If a function {f\colon \mathbb R\rightarrow \mathbb R} is differentiable at {a\in \mathbb R}, then it admits good linear approximation at small scales. Precisely: for every {\epsilon>0} there is {\delta>0} and a linear function {\ell(x)} such that {|f(x)-\ell(x)|<\epsilon \,\delta} for all {|x|<\delta}. Having {\delta} multiplied by {\epsilon} means that the deviation from linearity is small compared to the (already small) scale {\delta} on which the function is considered.

For example, this is a linear approximation to {f(x)=e^x} near {0} at scale {\delta=0.1}.

Linear approximation to exponential function
Linear approximation to exponential function

As is done on this graph, we can always take {\ell} to be the secant line to the graph of {f} based on the endpoints of the interval of consideration. This is because if {L} is another line for which {|f(x)-L(x)|<\epsilon \,\delta} holds, then {|\ell-L|\le \epsilon \,\delta} at the endpoints, and therefore on all of the interval (the function {x\mapsto |\ell(x)-L(x)|} is convex).


Here is a non-differentiable function that obviously fails the linear approximation property at {0}.

Self-similar graph
Self-similar graph

(By the way, this post is mostly about me trying out SageMathCloud.) A nice thing about {f(x)=x\sin \log |x|} is self-similarity: {f(rx)=rf(x)} with the similarity factor {r=e^{2\pi}}. This implies that no matter how far we zoom in on the graph at {x=0}, the graph will not get any closer to linear.

I like {x\sin \log |x|} more than its famous, but not self-similar, cousin {x\sin(1/x)}, pictured below.

Standard example from intro to real analysis
Standard example from intro to real analysis


Interestingly, linear approximation property does not imply differentiability. The function {f(x)=x\sin \sqrt{-\log|x|}} has this property at {0}, but it lacks derivative there since {f(x)/x} does not have a limit as {x\rightarrow 0}. Here is how it looks.

Now with the square root!
Now with the square root!

Let’s look at the scale {\delta=0.1}

scale 0.01
scale 0.1

and compare to the scale {\delta=0.001}

scale 0.001
scale 0.001

Well, that was disappointing. Let’s use math instead. Fix {\epsilon>0} and consider the function {\phi(\delta)=\sqrt{-\log \delta}-\sqrt{-\log (\epsilon \delta)}}. Rewriting it as

\displaystyle    \frac{\log \epsilon}{\sqrt{-\log \delta}+\sqrt{-\log (\epsilon \delta)}}

shows {\phi(\delta)\rightarrow 0} as {\delta\rightarrow 0}. Choose {\delta} so that {|\phi(\delta)|<\epsilon} and define {\ell(x)=x\sqrt{-\log \delta}}. Then for {\epsilon \,\delta\le |x|< \delta} we have {|f(x)-\ell(x)|\le \epsilon |x|<\epsilon\,\delta}, and for {|x|<\epsilon \delta} the trivial bound {|f(x)-\ell(x)|\le |f(x)|+|\ell(x)|} suffices.

Thus, {f} can be well approximated by linear functions near {0}; it’s just that the linear function has to depend on the scale on which approximation is made: its slope {\sqrt{-\log \delta}} does not have a limit as {\delta\to0}.

The linear approximation property does not become apparent until extremely small scales. Here is {\delta = 10^{-30}}.

Scale 1e-30
Scale 1e-30

Squarish polynomials

For some reason I wanted to construct polynomials approximating this piecewise constant function {f}:

How to approximate this with polynomials?
So square

Of course approximation cannot be uniform, since the function is not continuous. But it can be achieved in the sense of convergence of graphs in the Hausdorff metric: their limit should be the “graph” shown above, with the vertical line included. In concrete terms, this means for every {\epsilon>0} there is {N} such that for {n\ge N} the polynomial {p_n} satisfies

\displaystyle  |p_n-f|\le \epsilon\quad \text{ on }\ [0,2]\setminus [1-\epsilon,1+\epsilon]

and also

\displaystyle  -\epsilon\le  p_n  \le 1+\epsilon\quad \text{ on }\ [1-\epsilon,1+\epsilon]

How to get such {p_n} explicitly? I started with the functions {f_m(x) = \exp(-x^m)} when {m} is large. The idea is that as {m\rightarrow\infty}, the limit of {\exp(-x^m)} is what is wanted: {1} when {x<1}, {0} when {x>1}. Also, for each {m} there is a Taylor polynomial {T_{m,n}} that approximates {f_m} uniformly on {[0,2]}. Since the Taylor series is alternating, it is not hard to find suitable {n}. Let’s shoot for {\epsilon=0.01} in the Taylor remainder and see where this leads:

  • Degree {7} polynomial for {\exp(-x)}
  • Degree {26} polynomial for {\exp(-x^2)}
  • Degree {69} polynomial for {\exp(-x^3)}
  • Degree {180} polynomial for {\exp(-x^4)}
  • Degree {440} polynomial for {\exp(-x^5)}

The results are unimpressive, though:

Taylor polynomials of exp(-x^m) are not so square
Taylor polynomials of exp(-x^m) are not so square

To get within {0.01} of the desired square-ness, we need {\exp(-1.01^m)<0.01}. This means {m\ge 463}. Then, to have the Taylor remainder bounded by {0.01} at {x=2}, we need {2^{463n}/n! < 0.01}. Instead of messing with Stirling’s formula, just observe that {2^{463n}/n!} does not even begin to decrease until {n} exceeds {2^{463}}, which is more than {10^{139}}. That’s a … high degree polynomial. I would not try to ask a computer algebra system to plot it.

Bernstein polynomials turn out to work better. On the interval {[0,2]} they are given by

\displaystyle    p_n(x) = 2^{-n} \sum_{k=0}^n f(2k/n) \binom{n}{k} x^k (2-x)^{n-k}

To avoid dealing with {f(1)}, it is better to use odd degrees. For comparison, I used the same or smaller degrees as above: {7, 25, 69, 179, 439}.

Squarish Bernstein polynomials
Squarish Bernstein polynomials

Looks good. But I don’t know of a way to estimate the degree of Bernstein polynomial required to obtain Hausdorff distance less than a given {\epsilon} (say, {0.01}) from the square function.