The Taylor series of the exponential function,
provides reasonable approximations to the function near zero: for example, with we get
Taylor polynomial T_3 in red
The quality of approximation not being spectacular, one can try to improve it by using rational functions instead of polynomials. In view of the identity
one can get with . The improvement is substantial for negative :
Rational approximation based on exp(x/2)
Having rational approximation of the form makes perfect sense, because such approximants obey the same functional equation as the exponential function itself. We cannot hope to satisfy other functional equations like by functions simpler than .
However, the polynomial is not optimal for approximation except for . For degree , the optimal choice is . In the same plot window as above, the graph of is indistinguishable from .
Red=rational, Blue=exponential
This is a Padé approximant to the exponential function. One way to obtain such approximants is to replace the Taylor series with continued fraction, using long division:
Terminating the expansion after an even number of apprearances gives a Padé approximant of the above form.
This can be compared to replacing the decimal expansion of number :
with the continued fraction expansion
which, besides greater accuracy, has a regular pattern: 121 141 161 181 …
The numerators of the (diagonal) Padé approximants to the exponential function happen to have a closed form:
which shows that for every fixed , the coefficient of converges to as . The latter is precisely the Taylor coefficient of .
In practice, a recurrence relation is probably the easiest way to get these numerators: begin with and , and after that use . This relation can be derived from the recurrence relations for the convergents of a generalized continued fraction. Namely, and . Only the first relation is actually needed here.
Using the recurrence relation, we get
(so, not all coefficients have numerator 1…)
The quality of approximation to is best seen on logarithmic scale: i.e., how close is to ? Here is this comparison using .
Rational approximation based on 5th degree polynomial
For comparison, the Taylor polynomial of fifth degree, also on logarithmic scale: where .
Calculus books tend to introduce transcendental functions (trigonometric, exponential, logarithm) early. Analysis textbooks such as Principles of Mathematical Analysis by Rudin tend to introduce them later, because of how long it takes to develop enough of the theory of power series.
The Riemann-Lebesgue lemma involves either trigonometric or exponential functions. But the following version works with the “late transcendentals” approach.
Transcendental-free Riemann-Lebesgue Lemma
TFRLL. Suppose that and are continuously differentiable functions, and is bounded. Then as .
The familiar form of the lemma is recovered by letting or .
Proof. By the chain rule, is the derivative of . Integrate by parts:
By assumption, there exists a constant such that everywhere. Hence , , and . By the triangle inequality,
completing the proof.
As a non-standard example, TFRLL applies to, say, for which . The conclusion is that , that is, which seems somewhat interesting. When , the factor of can be removed by applying the result to , leading to .
What if we tried less smoothness?
Later in Rudin’s book we encounter the Weierstrass theorem: every continuous function on is a uniform limit of polynomials. Normally, this would be used to make the Riemann-Lebesgue lemma work for any continuous function . But the general form given above, with an unspecified , presents a difficulty.
Indeed, suppose is continuous on . Given , choose a polynomial such that on . Since has continuous derivative, it follows that . It remains to show that is close to . By the triangle inequality,
which is bounded by … um. Unlike for and , we do not have a uniform bound for or for its integral. Indeed, with the integrals grow linearly with . And this behavior would be even worse with , for example.
At present I do not see a way to prove TFRLL for continuous , let alone for integrable . But I do not have a counterexample either.
There are many ways to approximate a given continuous function (I will consider the interval for convenience.) For example, one can use piecewise linear interpolation through the points , where . The resulting piecewise linear function has some nice properties: for example, it is increasing if is increasing. But it is not smooth.
A convenient way to represent piecewise linear interpolation is the sum where the functions are the triangles shown below: .
Triangular basis functions
The functions form a partition of unity, meaning that and all are nonnegative. This property leads to the estimate
The latter sum is small because when is close to , the first factor is small by virtue of continuity, while the second factor is bounded by . When is far from , the second factor is zero, so the first one is irrelevant. The upshot is that is uniformly small.
But if we want a smooth approximation , we need a smooth partition of unity . But not just any set of smooth nonnegative functions that add up to is equally good. One desirable property is preserving monotonicity: if is increasing, then should be increasing, just as this works for piecewise linear interpolation. What does this condition require of our partition of unity?
An increasing function can be expressed as a limit of sums of the form where and is the Iverson bracket: 1 if true, 0 if false. By linearity, it suffices to have increasing for the case . In this case is simply for some , . So we want all to be increasing functions. Which is the case for the triangular partition of unity, when each looks like this:
Also known as a piecewise linear activation function
One smooth choice is Bernstein basis polynomials: . These are nonnegative on , and the binomial formula shows . Are the sums increasing with ? Let’s find out. By the product rule,
In the second sum the term with vanishes, and the terms with can be rewritten as , which is , which is . After the index shift this becomes identical to the terms of the first sum and cancels them out (except for the first one). Thus,
To summarize: the Bernstein polynomials are monotone whenever is. On the other hand, the proof that uniformly is somewhat complicated by the fact that the polynomial basis functions are not localized the way that the triangle basis functions are: the factors do not vanish when is far from . I refer to Wikipedia for a proof of convergence (which, by the way, is quite slow).
Bernstein polynomial basis
Is there some middle ground between non-smooth triangles and non-localized polynomials? Yes, of course: piecewise polynomials, splines. More specifically, B-splines which can be defined as follows: B-splines of degree are the triangle basis functions shown above; a B-spline of degree is the moving averages of a -spline of degree with a window of length . The moving average of can be written as . We get a partition of unity because the sum of moving averages is the moving average of a sum, and averaging a constant function does not change it.
The splines of even degrees are awkward to work with… they are obtained from the triangles by taking those integrals with an odd number of times, which makes their knots fall in the midpoints of the uniform grid instead of the grid points themselves. But I will use anyway, because this degree is enough for -smooth approximation.
Recall that a triangular basis function has slope and is supported on an interval where . Accordingly, its moving average will be supported on . Since , the second derivative is when , is when , and is again when . This is enough to figure out the formula for :
These look like:
Is this right?
Nice! But wait a moment, the sum near the endpoints is not constant: it is less than 1 because we do not get a contributions of two splines to the left and right of the interval. To correct for this boundary effect, replace with and with , using “ghost” elements of the basis that lie outside of the actual grid. Now the quadratic B-spline basis is correct:
A smooth partition of unity by quadratic splines
Does this partition of unity preserve monotinicity? Yes, it does: which is nonnegative because the sum is an increasing piecewise linear function, as noted previously. Same logic works for B-splines of higher degree.
In conclusion, here is a quadratic B-spline approximation (orange) to a tricky increasing function (blue).
Smooth approximation
One may wonder why the orange curve deviates from the line at the end – did we miss some boundary effect there? Yes, in a way… the spline actually approximates the continuous extension of our original function by constant values on the left and right. Imagine the blue graph continuing to the right as a horizontal line: this creates a corner at and the spline is smoothing that corner. To avoid this effect, one may want to extend in a better way and then work with the extended function, not folding the ghosts into .
But even so, B-spline achieves a better approximation than the Bernstein polynomial with the same number of basis functions (eight):
Bernstein polynomial
The reason is the non-local nature of the polynomial basis , which was noted above. Bernstein polynomials do match the function perfectly at the endpoints, but this is small consolation.
How to measure the nonlinearity of a function where is an interval? A natural way is to consider the smallest possible deviation from a line , that is . It turns out to be convenient to divide this by , the length of the interval . So, let . (This is similar to β-numbers of Peter Jones, except the deviation from a line is measured only in the vertical direction.)
NL(f; I) is the maximal vertical distance between red and black, divided by the length of I
Relation with derivatives
The definition of derivative immediately implies that if exists, then as shrinks to (that is, gets smaller while containing ). A typical construction of a nowhere differentiable continuous function is based on making bounded from below; it is enough to do this for dyadic intervals, and that can be done by adding wiggly terms like : see the blancmange curve.
The converse is false: if as shrinks to , the function may still fail to be differentiable at . The reason is that the affine approximation may have different slopes at different scales. An example is in a neighborhood of . Consider a small interval . The line with is a good approximation to because on most of the interval except for a very small part near , and on that part is very close to anyway.
Why the root of logarithm? Because has a fixed amount of change on a fixed proportion of , independently of . We need a function slower than the logarithm, so that as decreases, there is a smaller amount of change on a larger part of the interval .
Nonlinearity of Lipschitz functions
Suppose is a Lipschitz function, that is, there exists a constant such that for all . It’s easy to see that , by taking the mid-range approximation . But the sharp bound is whose proof is not as trivial. The sharpness is shown by with .
With the maximum difference 1/2 and the length of interval 2, we get NL(f; [-1, 1]) = 1/4Proof. Let be the slope of the linear function that agrees with at the endpoints of . Subtracting this linear function from gives us a Lipschitz function such that and . Let . Chebyshev’s inequality gives lower bounds for the measures of the sets and : namely, and . By adding these, we find that . Since , the mid-range approximation to has error at most . Hence .
Reducing nonlinearity
Turns out, the graph of every Lipschitz function has relatively large almost-flat pieces. That is, there are subintervals of nontrivial size where the measure of nonlinearity is much smaller than the Lipschitz constant. This result is a special (one-dimensional) case of Theorem 2.3 in Affine approximation of Lipschitz functions and nonlinear quotients by Bates, Johnson, Lindenstrauss, Preiss, and Schechtman.
Theorem AA (for “affine approximation”): For every there exists with the following property. If is an -Lipschitz function, then there exists an interval with and .
Theorem AA should not be confused with Rademacher’s theorem which says that a Lipschitz function is differentiable almost everywhere. The point here is a lower bound on the size of the interval . Differentiability does not provide that. In fact, if we knew that is smooth, or even a polynomial, the proof of Theorem AA would not become any easier.
Proof of Theorem AA
We may assume and . For let . That is, is the restricted Lipschitz constant, one that applies for distances at least . It is a decreasing function of , and .
Note that and that every value of is within of either or . Hence, the oscillation of on is at most . If , then the constant mid-range approximation on gives the desired conclusion, with . From now on .
The sequence is increasing toward , which implies for some . Pick an interval that realizes , that is and . Without loss of generality (otherwise consider ). Let be the middle half of . Since each point of is within distance of both and , it follows that for all .
The green lines have slope L_{k+1} which is close to the slope L_k of the secant line through (a, f(a)) and (b, f(b)). The graph of f is pinched between these green lines, except near a or b.
So far we have pinched between two affine functions of equal slope. Let us consider their difference: . Recall that , which gives a bound of for the difference. Approximating by the average of the two affine functions we conclude that as required.
It remains to consider the size of , about which we only know so far. Naturally, we want to take the smallest such that holds. Let be this value; then . Here and . The conclusion is that , hence . This finally yields as an acceptable choice, completing the proof of Theorem AA.
Given some data points one can devise various functions that extract information from them. Such as
Regression line (considered in When the digits of go to 11): it detects a linear trend , minimizing the sum of squares of residuals .
Natural cubic spline (considered in Connecting dots naturally), which passes through every data point while minimizing the amount of wiggling, measured by the square of its second derivative. Like this:
Natural cubic spline
A smoothing spline is something in between the above extremes: it insists on neither being a line (i.e., having zero second derivative) nor on passing through given points (i.e., having zero residuals). Instead, it minimizes a weighted sum of both things: the integral of second derivative squared, and the sum of residuals squared. Like this plot, where the red points are given but the spline chooses to interpolate the green ones:
Smoothing spline
I’ll demonstrate a version of a smoothing spline that might not be exactly canonical, but is very easy to implement in Matlab or Scilab (which I prefer to use). As in Connecting dots naturally, assume the knots equally spaced at distance . The natural cubic spline is determined by the values of its second derivative at , denoted . As explained earlier, these can be found from the linear system
where the column on the right contains the amounts by which the derivative of the piecewise linear interpolant through given points jumps at every knot. The notation means the second difference of , for example .
A smoothing spline is also a natural spline, but for a different set of points . One has to find that minimize a weighted sum of and of . The latter integral is easily computed in terms of : it is equal to . Since this quadratic form is comparable to , I’ll work with the latter instead.
The idea is to set up an underdetermined system for and , and let Scilab find a least squares solution of that. Let’s introduce a weight parameter that will determine the relative importance of curvature vs residuals. It is convenient to let , so that both and (second derivative) scale the same way. The terms contributes to the linear system for , since the right hand side now has instead of . This contribution is . Moving it to the left hand-side (since are unknowns) we obtain the following system.
where is the same tridiagonal matrix as above, and is the rectangular Laplacian-type matrix
All in all, the system has unknowns , and equations, reflecting the continuity of first derivative at each interior knot. The lsq command of Scilab finds the solution with the least sum of squares of the unknowns, which is what we are after.
Time for some examples. and can be seen above. Here are more:
lambda = 0.1
lambda = 1
As increases, the spline approaches the regression line:
lambda = 10
lambda=100
Finally, the Scilab code. It is almost the same as for natural spline; the difference is in five lines from B=... to newy=... The code after that is merely evaluation and plotting of the spline.
a = 0; b = 10
y = [3 1 4 1 5 9 2 6 5 3 5]
lambda = 0.1
n = length(y)-1
h = (b-a)/n
jumps = diff(y,2)/h
A = (h/6)*(diag(4*ones(1,n-1))+diag(ones(1,n-2),1)+diag(ones(1,n-2),-1))
B = lambda/h*(diag(-2*ones(1,n+1))+diag(ones(1,n),1)+diag(ones(1,n),-1))
C = [A, B(2:$-1,:)]
sol = lsq(C,-jumps')'
z = [0,sol(1:n-1),0]
newy = y + lambda*h^2*sol(n:$)
allx = []
spl = []
for i=1:n
xL = a+h*(i-1)
xR = a+h*i
x = linspace(xL,xR,100)
linear = newy(i)*(xR-x)/h + newy(i+1)*(x-xL)/h
correction = ((z(i+1)+2*z(i))*(xR-x)+(2*z(i+1)+z(i))*(x-xL)).*(xR-x).*(x-xL)/(6*h)
allx = [allx, x]
spl = [spl, linear+correction]
end
plot(allx, spl)
plot(a:h:b, newy, 'g*')
plot(a:h:b, y, 'r*')
Trying to approximate a generic continuous function on with the Fourier trigonometric series of the form is in general not very fruitful. Here’s such an approximation to , with the sum over :
Poor approximation to exp(x)
It’s better to make a linear change of variable: consider on the interval , and use the formula for the cosine series. This results in , which is approximated by the partial sum of its cosine Fourier series as follows.
Better approximation after a change of variable
But one can do much better with a different, nonlinear change of variable. Consider on the interval , and again use the formula for the cosine series. This results in , which is approximated by the partial sum of its cosine Fourier series as follows.
Excellent approximation after nonlinear change of variable
Yes, I can’t see any difference either: the error is less than .
The composition with cosine improves approximation because is naturally a periodic function, with no jumps or corners in its graph. Fourier series, which are periodic by nature, follow such functions more easily.
A practical implication of this approximation of is the Clenshaw-Curtis integration method. It can be expressed in one line:
The integral is approximated by summing , where are even-numbered cosine coefficients of . In the example with using just three coefficients yields
while the exact integral is .
At first this doesn’t look practical at all: after all, the Fourier coefficients are themselves found by integration. But since is so close to a trigonometric polynomial, one can sample it at equally spaced points and apply the Fast Fourier transform to the result, quickly obtaining many coefficients at once. This is what the Clenshaw-Curtis quadrature does (at least in principle, the practical implementation may streamline these steps.)
If one picks two real numbers from the interval (independent, uniformly distributed), their sum has the triangular distribution.
Also known as the hat function
The sum of three such numbers has a differentiable probability density function:
Piecewise quadratic, C1 smooth
And the density of is smoother still: the p.d.f. has two
continuous derivatives.
This begins to look normal
As the number of summands increases, these distributions converge to normal if they are translated and scaled properly. But I am not going to do that. Let’s keep the number of summands to four at most.
The p.d.f. of is a piecewise polynomial of degree . Indeed, for the density is piecewise constant, and the formula
provides the inductive step.
For each , the translated copies of function form a partition of unity:
The integral recurrence relation gives an easy proof of this:
And here is the picture for the quadratic case:
Piecewise quadratic partition of unity
A partition of unity can be used to approximate functions by piecewise polynomials: just multiply each partition element by the value of the function at the center of the corresponding interval, and add the results.
Doing this with amounts to piecewise linear interpolation: the original function is in blue, the weighted sum of hat functions in red.
PL interpolation of exp(-x/2)
With we get a smooth curve.
Approximating exp(-x/2) with a cubic B-spline
Unlike interpolating splines, this curve does not attempt to pass through the given points exactly. However, it has several advantages over interpolating splines:
Is easier to calculate; no linear system to solve;
If a function is differentiable at , then it admits good linear approximation at small scales. Precisely: for every there is and a linear function such that for all . Having multiplied by means that the deviation from linearity is small compared to the (already small) scale on which the function is considered.
For example, this is a linear approximation to near at scale .
Linear approximation to exponential function
As is done on this graph, we can always take to be the secant line to the graph of based on the endpoints of the interval of consideration. This is because if is another line for which holds, then at the endpoints, and therefore on all of the interval (the function is convex).
Here is a non-differentiable function that obviously fails the linear approximation property at .
Self-similar graph
(By the way, this post is mostly about me trying out SageMathCloud.) A nice thing about is self-similarity: with the similarity factor . This implies that no matter how far we zoom in on the graph at , the graph will not get any closer to linear.
I like more than its famous, but not self-similar, cousin , pictured below.
Standard example from intro to real analysis
Interestingly, linear approximation property does not imply differentiability. The function has this property at , but it lacks derivative there since does not have a limit as . Here is how it looks.
Now with the square root!
Let’s look at the scale
scale 0.1
and compare to the scale
scale 0.001
Well, that was disappointing. Let’s use math instead. Fix and consider the function . Rewriting it as
shows as . Choose so that and define . Then for we have , and for the trivial bound suffices.
Thus, can be well approximated by linear functions near ; it’s just that the linear function has to depend on the scale on which approximation is made: its slope does not have a limit as .
The linear approximation property does not become apparent until extremely small scales. Here is .
For some reason I wanted to construct polynomials approximating this piecewise constant function :
So square
Of course approximation cannot be uniform, since the function is not continuous. But it can be achieved in the sense of convergence of graphs in the Hausdorff metric: their limit should be the “graph” shown above, with the vertical line included. In concrete terms, this means for every there is such that for the polynomial satisfies
and also
How to get such explicitly? I started with the functions when is large. The idea is that as , the limit of is what is wanted: when , when . Also, for each there is a Taylor polynomial that approximates uniformly on . Since the Taylor series is alternating, it is not hard to find suitable . Let’s shoot for in the Taylor remainder and see where this leads:
Degree polynomial for
Degree polynomial for
Degree polynomial for
Degree polynomial for
Degree polynomial for
The results are unimpressive, though:
Taylor polynomials of exp(-x^m) are not so square
To get within of the desired square-ness, we need . This means . Then, to have the Taylor remainder bounded by at , we need . Instead of messing with Stirling’s formula, just observe that does not even begin to decrease until exceeds , which is more than . That’s a … high degree polynomial. I would not try to ask a computer algebra system to plot it.
To avoid dealing with , it is better to use odd degrees. For comparison, I used the same or smaller degrees as above: .
Squarish Bernstein polynomials
Looks good. But I don’t know of a way to estimate the degree of Bernstein polynomial required to obtain Hausdorff distance less than a given (say, ) from the square function.