## Between a cubic spline and the line of regression

Given some data points ${(x_i,y_i)}$ one can devise various functions that extract information from them. Such as

A smoothing spline is something in between the above extremes: it insists on neither being a line (i.e., having zero second derivative) nor on passing through given points (i.e., having zero residuals). Instead, it minimizes a weighted sum of both things: the integral of second derivative squared, and the sum of residuals squared. Like this plot, where the red points are given but the spline chooses to interpolate the green ones:

I’ll demonstrate a version of a smoothing spline that might not be exactly canonical, but is very easy to implement in Matlab or Scilab (which I prefer to use). As in Connecting dots naturally, assume the knots ${a=x_0 equally spaced at distance ${h=(b-a)/n}$. The natural cubic spline is determined by the values of its second derivative at ${x_1,\dots,x_{n-1}}$, denoted ${z_1,\dots,z_n}$. As explained earlier, these can be found from the linear system

$\displaystyle \frac{h}{6}\begin{pmatrix} 4 & 1 & 0 & 0 & \ldots & 0 & 0 \\ 1 & 4 & 1 & 0 & \ldots & 0 & 0 \\ 0 & 1 & 4 & 1 & \ldots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \ldots & \ldots \\ 0 & 0 & 0 & 0 & \ldots & 1 & 4 \end{pmatrix} \begin{pmatrix} z_1 \\ z_2 \\ \vdots \\ z_{n-1}\end{pmatrix} = - \frac{1}{h} \begin{pmatrix} \Delta^2 y_1 \\ \Delta^2 y_2 \\ \vdots \\ \Delta^2 y_{n-1}\end{pmatrix}$

where the column on the right contains the amounts by which the derivative of the piecewise linear interpolant through given points jumps at every knot. The notation ${\Delta^2y }$ means the second difference of ${(y_i)}$, for example ${\Delta^2y_1 =y_2-2y_1+y_0}$.

A smoothing spline is also a natural spline, but for a different set of points ${(x_i,\tilde y_i)}$. One has to find ${\tilde y_i}$ that minimize a weighted sum of ${\sum_i (\tilde y_i-y_i)^2}$ and of ${\int_a^b (f''(x))^2\,dx}$. The latter integral is easily computed in terms of ${z_i}$: it is equal to ${\frac{h}{3}\sum_{i=1}^{n}(z_i^2+z_{i-1}^2+z_iz_{i-1})}$. Since this quadratic form is comparable to ${\sum_{i=1}^{n}z_i^2}$, I’ll work with the latter instead.

The idea is to set up an underdetermined system for ${z_i}$ and ${\tilde y_i-y_i}$, and let Scilab find a least squares solution of that. Let’s introduce a weight parameter ${\lambda>0}$ that will determine the relative importance of curvature vs residuals. It is convenient to let ${\tilde y_i=y_i+\lambda h^2 \delta_i}$, so that both ${\delta_i}$ and ${z_i}$ (second derivative) scale the same way. The terms ${\delta_i}$ contributes to the linear system for ${z_i}$, since the right hand side now has ${\tilde y_i}$ instead of ${y_i}$. This contribution is ${- \lambda h \Delta^2 \delta}$. Moving it to the left hand-side (since ${\delta_i}$ are unknowns) we obtain the following system.

$\displaystyle \begin{pmatrix} A & | & B \end{pmatrix} \begin{pmatrix} z \\ \delta \end{pmatrix} = - \frac{1}{h} \Delta^2 y$

where ${A}$ is the same tridiagonal matrix as above, and ${B}$ is the rectangular Laplacian-type matrix

$\displaystyle B = \frac{\lambda}{h} \begin{pmatrix} -1 & 2 & -1 & 0 & \ldots & 0 & 0 \\ 0 & -1 & 2 & -1 & \ldots & 0 & 0 \\ 0 & 0 & -1 & 2 & \ldots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \ldots & \ldots \\ 0 & 0 & 0 & 0 & \ldots & 2 & -1 \end{pmatrix}$

All in all, the system has ${2n }$ unknowns ${z_1,\dots,z_{n-1},\delta_0,\dots,\delta_n}$, and ${(n-1)}$ equations, reflecting the continuity of first derivative at each interior knot. The lsq command of Scilab finds the solution with the least sum of squares of the unknowns, which is what we are after.

Time for some examples. ${\lambda=0}$ and ${\lambda=0.05}$ can be seen above. Here are more:

As ${\lambda}$ increases, the spline approaches the regression line:

Finally, the Scilab code. It is almost the same as for natural spline; the difference is in five lines from B=... to newy=... The code after that is merely evaluation and plotting of the spline.

a = 0; b = 10
y = [3 1 4 1 5 9 2 6 5 3 5]
lambda = 0.1

n = length(y)-1
h = (b-a)/n
jumps = diff(y,2)/h
A = (h/6)*(diag(4*ones(1,n-1))+diag(ones(1,n-2),1)+diag(ones(1,n-2),-1))

B = lambda/h*(diag(-2*ones(1,n+1))+diag(ones(1,n),1)+diag(ones(1,n),-1))
C = [A, B(2:$-1,:)] sol = lsq(C,-jumps')' z = [0,sol(1:n-1),0] newy = y + lambda*h^2*sol(n:$)

allx = []
spl = []
for i=1:n
xL = a+h*(i-1)
xR = a+h*i
x = linspace(xL,xR,100)
linear = newy(i)*(xR-x)/h + newy(i+1)*(x-xL)/h
correction = ((z(i+1)+2*z(i))*(xR-x)+(2*z(i+1)+z(i))*(x-xL)).*(xR-x).*(x-xL)/(6*h)
allx = [allx, x]
spl = [spl, linear+correction]
end

plot(allx, spl)
plot(a:h:b, newy, 'g*')
plot(a:h:b, y, 'r*')

## Trigonometric approximation and the Clenshaw-Curtis quadrature

Trying to approximate a generic continuous function on ${[-1,1]}$ with the Fourier trigonometric series of the form ${\sum_n (A_n\cos \pi nx+B_n\sin \pi nx)}$ is in general not very fruitful. Here’s such an approximation to ${f(x)=\exp(x)}$, with the sum over ${n\le 4}$:

It’s better to make a linear change of variable: consider ${f(2x-1)}$ on the interval ${[0,1]}$, and use the formula for the cosine series. This results in ${\exp(2x-1)}$, which is approximated by the partial sum ${\sum_{n=0}^4 A_n\cos \pi nx}$ of its cosine Fourier series as follows.

But one can do much better with a different, nonlinear change of variable. Consider ${f(\cos x)}$ on the interval ${[0,\pi]}$, and again use the formula for the cosine series. This results in ${\exp(\cos x)}$, which is approximated by the partial sum ${\sum_{n=0}^4 A_n\cos nx}$ of its cosine Fourier series as follows.

Yes, I can’t see any difference either: the error is less than ${10^{-3}}$.

The composition with cosine improves approximation because ${f(\cos t)}$ is naturally a periodic function, with no jumps or corners in its graph. Fourier series, which are periodic by nature, follow such functions more easily.

A practical implication of this approximation of ${f(\cos t)}$ is the Clenshaw-Curtis integration method. It can be expressed in one line:

$\displaystyle \int_{-1}^1 f(x)\,dx = \int_0^\pi f(\cos t)\sin t\,dt \approx \int_0^\pi \sum_{n=0}^N a_n \cos nt \sin t\,dt = \sum_{n=0}^N \frac{(1+(-1)^n) a_n}{1-n^2}$

The integral ${\int_{-1}^1 f(x)\,dx}$ is approximated by summing ${2a_{2k}/(1-4k^2)}$, where ${a_{2k}}$ are even-numbered cosine coefficients of ${f(\cos t)}$. In the example with ${f(x)=\exp(x)}$ using just three coefficients yields

$\displaystyle \frac{2a_0}{1}+\frac{2a_2}{-3}+\frac{2a_4}{-15} \approx 2.350405$

while the exact integral is ${\approx 2.350402}$.

At first this doesn’t look practical at all: after all, the Fourier coefficients are themselves found by integration. But since ${f(\cos t)}$ is so close to a trigonometric polynomial, one can sample it at equally spaced points and apply the Fast Fourier transform to the result, quickly obtaining many coefficients at once. This is what the Clenshaw-Curtis quadrature does (at least in principle, the practical implementation may streamline these steps.)

## B-splines and probability

If one picks two real numbers ${X_1,X_2}$ from the interval ${[0,1]}$ (independent, uniformly distributed), their sum ${S_2=X_1+X_2}$ has the triangular distribution.

The sum ${S_3}$ of three such numbers has a differentiable probability density function:

And the density of ${S_4=X_1+X_2+X_3+X_4}$ is smoother still: the p.d.f. has two
continuous derivatives.

As the number of summands increases, these distributions converge to normal if they are translated and scaled properly. But I am not going to do that. Let’s keep the number of summands to four at most.

The p.d.f. of ${S_n}$ is a piecewise polynomial of degree ${n-1}$. Indeed, for ${S_1=X_1}$ the density is piecewise constant, and the formula

$\displaystyle S_n(x) = \int_{x-1}^x S_{n-1}(t)\,dt$

provides the inductive step.

For each ${n}$, the translated copies of function ${S_n}$ form a partition of unity:

$\displaystyle \sum_{k\in\mathbb Z}S_n(x-k)\equiv 1$

The integral recurrence relation gives an easy proof of this:

$\displaystyle \sum_{k\in\mathbb Z}\int_{x-k-1}^{x-k} S_{n-1}(t)\,dt = \int_{\mathbb R} S_{n-1}(t)\,dt = 1$

And here is the picture for the quadratic case:

A partition of unity can be used to approximate functions by piecewise polynomials: just multiply each partition element by the value of the function at the center of the corresponding interval, and add the results.

Doing this with ${S_2}$ amounts to piecewise linear interpolation: the original function ${f(x)=e^{-x/2}}$ is in blue, the weighted sum of hat functions in red.

With ${S_4}$ we get a smooth curve.

Unlike interpolating splines, this curve does not attempt to pass through the given points exactly. However, it has several advantages over interpolating splines:

• Is easier to calculate; no linear system to solve;
• Yields positive function for positive data;
• Yields monotone function for monotone data

## Linear approximation and differentiability

If a function ${f\colon \mathbb R\rightarrow \mathbb R}$ is differentiable at ${a\in \mathbb R}$, then it admits good linear approximation at small scales. Precisely: for every ${\epsilon>0}$ there is ${\delta>0}$ and a linear function ${\ell(x)}$ such that ${|f(x)-\ell(x)|<\epsilon \,\delta}$ for all ${|x|<\delta}$. Having ${\delta}$ multiplied by ${\epsilon}$ means that the deviation from linearity is small compared to the (already small) scale ${\delta}$ on which the function is considered.

For example, this is a linear approximation to ${f(x)=e^x}$ near ${0}$ at scale ${\delta=0.1}$.

As is done on this graph, we can always take ${\ell}$ to be the secant line to the graph of ${f}$ based on the endpoints of the interval of consideration. This is because if ${L}$ is another line for which ${|f(x)-L(x)|<\epsilon \,\delta}$ holds, then ${|\ell-L|\le \epsilon \,\delta}$ at the endpoints, and therefore on all of the interval (the function ${x\mapsto |\ell(x)-L(x)|}$ is convex).

Here is a non-differentiable function that obviously fails the linear approximation property at ${0}$.

(By the way, this post is mostly about me trying out SageMathCloud.) A nice thing about ${f(x)=x\sin \log |x|}$ is self-similarity: ${f(rx)=rf(x)}$ with the similarity factor ${r=e^{2\pi}}$. This implies that no matter how far we zoom in on the graph at ${x=0}$, the graph will not get any closer to linear.

I like ${x\sin \log |x|}$ more than its famous, but not self-similar, cousin ${x\sin(1/x)}$, pictured below.

Interestingly, linear approximation property does not imply differentiability. The function ${f(x)=x\sin \sqrt{-\log|x|}}$ has this property at ${0}$, but it lacks derivative there since ${f(x)/x}$ does not have a limit as ${x\rightarrow 0}$. Here is how it looks.

Let’s look at the scale ${\delta=0.1}$

and compare to the scale ${\delta=0.001}$

Well, that was disappointing. Let’s use math instead. Fix ${\epsilon>0}$ and consider the function ${\phi(\delta)=\sqrt{-\log \delta}-\sqrt{-\log (\epsilon \delta)}}$. Rewriting it as

$\displaystyle \frac{\log \epsilon}{\sqrt{-\log \delta}+\sqrt{-\log (\epsilon \delta)}}$

shows ${\phi(\delta)\rightarrow 0}$ as ${\delta\rightarrow 0}$. Choose ${\delta}$ so that ${|\phi(\delta)|<\epsilon}$ and define ${\ell(x)=x\sqrt{-\log \delta}}$. Then for ${\epsilon \,\delta\le |x|< \delta}$ we have ${|f(x)-\ell(x)|\le \epsilon |x|<\epsilon\,\delta}$, and for ${|x|<\epsilon \delta}$ the trivial bound ${|f(x)-\ell(x)|\le |f(x)|+|\ell(x)|}$ suffices.

Thus, ${f}$ can be well approximated by linear functions near ${0}$; it’s just that the linear function has to depend on the scale on which approximation is made: its slope ${\sqrt{-\log \delta}}$ does not have a limit as ${\delta\to0}$.

The linear approximation property does not become apparent until extremely small scales. Here is ${\delta = 10^{-30}}$.

## Squarish polynomials

For some reason I wanted to construct polynomials approximating this piecewise constant function ${f}$:

Of course approximation cannot be uniform, since the function is not continuous. But it can be achieved in the sense of convergence of graphs in the Hausdorff metric: their limit should be the “graph” shown above, with the vertical line included. In concrete terms, this means for every ${\epsilon>0}$ there is ${N}$ such that for ${n\ge N}$ the polynomial ${p_n}$ satisfies

$\displaystyle |p_n-f|\le \epsilon\quad \text{ on }\ [0,2]\setminus [1-\epsilon,1+\epsilon]$

and also

$\displaystyle -\epsilon\le p_n \le 1+\epsilon\quad \text{ on }\ [1-\epsilon,1+\epsilon]$

How to get such ${p_n}$ explicitly? I started with the functions ${f_m(x) = \exp(-x^m)}$ when ${m}$ is large. The idea is that as ${m\rightarrow\infty}$, the limit of ${\exp(-x^m)}$ is what is wanted: ${1}$ when ${x<1}$, ${0}$ when ${x>1}$. Also, for each ${m}$ there is a Taylor polynomial ${T_{m,n}}$ that approximates ${f_m}$ uniformly on ${[0,2]}$. Since the Taylor series is alternating, it is not hard to find suitable ${n}$. Let’s shoot for ${\epsilon=0.01}$ in the Taylor remainder and see where this leads:

• Degree ${7}$ polynomial for ${\exp(-x)}$
• Degree ${26}$ polynomial for ${\exp(-x^2)}$
• Degree ${69}$ polynomial for ${\exp(-x^3)}$
• Degree ${180}$ polynomial for ${\exp(-x^4)}$
• Degree ${440}$ polynomial for ${\exp(-x^5)}$

The results are unimpressive, though:

To get within ${0.01}$ of the desired square-ness, we need ${\exp(-1.01^m)<0.01}$. This means ${m\ge 463}$. Then, to have the Taylor remainder bounded by ${0.01}$ at ${x=2}$, we need ${2^{463n}/n! < 0.01}$. Instead of messing with Stirling’s formula, just observe that ${2^{463n}/n!}$ does not even begin to decrease until ${n}$ exceeds ${2^{463}}$, which is more than ${10^{139}}$. That’s a … high degree polynomial. I would not try to ask a computer algebra system to plot it.

Bernstein polynomials turn out to work better. On the interval ${[0,2]}$ they are given by

$\displaystyle p_n(x) = 2^{-n} \sum_{k=0}^n f(2k/n) \binom{n}{k} x^k (2-x)^{n-k}$

To avoid dealing with ${f(1)}$, it is better to use odd degrees. For comparison, I used the same or smaller degrees as above: ${7, 25, 69, 179, 439}$.

Looks good. But I don’t know of a way to estimate the degree of Bernstein polynomial required to obtain Hausdorff distance less than a given ${\epsilon}$ (say, ${0.01}$) from the square function.