Given some data points one can devise various functions that extract information from them. Such as
- Regression line (considered in When the digits of go to 11): it detects a linear trend , minimizing the sum of squares of residuals .
- Natural cubic spline (considered in Connecting dots naturally), which passes through every data point while minimizing the amount of wiggling, measured by the square of its second derivative. Like this:
A smoothing spline is something in between the above extremes: it insists on neither being a line (i.e., having zero second derivative) nor on passing through given points (i.e., having zero residuals). Instead, it minimizes a weighted sum of both things: the integral of second derivative squared, and the sum of residuals squared. Like this plot, where the red points are given but the spline chooses to interpolate the green ones:
I’ll demonstrate a version of a smoothing spline that might not be exactly canonical, but is very easy to implement in Matlab or Scilab (which I prefer to use). As in Connecting dots naturally, assume the knots equally spaced at distance . The natural cubic spline is determined by the values of its second derivative at , denoted . As explained earlier, these can be found from the linear system
where the column on the right contains the amounts by which the derivative of the piecewise linear interpolant through given points jumps at every knot. The notation means the second difference of , for example .
A smoothing spline is also a natural spline, but for a different set of points . One has to find that minimize a weighted sum of and of . The latter integral is easily computed in terms of : it is equal to . Since this quadratic form is comparable to , I’ll work with the latter instead.
The idea is to set up an underdetermined system for and , and let Scilab find a least squares solution of that. Let’s introduce a weight parameter that will determine the relative importance of curvature vs residuals. It is convenient to let , so that both and (second derivative) scale the same way. The terms contributes to the linear system for , since the right hand side now has instead of . This contribution is . Moving it to the left hand-side (since are unknowns) we obtain the following system.
where is the same tridiagonal matrix as above, and is the rectangular Laplacian-type matrix
All in all, the system has unknowns , and equations, reflecting the continuity of first derivative at each interior knot. The
lsq command of Scilab finds the solution with the least sum of squares of the unknowns, which is what we are after.
Time for some examples. and can be seen above. Here are more:
As increases, the spline approaches the regression line:
Finally, the Scilab code. It is almost the same as for natural spline; the difference is in five lines from
newy=... The code after that is merely evaluation and plotting of the spline.
a = 0; b = 10 y = [3 1 4 1 5 9 2 6 5 3 5] lambda = 0.1 n = length(y)-1 h = (b-a)/n jumps = diff(y,2)/h A = (h/6)*(diag(4*ones(1,n-1))+diag(ones(1,n-2),1)+diag(ones(1,n-2),-1)) B = lambda/h*(diag(-2*ones(1,n+1))+diag(ones(1,n),1)+diag(ones(1,n),-1)) C = [A, B(2:$-1,:)] sol = lsq(C,-jumps')' z = [0,sol(1:n-1),0] newy = y + lambda*h^2*sol(n:$) allx =  spl =  for i=1:n xL = a+h*(i-1) xR = a+h*i x = linspace(xL,xR,100) linear = newy(i)*(xR-x)/h + newy(i+1)*(x-xL)/h correction = ((z(i+1)+2*z(i))*(xR-x)+(2*z(i+1)+z(i))*(x-xL)).*(xR-x).*(x-xL)/(6*h) allx = [allx, x] spl = [spl, linear+correction] end plot(allx, spl) plot(a:h:b, newy, 'g*') plot(a:h:b, y, 'r*')