A neat way to visualize a real number is to make a sunflower out of it. This is an arrangement of points with polar angles and polar radii (so that the concentric disks around the origin get the number of points proportional to their area). The prototypical sunflower has , the golden ratio. This is about the most uniform arrangement of points within a disk that one can get.

But nothing stops us from using other numbers. The square root of 5 is not nearly as uniform, forming distinct spirals.

The number begins with spirals, but quickly turns into something more uniform.

The number has stronger spirals: seven of them, due to approximation.

Of course, if was actually , the arrangement would have rays instead of spirals:

What if we used more points? The previous pictures have 500 points; here is with . The new pattern has 113 rays: .

Apéry’s constant, after beginning with five spirals, refuses to form rays or spirals again even with 3000 points.

The images were made with Scilab as follows, with an offset by 1/2 in the polar radius to prevent the center from sticking out too much.

n = 500
alpha = (sqrt(5)+1)/2
r = sqrt([1:n]-1/2)
theta = 2*%pi*alpha*[1:n]
plot(r.*cos(theta), r.*sin(theta), '*');
set(gca(), "isoview", "on")

Given some data points one can devise various functions that extract information from them. Such as

Regression line (considered in When the digits of go to 11): it detects a linear trend , minimizing the sum of squares of residuals .

Natural cubic spline (considered in Connecting dots naturally), which passes through every data point while minimizing the amount of wiggling, measured by the square of its second derivative. Like this:

A smoothing spline is something in between the above extremes: it insists on neither being a line (i.e., having zero second derivative) nor on passing through given points (i.e., having zero residuals). Instead, it minimizes a weighted sum of both things: the integral of second derivative squared, and the sum of residuals squared. Like this plot, where the red points are given but the spline chooses to interpolate the green ones:

I’ll demonstrate a version of a smoothing spline that might not be exactly canonical, but is very easy to implement in Matlab or Scilab (which I prefer to use). As in Connecting dots naturally, assume the knots equally spaced at distance . The natural cubic spline is determined by the values of its second derivative at , denoted . As explained earlier, these can be found from the linear system

where the column on the right contains the amounts by which the derivative of the piecewise linear interpolant through given points jumps at every knot. The notation means the second difference of , for example .

A smoothing spline is also a natural spline, but for a different set of points . One has to find that minimize a weighted sum of and of . The latter integral is easily computed in terms of : it is equal to . Since this quadratic form is comparable to , I’ll work with the latter instead.

The idea is to set up an underdetermined system for and , and let Scilab find a least squares solution of that. Let’s introduce a weight parameter that will determine the relative importance of curvature vs residuals. It is convenient to let , so that both and (second derivative) scale the same way. The terms contributes to the linear system for , since the right hand side now has instead of . This contribution is . Moving it to the left hand-side (since are unknowns) we obtain the following system.

where is the same tridiagonal matrix as above, and is the rectangular Laplacian-type matrix

All in all, the system has unknowns , and equations, reflecting the continuity of first derivative at each interior knot. The lsq command of Scilab finds the solution with the least sum of squares of the unknowns, which is what we are after.

Time for some examples. and can be seen above. Here are more:

As increases, the spline approaches the regression line:

Finally, the Scilab code. It is almost the same as for natural spline; the difference is in five lines from B=... to newy=... The code after that is merely evaluation and plotting of the spline.

a = 0; b = 10
y = [3 1 4 1 5 9 2 6 5 3 5]
lambda = 0.1
n = length(y)-1
h = (b-a)/n
jumps = diff(y,2)/h
A = (h/6)*(diag(4*ones(1,n-1))+diag(ones(1,n-2),1)+diag(ones(1,n-2),-1))
B = lambda/h*(diag(-2*ones(1,n+1))+diag(ones(1,n),1)+diag(ones(1,n),-1))
C = [A, B(2:$-1,:)]
sol = lsq(C,-jumps')'
z = [0,sol(1:n-1),0]
newy = y + lambda*h^2*sol(n:$)
allx = []
spl = []
for i=1:n
xL = a+h*(i-1)
xR = a+h*i
x = linspace(xL,xR,100)
linear = newy(i)*(xR-x)/h + newy(i+1)*(x-xL)/h
correction = ((z(i+1)+2*z(i))*(xR-x)+(2*z(i+1)+z(i))*(x-xL)).*(xR-x).*(x-xL)/(6*h)
allx = [allx, x]
spl = [spl, linear+correction]
end
plot(allx, spl)
plot(a:h:b, newy, 'g*')
plot(a:h:b, y, 'r*')

Pick two random numbers from the interval ; independent, uniformly distributed. Normalize them to have mean zero, which simply means subtracting their mean from each. Repeat many times. Plot the histogram of all numbers obtained in the process.

No surprise here. In effect this is the distribution of with independent and uniformly distributed over . The probability density function of is found via convolution, and the convolution of with itself is a triangular function.

Repeat the same with four numbers , again subtracting the mean. Now the distribution looks vaguely bell-shaped.

With ten numbers or more, the distribution is not so bell-shaped anymore: the top is too flat.

The mean now follows an approximately normal distribution, but the fact that it’s subtracted from uniformly distributed amounts to convolving the Gaussian with . Hence the flattened top.

What if we use the median instead of the mean? With two numbers there is no difference: the median is the same as the mean. With four there is.

That’s an odd-looking distribution, with convex curves on both sides of a pointy maximum. And with points it becomes even more strange.

Scilab code:

k = 10
A = rand(200000,k)
A = A - median(A,'c').*.ones(1,k)
histplot(100,A(:))

To generate a random number uniformly distributed on the interval , one can keep tossing a fair coin, record the outcomes as an infinite sequence of 0s and 1s, and let . Here is a histogram of samples from the uniform distribution… nothing to see here, except maybe an incidental interference pattern.

Let’s note that where has the same distribution as itself, and is independent of . This has an implication for the (constant) probability density function of :

because is the p.d.f. of and is the p.d.f. of . Simply put, is equal to the convolution of the rescaled function with the discrete measure .

Let’s iterate the above construction by letting each be uniformly distributed on instead of being constrained to the endpoints. This is like tossing a “continuous fair coin”. Here is a histogram of samples of ; predictably, with more averaging the numbers gravitate toward the middle.

This is not a normal distribution; the top is too flat. The plot was made with this Scilab code, putting n samples put into b buckets:

n = 1e6
b = 200
z = zeros(1,n)
for i = 1:10
z = z + rand(1,n)/2^i
end
c = histplot(b,z)

If this plot too jagged, look at the cumulative distribution function instead:

It took just more line of code: plot(linspace(0,1,b),cumsum(c)/sum(c))

Compare the two plots: the c.d.f. looks very similar to the left half of the p.d.f. It turns out, they are identical up to scaling.

Let’s see what is going on here. As before, where has the same distribution as itself, and the summands are independent. But now that is uniform, the implication for the p.d.f of is different:

This is a direct relation between and its antiderivative. Incidentally, if shows that is infinitely differentiable because the right hand side always has one more derivative than the left hand side.

To state the self-similarity property of in the cleanest way possible, one introduces the cumulative distribution function (the Fabius function) and extends it beyond by alternating even and odd reflections across the right endpoint. The resulting function satisfies the delay-differential equation : the derivative is a rescaled copy of the function itself.

Since vanishes at the even integers, it follows that at every dyadic rational, all but finitely many derivatives of are zero. The Taylor expansion at such points is a polynomial, while itself is not. Thus, is nowhere analytic despite being everywhere .

This was, in fact, the motivation for J. Fabius to introduce this construction in 1966 paper Probabilistic Example of a Nowhere Analytic -Function.

It is easy to find the minimum of if you are human. For a computer this takes more work:

The animation shows a simplified form of the Nelder-Mead algorithm: a simplex-based minimization algorithm that does not use any derivatives of . Such algorithms are easy to come up with for functions of one variable, e.g., the bisection method. But how to minimize a function of two variables?

A natural way to look for minimum is to slide along the graph in the direction opposite to ; this is the method of steepest descent. But for computational purposes we need a discrete process, not a continuous one. Instead of thinking of a point sliding down, think of a small tetrahedron tumbling down the graph of ; this is a discrete process of flips and flops. The process amounts to the triangle of contact being replaced by another triangle with an adjacent side. The triangle is flipped in the direction away from the highest vertex.

This is already a reasonable minimization algorithm: begin with a triangle ; find the values of at the vertices of ; reflect the triangle away from the highest value; if the reflected point has a smaller value, move there; otherwise stop.

But there’s a problem: the size of triangle never changes in this process. If is large, we won’t know where the minimum is even if eventually covers it. If is small, it will be moving in tiny steps.

Perhaps, instead of stopping when reflection does not work anymore, we should reduce the size of . It is natural to contract it toward the “best” vertex (the one with the smallest value of ), replacing two other vertices with the midpoints of corresponding sides. Then repeat. The stopping condition can be the values of at all vertices becoming very close to one another.

This looks clever, but the results are unspectacular. The algorithm is prone to converge to a non-stationary point where just by an accident the triangle attains a nearly horizontal position. The problem is that the triangle, while changing its size, does not change its shape to fit the geometry of the graph of .

The Nelder-Mead algorithm adapts the shape of the triangle by including the possibility of stretching while flipping. Thus, the triangle can grow smaller and larger, moving faster when the path is clear, or becoming very thin to fit into a narrow passage. Here is a simplified description:

Begin with some triangle .

Evaluate the function at each vertex. Call the vertices where is the worst one (the largest value of ) and is the best.

Reflect about the midpoint of the good side . Let be the reflected point.

If , then we consider moving even further in the same direction, extending the line beyond by half the length of . Choose between and based on where is smaller, and make the chosen point a new vertex of our triangle, replacing .

Else, do not reflect and instead shrink the triangle toward .

Repeat, stopping when we either exceed the number of iterations or all values of at the vertices of triangle become nearly equal.

(The full version of the Nelder-Mead algorithm also includes the comparison of with , and also involves trying a point inside the triangle.)

This is Rosenbrock’s function , one of standard torture tests for minimization algorithms. Its graph has a narrow valley along the parabola . At the bottom of the valley, the incline toward the minimum is relatively small, compared to steep walls surrounding the valley. The steepest descent trajectory quickly reaches the valley but dramatically slows down there, moving in tiny zig-zagging steps.

The algorithm described above gets within of the minimum in 65 steps.

In conclusion, Scilab code with this algorithm.

x = -0.4:0.1:1.6; y = -2:0.1:1.4 // viewing window
[X,Y] = meshgrid(x,y); contour(x,y,f(X,Y)',30) // contour plot
plot([1],[1],'r+') // minimum point
tol = 10^(-6)
n = 0
T = [0, -1.5 ; 1.4, -1.5; 1.5, 0.5] // initial triangle
for i=1:3
values(i) = f(T(i,1), T(i,2))
end
while (%T)
xpoly(T(:,1),T(:,2),'lines',1) // draw the triangle
[values, index] = gsort(values) // sort the values
T = T(index,:)
if values(1)-values(3) < tol // close enough?
mfprintf(6, "Minimum at (%.3f, %.3f)",T(3,1),T(3,2))
break
end
R = T(2,:) + T(3,:) - T(1,:) // reflected
fR = f(R(1),R(2))
if fR < values(3)
E = 1.5*T(2,:) + 1.5*T(3,:) - 2*T(1,:) // extended
fE = f(E(1),E(2))
if fE < fR
T(1,:)=E; values(1)=fE // pick extended
else
T(1,:)=R; values(1)=fR // pick reflected
end
else
for i=1:2
T(i,:) = (T(i,:)+T(3,:))/2 // shrink
values(i) = f(T(i,1), T(i,2))
end
end
n = n+1
if n >= 200
disp('Failed to converge'); break // too bad
end
end

Here is a streamlined version of the construction.

Fix (on the above picture ). Let , and inductively define so that

when .

If , let .

Now that has been defined on , extend it to by linear interpolation.

Let .

Since by construction, it suffices to understand the behavior of on .

Each is piecewise linear and increasing. At each step of the construction, every line segment of (say, with slope ) is replaced by two segments, with slopes and . Since . Hence, .

Since when , it is easy to understand by considering its values at dyadic rationals and using monotonicity. This is how one can see that:

The difference of values of at consecutive points of is at most . Therefore, is Hölder continuous with exponent .

The difference of values of at consecutive points of is at least . Therefore, is strictly increasing, and its inverse is Hölder continuous with exponent .

It remains to check that almost everywhere. Since is monotone, it is differentiable almost everywhere. Let be a point of differentiability (and not a dyadic rational, though this is automatic). For each there is such that . Let ; this is the slope of on the -dyadic interval containing . Since exists, we must have . On the other hand, the ratio of consecutive terms of this sequence, , is always either or . Such a sequence cannot have a finite nonzero limit. Thus .

Here is another , with .

By making very small, and being more careful with the analysis of , one can make the Hausdorff dimension of the complement of arbitrarily small.

An interesting modification of Salem’s function was introduced by Tukia in Hausdorff dimension and quasisymmetric mappings (1989). For the functions considered above, the one-sided derivatives at every dyadic rational are zero and infinity, which is a rather non-symmetric state of affair. In particular, these functions are not quasisymmetric. But Tukia showed that if one alternates between and at every step, the resulting homeomorphism of becomes quasisymmetric. Here is the picture of alternating construction with ; preliminary stages of construction are in green.

This is quite similar to how the introduction of alternating signs turns Takagi curve (blancmange curve) into a quasiarc (i.e., a curve without cusps); see Sweetened and flavored dessert made from gelatinous or starchy ingredients and milk. But the fractal curves in this post are relatively mild-mannered: they are rectifiable (and thus, not really fractal).

Here is the simple Scilab code I used for the above plots.

r = 1/4
t = [0 1]
f = t
for i = 1:10
t = [t, t(1:$-1)+2^(-i)]
f = [f, r*f(1:$-1)+(1-r)*f(2:$)]
[t, ind] = gsort(t,'g','i')
f = f(ind)
end
plot(t,f)

To have preliminary stages shown as well, move plot(t,f) into the loop. For Tukia’s alternating version, insert the line r = 1-r into the loop.

One way is to connect them with straight lines, creating a piecewise linear function:

This is the shortest graph of a function that interpolates the data. In other words, the piecewise linear function minimizes the integral

among all functions with . As is often the case, the length functional can be replaced with the elastic energy

because the piecewise linear (and only it) minimizes it too.

Of course, it is not natural for the connecting curve to take such sharp turns at the data points. One could try to fit a polynomial function to these points, which is guaranteed to be smooth. With 11 points we need a 10th degree polynomial. The result is disappointing:

It is not natural for a curve connecting the points with to shoot up over . We want a connecting curve that does not wiggle more than necessary.

To reduce the wiggling and remove sharp turns at the same time, one can minimize the bending energy of the function, thinking of its graph as a thin metal rod. This energy is

and the function that minimizes it subject to conditions looks very nice indeed:

The Euler-Lagrange equation for the functional dictates that the fourth derivative of is zero in the intervals between the knots . Thus, is a piecewise cubic polynomial. Also, both and must be continuous for any function with integrable second derivative. More delicate analysis is required for , but it also can be shown to be continuous for minimizing function ; moreover, must vanish at the endpoints and . Taken together, these properties (all derived from the variational problem) complete the description of a natural cubic spline.

It remains to actually construct one. I prefer to think of this process as adding a correction term to the piecewise linear interpolant . Here the spline is shown together with (green) and (magenta).

On each interval the correction term is a cubic polynomial vanishing at both endpoints. The space of such polynomials is two-dimensional: thus,

on this interval. There are 20 coefficients , to find. At each of 9 knots 1,2,…9 we have two conditions: must have removable singularity and must jump by the amount opposite to the jump of . Since also vanishes at , there are 20 linear equations for 20 unknowns.

It is easier to set up a linear system in terms of . Indeed, the values of at two consecutive knots determine the correction term within: and . This leaves equations (from the jumps of ) for unknowns. The best part is that the matrix of this system is really nice: tridiagonal with dominant diagonal.

One can solve the system for within a for loop, but I used the Scilab solver instead. Here is the Scilab code for the most interesting part: the spline. The jumps of the derivative of the piecewise linear interpolant are obtained from the second order difference of the sequence of y-values.

a = 0; b = 10
y = [3 1 4 1 5 9 2 6 5 3 5]
n = length(y)-1
h = (b-a)/n
jumps = diff(y,2)/h
A = (h/6)*(diag(4*ones(1,n-1))+diag(ones(1,n-2),1)+diag(ones(1,n-2),-1))
z = [0,-jumps/A,0]
allx = []; spl = []
for i=1:n
xL = a+h*(i-1)
xR = a+h*i
x = linspace(xL,xR,100)
linear = y(i)*(xR-x)/h + y(i+1)*(x-xL)/h
cor = ((z(i+1)+2*z(i))*(xR-x)+(2*z(i+1)+z(i))*(x-xL)).*(xR-x).*(x-xL)/(6*h)
allx = [allx, x]
spl = [spl, linear + cor]
end
plot(allx, spl)
plot(a:h:b, y, 'r*')

A knight walks randomly on the standard chessboard. What is the proportion of time that it will spend at e4?

The answer (1/42) is not hard to get, but I’ll take the computational approach first (testing Scilab 5.5.0 beta 1, by the way). Begin by placing the knight at b1 and letting it make five random moves. This is the distribution of its position after five moves:

Unsurprisingly, half of the board is black; actually more than half because the knight can’t get to h8 in five moves. the other half isn’t — you can even get to h8 in five moves (want to find all possible ways to do this?).

After ten moves, the colors become more uniform.

After 200 (or any large even number) of moves, the distribution is little changed. But you may notice that it is centrally symmetric, while the previous one isn’t quite.

Let’s repeat the process beginning with two knights at b1 and g1. After five moves of each:

After ten moves:

After a large number of moves (does not matter, even or odd), the variety of colors is greatly reduced:

Indeed, this is the distribution which also describes the proportion of time that the knight (wherever it started) will spend at a given square Q.

Precisely, the proportion of time spent at Q is P(Q)=N(Q)/336 where N(Q) is the number of legal moves from Q. For the sixteen central squares P(Q) = 8/336 = 1/42, while for the corners we get 2/336 = 1/168.

Here is a quick argument to support the above. Let Q1 and Q2 be two squares such that the move from Q1 to Q2 is legal. The proportion of time that the knight makes this move is P(Q1)/N(Q1). Similarly, the time proportion of Q2-Q1 move is P(Q2)/N(Q2). Since the process is symmetric in time (we have a reversible Markov chain), P(Q1)/N(Q1)=P(Q2)/N(Q2). In other words, the ratio P/Q is the same for any two squares that are joined by a legal move; it follows that P/Q is the same for all cells. Finding the coefficient of proportionality is a matter of counting, since the sum of P(Q) over all Q is 1.

The Scilab code I wrote for this post is largely not knight-specific. The function update receives the initial distribution of a chess piece and the set of its legal moves. It computes the distribution after one random move.

function newState=update(state,moves)
[maxK,n]=size(moves)
[n,n]=size(state)
newState=zeros(state)
for i=1:n
for j=1:n
neighborCount=0
contribution=zeros(state)
for k=1:maxK
pos=[i,j]+moves(k,:)
if (min(pos)>=1)&(max(pos)<=n) then
neighborCount=neighborCount+1
contribution(pos(1),pos(2))=state(i,j)
end
end
newState=newState+contribution/neighborCount
end
end
endfunction

This set is given in the function knight which calls update iteratively.

function state=knight(state,iter)
moves=[2 1;1 2;-2 1;1 -2;2 -1;-1 2;-2 -1;-1 -2]
for i=1:iter
state=update(state,moves)
end
endfunction

For example, this is how I plotted the distribution after 10 random moves starting at b1:

initial = zeros(8,8)
initial(8,2)=1
state = knight(initial,10)
isoview(0,0,9,9)
Matplot(state*500)
f = gcf()
f.color_map = hotcolormap(32)

NB: getting the correct color representation of matrices from Matplot requires an up-to-date version of Scilab; in version 5.4.0 Matplot smoothens colors, producing intriguing but less informative images.

Returning to the stopping time of the process (first part here):

Here is the plot of the stopping time embellished with a few logarithmic functions.

This structure is explained by looking at the number of operations that a number experiences before reaching .

No operations. Such number are obviously of the form , with stopping time . These creates the points which lie on the curve .

One operation. To find such numbers, follow their orbit backward: a series of multiplication by , then operation, then more multiplications by . This leads to numbers of the form where must be even in order for to be divisible by . The stopping time is . Since , the corresponding points lie close to the curve . Also notice that unlike the preceding case, clusters appear: there may be multiple pairs with even and the same . The larger the sum is, the more such pairs occur.

Two operations. Tracing the orbit backwards again, we find that these are numbers of the form

It is straightforward to work out the conditions on which allow both divisions by to proceed. They are: either is odd and , or is even and . In any event, the stopping time is and the number itself is approximately . On the above chart, these points lie near the curve . Clustering will be more prominent than in the previous case, because we now deal with triples that will be nearby each other if is the same.

…

k) operations of kind. These yield the points near the curve , or, to put it another way, . The plot above shows such curves for .

It is a well-known open problem whether the following process terminates for every positive integer:

Experiments suggest that it does, possibly with unintended side effects.

Since for any odd integer the number is even, it is convenient to replace the step with , as show below:

As a function of , the stopping time has a nice patterned graph:

An odd integer is of the form or . In the first case, is even, while in the second case is odd. So, if is picked randomly from all odd integers of certain size, the probability of being even is . Similarly, for an even , the number can be even or odd with probability .

This leads to a stochastic model of the process:

The graph of stopping time in the stochastic model is, of course random. It looks nothing like the nice pattern of the deterministic process.

However, smoothing out both graphs by moving window of width or so, we see the similarity:

The stochastic process is much easier to analyze. Focusing on the logarithm , we see that it changes either by or by approximately . The expected value of the change is . This suggests that we can expect the logarithm to drop down to in about steps. (Rigorous derivation requires more tools from probability theory, but is still routine.)

The curve fits the experimental data nicely. (The red curve, being randomly generated, is different from the one on the previous graph.)

For the computations, I used Scilab. The function hail(n,m) calculates the stopping times up to given value of n, and takes moving average with window size m (which can be set to 1 for no averaging).

function hail(n,m)
steps=zeros(1:n);
steps(1)=0
for i=2:n
k=i;
s=0;
while k>=i
if modulo(k,2)==0 then
k=k/2;
s=s+1;
else
k=(3*k+1)/2;
s=s+1;
end
end
steps(i)=s+steps(k);
end
total = cumsum(steps)
for i=1:n-m
average(i)=(total(i+m)-total(i))/m;
end
plot(average,'+');
endfunction

As soon as the result of computations drops below the starting value, the number of remaining steps is fetched from the array that is already computed. This speeds up the process a bit.

The second function follows the stochastic model, for which the aforementioned optimization is not available. This is actually an interesting point: it is conceivable that the stochastic model would be more accurate if it also used the pre-computed stopping time once drops below the starting value. This would change the distribution of stopping times, resulting in wider fluctuations after averaging.

function randomhail(n,m)
rsteps=zeros(1:n);
rsteps(1)=0
for i=2:n
k=i;
s=0;
while k>1
if grand(1,1,"bin",1,1/2)==0 then
k=k/2;
s=s+1;
else
k=(3*k+1)/2;
s=s+1;
end
end
rsteps(i)=s;
end
rtotal = cumsum(rsteps)
for i=1:n-m
raverage(i)=(rtotal(i+m)-rtotal(i))/m;
end
plot(raverage,'r+');
endfunction