Consider this linear differential equation: with boundary conditions and . Nothing looks particularly scary here. Just one nonconstant coefficient, and it’s a simple one. Entering this problem into Wolfram Alpha produces the following explicit solution:

I am not sure how anyone could use this formula for any purpose.

Let us see what simple linear algebra can do here. The differential equation can be discretized by placing, for example, equally spaces interior grid points on the interval: , . The yet-unknown values of at these points are denoted . Standard finite-difference formulas provide approximate values of and :

where is the step size, in our case. Stick all this into the equation: we get 4 linear equations, one for each interior point. Namely, at it is

(notice how the condition is used above), at it is

and so on. Clean up this system and put it in matrix form:

This isn’t too hard to solve even with pencil and paper. The solution is

It can be visualized by plotting 4 points :

Not particularly impressive is it? And why are all these negative y-values in a problem with boundary condition ? They do not really look like they want to approach at the left end of the interval. But let us go ahead and plot them together with the boundary conditions, using linear interpolation in between:

Or better, use cubic spline interpolation, which only adds another step of linear algebra (see Connecting dots naturally) to our computations.

This begins to look believable. For comparison, I used a heavier tool: BVP solver from SciPy. Its output is the red curve below.

Those four points we got from a 4-by-4 system, solvable by hand, pretty much tell the whole story. At any rate, they tell a better story than the explicit solution does.

Graphics made with: SciPy and Matplotlib using Google Colab.

For a positive integer let be the remainder of the division of by . This is conveniently computed in Python as pow(2, n, n). For example, the first 20 values are returned by

Obviously, if and only if is a power of . It also looks like is always even, with powers of 2 dominating the list of its values. But does take on odd values, although this does not happen often: only 6 times in the first 100 integers.

[(n, pow(2, n, n)) for n in range(1, 101) if pow(2, n, n) % 2]

It is conjectured that the range of consists of all nonnegative integers except 1. Here is a proof that , provided by Max Alexeyev on the OEIS page for A036236

Suppose for some . Let be the smallest prime divisor of . Then . This means that the order of in the multiplicative group is a divisor of . But this group has elements, and the only divisor of that is smaller than is . Thus, the order of is , which means , which is absurd.

The fact that the smallest with is 4700063497 deserves to be mentioned here. But I would like to consider the most frequent values of instead. Experimentally, they are found like this:

from collections import Counter
freq = Counter(pow(2, n, n) for n in range(1, 100000000 + 1))
print(freq.most_common(20))

These are all powers of 2… but not exactly in the order one might expect. I repeated the experiment for the first integers with . The frequency order remained stable at the top. These are the most common 11 values, with their frequency (in %) listed for the aforementioned six ranges.

It seems that 32 and 128 are consistently less common than one might expect.

The most common value, 2, is contributed by primes and pseudoprimes to base 2. The value of 4 appears when with prime (but not only then). Still, the primes having zero density makes it difficult to explain the frequency pattern based on primality. A more convincing reason could be: when is even, we are computing and that is likely to be a power of . This boosts the frequency of the even (generally, composite) powers of 2 in the sequence.

scipy.integrate.quad is a popular method of numerical integration with Python. Let’s see how it chooses the points at which to evaluate the function being integrated. Begin with a simple example, the exponential function.

The blue dots indicate the evaluation points, their y-coordinate being the order of evaluation. So, it begins with x=0, continues with something close to -1, etc. The function, drawn in red, is not to its true vertical scale.

The placement of dots gives away the method of integration: it is the Gauss-Kronrod quadrature with 10 Gauss nodes and 21 Kronrod nodes, abbreviated (G10, K21). The Gauss nodes are included in the Kronrod nodes, and of course the function is not evaluated there again. The evaluation process is slightly out of order in that 0 is a Kronrod node but comes first, followed by 10 Gauss nodes, followed by 10 remaining Kronrod nodes. The process ends there, as the comparison of G10 and K21 results shows the necessary precision was reached.

The square root on [0, 1] is not such a nice function, so (G10, K21) does not reach the required precision at once. The interval is bisected again and again until it does.

Surely the cube root is even worse. Or is it?
The nodes are symmetric about the midpoint of the interval of integration. So for any odd function on an interval symmetric about 0 we get G10 = 0 and K21 = 0, and the process stops at once. To see the effect of the cube root singularity, one has to use a non-symmetric interval such as [-1, 2].
That’s still fewer subdivisions than for the square root: cancellation between the left and right neighborhoods of 0 still helps. Let’s look at smooth functions next.
Rapid oscillations forces subdivisions here. There are also other reasons for subdivision, such as the loss of analyticity:
Although exp(-1/x) is infinitely differentiable on this interval, the fact that it is not analytic at 0 makes it a more difficult integrand that exp(x). Finally, an analytic function with no oscillation which still needs a bunch of subintervals:
This is the standard example used to illustrate the Runge phenomenon. Although we are not interpolating here, numerical integration is also influenced by the function having a small radius of convergence of its Taylor series. After all, G10 can be thought of as degree-9 interpolation of the given function at the Gauss nodes (the zeros of a Legendre polynomial), with the formula returning the integral of the interpolating polynomial.

The code used to plot these things:

import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import quad
global eval_points
f = lambda x: np.exp(x)
def f_int(x):
eval_points.append(x)
return f(x)
eval_points = []
a, b = -1, 1
quad(f_int, a, b)
n = len(eval_points)
t = np.linspace(a, b, 1000)
y = f(t)
yy = n*(y - y.min())/(y.max() - y.min())
plt.plot(t, yy, 'r')
plt.plot(eval_points, np.arange(0, n), '.')
plt.show()

The choice of piecewise polynomials of degree 3 for interpolation is justifiably popular: even-degree splines are algebraically awkward to construct, degree 1 is simply piecewise linear interpolation (not smooth), and degree 5, while feasible, entails juggling too many coefficients. Besides, a cubic polynomial minimizes the amount of wiggling (the integral of second derivative squared) for given values and slopes at the endpoints of an interval. (Recall Connecting dots naturally.)

But the derivative of a cubic spline is a quadratic spline. And one needs the derivative to find the critical points. This results in an awkward example in SciPy documentation, annotated with “(NB: sproot only works for order 3 splines, so we fit an order 4 spline)”.

Although not implemented in SciPy, the task of computing the roots of a quadratic spline is a simple one. Obtaining the roots from the internal representation of a quadratic spline in SciPy (as a linear combination of B-splines) would take some work and reading. But a quadratic polynomial is determined by three values, so sampling it at three points, such as two consecutive knots and their average, is enough.

Quadratic formula with values instead of coefficients

Suppose we know the values of a quadratic polynomial q at -1, 0, 1, and wish to find if it has roots between -1 and 1. Let’s normalize so that q(0)=1, and let x = q(-1), y = q(1). If either x or y is negative, there is definitely a root on the interval. If they are positive, there is still a chance: we need the parabola to be concave up, have a minimum within [-1, 1], and for the minimum to be negative. All of this is easily determined once we note that the coefficients of the polynomial are a = (x+y)/2 – 1, b = (y-x)/2, and c = 1.

The inequality ensures the suitable sign of the discriminant. It describes a parabola with vertex (1, 1) and focus (2, 2), contained in the first quadrant and tangent to the axes at (4, 0) and (0, 4). Within the orange region there are no real roots.

The line x+y=2, tangent to the parabola at its vertex, separates convex and concave parabolas. While concavity in conjunction with x, y being positive definitely precludes having roots in [-1, 1], slight convexity is not much better: it results in real roots outside of the interval. Here is the complete picture: green means there is a root in [-1, 1], orange means no real roots, red covers the rest.

Back to splines

Since the derivative of a spline is implemented in SciPy (B-splines have a nice formula for derivatives), all we need is a root-finding routine for quadratic splines. Here it is, based on the above observations but using built-in NumPy polynomial solver np.roots to avoid dealing with various special cases for the coefficients.

def quadratic_spline_roots(spl):
roots = []
knots = spl.get_knots()
for a, b in zip(knots[:-1], knots[1:]):
u, v, w = spl(a), spl((a+b)/2), spl(b)
t = np.roots([u+w-2*v, w-u, 2*v])
t = t[np.isreal(t) & (np.abs(t) <= 1)]
roots.extend(t*(b-a)/2 + (b+a)/2)
return np.array(roots)

A demonstration, which plots the spline (blue), its critical points (red), and original data points (black) as follows:

import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import InterpolatedUnivariateSpline
x = np.arange(7)
y = np.array([3, 1, 1, 2, 2, 4, 3])
f = InterpolatedUnivariateSpline(x, y, k=3)
crit_pts = quadratic_spline_roots(f.derivative())
t = np.linspace(x[0], x[-1], 500)
plt.plot(t, f(t))
plt.plot(x, y, 'kd')
plt.plot(crit_pts, f(crit_pts), 'ro')
plt.show()

Suppose we have real numbers and want to find the sum of all distances over . Why? Maybe because over five years ago, the gradient flow of this quantity was used for "clustering by collision" (part 1, part 2, part 3).

If I have a Python console open, the problem appears to be solved with one line:

>>> 0.5 * np.abs(np.subtract.outer(x, x)).sum()

where the outer difference of x with x creates a matrix of all differences , then absolute values are taken, and then they are all added up. Double-counted, hence the factor of 0.5.

But trying this with, say, one million numbers is not likely to work. If each number takes 8 bytes of memory (64 bits, double precision), then the array x is still pretty small (under 8 MB) but a million-by-million matrix will require over 7 terabytes, and I won’t have that kind of RAM anytime soon.

In principle, one could run a loop adding these values, or store the matrix on a hard drive. Both are going to take forever.

There is a much better way, though. First, sort the numbers in nondecreasing order; this does not require much time or memory (compared to quadratic memory cost of forming a matrix). Then consider the partial sums ; the cost of computing them is linear in time and memory. For each fixed , the sum of distances to with is simply , or, equivalently, . So, all we have to do is add these up. Still one line of code (after sorting), but a much faster one:

For example, x could be a sample from some continuous distribution. Assuming the distribution has a mean (i.e., is not too heavy tailed), the sum of all pairwise distances grows quadratically with n, and its average approaches a finite limit. For the uniform distribution on [0, 1] the computation shows this limit is 1/3. For the standard normal distribution it is 1.128… which is not as recognizable a number.

As , the average distance of a sample taken from a distribution converges to the expected value of |X-Y| where X, Y are two independent variables with that distribution. Let’s express this in terms of the probability density function and the cumulative distribution function . By symmetry, we can integrate over and double the result:

Integrate by parts in the second integral: , and the boundary terms are zero.

Integrate by parts in the other integral, throwing the derivative onto the indefinite integral and thus eliminating it. There is a boundary term this time.

Since , this simplifies nicely:

This is a lot neater than I expected: is simply the integral of . I don’t often see CDF squared, like here. Some examples: for the uniform distribution on [0,1] we get

and for the standard normal, with , it is

The trick with sorting and cumulative sums can also be used to find, for every point , the sum (or average) of distances to all other points. To do this, we don’t sum over but must also add for . The latter sum is simply where is the total sum. So, all we need is

Unfortunately, the analogous problems for vector-valued sequences are not as easy. If the Manhattan metric is used, we can do the computations for each coordinate separately, and add the results. For the Euclidean metric…

The existence of multiple versions of Discrete Cosine Transform (DCT) can be confusing. Wikipedia explains that its 8 types are determined by how one reflects across the boundaries. E.g., one can reflect 1, 2, 3 across the left boundary as 3, 2, 1, 2, 3 or as 3, 2, 1, 1, 2, 3, and there are such choices for the other boundary too (also, the other reflection can be odd or even). Makes sense enough.

But there is another aspect to the two most used forms, DCT-I and DCT-II (types I and II): they can be expressed in terms of the Trapezoidal and Midpoint rules for integration. Here is how.

The cosines , are orthogonal on the interval with respect to the Lebesgue measure. The basis of discrete transforms is that these cosines are also orthogonal with respect to some discrete measures, until certain frequency is reached. Indeed, if cosines are orthogonal with respect to measure whose support consists of points, then we can efficiently use them to represent any function defined on the support of , and those are naturally identified with sequences of length n.

How to find such measures? It helps that some simple rules of numerical integration are exact for trigonometric polynomials up to some degree.

For example, the trapezoidal rule with n sample points exactly integrates the functions for . This could be checked by converting to exponential form and summing geometric progressions, but here is a visual explanation with n=4, where the interval is represented as upper semi-circle. The radius of each red circle indicates the weight placed at that point; the endpoints get 1/2 of the weight of other sample points. To integrate correctly, we must have the x-coordinate of the center of mass equal to zero, which is obviously the case.

Replacing by means multiplying the polar angle of each sample point by . This is what we get:

In all cases the x-coordinate of the center of mass is zero. With k=6 this breaks down, as all the weight gets placed in one point. And this is how it goes in general with integration of sines and cosines: equally spaced points work perfectly until they don’t work at all, which happens when the step size is equal to the period of the function. When , the period is equal to , the spacing of points in the trapezoidal rule.

The orthogonality of cosines has to do with the formula . Let be the measure expressing the trapezoidal rule on with sample points; so it’s the sum of point masses at . Then are orthogonal with respect to because any product with taken from this range will have . Consequently, we can compute the coefficients of any function in the cosine basis as

The above is what DCT-I (discrete cosine transform of type 1) does, up to normalization.

The DCT-II transform uses the Midpoint rule instead of the Trapezoidal rule. Let be the measure expressing the Midpoint rule on with sample points; it gives equal mass to the points for . These are spaced at and therefore the midpoint rule is exact for with which is better than what the trapezoidal rule does. Perhaps more significantly, by identifying the given data points with function values at the midpoints of subintervals we stay away from the endpoints where the cosines are somewhat restricted by having to have zero slope.

Let’s compare DCT-I and DCT-II on the same data set, . There are 9 numbers here. Following DCT-I we place them at the sample points of the trapezoidal rule, and expand into cosines using the inner product with respect to . Here is the plot of the resulting trigonometric polynomial: of course it interpolates the data.

But DCT-II does it better, despite having exactly the same cosine functions. The only change is that we use and so place the -values along its support.

Less oscillation means the high-degree coefficients are smaller, and therefore easier to discard in order to compress information. For example, drop the last two coefficients in each expansion, keeping 6 numbers instead of 8. DCT-II clearly wins in accuracy then.

Okay, so the Midpoint rule is better, no surprise. After all, it’s in general about twice as accurate as the Trapezoidal rule. What about Simpson’s rule, would it lead to some super-efficient form of DCT? That is, why don’t we let be the discrete measure that expresses Simpson’s rule and use the inner product for cosine expansion? Alas, Simpson’s rule on points is exact only for with , which is substantially worse than either Trapezoidal or Midpoint rules. As a result, we don’t get enough orthogonal cosines with respect to to have an orthogonal basis. Simpson’s rule has an advantage when dealing with algebraic polynomials, not with trigonometric ones.

Finally, the Python code used for the graphics; I did not use SciPy’s DCT method (which is of course more efficient) to keep the relation to numerical integration explicit in the code. The method trapz implements the trapezoidal rule, and the midpoint rule is just the summation of sampled values. In both cases there is no need to worry about factor dx, since it cancels out when we divide one numerical integral by the other.

import numpy as np
import matplotlib.pyplot as plt
# Setup
y = np.sqrt(np.arange(9))
c = np.zeros_like(y)
n = y.size
# DCT-I, trapezoidal
x = np.arange(n)*np.pi/(n-1)
for k in range(n):
c[k] = np.trapz(y*np.cos(k*x))/np.trapz(np.cos(k*x)**2)
t = np.linspace(0, np.pi, 500)
yy = np.sum(c*np.cos(np.arange(9)*t.reshape(-1, 1)), axis=1)
plt.plot(x, y, 'ro')
plt.plot(t, yy)
plt.show()
# DCT-II, midpoint
x = np.arange(n)*np.pi/n + np.pi/(2*n)
for k in range(n):
c[k] = np.sum(y*np.cos(k*x))/np.sum(np.cos(k*x)**2)
t = np.linspace(0, np.pi, 500)
yy = np.sum(c*np.cos(np.arange(9)*t.reshape(-1, 1)), axis=1)
plt.plot(x, y, 'ro')
plt.plot(t, yy)
plt.show()

Given a sequence of numbers of length one may want to look for evidence of its periodic behavior. One way to do this is by computing autocorrelation, the correlation of the sequence with a shift of itself. Here is one reasonable way to do so: for lag values compute the correlation coefficient of with . That the lag does not exceed ensures the entire sequence participates in the computation, so we are not making a conclusion about its periodicity after comparing a handful of terms at the beginning and the end. In other words, we are not going to detect periodicity if the period is more than half of the observed time period.

Having obtained the correlation coefficients, pick one with the largest absolute value; call it R. How large does R have to be in order for us to conclude the correlation is not a fluke? The answer depends on the distribution of our data, but an experiment can be used to get some idea of likelihood of large R.

I picked independently from the standard normal distribution, and computed as above. After 5 million trials with a sequence of length 100, the distribution of R was as follows:

Based on this experiment, the probability of obtaining |R| greater than 0.5 is less than 0.0016. So, 0.5 is pretty solid evidence. The probability of is two orders of magnitude less, etc. Also, |R| is unlikely to be very close to zero unless the data is structured in some strange way. Some kind of correlation ought to be present in the white noise.

Aside: it’s not easy to construct perfectly non-autocorrelated sequences for the above test. For length 5 an example is 1,2,3,2,3. Indeed, (1,2,3,2) is uncorrelated with (2,3,2,3) and (1,2,3) is uncorrelated with (3,2,3). For length 6 and more I can’t construct these without filling them with a bunch of zeros.

Repeating the experiment with sequences of length 1000 shows a tighter distribution of R: now |R| is unlikely to be above 0.2. So, if a universal threshold is to be used here, we need to adjust R based on sequence length.

I did not look hard for statistical studies of this subject, resorting to an experiment. Experimentally obtained p-values are pretty consistent for the criterion . The number of trials was not very large (10000) so there is some fluctuation, but the pattern is clear.

Length, L

P(L^{0.45}|R| > 4)

100

0.002

300

0.0028

500

0.0022

700

0.0028

900

0.0034

1100

0.0036

1300

0.0039

1500

0.003

1700

0.003

1900

0.0042

2100

0.003

2300

0.0036

2500

0.0042

2700

0.0032

2900

0.0043

3100

0.0042

3300

0.0025

3500

0.0031

3700

0.0027

3900

0.0042

Naturally, all this depends on the assumption of independent normal variables.

And this is the approach I took to computing r in Python:

import numpy as np
n = 1000
x = np.random.normal(size=(n,))
acorr = np.correlate(x, x, mode='same')
acorr = acorr[n//2+1:]/(x.var()*np.arange(n-1, n//2, -1))
r = acorr[np.abs(acorr).argmax()]

In a comment to Recursive randomness of integers Rahul pointed out a continuous version of the same problem: pick uniformly in , then uniformly in , then uniformly in , etc. What is the distribution of the sum ?

The continuous version turns out to be easier to analyse. To begin with, it’s equivalent to picking uniformly distributed, independent and letting . Then the sum is

which can be written as

So, is a stationary point of the random process where is uniformly distributed in . Simply put, and have the same distribution. This yields the value of in a much simpler way than in the previous post:

hence .

We also get an equation for the cumulative distribution function . Indeed,

The latter probability is . Conclusion: . Differentiate to get an equation for the probability density function , namely . It’s convenient to change the variable of integration to , which leads to

Looks pretty simple, doesn’t it? Since the density is zero for negative arguments, it is constant on . This constant, which I’ll denote , is , or simply . I couldn’t get an analytic formula for . My attempt was where are the moments of . The moments can be computed recursively using , which yields

The first few moments, starting with , are 1, 1, 3/2, 17/6, 19/3, 81/5, 8351/80… Unfortunately the series diverges, so this approach seems doomed. Numerically which is not far from the Euler-Mascheroni constant, hence the choice of notation.

On the interval (1,2) we have , hence

for .

The DDE gets harder to integrate after that… on the interval the solution already involves the dilogarithm (Spence’s function):

To generate a sample from distribution S, I begin with a bunch of zeros and repeat “add 1, multiply by U[0,1]” many times. That’s it.

import numpy as np
import matplotlib.pyplot as plt
trials = 10000000
terms = 10000
x = np.zeros(shape=(trials,))
for _ in range(terms):
np.multiply(x+1, np.random.uniform(size=(trials,)), out=x)
_ = plt.hist(x, bins=1000, normed=True)
plt.show()

I still want to know the exact value of … after all, it’s also the probability that the sum of our random decreasing sequence is less than 1.

Update

The constant I called “” is in fact where is indeed Euler’s constant… This is what I learned from the Inverse Symbolic Calculator after solving the DDE (with initial value 1) numerically, and calculating the integral of the solution. From there, it did not take long to find that

The p.d.f. of S is the Dickman function, normalized to have integral 1. Wikipedia plots it on log scale, but Wolfram Mathworld has a plot identical to the above. Neither one mention the sum of random decreasing sequence.

The problem and its solution go back to 1973 paper “A probabilistic approach to differential-difference equations arising in analytic number theory” by J.-M-F. Chamayou, cited in the survey Euler’s constant: Euler’s work and modern developments by J. C. Lagarias.

Oh well. At least I practiced solving delay differential equations in Python. There is no built-in method in SciPy for that, and although there are some modules for DDE out there, I decided to roll my own. The logic is straightforward: solve the ODE on an interval of length 1, then build an interpolating spline out of the numeric solution and use it as the right hand side in the ODE, repeat. I used Romberg’s method for integrating the solution; the integration is done separately on each interval [k, k+1] because of the lack of smoothness at the integers.

import numpy as np
from scipy.integrate import odeint, romb
from scipy.interpolate import interp1d
numpoints = 2**12 + 1
solution = [lambda x: 1]
integrals = [1]
for k in range(1, 15):
y0 = solution[k-1](k)
t = np.linspace(k, k+1, numpoints)
rhs = lambda y, x: -solution[k-1](np.clip(x-1, k-1, k))/x
y = odeint(rhs, y0, t, atol=1e-15, rtol=1e-13).squeeze()
solution.append(interp1d(t, y, kind='cubic', assume_sorted=True))
integrals.append(romb(y, dx=1/(numpoints-1)))
total_integral = sum(integrals)
print("{:.15f}".format(1/total_integral))

As a byproduct, the program found the probabilities of the random sum being in each integer interval:

A 0-1 sequence can be interpreted as a point in the interval [0,1]. But this makes the long-term behavior of the sequence practically invisible due to limited resolution of our screens (and eyes). To make it visible, we can also plot the points obtained by shifting the binary sequence to the left (Bernoulli shift, which also goes by many other names). The resulting orbit is often dense in the interval, which doesn’t really help us visualize any patterns. But sometimes we get an interesting complex structure.

The vertical axis here is the time parameter, the number of dyadic shifts. The 0-1 sequence being visualized is the Kolakoski sequence in its binary form, with 0 and 1 instead of 1 and 2. By definition, the n-th run of equal digits in this sequence has length . In particular, 000 and 111 never occur, which contributes to the blank spots near 0 and 1.

Although the sequence is not periodic, the set is quite stable in time; it does not make a visible difference whether one plots the first 10,000 shifts, or 10,000,000. The apparent symmetry about 1/2 is related to the open problem of whether the Kolakoski sequence is mirror invariant, meaning that together with any finite word (such as 0010) it also contains its complement (that would be 1101).

There are infinitely many forbidden words apart from 000 and 111 (and the words containing those). For example, 01010 cannot occur because it has 3 consecutive runs of length 1, which implies having 000 elsewhere in the sequence. For the same reason, 001100 is forbidden. This goes on forever: 00100100 is forbidden because it implies having 10101, etc.

The number of distinct words of length n in the Kolakoski sequence is bounded by a power of n (see F. M. Dekking, What is the long range order in the Kolakoski sequence?). Hence, the set pictured above is covered by intervals of length , which implies it (and even its closure) is zero-dimensional in any fractal sense (has Minkowski dimension 0).

The set KC apparently does not have any isolated points; this is also an open problem, of recurrence (whether every word that appears in the sequence has to appear infinitely many times). Assuming this is so, the closure of the orbit is a totally disconnected compact set without isolated points, i.e., a Cantor set. It is not self-similar (not surprising, given it’s zero-dimensional), but its relation to the Bernoulli shift implies a structure resembling self-similarity:

Applying the transformations and yields two disjoint smaller copies that cover the original set, but with some spare parts left. The leftover bits exist because not every word in the sequence can be preceded by both 0 and 1.

Applying the transformations and yields two larger copies that cover the original set. There are no extra parts within the interval [0,1] but there is an overlap between the two copies.

The number appears several times in the structure of the set: for instance, the central gap is , the second-largest gap on the left has the left endpoint , etc. The Inverse Symbolic Calculator has not found anything about this number. Its binary expansion begins with 0.001 001 011 001 001 101 001 001 101 100… which one can recognize as the smallest binary number that can be written without doing anything three times in a row. (Can’t have 000; also can’t have 001 three times in a row; and 001 010 is not allowed because it contains 01010, three runs of length 1. Hence, the number begins with 001 001 011.) This number is obviously irrational, but other than that…

In conclusion, the Python code used to plot KC.

import numpy as np
import matplotlib.pyplot as plt
n = 1000000
a = np.zeros(n, dtype=int)
j = 0
same = False
for i in range(1, n):
if same:
a[i] = a[i-1]
same = False
else:
a[i] = 1 - a[i-1]
j += 1
same = bool(a[j])
v = np.array([1/2**k for k in range(60, 0, -1)])
b = np.convolve(a, v, mode='valid')
plt.plot(b, np.arange(np.size(b)), '.', ms=2)
plt.show()

In a 1946 paper Charles Pisot proved a theorem involving a curious constant . It can be defined as follows:

monic polynomial such that whenever

Equivalently, is determined by the requirement that the set have logarithmic capacity 1; this won’t be used here. The theorem is stated below, although this post is really about the constant.

Theorem: If an entire function takes integer values at nonnegative integers and is for some , then it is a finite linear combination of terms of the form , where each is an algebraic integer.

The value of is best possible; thus, in some sense Pisot’s theorem completed a line of investigation that began with a 1915 theorem by Pólya which had in place of , and where the conclusion was that is a polynomial. (Informally speaking, Pólya proved that is the “smallest” entire-function that is integer-valued on nonnegative integers.)

Although the constant was mentioned in later literature (here, here, and here), no further digits of it have been stated anywhere, as far as I know. So, let it be known that the decimal expansion of begins with 0.84383.

A lower bound on can be obtained by constructing a monic polynomial that is bounded by 1 on the set . Here is E(0.843):

It looks pretty round, except for that flat part on the left. In fact, E(0.82) is covered by a disk of unit radius centered at 1.3, which means that the choice shows .

How to get an upper bound on ? Turns out, it suffices to exhibit a monic polynomial that has all zeros in and satisfies on the boundary of . The existence of such shows . Indeed, suppose that is monic and on . Consider the function . By construction on the boundary of . Also, is subharmonic in its complement, including , where the singularities of both logarithms cancel out, leaving . This contradicts the maximum principle for subharmonic functions, according to which cannot exceed the maximum of on the boundary.

The choice of works for .

So we have boxed between 0.82 and 0.89; how to get more precise bounds? I don’t know how Pisot achieved the precision of 0.843… it’s possible that he strategically picked some linear and quadratic factors, raised them to variable integer powers and optimized the latter. Today it is too tempting to throw some optimization routine on the problem and let it run for a while.

But what to optimize? The straightforward approach is to minimize the maximum of on the circle , approximated by sampling the function at a sufficiently fine uniform grid and picking the maximal value. This works… unspectacularly. One problem is that the objective function is non-differentiable. Another is that taking maximum throws out a lot of information: we are not using the values at other sample points to better direct the search. After running optimization for days, trying different optimization methods, tolerance options, degrees of the polynomial, and starting values, I was not happy with the results…

Turns out, the optimization is much more effective if one minimizes the variance of the set . Now we are minimizing a polynomial function of , which pushes them toward having the same absolute value — the behavior that we want the polynomial to have. It took from seconds to minutes to produce the polynomials shown below, using BFGS method as implemented in SciPy.

As the arguments for optimization function I took the real and imaginary parts of the zeros of the polynomial. The symmetry about the real axis was enforced automatically: the polynomial was the product of quadratic terms . This eliminated the potentially useful option of having real zeros of odd order, but I did not feel like special-casing those.

Again, nearly the same polynomial works for upper and lower bounds. The fact that the absolute value of each of these polynomials is below 1 (for lower bounds) or greater than 1 (for upper bounds) can be ascertained by sampling them and using an upper estimate on the derivative; there is enough margin to trust computations with double precision.

Finally, the Python script I used. The function “obj” is getting minimized while function “values” returns the actual values of interest: the minimum and maximum of polynomial. The degree of polynomial is 2n, and the radius under consideration is r. The sample points are collected in array s. To begin with, the roots are chosen randomly. After minimization runs (inevitably, ending in a local minimum of which there are myriads), the new starting point is obtained by randomly perturbing the local minimum found. (The perturbation is smaller if minimization was particularly successful.)

import numpy as np
from scipy.optimize import minimize
def obj(r):
rc = np.concatenate((r[:n]+1j*r[n:], r[:n]-1j*r[n:])).reshape(-1,1)
p = np.prod(np.abs(s-rc)**2, axis=0)
return np.var(p)
def values(r):
rc = np.concatenate((r[:n]+1j*r[n:], r[:n]-1j*r[n:])).reshape(-1,1)
p = np.prod(np.abs(s-rc), axis=0)
return [np.min(p), np.max(p)]
r = 0.84384
n = 10
record = 2
s = np.exp(r * np.exp(1j*np.arange(0, np.pi, 0.01)))
xr = np.random.uniform(0.8, 1.8, size=(n,))
xi = np.random.uniform(0, 0.7, size=(n,))
x0 = np.concatenate((xr, xi))
while True:
res = minimize(obj, x0, method = 'BFGS')
if res['fun'] < record:
record = res['fun']
print(repr(res['x']))
print(values(res['x']))
x0 = res['x'] + np.random.uniform(-0.001, 0.001, size=x0.shape)
else:
x0 = res['x'] + np.random.uniform(-0.05, 0.05, size=x0.shape)