The sum of pairwise distances and the square of CDF

Suppose we have {n} real numbers {x_0,\dots, x_{n-1}} and want to find the sum of all distances {|x_j-x_k|} over {j < k}. Why? Maybe because over five years ago, the gradient flow of this quantity was used for "clustering by collision" (part 1, part 2, part 3).

If I have a Python console open, the problem appears to be solved with one line:

>>> 0.5 * np.abs(np.subtract.outer(x, x)).sum()

where the outer difference of x with x creates a matrix of all differences {x_i-x_j}, then absolute values are taken, and then they are all added up. Double-counted, hence the factor of 0.5.

But trying this with, say, one million numbers is not likely to work. If each number takes 8 bytes of memory (64 bits, double precision), then the array x is still pretty small (under 8 MB) but a million-by-million matrix will require over 7 terabytes, and I won’t have that kind of RAM anytime soon.

In principle, one could run a loop adding these values, or store the matrix on a hard drive. Both are going to take forever.

There is a much better way, though. First, sort the numbers in nondecreasing order; this does not require much time or memory (compared to quadratic memory cost of forming a matrix). Then consider the partial sums {s_k = x_0+\dots+x_k}; the cost of computing them is linear in time and memory. For each fixed {k}, the sum of distances to {x_j} with {j<k} is simply {kx_k - s_{k-1}}, or, equivalently, {(k+1)x_k - s_k}. So, all we have to do is add these up. Still one line of code (after sorting), but a much faster one:

>>> x.sort()
>>> (np.arange(1, n+1)*x - np.cumsum(x)).sum()

For example, x could be a sample from some continuous distribution. Assuming the distribution has a mean (i.e., is not too heavy tailed), the sum of all pairwise distances grows quadratically with n, and its average approaches a finite limit. For the uniform distribution on [0, 1] the computation shows this limit is 1/3. For the standard normal distribution it is 1.128… which is not as recognizable a number.


As {n\to \infty}, the average distance of a sample taken from a distribution converges to the expected value of |X-Y| where X, Y are two independent variables with that distribution. Let’s express this in terms of the probability density function {p} and the cumulative distribution function {\Phi}. By symmetry, we can integrate over {x> y} and double the result:

{\displaystyle \frac12 E|X-Y| = \int_{-\infty}^\infty p(x)\,dx \int_{-\infty}^x (x-y) p(y)\,dy}

Integrate by parts in the second integral: {p(y) = \Phi'(y)}, and the boundary terms are zero.

{\displaystyle \frac12 E|X-Y| = \int_{-\infty}^\infty p(x)\,dx \int_{-\infty}^x \Phi(y)\,dy}

Integrate by parts in the other integral, throwing the derivative onto the indefinite integral and thus eliminating it. There is a boundary term this time.

{\displaystyle \frac12 E|X-Y| = \Phi(\infty) \int_{-\infty}^\infty \Phi(y)\,dy - \int_{-\infty}^\infty \Phi(x)^2\,dx}

Since {\Phi(\infty) = 1}, this simplifies nicely:

{\displaystyle \frac12 E|X-Y| = \int_{-\infty}^\infty \Phi(x) (1-\Phi(x))\,dx}

This is a lot neater than I expected: {E|X-Y|} is simply the integral of {2\Phi(1-\Phi)}. I don’t often see CDF squared, like here. Some examples: for the uniform distribution on [0,1] we get

{\displaystyle E|X-Y| = \int_0^1 2x(1-x)\,dx = \frac13}

and for the standard normal, with {\Phi(x) = (1+\mathrm{erf}\,(x/\sqrt{2}))/2}, it is

{\displaystyle \int_{-\infty}^\infty \frac12 \left(1-\mathrm{erf}\,(x/\sqrt{2}) ^2 \right)\,dx = \frac{2}{\sqrt{\pi}}\approx 1.12838\ldots }


The trick with sorting and cumulative sums can also be used to find, for every point {x_k}, the sum (or average) of distances to all other points. To do this, we don’t sum over {k} but must also add {|x_j-x_k|} for {j>k}. The latter sum is simply {S - s_k - (n-k-1)x_k} where {S} is the total sum. So, all we need is

>>> (2*np.arange(1,n+1)-n)*x - 2*np.cumsum(x) + x.sum()

Unfortunately, the analogous problems for vector-valued sequences are not as easy. If the Manhattan metric is used, we can do the computations for each coordinate separately, and add the results. For the Euclidean metric…

Moderatorship connections between Stack Exchange sites

Consider the graph in which the vertices are Stack Exchange sites, connected if two sites share a moderator. What does this graph look like? As being a moderator correlates with some level of expertise or at least interest in the topic, we can expect the structure of the graph to highlight topical connections between the sites.

Unsurprisingly, there is a dominant component (55 sites) centered at Stack Overflow. (Click the image for the larger version.)

SO_component
Stack Overflow component

Its radius is 6: every site can be joined to Stack Overflow by a path of length at most 6. The five sites at distance 6 from SO are Lifehacks, Martial Arts, Physical Fitness, Space Exploration, and Travel.

By the triangle inequality, the diameter of the SO component is at most 12. In fact it is exactly 12: it takes 12 steps to go from Space to either Fitness, Lifehacks, or Martial Arts.

The center of the SO component is not unique: Web Applications is another site from which every other one can be reached in 6 steps. And it has an advantage over SO in that only four sites in the component are at distance 6 from Web Apps. Naturally, these are the four sites that realize the diameter 12 of the component.

That said, Stack Overflow is the vertex of highest degree (8), meaning that Stack Overflow moderators also take part in moderating eight other sites.

What about other components? Excluding isolated vertices, we have the following picture of small components.

other_components
Other nontrivial components

The largest of these is the “language component”: English Language & Usage, English Language Learners, Japanese Language, Portuguese Language, and less logically, Sustainable Living. Other notable components are Unix (with bioinformatics thrown in) and Mathematics.


Source of the information: on the page listing moderators grouped by users I executed a JavaScript one-liner

JSON.stringify(Array.from(document.querySelectorAll('.mods-summary-list')).filter(e=>e.children.length>1).map(e=>Array.from(e.children).map(x=>x.hostname.split('.')[0])))

which gave a list of connections. The rest was done in Python:

import networkx as nx
from itertools import chain, combinations
connections = # copied JS output
edges = list(chain.from_iterable(combinations(c, 2) for c in connections))
G = nx.Graph()
G.add_edges_from(edges)
components = sorted(list(nx.connected_components(G)), key=len)
main_component = G.subgraph(components[-1])
pos = nx.kamada_kawai_layout(main_component)
nx.draw_networkx(main_component, pos=pos, node_size=100)

I used different layout algorithms for the dominant component and for the rest. The default “spring” layout makes the SO component too crowded, but is okay for small components:

other_components = G.subgraph(chain.from_iterable(components[:-1]))
pos = nx.spring_layout(other_components, k=0.25) 
nx.draw_networkx(other_components, pos=pos, node_size=100)

The rest was done by various functions of the NetworkX library such as

nx.degree(main_component)
nx.center(main_component)
nx.diameter(main_component)
nx.periphery(main_component)
nx.radius(main_component)
nx.shortest_path_length(main_component, source="stackoverflow")
nx.shortest_path_length(main_component, source="webapps")

Discrete Cosine Transforms: Trapezoidal vs Midpoint

The existence of multiple versions of Discrete Cosine Transform (DCT) can be confusing. Wikipedia explains that its 8 types are determined by how one reflects across the boundaries. E.g., one can reflect 1, 2, 3 across the left boundary as 3, 2, 1, 2, 3 or as 3, 2, 1, 1, 2, 3, and there are such choices for the other boundary too (also, the other reflection can be odd or even). Makes sense enough.

But there is another aspect to the two most used forms, DCT-I and DCT-II (types I and II): they can be expressed in terms of the Trapezoidal and Midpoint rules for integration. Here is how.

The cosines {\cos kx}, {k=0,1,\dots} are orthogonal on the interval {[0,\pi]} with respect to the Lebesgue measure. The basis of discrete transforms is that these cosines are also orthogonal with respect to some discrete measures, until certain frequency is reached. Indeed, if cosines are orthogonal with respect to measure {\mu} whose support consists of {n} points, then we can efficiently use them to represent any function defined on the support of {\mu}, and those are naturally identified with sequences of length n.

How to find such measures? It helps that some simple rules of numerical integration are exact for trigonometric polynomials up to some degree.

For example, the trapezoidal rule with n sample points exactly integrates the functions {\cos kx} for {k=0,\dots, 2n-3}. This could be checked by converting to exponential form and summing geometric progressions, but here is a visual explanation with n=4, where the interval {[0,\pi]} is represented as upper semi-circle. The radius of each red circle indicates the weight placed at that point; the endpoints get 1/2 of the weight of other sample points. To integrate {\cos x} correctly, we must have the x-coordinate of the center of mass equal to zero, which is obviously the case.

t1
Trapezoidal rule with 4 points

Replacing {\cos x} by {\cos kx} means multiplying the polar angle of each sample point by {k}. This is what we get:

t2
Argument of cosine multiplied by k=2 or k=4
t3
Argument multiplied by k=3
t5
Argument multiplied by k=5

In all cases the x-coordinate of the center of mass is zero. With k=6 this breaks down, as all the weight gets placed in one point. And this is how it goes in general with integration of sines and cosines: equally spaced points work perfectly until they don’t work at all, which happens when the step size is equal to the period of the function. When {k=2n-2}, the period {2\pi/k} is equal to {\pi/(n-1)}, the spacing of points in the trapezoidal rule.

The orthogonality of cosines has to do with the formula {\cos kx \cos j x = \frac12(\cos (k-j)x) + \frac12(\cos (k+j)x)}. Let {\tau_n} be the measure expressing the trapezoidal rule on {[0,\pi]} with {n} sample points; so it’s the sum of point masses at {0, \pi/(n-1), \dots , \pi}. Then {\{\cos k x \colon k=0,\dots, n-1\}} are orthogonal with respect to {\tau_n} because any product {\cos k x\cos j x } with {k >j} taken from this range will have {k-j, k+j \le 2n-3}. Consequently, we can compute the coefficients of any function {f} in the cosine basis as

{\displaystyle c_k = \int f(x)\cos kx\,d\tau_n(x) \bigg/ \int \cos^2 kx\,d\tau_n(x)}

The above is what DCT-I (discrete cosine transform of type 1) does, up to normalization.

The DCT-II transform uses the Midpoint rule instead of the Trapezoidal rule. Let {\mu_n} be the measure expressing the Midpoint rule on {[0,\pi]} with {n} sample points; it gives equal mass to the points {(2j-1)\pi/(2n)} for {j=1,\dots, n}. These are spaced at {\pi/n} and therefore the midpoint rule is exact for {\cos kx} with {k=0,\dots, 2n-1} which is better than what the trapezoidal rule does. Perhaps more significantly, by identifying the given data points with function values at the midpoints of subintervals we stay away from the endpoints {0,\pi} where the cosines are somewhat restricted by having to have zero slope.

Let’s compare DCT-I and DCT-II on the same data set, {y=(0, \sqrt{1}, \sqrt{2}, \dots, \sqrt{8})}. There are 9 numbers here. Following DCT-I we place them at the sample points of the trapezoidal rule, and expand into cosines using the inner product with respect to {\tau_n}. Here is the plot of the resulting trigonometric polynomial: of course it interpolates the data.

trapezoid
DCT-I interpolation

But DCT-II does it better, despite having exactly the same cosine functions. The only change is that we use {\mu_n} and so place the {y}-values along its support.

midpoint
DCT-II interpolation

Less oscillation means the high-degree coefficients are smaller, and therefore easier to discard in order to compress information. For example, drop the last two coefficients in each expansion, keeping 6 numbers instead of 8. DCT-II clearly wins in accuracy then.

trap-truncated
Truncated DCT-I
midpoint-truncated
Truncated DCT-II

Okay, so the Midpoint rule is better, no surprise. After all, it’s in general about twice as accurate as the Trapezoidal rule. What about Simpson’s rule, would it lead to some super-efficient form of DCT? That is, why don’t we let {\sigma_n} be the discrete measure that expresses Simpson’s rule and use the inner product {\int fg\,d\sigma_n} for cosine expansion? Alas, Simpson’s rule on {n} points is exact only for {\cos kx} with {k=0,\dots, n-2}, which is substantially worse than either Trapezoidal or Midpoint rules. As a result, we don’t get enough orthogonal cosines with respect to {\sigma_n} to have an orthogonal basis. Simpson’s rule has an advantage when dealing with algebraic polynomials, not with trigonometric ones.

Finally, the Python code used for the graphics; I did not use SciPy’s DCT method (which is of course more efficient) to keep the relation to numerical integration explicit in the code. The method trapz implements the trapezoidal rule, and the midpoint rule is just the summation of sampled values. In both cases there is no need to worry about factor dx, since it cancels out when we divide one numerical integral by the other.

import numpy as np
import matplotlib.pyplot as plt
#   Setup 
y = np.sqrt(np.arange(9))
c = np.zeros_like(y)
n = y.size
#   DCT-I, trapezoidal
x = np.arange(n)*np.pi/(n-1)
for k in range(n):
  c[k] = np.trapz(y*np.cos(k*x))/np.trapz(np.cos(k*x)**2)
t = np.linspace(0, np.pi, 500)
yy = np.sum(c*np.cos(np.arange(9)*t.reshape(-1, 1)), axis=1)
plt.plot(x, y, 'ro')
plt.plot(t, yy)
plt.show()
#   DCT-II, midpoint
x = np.arange(n)*np.pi/n + np.pi/(2*n)
for k in range(n):
  c[k] = np.sum(y*np.cos(k*x))/np.sum(np.cos(k*x)**2)
t = np.linspace(0, np.pi, 500)
yy = np.sum(c*np.cos(np.arange(9)*t.reshape(-1, 1)), axis=1)
plt.plot(x, y, 'ro')
plt.plot(t, yy)
plt.show()

Experiments with the significance of autocorrelation

Given a sequence of numbers {x_j} of length {L} one may want to look for evidence of its periodic behavior. One way to do this is by computing autocorrelation, the correlation of the sequence with a shift of itself. Here is one reasonable way to do so: for lag values {\ell=1,\dots, \lfloor L/2 \rfloor} compute the correlation coefficient of {(x_1,\dots, x_{L-\ell}} with {(x_{\ell+1},\dots, x_L)}. That the lag does not exceed {L/2} ensures the entire sequence participates in the computation, so we are not making a conclusion about its periodicity after comparing a handful of terms at the beginning and the end. In other words, we are not going to detect periodicity if the period is more than half of the observed time period.

Having obtained the correlation coefficients, pick one with the largest absolute value; call it R. How large does R have to be in order for us to conclude the correlation is not a fluke? The answer depends on the distribution of our data, but an experiment can be used to get some idea of likelihood of large R.

I picked {x_j} independently from the standard normal distribution, and computed {r} as above. After 5 million trials with a sequence of length 100, the distribution of R was as follows:

ac100-5M
Extremal correlation coefficient in a sequence of length 100

Based on this experiment, the probability of obtaining |R| greater than 0.5 is less than 0.0016. So, 0.5 is pretty solid evidence. The probability of {|R| > 0.6} is two orders of magnitude less, etc. Also, |R| is unlikely to be very close to zero unless the data is structured in some strange way. Some kind of correlation ought to be present in the white noise.

Aside: it’s not easy to construct perfectly non-autocorrelated sequences for the above test. For length 5 an example is 1,2,3,2,3. Indeed, (1,2,3,2) is uncorrelated with (2,3,2,3) and (1,2,3) is uncorrelated with (3,2,3). For length 6 and more I can’t construct these without filling them with a bunch of zeros.

Repeating the experiment with sequences of length 1000 shows a tighter distribution of R: now |R| is unlikely to be above 0.2. So, if a universal threshold is to be used here, we need to adjust R based on sequence length.

ac1000-5M
Extremal correlation coefficient in a sequence of length 1000

I did not look hard for statistical studies of this subject, resorting to an experiment. Experimentally obtained p-values are pretty consistent for the criterion {L^{0.45}|R| > 4}. The number of trials was not very large (10000) so there is some fluctuation, but the pattern is clear.
 

Length, L P(L0.45|R| > 4)
100 0.002
300 0.0028
500 0.0022
700 0.0028
900 0.0034
1100 0.0036
1300 0.0039
1500 0.003
1700 0.003
1900 0.0042
2100 0.003
2300 0.0036
2500 0.0042
2700 0.0032
2900 0.0043
3100 0.0042
3300 0.0025
3500 0.0031
3700 0.0027
3900 0.0042

Naturally, all this depends on the assumption of independent normal variables.

And this is the approach I took to computing r in Python:

import numpy as np
n = 1000
x = np.random.normal(size=(n,))
acorr = np.correlate(x, x, mode='same')
acorr = acorr[n//2+1:]/(x.var()*np.arange(n-1, n//2, -1))
r = acorr[np.abs(acorr).argmax()]

Top 10 xkcd comics according to Stack Overflow

Sorted according to the number of Stack Overflow posts (as of now) in which the comic is linked. The posts themselves can be seen by clicking the “posts” link.

#10: Standards (22 posts)

#9: Compiling (29 posts)

#8: Python (33 posts)

#7: Regular Expressions (36 posts)

#6: ISO 8601 (50 posts)

#5: Password Strength (60 posts)

#4: Wisdom of the Ancients (63 posts)

#3: goto (64 posts)

#2: Random Number (74 posts)

#1: Exploits of a Mom (680 posts)

Recursive randomness of reals: summing a random decreasing sequence

In a comment to Recursive randomness of integers Rahul pointed out a continuous version of the same problem: pick {x_0} uniformly in {[0,1]}, then {x_1} uniformly in {[0,x_0]}, then {x_2} uniformly in {[0, x_1]}, etc. What is the distribution of the sum {S=\sum_{n=0}^\infty x_n}?

The continuous version turns out to be easier to analyse. To begin with, it’s equivalent to picking uniformly distributed, independent {y_n\in [0,1]} and letting {x_n = y_0y_1\cdots y_n}. Then the sum is

{\displaystyle y_0+y_0y_1 + y_0y_1y_2 + \cdots }

which can be written as

{\displaystyle y_0(1+y_1(1+y_2(1+\cdots )))}

So, {S} is a stationary point of the random process {X\mapsto (X+1)U} where {U} is uniformly distributed in {[0,1]}. Simply put, {S} and {(S+1)U} have the same distribution. This yields the value of {E[S]} in a much simpler way than in the previous post:

{E[S]=E[(S+1)U] = (E[S] + 1) E[U] = (E[S] + 1)/2}

hence {E[S]=1}.

We also get an equation for the cumulative distribution function {C(t) = P[S\le t]}. Indeed,

{\displaystyle P[S\le t] = P[(S+1)U \le t] = P[S \le t/U-1]}

The latter probability is {\int_0^1 P[S\le t/u-1]\,du = \int_0^1 C(t/u-1)\,du}. Conclusion: {C(t) = \int_0^1 C(t/u-1)\,du}. Differentiate to get an equation for the probability density function {p(t)}, namely {p(t) = \int_0^1 p(t/u-1)\,du/u}. It’s convenient to change the variable of integration to {v = t/u-1}, which leads to

{\displaystyle p(t) = \int_{t-1}^\infty p(v)\,\frac{dv}{v+1}}

Another differentiation turns the integral equation into a delay differential equation,

{\displaystyle p'(t) = - \frac{p(t-1)}{t}}

Looks pretty simple, doesn’t it? Since the density is zero for negative arguments, it is constant on {[0,1]}. This constant, which I’ll denote {\gamma}, is {\int_{0}^\infty p(v)\,\frac{dv}{v+1}}, or simply {E[1/(S+1)]}. I couldn’t get an analytic formula for {\gamma}. My attempt was {E[1/(S+1)] = \sum_{n=0}^\infty (-1)^n M_n} where {M_n=E[S^n]} are the moments of {S}. The moments can be computed recursively using {E[S^n] = E[(S+1)^n]E[U^n]}, which yields

{\displaystyle M_n=\frac{1}{n} \sum_{k=0}^{n-1} \binom{n}{k}M_k}

The first few moments, starting with {M_0}, are 1, 1, 3/2, 17/6, 19/3, 81/5, 8351/80… Unfortunately the series {\sum_{n=0}^\infty (-1)^n M_n} diverges, so this approach seems doomed. Numerically {\gamma \approx 0.5614} which is not far from the Euler-Mascheroni constant, hence the choice of notation.

On the interval (1,2) we have {p'(t) = -\gamma/t}, hence

{p(t) = \gamma(1-\log t)} for {1 \le t \le 2}.

The DDE gets harder to integrate after that… on the interval {[2,3]} the solution already involves the dilogarithm (Spence’s function):

{\displaystyle p(t) = \gamma(1+\pi^2/12 - \log t + \log(t-1)\log t + \mathrm{Spence}\,(t))}

following SciPy’s convention for Spence. This is as far as I went… but here is an experimental confirmation of the formulas obtained so far (link to full size image).

match03
Histogram of 10 million samples and the pdf (red) plotted on [0,3]

To generate a sample from distribution S, I begin with a bunch of zeros and repeat “add 1, multiply by U[0,1]” many times. That’s it.

import numpy as np
import matplotlib.pyplot as plt
trials = 10000000
terms = 10000
x = np.zeros(shape=(trials,))
for _ in range(terms):
    np.multiply(x+1, np.random.uniform(size=(trials,)), out=x)
_ = plt.hist(x, bins=1000, normed=True)
plt.show()

I still want to know the exact value of {\gamma}… after all, it’s also the probability that the sum of our random decreasing sequence is less than 1.

Update

The constant I called “{\gamma}” is in fact {\exp(-\gamma)} where {\gamma} is indeed Euler’s constant… This is what I learned from the Inverse Symbolic Calculator after solving the DDE (with initial value 1) numerically, and calculating the integral of the solution. From there, it did not take long to find that

Oh well. At least I practiced solving delay differential equations in Python. There is no built-in method in SciPy for that, and although there are some modules for DDE out there, I decided to roll my own. The logic is straightforward: solve the ODE on an interval of length 1, then build an interpolating spline out of the numeric solution and use it as the right hand side in the ODE, repeat. I used Romberg’s method for integrating the solution; the integration is done separately on each interval [k, k+1] because of the lack of smoothness at the integers.

import numpy as np
from scipy.integrate import odeint, romb
from scipy.interpolate import interp1d
numpoints = 2**12 + 1
solution = [lambda x: 1]
integrals = [1]
for k in range(1, 15):
    y0 = solution[k-1](k)
    t = np.linspace(k, k+1, numpoints)
    rhs = lambda y, x: -solution[k-1](np.clip(x-1, k-1, k))/x
    y = odeint(rhs, y0, t, atol=1e-15, rtol=1e-13).squeeze()
    solution.append(interp1d(t, y, kind='cubic', assume_sorted=True))
    integrals.append(romb(y, dx=1/(numpoints-1)))
total_integral = sum(integrals)
print("{:.15f}".format(1/total_integral))

As a byproduct, the program found the probabilities of the random sum being in each integer interval:

  • 56.15% in [0,1]
  • 34.46% in [1,2]
  • 8.19% in [2,3]
  • 1.1% in [3,4]
  • 0.1% in [4,5]
  • less than 0.01% chance of being greater than 5

The Kolakoski-Cantor set

A 0-1 sequence can be interpreted as a point in the interval [0,1]. But this makes the long-term behavior of the sequence practically invisible due to limited resolution of our screens (and eyes). To make it visible, we can also plot the points obtained by shifting the binary sequence to the left (Bernoulli shift, which also goes by many other names). The resulting orbit  is often dense in the interval, which doesn’t really help us visualize any patterns. But sometimes we get an interesting complex structure.

kol_set_small
The Kolakoski-Cantor set, KC

The vertical axis here is the time parameter, the number of dyadic shifts. The 0-1 sequence being visualized is the Kolakoski sequence in its binary form, with 0 and 1 instead of 1 and 2. By definition, the n-th run of equal digits in this sequence has length {x_n+1}. In particular, 000 and 111 never occur, which contributes to the blank spots near 0 and 1.

Although the sequence is not periodic, the set is quite stable in time; it does not make a visible difference whether one plots the first 10,000 shifts, or 10,000,000. The apparent symmetry about 1/2 is related to the open problem of whether the Kolakoski sequence is mirror invariant, meaning that together with any finite word (such as 0010) it also contains its complement (that would be 1101).

There are infinitely many forbidden words apart from 000 and 111 (and the words containing those). For example, 01010 cannot occur because it has 3 consecutive runs of length 1, which implies having 000 elsewhere in the sequence. For the same reason, 001100 is forbidden. This goes on forever: 00100100 is forbidden because it implies having 10101, etc.

The number of distinct words of length n in the Kolakoski sequence is bounded by a power of n (see F. M. Dekking, What is the long range order in the Kolakoski sequence?). Hence, the set pictured above is covered by {O(n^p)} intervals of length {2^{-n}}, which implies it (and even its closure) is zero-dimensional in any fractal sense (has Minkowski dimension 0).

The set KC apparently does not have any isolated points; this is also an open problem, of recurrence (whether every word that appears in the sequence has to appear infinitely many times). Assuming this is so, the closure of the orbit is a totally disconnected compact set without isolated points, i.e., a Cantor set. It is not self-similar (not surprising, given it’s zero-dimensional), but its relation to the Bernoulli shift implies a structure resembling self-similarity:

kolhalf
KC is covered by two copies scaled by 1/2

Applying the transformations {x\mapsto x/2} and {x\mapsto (1+x)/2} yields two disjoint smaller copies that cover the original set, but with some spare parts left. The leftover bits exist because not every word in the sequence can be preceded by both 0 and 1.

kolx2
KC covered by two copies scaled by 2

Applying the transformations {x\mapsto 2x} and {x\mapsto 2x-1} yields two larger copies that cover the original set. There are no extra parts within the interval [0,1] but there is an overlap between the two copies.

The number {c = \inf KC\approx 0.146778684766479} appears several times in the structure of the set: for instance, the central gap is {((1-c)/2, (1+c)/2)}, the second-largest gap on the left has the left endpoint {(1-c)/4}, etc. The Inverse Symbolic Calculator has not found anything about this number. Its binary expansion begins with 0.001 001 011 001 001 101 001 001 101 100… which one can recognize as the smallest binary number that can be written without doing anything three times in a row. (Can’t have 000; also can’t have 001 three times in a row; and 001 010 is not allowed because it contains 01010, three runs of length 1. Hence, the number begins with 001 001 011.) This number is obviously irrational, but other than that…

In conclusion, the Python code used to plot KC.

import numpy as np
import matplotlib.pyplot as plt
n = 1000000
a = np.zeros(n, dtype=int)
j = 0                  
same = False  
for i in range(1, n):
    if same:
        a[i] = a[i-1]    
        same = False
    else:
        a[i] = 1 - a[i-1]
        j += 1            
        same = bool(a[j])
v = np.array([1/2**k for k in range(60, 0, -1)])
b = np.convolve(a, v, mode='valid')
plt.plot(b, np.arange(np.size(b)), '.', ms=2)
plt.show()