The Kolakoski-Cantor set

A 0-1 sequence can be interpreted as a point in the interval [0,1]. But this makes the long-term behavior of the sequence practically invisible due to limited resolution of our screens (and eyes). To make it visible, we can also plot the points obtained by shifting the binary sequence to the left (Bernoulli shift, which also goes by many other names). The resulting orbit  is often dense in the interval, which doesn’t really help us visualize any patterns. But sometimes we get an interesting complex structure.

The Kolakoski-Cantor set, KC

The vertical axis here is the time parameter, the number of dyadic shifts. The 0-1 sequence being visualized is the Kolakoski sequence in its binary form, with 0 and 1 instead of 1 and 2. By definition, the n-th run of equal digits in this sequence has length {x_n+1}. In particular, 000 and 111 never occur, which contributes to the blank spots near 0 and 1.

Although the sequence is not periodic, the set is quite stable in time; it does not make a visible difference whether one plots the first 10,000 shifts, or 10,000,000. The apparent symmetry about 1/2 is related to the open problem of whether the Kolakoski sequence is mirror invariant, meaning that together with any finite word (such as 0010) it also contains its complement (that would be 1101).

There are infinitely many forbidden words apart from 000 and 111 (and the words containing those). For example, 01010 cannot occur because it has 3 consecutive runs of length 1, which implies having 000 elsewhere in the sequence. For the same reason, 001100 is forbidden. This goes on forever: 00100100 is forbidden because it implies having 10101, etc.

The number of distinct words of length n in the Kolakoski sequence is bounded by a power of n (see F. M. Dekking, What is the long range order in the Kolakoski sequence?). Hence, the set pictured above is covered by {O(n^p)} intervals of length {2^{-n}}, which implies it (and even its closure) is zero-dimensional in any fractal sense (has Minkowski dimension 0).

The set KC apparently does not have any isolated points; this is also an open problem, of recurrence (whether every word that appears in the sequence has to appear infinitely many times). Assuming this is so, the closure of the orbit is a totally disconnected compact set without isolated points, i.e., a Cantor set. It is not self-similar (not surprising, given it’s zero-dimensional), but its relation to the Bernoulli shift implies a structure resembling self-similarity:

KC is covered by two copies scaled by 1/2

Applying the transformations {x\mapsto x/2} and {x\mapsto (1+x)/2} yields two disjoint smaller copies that cover the original set, but with some spare parts left. The leftover bits exist because not every word in the sequence can be preceded by both 0 and 1.

KC covered by two copies scaled by 2

Applying the transformations {x\mapsto 2x} and {x\mapsto 2x-1} yields two larger copies that cover the original set. There are no extra parts within the interval [0,1] but there is an overlap between the two copies.

The number {c = \inf KC\approx 0.146778684766479} appears several times in the structure of the set: for instance, the central gap is {((1-c)/2, (1+c)/2)}, the second-largest gap on the left has the left endpoint {(1-c)/4}, etc. The Inverse Symbolic Calculator has not found anything about this number. Its binary expansion begins with 0.001 001 011 001 001 101 001 001 101 100… which one can recognize as the smallest binary number that can be written without doing anything three times in a row. (Can’t have 000; also can’t have 001 three times in a row; and 001 010 is not allowed because it contains 01010, three runs of length 1. Hence, the number begins with 001 001 011.) This number is obviously irrational, but other than that…

In conclusion, the Python code used to plot KC.

import numpy as np
import matplotlib.pyplot as plt
n = 1000000
a = np.zeros(n, dtype=int)
j = 0                  
same = False  
for i in range(1, n):
    if same:
        a[i] = a[i-1]    
        same = False
        a[i] = 1 - a[i-1]
        j += 1            
        same = bool(a[j])
v = np.array([1/2**k for k in range(60, 0, -1)])
b = np.convolve(a, v, mode='valid')
plt.plot(b, np.arange(np.size(b)), '.', ms=2)

Kolakoski turtle curve

Let’s take another look at the Kolakoski sequence (part 1, part 2) which, by definition, is the sequence of 1s and 2s in which the nth term is equal to the length of the nth run of consecutive equal numbers in the same sequence. When a sequence has only two distinct entries, it can be visualized with the help of a turtle that turns left (when the entry is 1) or right (when the entry is 2). This visualization method seems particularly appropriate for the  Kolakoski sequence since there are no runs of 3 equal entries, meaning the turtle will never move around a square of sidelength equal to its step. In particular, this leaves open the possibility of getting a simple curve… Here are the first 300 terms; the turtle makes its first move down and then goes left-right-right-left-left-right-left-… according to the terms 1,2,2,1,1,2,1,…

300 terms of the sequence
300 terms of the sequence

No self-intersections yet… alas, at the 366th term it finally happens.

First self-intersection: 366 terms
First self-intersection: 366 terms

Self-intersections keep occurring after that:

1000 terms
1000 terms

again and again…

5000 terms
5000 terms

Okay, the curve obviously doesn’t mind intersecting self. But it can’t be periodic since the Kolakoski sequence isn’t. This leaves two questions unresolved:

  • Does the turtle ever get above its initial position? Probably… I haven’t tried more than 5000 terms
  • Is the curve bounded? Unlikely, but I’ve no idea how one would dis/prove that. For example, there cannot be a long diagonal run (left-right-left-right-left) because having 1,2,1,2,1 in the sequence implies that elsewhere, there are three consecutive 1s, and that doesn’t happen.

Here’s the Python code used for the above. I represented the sequence as a Boolean array with 1 = False, 2 = True.

import numpy as np
import turtle
n = 500                   # number of terms to compute
a = np.zeros(n, dtype=np.bool_)
j = 0                     # the index to look back at 
same = False   # will next term be same as current one?
for i in range(1, n):
    if same:
        a[i] = a[i-1]     # current run continues
        same = False
        a[i] = not a[i-1] # the run is over
        j += 1            # another run begins
        same = a[j]       # a[j] determines its length

for i in range(n):
    turtle.forward(10)    # used steps of 10 or 5 pixels 
    if a[i]:

Kolakoski sequence II

The computational evidence in my previous post on the Kolakoski sequence was meagre: I looked only at the first 500 terms. By definition, the Kolakoski sequence consists of 1s and 2s, begins with 1, and is self-similar in the following sense: replacing each run by its length, you recover the original sequence. Here is its beginning, with separating spaces omitted within each run.

1 22 11 2 1 22 1 22 11 2 11 22 1 2 11 2 1 22 11 2 11 2 1 22 1 22 11 2 1 22 1 2 11 2 11 22 1 22 11 2 1 22 1 22 11 2 11 2 1 22 1 2 11 22 1 22 11 2 1 22 1 22 11 2 11 22 1 2 11 2 1 22 1 22 11 2 11 2 1 22 11 2 11 22 1 2 11 2 11 22 1 22 11 2 1 22 1 2 11 22 1 22 1 2 11 2 11 22 1 22 11 …

Conjecturally, the density of 1s and of 2s in this sequence is 50%. And indeed, the number of 2s among the first N terms grows in a way that’s hard to distinguish from N/2 on a plot. However, the density looks quite different if one looks at three subsequences separately:

  • a_{3k+1} (red subsequence)
  • a_{3k+2} (green subsequence)
  • a_{3k+3} (blue subsequence)

At the very beginning (among the first 500 terms) the red sequence is dominated by 1s while the blue is dominated by 2s. But the story changes quickly. The plot below shows the count of 2s in each subsequence, minus the linear prediction. The average of red, green, and blue is shown in black.

200000 terms, separated into 3 subsequences

The nearly flawless behavior of the overall count of 2s hides the wild oscillation within the three subsequences. Presumably, they continue moving back and forth and the amplitude of their oscillation appears to be growing. However, the growth looks sublinear, which means that even within each subsequence, the asymptotic density of 2s might converge to 1/2.

The computation was done in scilab. For simplicity I replaced 1s and 2s by 0s ans 1s, so that the length of the jth run is a_j+1. The variable rep indicates whether the next term should be the same as its predecessor; j keeps the count of runs.

N=200000; a=zeros(N); j=1; rep=0;
for i=2:N
if (rep==1) a(i)=a(i-1); rep=0;
else a(i)=1-a(i-1); j=j+1; rep=a(j);

Pretty neat, isn’t it? The rest of the code is less neat, though. I still don’t know of any good way to calculate running totals in scilab.

M=floor(N/3)+1; s=zeros(M,4);
for i=0:M-2
for j=1:3
plot(s(:,1)-s(:,4),'r'); plot(s(:,2)-s(:,4),'g'); plot(s(:,3)-s(:,4),'b');

Bonus content: I extended the computations to N=1663500 (yeah, should have increased the stack size beforehand).

1663500 terms

The OEIS sequence A000002

The sequences in The On-Line Encyclopedia of Integer Sequences® are themselves arranged into a sequence, each being indexed by Axxxxxx. Hence, the sequence
2, 3, 2, 1, 3, 4, 1, 7, 7, 5, 45, 2, 181, 43, 17, ...
can never be included in the encyclopedia: its definition is a_n=1+\text{nth term of the nth sequence in the OEIS}.

By the way, which sequences have the honor of being on the virtual first page of the OEIS? A000004 consists of zeros. Its description offers the following explicit formula:

a(n)=A*sin(n*Pi) for any A. [From Jaume Oliver Lafont, Apr 02 2010]

The sequence A000001 counts the groups of order n.

The sequence A000002 consists of 1s and 2s only:
1, 2, 2, 1, 1, 2, 1, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 2, 1, 2, 2, 1, 1, ...
The defining property of this sequence is self-similarity: consider the runs of equal numbers
1, 22, 11, 2, 1, 22, 1, 22, 11, 2, 11, 22, 1, 2, 11, 2, 1, 22, 11, ...
and replace each run by its length:
1, 2, 2, 1, 1, 2, 1, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 2, 2, ...
You get the original sequence again.

This is the Kolakoski sequence, which first appeared in print in 1965. Contrary to what one might expect from such a simple definition, there is no simple pattern; it is not even known whether 1s and 2s appear with the same asymptotic frequency 1/2. One way to look for patterns in a function/sequence is to take its Fourier transform, and this is what I did below:

Fourier transform: click to magnify

Specifically, this is the absolute value of \displaystyle F(t)=\sum_{n=1}^{500} (2a_n-3) e^{2\pi i n t}. To better see where the peaks fall, it’s convenient to look at |F(1/t)|:

Reciprocal axis: click to magnify

Something is going on at t=1/3, i.e. at period 3. Recall that the sequence contains no triples 111 or 222: there are no runs of more than 2 equal numbers. So perhaps it’s not surprising that a pattern emerges at period 3. And indeed, even though the percentage of 1s among the first 500 terms of the sequence is 49.8%, breaking it down by n mod 3 yields the following:

  • Among a_{3k}, the percentage of 1s is 17.4%
  • Among a_{3k+1}, the percentage of 1s is 85.2%
  • Among a_{3k+2}, the percentage of 1s is 46.8%

Noticing another peak at t=1/9, I also broke down the percentages by n mod 9:

  • n=9k: there are 5.4% 1s
  • n=9k+1: there are 74.8% 1s
  • n=9k+2: there are 41.4% 1s
  • n=9k+3: there are 21.6% 1s
  • n=9k+4: there are 97.2% 1s
  • n=9k+5: there are 70.2% 1s
  • n=9k+6: there are 25.2% 1s
  • n=9k+7: there are 84.6% 1s
  • n=9k+8: there are 28.8% 1s

Nothing close to 50%…

In conclusion, my code for the Kolakoski sequence. This time I used Maple (instead of Scilab) for no particular reason.

N:=500; ks:=Array(1..N+2); i:=0; j:=1; new:=1;
while i<=N do i:=i+1; ks[i]:=new;
if (ks[j]=2) then i:=i+1; ks[i]:=new; end if;
j:=j+1; new:=3-new;
end do: