## The Kolakoski-Cantor set

A 0-1 sequence can be interpreted as a point in the interval [0,1]. But this makes the long-term behavior of the sequence practically invisible due to limited resolution of our screens (and eyes). To make it visible, we can also plot the points obtained by shifting the binary sequence to the left (Bernoulli shift, which also goes by many other names). The resulting orbit  is often dense in the interval, which doesn’t really help us visualize any patterns. But sometimes we get an interesting complex structure.

The vertical axis here is the time parameter, the number of dyadic shifts. The 0-1 sequence being visualized is the Kolakoski sequence in its binary form, with 0 and 1 instead of 1 and 2. By definition, the n-th run of equal digits in this sequence has length ${x_n+1}$. In particular, 000 and 111 never occur, which contributes to the blank spots near 0 and 1.

Although the sequence is not periodic, the set is quite stable in time; it does not make a visible difference whether one plots the first 10,000 shifts, or 10,000,000. The apparent symmetry about 1/2 is related to the open problem of whether the Kolakoski sequence is mirror invariant, meaning that together with any finite word (such as 0010) it also contains its complement (that would be 1101).

There are infinitely many forbidden words apart from 000 and 111 (and the words containing those). For example, 01010 cannot occur because it has 3 consecutive runs of length 1, which implies having 000 elsewhere in the sequence. For the same reason, 001100 is forbidden. This goes on forever: 00100100 is forbidden because it implies having 10101, etc.

The number of distinct words of length n in the Kolakoski sequence is bounded by a power of n (see F. M. Dekking, What is the long range order in the Kolakoski sequence?). Hence, the set pictured above is covered by ${O(n^p)}$ intervals of length ${2^{-n}}$, which implies it (and even its closure) is zero-dimensional in any fractal sense (has Minkowski dimension 0).

The set KC apparently does not have any isolated points; this is also an open problem, of recurrence (whether every word that appears in the sequence has to appear infinitely many times). Assuming this is so, the closure of the orbit is a totally disconnected compact set without isolated points, i.e., a Cantor set. It is not self-similar (not surprising, given it’s zero-dimensional), but its relation to the Bernoulli shift implies a structure resembling self-similarity:

Applying the transformations ${x\mapsto x/2}$ and ${x\mapsto (1+x)/2}$ yields two disjoint smaller copies that cover the original set, but with some spare parts left. The leftover bits exist because not every word in the sequence can be preceded by both 0 and 1.

Applying the transformations ${x\mapsto 2x}$ and ${x\mapsto 2x-1}$ yields two larger copies that cover the original set. There are no extra parts within the interval [0,1] but there is an overlap between the two copies.

The number ${c = \inf KC\approx 0.146778684766479}$ appears several times in the structure of the set: for instance, the central gap is ${((1-c)/2, (1+c)/2)}$, the second-largest gap on the left has the left endpoint ${(1-c)/4}$, etc. The Inverse Symbolic Calculator has not found anything about this number. Its binary expansion begins with 0.001 001 011 001 001 101 001 001 101 100… which one can recognize as the smallest binary number that can be written without doing anything three times in a row. (Can’t have 000; also can’t have 001 three times in a row; and 001 010 is not allowed because it contains 01010, three runs of length 1. Hence, the number begins with 001 001 011.) This number is obviously irrational, but other than that…

In conclusion, the Python code used to plot KC.

import numpy as np
import matplotlib.pyplot as plt
n = 1000000
a = np.zeros(n, dtype=int)
j = 0
same = False
for i in range(1, n):
if same:
a[i] = a[i-1]
same = False
else:
a[i] = 1 - a[i-1]
j += 1
same = bool(a[j])
v = np.array([1/2**k for k in range(60, 0, -1)])
b = np.convolve(a, v, mode='valid')
plt.plot(b, np.arange(np.size(b)), '.', ms=2)
plt.show()

## Kolakoski turtle curve

Let’s take another look at the Kolakoski sequence (part 1, part 2) which, by definition, is the sequence of 1s and 2s in which the nth term is equal to the length of the nth run of consecutive equal numbers in the same sequence. When a sequence has only two distinct entries, it can be visualized with the help of a turtle that turns left (when the entry is 1) or right (when the entry is 2). This visualization method seems particularly appropriate for the  Kolakoski sequence since there are no runs of 3 equal entries, meaning the turtle will never move around a square of sidelength equal to its step. In particular, this leaves open the possibility of getting a simple curve… Here are the first 300 terms; the turtle makes its first move down and then goes left-right-right-left-left-right-left-… according to the terms 1,2,2,1,1,2,1,…

No self-intersections yet… alas, at the 366th term it finally happens.

Self-intersections keep occurring after that:

again and again…

Okay, the curve obviously doesn’t mind intersecting self. But it can’t be periodic since the Kolakoski sequence isn’t. This leaves two questions unresolved:

• Does the turtle ever get above its initial position? Probably… I haven’t tried more than 5000 terms
• Is the curve bounded? Unlikely, but I’ve no idea how one would dis/prove that. For example, there cannot be a long diagonal run (left-right-left-right-left) because having 1,2,1,2,1 in the sequence implies that elsewhere, there are three consecutive 1s, and that doesn’t happen.

Here’s the Python code used for the above. I represented the sequence as a Boolean array with 1 = False, 2 = True.

import numpy as np
import turtle
n = 500                   # number of terms to compute
a = np.zeros(n, dtype=np.bool_)
j = 0                     # the index to look back at
same = False   # will next term be same as current one?
for i in range(1, n):
if same:
a[i] = a[i-1]     # current run continues
same = False
else:
a[i] = not a[i-1] # the run is over
j += 1            # another run begins
same = a[j]       # a[j] determines its length

turtle.hideturtle()
turtle.right(90)
for i in range(n):
turtle.forward(10)    # used steps of 10 or 5 pixels
if a[i]:
turtle.right(90)
else:
turtle.left(90)


## Kolakoski sequence II

The computational evidence in my previous post on the Kolakoski sequence was meagre: I looked only at the first 500 terms. By definition, the Kolakoski sequence consists of 1s and 2s, begins with 1, and is self-similar in the following sense: replacing each run by its length, you recover the original sequence. Here is its beginning, with separating spaces omitted within each run.

1 22 11 2 1 22 1 22 11 2 11 22 1 2 11 2 1 22 11 2 11 2 1 22 1 22 11 2 1 22 1 2 11 2 11 22 1 22 11 2 1 22 1 22 11 2 11 2 1 22 1 2 11 22 1 22 11 2 1 22 1 22 11 2 11 22 1 2 11 2 1 22 1 22 11 2 11 2 1 22 11 2 11 22 1 2 11 2 11 22 1 22 11 2 1 22 1 2 11 22 1 22 1 2 11 2 11 22 1 22 11 …

Conjecturally, the density of 1s and of 2s in this sequence is 50%. And indeed, the number of 2s among the first N terms grows in a way that’s hard to distinguish from N/2 on a plot. However, the density looks quite different if one looks at three subsequences separately:

• $a_{3k+1}$ (red subsequence)
• $a_{3k+2}$ (green subsequence)
• $a_{3k+3}$ (blue subsequence)

At the very beginning (among the first 500 terms) the red sequence is dominated by 1s while the blue is dominated by 2s. But the story changes quickly. The plot below shows the count of 2s in each subsequence, minus the linear prediction. The average of red, green, and blue is shown in black.

The nearly flawless behavior of the overall count of 2s hides the wild oscillation within the three subsequences. Presumably, they continue moving back and forth and the amplitude of their oscillation appears to be growing. However, the growth looks sublinear, which means that even within each subsequence, the asymptotic density of 2s might converge to 1/2.

The computation was done in scilab. For simplicity I replaced 1s and 2s by 0s ans 1s, so that the length of the $j$th run is $a_j+1$. The variable rep indicates whether the next term should be the same as its predecessor; $j$ keeps the count of runs.
 N=200000; a=zeros(N); j=1; rep=0; for i=2:N if (rep==1) a(i)=a(i-1); rep=0; else a(i)=1-a(i-1); j=j+1; rep=a(j); end end 

Pretty neat, isn’t it? The rest of the code is less neat, though. I still don’t know of any good way to calculate running totals in scilab.
 M=floor(N/3)+1; s=zeros(M,4); for i=0:M-2 for j=1:3 s(i+2,j)=s(i+1,j)+a(3*i+j) end s(i+2,4)=i/2; end plot(s(:,1)-s(:,4),'r'); plot(s(:,2)-s(:,4),'g'); plot(s(:,3)-s(:,4),'b'); plot((s(:,1)+s(:,2)+s(:,3))/3-s(:,4),'k'); 

Bonus content: I extended the computations to N=1663500 (yeah, should have increased the stack size beforehand).

## The OEIS sequence A000002

The sequences in The On-Line Encyclopedia of Integer Sequences® are themselves arranged into a sequence, each being indexed by Axxxxxx. Hence, the sequence
2, 3, 2, 1, 3, 4, 1, 7, 7, 5, 45, 2, 181, 43, 17, ...
can never be included in the encyclopedia: its definition is $a_n=1+\text{nth term of the nth sequence in the OEIS}$.

By the way, which sequences have the honor of being on the virtual first page of the OEIS? A000004 consists of zeros. Its description offers the following explicit formula:

a(n)=A*sin(n*Pi) for any A. [From Jaume Oliver Lafont, Apr 02 2010]

The sequence A000001 counts the groups of order n.

The sequence A000002 consists of 1s and 2s only:
1, 2, 2, 1, 1, 2, 1, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 2, 1, 2, 2, 1, 1, ...
The defining property of this sequence is self-similarity: consider the runs of equal numbers
1, 22, 11, 2, 1, 22, 1, 22, 11, 2, 11, 22, 1, 2, 11, 2, 1, 22, 11, ...
and replace each run by its length:
1, 2, 2, 1, 1, 2, 1, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 2, 2, ...
You get the original sequence again.

This is the Kolakoski sequence, which first appeared in print in 1965. Contrary to what one might expect from such a simple definition, there is no simple pattern; it is not even known whether 1s and 2s appear with the same asymptotic frequency 1/2. One way to look for patterns in a function/sequence is to take its Fourier transform, and this is what I did below:

Specifically, this is the absolute value of $\displaystyle F(t)=\sum_{n=1}^{500} (2a_n-3) e^{2\pi i n t}$. To better see where the peaks fall, it’s convenient to look at $|F(1/t)|$:

Something is going on at $t=1/3$, i.e. at period 3. Recall that the sequence contains no triples 111 or 222: there are no runs of more than 2 equal numbers. So perhaps it’s not surprising that a pattern emerges at period 3. And indeed, even though the percentage of 1s among the first 500 terms of the sequence is 49.8%, breaking it down by n mod 3 yields the following:

• Among $a_{3k}$, the percentage of 1s is 17.4%
• Among $a_{3k+1}$, the percentage of 1s is 85.2%
• Among $a_{3k+2}$, the percentage of 1s is 46.8%

Noticing another peak at $t=1/9$, I also broke down the percentages by n mod 9:

• $n=9k$: there are 5.4% 1s
• $n=9k+1$: there are 74.8% 1s
• $n=9k+2$: there are 41.4% 1s
• $n=9k+3$: there are 21.6% 1s
• $n=9k+4$: there are 97.2% 1s
• $n=9k+5$: there are 70.2% 1s
• $n=9k+6$: there are 25.2% 1s
• $n=9k+7$: there are 84.6% 1s
• $n=9k+8$: there are 28.8% 1s

Nothing close to 50%…

In conclusion, my code for the Kolakoski sequence. This time I used Maple (instead of Scilab) for no particular reason.

N:=500; ks:=Array(1..N+2); i:=0; j:=1; new:=1; while i<=N do i:=i+1; ks[i]:=new; if (ks[j]=2) then i:=i+1; ks[i]:=new; end if; j:=j+1; new:=3-new; end do: