The sum of pairwise distances and the square of CDF

Suppose we have {n} real numbers {x_0,\dots, x_{n-1}} and want to find the sum of all distances {|x_j-x_k|} over {j < k}. Why? Maybe because over five years ago, the gradient flow of this quantity was used for "clustering by collision" (part 1, part 2, part 3).

If I have a Python console open, the problem appears to be solved with one line:

>>> 0.5 * np.abs(np.subtract.outer(x, x)).sum()

where the outer difference of x with x creates a matrix of all differences {x_i-x_j}, then absolute values are taken, and then they are all added up. Double-counted, hence the factor of 0.5.

But trying this with, say, one million numbers is not likely to work. If each number takes 8 bytes of memory (64 bits, double precision), then the array x is still pretty small (under 8 MB) but a million-by-million matrix will require over 7 terabytes, and I won’t have that kind of RAM anytime soon.

In principle, one could run a loop adding these values, or store the matrix on a hard drive. Both are going to take forever.

There is a much better way, though. First, sort the numbers in nondecreasing order; this does not require much time or memory (compared to quadratic memory cost of forming a matrix). Then consider the partial sums {s_k = x_0+\dots+x_k}; the cost of computing them is linear in time and memory. For each fixed {k}, the sum of distances to {x_j} with {j<k} is simply {kx_k - s_{k-1}}, or, equivalently, {(k+1)x_k - s_k}. So, all we have to do is add these up. Still one line of code (after sorting), but a much faster one:

>>> x.sort()
>>> (np.arange(1, n+1)*x - np.cumsum(x)).sum()

For example, x could be a sample from some continuous distribution. Assuming the distribution has a mean (i.e., is not too heavy tailed), the sum of all pairwise distances grows quadratically with n, and its average approaches a finite limit. For the uniform distribution on [0, 1] the computation shows this limit is 1/3. For the standard normal distribution it is 1.128… which is not as recognizable a number.

As {n\to \infty}, the average distance of a sample taken from a distribution converges to the expected value of |X-Y| where X, Y are two independent variables with that distribution. Let’s express this in terms of the probability density function {p} and the cumulative distribution function {\Phi}. By symmetry, we can integrate over {x> y} and double the result:

{\displaystyle \frac12 E|X-Y| = \int_{-\infty}^\infty p(x)\,dx \int_{-\infty}^x (x-y) p(y)\,dy}

Integrate by parts in the second integral: {p(y) = \Phi'(y)}, and the boundary terms are zero.

{\displaystyle \frac12 E|X-Y| = \int_{-\infty}^\infty p(x)\,dx \int_{-\infty}^x \Phi(y)\,dy}

Integrate by parts in the other integral, throwing the derivative onto the indefinite integral and thus eliminating it. There is a boundary term this time.

{\displaystyle \frac12 E|X-Y| = \Phi(\infty) \int_{-\infty}^\infty \Phi(y)\,dy - \int_{-\infty}^\infty \Phi(x)^2\,dx}

Since {\Phi(\infty) = 1}, this simplifies nicely:

{\displaystyle \frac12 E|X-Y| = \int_{-\infty}^\infty \Phi(x) (1-\Phi(x))\,dx}

This is a lot neater than I expected: {E|X-Y|} is simply the integral of {2\Phi(1-\Phi)}. I don’t often see CDF squared, like here. Some examples: for the uniform distribution on [0,1] we get

{\displaystyle E|X-Y| = \int_0^1 2x(1-x)\,dx = \frac13}

and for the standard normal, with {\Phi(x) = (1+\mathrm{erf}\,(x/\sqrt{2}))/2}, it is

{\displaystyle \int_{-\infty}^\infty \frac12 \left(1-\mathrm{erf}\,(x/\sqrt{2}) ^2 \right)\,dx = \frac{2}{\sqrt{\pi}}\approx 1.12838\ldots }

The trick with sorting and cumulative sums can also be used to find, for every point {x_k}, the sum (or average) of distances to all other points. To do this, we don’t sum over {k} but must also add {|x_j-x_k|} for {j>k}. The latter sum is simply {S - s_k - (n-k-1)x_k} where {S} is the total sum. So, all we need is

>>> (2*np.arange(1,n+1)-n)*x - 2*np.cumsum(x) + x.sum()

Unfortunately, the analogous problems for vector-valued sequences are not as easy. If the Manhattan metric is used, we can do the computations for each coordinate separately, and add the results. For the Euclidean metric…

Kolakoski sequence II

The computational evidence in my previous post on the Kolakoski sequence was meagre: I looked only at the first 500 terms. By definition, the Kolakoski sequence consists of 1s and 2s, begins with 1, and is self-similar in the following sense: replacing each run by its length, you recover the original sequence. Here is its beginning, with separating spaces omitted within each run.

1 22 11 2 1 22 1 22 11 2 11 22 1 2 11 2 1 22 11 2 11 2 1 22 1 22 11 2 1 22 1 2 11 2 11 22 1 22 11 2 1 22 1 22 11 2 11 2 1 22 1 2 11 22 1 22 11 2 1 22 1 22 11 2 11 22 1 2 11 2 1 22 1 22 11 2 11 2 1 22 11 2 11 22 1 2 11 2 11 22 1 22 11 2 1 22 1 2 11 22 1 22 1 2 11 2 11 22 1 22 11 …

Conjecturally, the density of 1s and of 2s in this sequence is 50%. And indeed, the number of 2s among the first N terms grows in a way that’s hard to distinguish from N/2 on a plot. However, the density looks quite different if one looks at three subsequences separately:

  • a_{3k+1} (red subsequence)
  • a_{3k+2} (green subsequence)
  • a_{3k+3} (blue subsequence)

At the very beginning (among the first 500 terms) the red sequence is dominated by 1s while the blue is dominated by 2s. But the story changes quickly. The plot below shows the count of 2s in each subsequence, minus the linear prediction. The average of red, green, and blue is shown in black.

200000 terms, separated into 3 subsequences

The nearly flawless behavior of the overall count of 2s hides the wild oscillation within the three subsequences. Presumably, they continue moving back and forth and the amplitude of their oscillation appears to be growing. However, the growth looks sublinear, which means that even within each subsequence, the asymptotic density of 2s might converge to 1/2.

The computation was done in scilab. For simplicity I replaced 1s and 2s by 0s ans 1s, so that the length of the jth run is a_j+1. The variable rep indicates whether the next term should be the same as its predecessor; j keeps the count of runs.

N=200000; a=zeros(N); j=1; rep=0;
for i=2:N
if (rep==1) a(i)=a(i-1); rep=0;
else a(i)=1-a(i-1); j=j+1; rep=a(j);

Pretty neat, isn’t it? The rest of the code is less neat, though. I still don’t know of any good way to calculate running totals in scilab.

M=floor(N/3)+1; s=zeros(M,4);
for i=0:M-2
for j=1:3
plot(s(:,1)-s(:,4),'r'); plot(s(:,2)-s(:,4),'g'); plot(s(:,3)-s(:,4),'b');

Bonus content: I extended the computations to N=1663500 (yeah, should have increased the stack size beforehand).

1663500 terms