# The sum of pairwise distances and the square of CDF

Suppose we have ${n}$ real numbers ${x_0,\dots, x_{n-1}}$ and want to find the sum of all distances ${|x_j-x_k|}$ over ${j < k}$. Why? Maybe because over five years ago, the gradient flow of this quantity was used for "clustering by collision" (part 1, part 2, part 3).

If I have a Python console open, the problem appears to be solved with one line:

>>> 0.5 * np.abs(np.subtract.outer(x, x)).sum()

where the outer difference of x with x creates a matrix of all differences ${x_i-x_j}$, then absolute values are taken, and then they are all added up. Double-counted, hence the factor of 0.5.

But trying this with, say, one million numbers is not likely to work. If each number takes 8 bytes of memory (64 bits, double precision), then the array x is still pretty small (under 8 MB) but a million-by-million matrix will require over 7 terabytes, and I won’t have that kind of RAM anytime soon.

In principle, one could run a loop adding these values, or store the matrix on a hard drive. Both are going to take forever.

There is a much better way, though. First, sort the numbers in nondecreasing order; this does not require much time or memory (compared to quadratic memory cost of forming a matrix). Then consider the partial sums ${s_k = x_0+\dots+x_k}$; the cost of computing them is linear in time and memory. For each fixed ${k}$, the sum of distances to ${x_j}$ with ${j is simply ${kx_k - s_{k-1}}$, or, equivalently, ${(k+1)x_k - s_k}$. So, all we have to do is add these up. Still one line of code (after sorting), but a much faster one:

>>> x.sort()
>>> (np.arange(1, n+1)*x - np.cumsum(x)).sum()

For example, x could be a sample from some continuous distribution. Assuming the distribution has a mean (i.e., is not too heavy tailed), the sum of all pairwise distances grows quadratically with n, and its average approaches a finite limit. For the uniform distribution on [0, 1] the computation shows this limit is 1/3. For the standard normal distribution it is 1.128… which is not as recognizable a number.

As ${n\to \infty}$, the average distance of a sample taken from a distribution converges to the expected value of |X-Y| where X, Y are two independent variables with that distribution. Let’s express this in terms of the probability density function ${p}$ and the cumulative distribution function ${\Phi}$. By symmetry, we can integrate over ${x> y}$ and double the result:

${\displaystyle \frac12 E|X-Y| = \int_{-\infty}^\infty p(x)\,dx \int_{-\infty}^x (x-y) p(y)\,dy}$

Integrate by parts in the second integral: ${p(y) = \Phi'(y)}$, and the boundary terms are zero.

${\displaystyle \frac12 E|X-Y| = \int_{-\infty}^\infty p(x)\,dx \int_{-\infty}^x \Phi(y)\,dy}$

Integrate by parts in the other integral, throwing the derivative onto the indefinite integral and thus eliminating it. There is a boundary term this time.

${\displaystyle \frac12 E|X-Y| = \Phi(\infty) \int_{-\infty}^\infty \Phi(y)\,dy - \int_{-\infty}^\infty \Phi(x)^2\,dx}$

Since ${\Phi(\infty) = 1}$, this simplifies nicely:

${\displaystyle \frac12 E|X-Y| = \int_{-\infty}^\infty \Phi(x) (1-\Phi(x))\,dx}$

This is a lot neater than I expected: ${E|X-Y|}$ is simply the integral of ${2\Phi(1-\Phi)}$. I don’t often see CDF squared, like here. Some examples: for the uniform distribution on [0,1] we get

${\displaystyle E|X-Y| = \int_0^1 2x(1-x)\,dx = \frac13}$

and for the standard normal, with ${\Phi(x) = (1+\mathrm{erf}\,(x/\sqrt{2}))/2}$, it is

${\displaystyle \int_{-\infty}^\infty \frac12 \left(1-\mathrm{erf}\,(x/\sqrt{2}) ^2 \right)\,dx = \frac{2}{\sqrt{\pi}}\approx 1.12838\ldots }$

The trick with sorting and cumulative sums can also be used to find, for every point ${x_k}$, the sum (or average) of distances to all other points. To do this, we don’t sum over ${k}$ but must also add ${|x_j-x_k|}$ for ${j>k}$. The latter sum is simply ${S - s_k - (n-k-1)x_k}$ where ${S}$ is the total sum. So, all we need is

>>> (2*np.arange(1,n+1)-n)*x - 2*np.cumsum(x) + x.sum()

Unfortunately, the analogous problems for vector-valued sequences are not as easy. If the Manhattan metric is used, we can do the computations for each coordinate separately, and add the results. For the Euclidean metric…