Suppose we have real numbers
and want to find the sum of all distances
over
. Why? Maybe because over five years ago, the gradient flow of this quantity was used for "clustering by collision" (part 1, part 2, part 3).
If I have a Python console open, the problem appears to be solved with one line:
>>> 0.5 * np.abs(np.subtract.outer(x, x)).sum()
where the outer difference of x with x creates a matrix of all differences , then absolute values are taken, and then they are all added up. Double-counted, hence the factor of 0.5.
But trying this with, say, one million numbers is not likely to work. If each number takes 8 bytes of memory (64 bits, double precision), then the array x is still pretty small (under 8 MB) but a million-by-million matrix will require over 7 terabytes, and I won’t have that kind of RAM anytime soon.
In principle, one could run a loop adding these values, or store the matrix on a hard drive. Both are going to take forever.
There is a much better way, though. First, sort the numbers in nondecreasing order; this does not require much time or memory (compared to quadratic memory cost of forming a matrix). Then consider the partial sums ; the cost of computing them is linear in time and memory. For each fixed
, the sum of distances to
with
is simply
, or, equivalently,
. So, all we have to do is add these up. Still one line of code (after sorting), but a much faster one:
>>> x.sort() >>> (np.arange(1, n+1)*x - np.cumsum(x)).sum()
For example, x could be a sample from some continuous distribution. Assuming the distribution has a mean (i.e., is not too heavy tailed), the sum of all pairwise distances grows quadratically with n, and its average approaches a finite limit. For the uniform distribution on [0, 1] the computation shows this limit is 1/3. For the standard normal distribution it is 1.128… which is not as recognizable a number.
As , the average distance of a sample taken from a distribution converges to the expected value of |X-Y| where X, Y are two independent variables with that distribution. Let’s express this in terms of the probability density function
and the cumulative distribution function
. By symmetry, we can integrate over
and double the result:
Integrate by parts in the second integral: , and the boundary terms are zero.
Integrate by parts in the other integral, throwing the derivative onto the indefinite integral and thus eliminating it. There is a boundary term this time.
Since , this simplifies nicely:
This is a lot neater than I expected: is simply the integral of
. I don’t often see CDF squared, like here. Some examples: for the uniform distribution on [0,1] we get
and for the standard normal, with , it is
The trick with sorting and cumulative sums can also be used to find, for every point , the sum (or average) of distances to all other points. To do this, we don’t sum over
but must also add
for
. The latter sum is simply
where
is the total sum. So, all we need is
>>> (2*np.arange(1,n+1)-n)*x - 2*np.cumsum(x) + x.sum()
Unfortunately, the analogous problems for vector-valued sequences are not as easy. If the Manhattan metric is used, we can do the computations for each coordinate separately, and add the results. For the Euclidean metric…