1000 Stack Overflow answers

A randomly-round milestone (which does not include 424 answers posted under another account) reminded me of how my Stack Overflow activity shifted over the years. In rough chronological order, it went through the following.

Matlab and Scilab is where I started, being familiar with both from some Numerical Methods courses. Downsides: most Matlab questions are either answered quickly, or are unanswerable. Scilab is a much smaller niche, with almost no competition from other answerers. But a low-activity tag also means virtually no votes. Eventually I got bored of both and do not watch these tags anymore. Sorry, Scilab – I like your improvements on the weird Matlab syntax, but the emergence of Python’s scientific stack leaves you with little potential userbase.

Google Sheets is a topic awkwardly split between Stack Overflow and Web Applications. Is using spreadsheet functions programming? Well, is writing regular expressions programming? Is writing SQL queries programming? Google Sheets functions include both regex support and SQL-like language, although the latter is seriously limited by the lack of joins. It can be fun to find workarounds for various limitations of spreadsheet functions (example). But it does not last forever; I have long abandoned both the Web Applications site and the Google Sheets tag on Stack Overflow.

Google Apps Script is the tool I use daily. It processes bundles of spreadsheets, it scrapes information from various places and serves up static webpages and simple web applications for my own use. For example, when I’m in the mood to use some moderation tools, it identifies the posts that need to be disposed of.

Homework dump
Yeah I’ll get right on that

Some scripts also act as me, automating away some tasks. Apps Script may be based on an ancient, frozen JavaScript engine (still no arrow functions) but I like the way it just works. Still, I got tired of spreadsheet-related scripts and now limit my attention to the occasional questions about working with Google Documents programmatically.

SymPy is a library I almost never use for my own work, but I like this mix of pure math and pure Python. Not to mention plentiful bugs or deficiencies of implementation, which create fertile ground for new tricky questions. On multiple occasions, answering a SymPy question led to diving into the source, and possibly adding an issue, sometimes with a pull request. This is the tag I watch most closely now.

NumPy and SciPy, of course. I like interpolation questions the most, followed by other typical numerical tasks: integration, optimization, root finding, numerical linear algebra. Tricky situations arise all the time, and the sparsity of SciPy documentation in niche areas is remarkable. Shockingly, one still needs to know math to do math with a computer.

Although I dabbled in both pandas and scikit-learn, I left both tags pretty quickly: pandas has no shortage of extremely active answerers, and the scikit-learn tag is depressingly full of “I got a machine learning problem and installed Anaconda, what do I do next”.

And I never even considered watching the generic Python or JavaScript tags. Why? The following quote suffices as an explanation.

For a very long time now, Stack Overflow has been headed in two different directions…

  1. The first direction involves building an increasingly-detailed set of Q&A on broad, popular topics. This is the easiest one to observe: if you’re learning, say, C# then you’ll find a very broad and deep set of information here, with more and more corner-cases being filled in hourly. The single biggest obstacle here is noise: with almost every possible topic covered in multiple questions expressed in multiple ways from multiple perspectives, finding a question that seems to match the problem you’re having is easy – but finding the question that has an applicable answer can be a slog.

  2. The second direction involves covering an ever-more-vast set of topics. Everything from languages and platforms with only a few (initial) users, to specialized libraries and tools nominally under the umbrella of a more well-known language or platform. The big problem here is (and has always been) that these topics are nearly impossible to moderate effectively; they may have only a tiny handful of active users, they tend to not attract much voting, and the bulk of the moderation tooling (and elected moderator team) is skewed toward serving #1.

Solving the problems of #1 is… An almost insurmountable challenge. We tried hard to attack it head on, and eventually came to the conclusion that we could probably throw a billion dollars in dev time at it and still only maybe succeed; the root causes are simply a lot bigger and broader than the little Q&A site they affect. That doesn’t mean we shouldn’t try to mitigate those effects, but it very much means we should do so while compulsively reciting the serenity prayer.

Because of this… Or perhaps in spite of it… I’ve come to believe that direction #2 is the future of Stack Overflow, the area that both askers and answerers will find most rewarding in years to come. Specialization may not appeal to Lazarus Long, but as our field matures it’s an increasingly-necessary option – and one we have a real opportunity to serve better here on Stack Overflow.

Nonexpanding Jordan curves

A simple closed curve {\Gamma} on the plane can be parameterized by a homeomorphism {f\colon \mathbb T\to \Gamma} in infinitely many ways. It is natural to look for “nice” parameterizations: smooth ones, for example. I do not want to require smooth {\Gamma} here, so let us try to find {f} that is nonexpanding, that is {|f(a)-f(b)|\le |a-b|} for all {a, b\in \mathbb T}. Note that Euclidean distance is used here, not arclength.

What are some necessary conditions for the existence of a nonexpanding parameterization?

  1. The curve must have length at most {2\pi}, since nonexpanding maps do not increase length. But this is not sufficient: an equilateral triangle of sidelength {2\pi/3} has no nonexpanding parameterization, despite its length being {2\pi}.
  2. The curve must have diameter at most 2 (which the triangle in item 1 fails). Indeed, nonexpanding maps do not increase the diameter either. However, {\mathrm{diam}\,\Gamma\le 2} is not sufficient either: an equilateral triangle of sidelength {2} has no nonexpanding parameterization, despite its diameter being 2 (and length {6 < 2\pi}).
  3. The curve must be contained in some closed disk of radius 1. This is not as obvious as the previous two items. We need Kirszbraun’s theorem here: any nonexpanding map {f\colon \mathbb T\to \Gamma} extends to a nonexpanding map {F\colon \mathbb R^2\to\mathbb R^2}, and therefore {f(\mathbb T)} is contained in the closed disk of radius 1 centered at {F(0)}. (This property fails for the triangle in item 2.)

The combination of 1 and 3 (with 2 being superseded by 3) still is not sufficient. A counterexample is given by any polygon that has length {2\pi} but is small enough to fit in a unit disk, for example:

crown
Uneasy lies the head that cannot find a nonexpanding parameterization

Indeed, since the length is exactly {2\pi}, a nonexpanding parametrization must have constant speed 1. But mapping a circular arc onto a line segment with speed 1 increases pairwise Euclidean distances, since we are straightening out the arc.

Since I do not see a way to refine the necessary conditions further, let us turn to the sufficient ones.

  1. It is sufficient for {\Gamma} to be a convex curve contained in the unit disk. Indeed, the nearest-point projection onto a convex set is a nonexpanding map, and projecting the unit circle onto the curve in this way gives the desired parameterization.
  2. It is sufficient for the curve to have length at most 4. Indeed, in this case there exists a parameterization {f\colon \mathbb T\to\Gamma} with {|f'|\le 4/(2\pi) = 2/\pi}. For any {a, b\in \mathbb T} the length of the shorter subarc {\gamma} between {a} and {b} is at most {(\pi/2)|a-b|}. Therefore, the length of {f(\gamma)} is at most {|a-b|}, which implies {|f(a)-f(b)|\le |a-b|}.

Clearly, neither of two sufficient conditions is necessary for the existence of a nonexpanding parameterization. But one can consider “length {\le 2\pi} is necessary, length {\le 4} is sufficient” a reasonable resolution of the problem: up to some constant, the length of a curve can decide the existence one way or the other.

Stay tuned for a post on noncontracting parameterizations…

Modes of convergence on abstract (and not so abstract) sets

In how many ways can a series of real-valued functions on an abstract set converge? Having no measure on the domain eliminates the infinitude of modes of convergence based on integral norms. I can think of five modes of convergence of {\sum f_n} where {f_n\colon X\to \mathbb R}:

  • (P) Pointwise convergence: {\sum f_n(x)} converges for each {x\in X}.
  • (U) Uniform convergence: the partial sums {S_n} of the series converge to some function {f\colon X\to\mathbb R} uniformly, i.e., {\sup |S_n-f| \to 0}.
  • (PA) Pointwise convergence of absolute values: {\sum |f_n(x)|} converges for each {x\in X}.
  • (UA) Uniform convergence of absolute values: like uniform, but for {\sum |f_n|}.
  • (M) Weierstrass M-test convergence: {\sum \sup |f_n| } converges.

Implications (all easy): (M) implies (UA), which implies both (U) and (PA). Neither (U) nor (PA) implies the other one, but each of them implies (P).

Perhaps (U) and (PA) deserve an illustration, being incomparable. Let {X = [0, 1]}. The constant functions {f_n(x)=(-1)^n/n} form a series that converges uniformly but not in the sense (PA). In the opposite direction, a series of triangles with height 1 and disjoint supports converges (PA) but not (U).

UA

Notably, the sum of the latter series is not a continuous function. This had to happen: by Dini’s theorem, if a series of continuous functions is (PA)-convergent and its sum is continuous, then it is (UA)-convergent. This “self-improving” property of (PA) convergence will comes up again in the second part of this post.

From abstract sets to normed spaces

In functional analysis, the elements of a normed space {E} can often be usefully interpreted as functions on the unit ball of the dual space {E^*}. Indeed, each {x\in E} induces {f_x(z) = z(x)} for {z\in E^*}. Applying the aforementioned modes of convergence to {\sum x_n} with {x_n\in E}, we arrive at

  • (P) ⇔ Convergence in the weak topology of E.
  • (U) ⇔ Convergence in the norm topology of E.
  • (PA) ⇔ Unconditional convergence in the weak topology of E.
  • (UA) ⇔ Unconditional convergence in the norm topology of E.
  • (M) ⇔ Absolute convergence, i.e., {\sum \|x_n\|} converges.

The equivalences (P), (U), (M) are straightforward exercises, but the unconditional convergence merits further discussion. For one thing, there are subtly different approaches to formalizing the concept. Following “Normed Linear Spaces” by M. M. Day, let’s say that a series {\sum x_n} is

  • (RC) Reordered convergent if there exists {x} such that {\sum x_{\pi(n)} =x} for every bijection {\pi:\mathbb{N}\to\mathbb{N}}
  • (UC) Unordered convergent if there exists {x} such that for every neighborhood {U} of {x} there exists a finite set {E\subset \mathbb{N}} with the property that {\sum_{n\in F}x_n\in U} for every finite set {F} containing {E}.
  • (SC) Subseries convergent if for every increasing sequence of integers {(n_k)} the series {\sum x_{n_k}} converges.
  • (BC) Bounded-multiplier convergent if for every bounded sequence of scalars {(a_n)}, the series {\sum a_n x_n} converges.

In a general locally convex space, (BC) ⇒ (SC) ⇒ (UC) ⇔ (RC). The issue with reversing the first two implications is that they involve the existence of a sum for some new series, and if the space lacks completeness, the sum might fail to exist for no good reason. All four properties are equivalent in sequentially complete spaces (those where every Cauchy sequence converges).

Let’s prove that interpretation of (PA) stated above, using the (BC) form of unconditional convergence. Suppose {\sum x_n} converges in the sense (PA), that is for every linear functional {z} the series {\sum |z(x_n)|} converges. Then it’s clear that {\sum a_n x_n} has the same property for any bounded scalar sequence {(a_n)}. That is, (PA) implies bounded-multiplier convergence in the weak topology. Conversely, suppose {\sum x_n} enjoys weak bounded-multiplier convergence and let {z\in E^*}. Multiplying each {x_n} by a suitable unimodular factor {a_n} we can get {z(a_n x_n) > 0} for all {n}. Now the weak convergence of {\sum a_n x_n} yields the pointwise convergence of {\sum |z(x_n)|}.

A theorem of Orlicz, proved in the aforementioned book by Day, says that (SC) convergence in the weak topology of a Banach space is equivalent to (SC) convergence in the norm topology. Thanks to completeness, in the norm topology of a Banach space all forms of unconditional convergence are equivalent. The upshot is that (PA) automatically upgrades to (UA) in the context of the elements of a Banach space being considered as functions on the dual unit ball.

Orthonormal bases formed by translation of sequences

The standard orthonormal basis (ONB) in the Hilbert space {\ell^2(\mathbb N_0)} consists of the vectors
(1, 0, 0, 0, …)
(0, 1, 0, 0, …)
(0, 0, 1, 0, …)

Let S be the forward shift operator: {S(x_0, x_1, \dots) = (0, x_0, x_1, \dots)}. The aforementioned ONB is precisely the orbit of the first basis vector {e_0} under the iteration of S. Are there any other vectors x whose orbit under S is an ONB?

If one tries to construct such x by hand, taking some finite linear combination of {e_k}, this is not going to work. Indeed, if the coefficient sequence has finitely many nonzero terms, then one of them, say, {c_m}, is the last one. Then {S^m x} is not orthogonal to {x} because the inner product is {\overline{c_0} c_m} and that is not zero.

However, such vectors x exist, and arise naturally in complex analysis. Indeed, to a sequence {(c_n)\in\ell^2(\mathbb N_0)} we can associate a function {f(z) = \sum_{n=1}^\infty c_n z^n}. The series converges in the open unit disk to a holomorphic function which, being in the Hardy space {H^2}, has boundary values represented by an square-integrable function on the unit circle {\mathbb T}. Forward shift corresponds to multiplication by {z}. Thus, the orthogonality requires that for every {k\in \mathbb N} the function {z^kf} be orthogonal to {f} in {L^2(\mathbb T)}. This means that {|f|^2} is orthogonal to all such {z^k}; and since it’s real, it is orthogonal to {z^k} for all {k\in \mathbb Z\setminus \{0\}} by virtue of conjugation. Conclusion: |f| has to be constant on the boundary; specifically we need |f|=1 a.e. to have a normalized basis. All the steps can be reversed: |f|=1 is also sufficient for orthogonality.

So, all we need is a holomorphic function f on the unit disk such that almost all boundary values are unimodular and f(0) is nonzero; the latter requirement comes from having to span the entire space. In addition to the constant 1, which yields the canonical ONB, one can use

  • A Möbius transformation {f(z)=(z-a)/(1-\bar a z)} where {a\ne 0}.
  • A product of those (a Blaschke product), which can be infinite if the numbers {a_k} converge to the boundary at a sufficient rate to ensure the convergence of the series.
  • The function {f(z) = \exp((z+1)/(z-1))} which is not a Blaschke product (indeed, it has no zeros) yet satisfies {|f(z)|=1} for all {z\in \mathbb T\setminus \{-1\}}.
  • Most generally, an inner function which is a Blaschke product multiplied by an integral of rotated versions of the aforementioned exponential function.

Arguably the simplest of these is the Möbius transformation with {a=1/2}; expanding it into the Taylor series we get
{\displaystyle -\frac12 +\sum_{n=1}^\infty \frac{3}{2^{n+1}} z^n }
Thus, the second simplest ONB-by-translations after the canonical one consists of
(-1/2, 3/4, 3/8, 3/16, 3/32, 3/64, …)
(0, -1/2, 3/4, 3/8, 3/16, 3/32, …)
(0, 0, -1/2, 3/4, 3/8, 3/16, …)
and so on. Direct verification of the ONB properties is an exercise in summing geometric series.

What about the exponential one? The Taylor series of {\exp((z+1)/(z-1))} begins with
{\displaystyle \frac{1}{e}\left(1 - 2z +\frac23 z^3 + \frac23 z^4 +\frac25 z^5 + \frac{4}{45} z^6 - \frac{10}{63} z^7 - \frac{32}{105}z^8 - \cdots \right) }
I don’t know if these coefficients in parentheses have any significance. Well perhaps they do because the sum of their squares is {e^2} . But I don’t know anything else about them. For example, are there infinitely many terms of either sign?

Geometrically, a Möbius transform corresponds to traversing the boundary circle once, a Blaschke product of degree n means doing it n times, while the exponential function, as well as infinite Blaschke products, manage to map a circle onto itself so that it winds around infinitely many times.

Finally, is there anything like that for the backward shift {S^*(x_0, x_1, x_2, \dots) = (x_1, x_2, \dots)}? The vector {(S^*)^k x} is orthogonal to {x} if and only if {x} is orthogonal to {S^kx}, so the condition for orthogonality is the same as above. But the orbit of any {\ell^2(\mathbb N_0)} vector under {S^*} tends to zero, thus cannot be an orthonormal basis.

Critical points of a cubic spline

The choice of piecewise polynomials of degree 3 for interpolation is justifiably popular: even-degree splines are algebraically awkward to construct, degree 1 is simply piecewise linear interpolation (not smooth), and degree 5, while feasible, entails juggling too many coefficients. Besides, a cubic polynomial minimizes the amount of wiggling (the integral of second derivative squared) for given values and slopes at the endpoints of an interval. (Recall Connecting dots naturally.)

But the derivative of a cubic spline is a quadratic spline. And one needs the derivative to find the critical points. This results in an awkward example in SciPy documentation, annotated with “(NB: sproot only works for order 3 splines, so we fit an order 4 spline)”.

Although not implemented in SciPy, the task of computing the roots of a quadratic spline is a simple one. Obtaining the roots from the internal representation of a quadratic spline in SciPy (as a linear combination of B-splines) would take some work and reading. But a quadratic polynomial is determined by three values, so sampling it at three points, such as two consecutive knots and their average, is enough.

Quadratic formula with values instead of coefficients

Suppose we know the values of a quadratic polynomial q at -1, 0, 1, and wish to find if it has roots between -1 and 1. Let’s normalize so that q(0)=1, and let x = q(-1), y = q(1). If either x or y is negative, there is definitely a root on the interval. If they are positive, there is still a chance: we need the parabola to be concave up, have a minimum within [-1, 1], and for the minimum to be negative. All of this is easily determined once we note that the coefficients of the polynomial are a = (x+y)/2 – 1, b = (y-x)/2, and c = 1.

The inequality {(x-y)^2 \ge 8(x+y-2)} ensures the suitable sign of the discriminant. It describes a parabola with vertex (1, 1) and focus (2, 2), contained in the first quadrant and tangent to the axes at (4, 0) and (0, 4). Within the orange region there are no real roots.

orange
No real roots in the orange region

The line x+y=2, tangent to the parabola at its vertex, separates convex and concave parabolas. While concavity in conjunction with x, y being positive definitely precludes having roots in [-1, 1], slight convexity is not much better: it results in real roots outside of the interval. Here is the complete picture: green means there is a root in [-1, 1], orange means no real roots, red covers the rest.

all_colors
Green = there is a root in the interval [-1, 1]

Back to splines

Since the derivative of a spline is implemented in SciPy (B-splines have a nice formula for derivatives), all we need is a root-finding routine for quadratic splines. Here it is, based on the above observations but using built-in NumPy polynomial solver np.roots to avoid dealing with various special cases for the coefficients.

def quadratic_spline_roots(spl):
    roots = []
    knots = spl.get_knots()
    for a, b in zip(knots[:-1], knots[1:]):
        u, v, w = spl(a), spl((a+b)/2), spl(b)
        t = np.roots([u+w-2*v, w-u, 2*v])
        t = t[np.isreal(t) & (np.abs(t) <= 1)]
        roots.extend(t*(b-a)/2 + (b+a)/2)
    return np.array(roots)

A demonstration, which plots the spline (blue), its critical points (red), and original data points (black) as follows:

spline
There can be 0, 1, or 2 critical points between two knots
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import InterpolatedUnivariateSpline

x = np.arange(7)
y = np.array([3, 1, 1, 2, 2, 4, 3])
f = InterpolatedUnivariateSpline(x, y, k=3)
crit_pts = quadratic_spline_roots(f.derivative())

t = np.linspace(x[0], x[-1], 500)
plt.plot(t, f(t))
plt.plot(x, y, 'kd')
plt.plot(crit_pts, f(crit_pts), 'ro')
plt.show()

 

Quasi-projections and isometries

A typical scenario: given a subset {E} of a metric space {X} and a point {x\in X}, we look for a point {y\in E} that is nearest to {x}: that is, {d(x, y) = \mathrm{dist}\,(x, E)}. Such a point is generally not unique: for example, if {E} is the graph of cosine function and {x = (\pi, \pi/2)}, then both {(\pi/2, 0)} and {(3\pi/2, 0)} qualify as nearest to {x}. This makes the nearest-point projection onto {E} discontinuous: moving {x} slightly to the left or to the right will make its projection onto {E} jump from one point to another. Not good.

proj1
Discontinuous nearest-point projection

Even when the nearest point projection is well-defined and continuous, it may not be the kind of projection we want. For example, in a finite-dimensional normed space with strictly convex norm we have a continuous nearest-point projection onto any linear subspace, but it is in general a nonlinear map.

Let’s say that {P\colon X\to E} is a quasi-projection if {d(x, P(x)) \le C \mathrm{dist}\,(x, E)} for some constant {C} independent of {x}. Such maps are much easier to construct: indeed, every Lipschitz continuous map {P\colon X\to E} such that {P(x)=x} for {x \in E} is a quasi-projection. For example, one quasi-projection onto the graph of cosine is the map {(x, y)\mapsto (x, \cos x)} shown below.

proj2
Continuous quasi-projection

If {X} is a Banach space and {E} is its subspace, then any idempotent operator with range {E} is a quasi-projection onto {E}. Not every subspace admits such an operator but many do (these are complemented subspaces; they include all subspaces of finite dimension or finite codimension). By replacing “nearest” with “close enough” we gain linearity. And even some subspaces that are not linearly complemented admit a continuous quasi-projection.

Here is a neat fact: if {M} and {N} are subspaces of a Euclidean space and {\dim M = \dim N}, then there exists an isometric quasi-projection of {M} onto {N} with constant {C=\sqrt{2}}. This constant is best possible: for example, an isometry from the {y}-axis onto the {x}-axis has to send {(0, 1)} to one of {(\pm 1, 0)}, thus moving it by distance {\sqrt{2}}.

proj3
An isometry must incur sqrt(2) distance cost

Proof. Let {k} be the common dimension of {M} and {N}. Fix some orthonormal bases in {M} and {N}. In these bases, the orthogonal (nearest-point) projection from {M} to {N} is represented by some {k\times k} matrix {P} of norm at most {1}. We need an orthogonal {k\times k} matrix {Q} such that the map {M\to N} that it defines is a {\sqrt{2}}-quasi-projection. What exactly does this condition mean for {Q}?

Let’s say {x\in M}, {y\in N} is the orthogonal projection of {x} onto {N}, and {z\in N} is where we want to send {x} by an isometry. Our goal is {\|x-z\|\le \sqrt{2}\|x-y\|}, in addition to {\|z\|=\|x\|}. Squaring and expanding inner products yields {2\langle x, y\rangle - \langle x, z \rangle \le \|y\|^2}. Since both {y} and {z} are in {N}, we can replace {x} on the left by its projection {y}. So, the goal simplifies to {\|y\|^2 \le \langle y, z\rangle}. Geometrically, this means placing {z} so that its projection onto the line through {y} lies on the continuation of this line beyond {y}.

So far so good, but the disappearance of {x} from the inequality is disturbing. Let’s bring it back by observing that {\|y\|^2 \le \langle y, z\rangle} is equivalent to {4(\|y\|^2 - \langle y, z\rangle) + \|z\|^2 \le \|x\|^2}, which is simply {\|2y-z\| \le \|x\|}. So that’s what we want to do: map {x} so that the distance from its image to {2y} does not exceed {\|x\|}. In terms of matrices and their operator norm, this means {\|2P-Q\|\le 1}.

It remains to show that every square matrix of norm at most {2} (such as {2P} here) is within distance {1} of some orthogonal matrix. Let {2P = U\Sigma V^T} be the singular value decomposition, with {U, V} orthogonal and {\Sigma} a diagonal matrix with the singular values of {2P} on the diagonal. Since the singular values of {2P} are between {0} and {2}, it follows that {\|\Sigma-I\|\le 1}. Hence {\|2P - UV^T\|\le 1}, and taking {Q=UV^T} concludes the proof.

(The proof is based on a Stack Exchange post by user hypernova.)

The sum of pairwise distances and the square of CDF

Suppose we have {n} real numbers {x_0,\dots, x_{n-1}} and want to find the sum of all distances {|x_j-x_k|} over {j < k}. Why? Maybe because over five years ago, the gradient flow of this quantity was used for "clustering by collision" (part 1, part 2, part 3).

If I have a Python console open, the problem appears to be solved with one line:

>>> 0.5 * np.abs(np.subtract.outer(x, x)).sum()

where the outer difference of x with x creates a matrix of all differences {x_i-x_j}, then absolute values are taken, and then they are all added up. Double-counted, hence the factor of 0.5.

But trying this with, say, one million numbers is not likely to work. If each number takes 8 bytes of memory (64 bits, double precision), then the array x is still pretty small (under 8 MB) but a million-by-million matrix will require over 7 terabytes, and I won’t have that kind of RAM anytime soon.

In principle, one could run a loop adding these values, or store the matrix on a hard drive. Both are going to take forever.

There is a much better way, though. First, sort the numbers in nondecreasing order; this does not require much time or memory (compared to quadratic memory cost of forming a matrix). Then consider the partial sums {s_k = x_0+\dots+x_k}; the cost of computing them is linear in time and memory. For each fixed {k}, the sum of distances to {x_j} with {j<k} is simply {kx_k - s_{k-1}}, or, equivalently, {(k+1)x_k - s_k}. So, all we have to do is add these up. Still one line of code (after sorting), but a much faster one:

>>> x.sort()
>>> (np.arange(1, n+1)*x - np.cumsum(x)).sum()

For example, x could be a sample from some continuous distribution. Assuming the distribution has a mean (i.e., is not too heavy tailed), the sum of all pairwise distances grows quadratically with n, and its average approaches a finite limit. For the uniform distribution on [0, 1] the computation shows this limit is 1/3. For the standard normal distribution it is 1.128… which is not as recognizable a number.


As {n\to \infty}, the average distance of a sample taken from a distribution converges to the expected value of |X-Y| where X, Y are two independent variables with that distribution. Let’s express this in terms of the probability density function {p} and the cumulative distribution function {\Phi}. By symmetry, we can integrate over {x> y} and double the result:

{\displaystyle \frac12 E|X-Y| = \int_{-\infty}^\infty p(x)\,dx \int_{-\infty}^x (x-y) p(y)\,dy}

Integrate by parts in the second integral: {p(y) = \Phi'(y)}, and the boundary terms are zero.

{\displaystyle \frac12 E|X-Y| = \int_{-\infty}^\infty p(x)\,dx \int_{-\infty}^x \Phi(y)\,dy}

Integrate by parts in the other integral, throwing the derivative onto the indefinite integral and thus eliminating it. There is a boundary term this time.

{\displaystyle \frac12 E|X-Y| = \Phi(\infty) \int_{-\infty}^\infty \Phi(y)\,dy - \int_{-\infty}^\infty \Phi(x)^2\,dx}

Since {\Phi(\infty) = 1}, this simplifies nicely:

{\displaystyle \frac12 E|X-Y| = \int_{-\infty}^\infty \Phi(x) (1-\Phi(x))\,dx}

This is a lot neater than I expected: {E|X-Y|} is simply the integral of {2\Phi(1-\Phi)}. I don’t often see CDF squared, like here. Some examples: for the uniform distribution on [0,1] we get

{\displaystyle E|X-Y| = \int_0^1 2x(1-x)\,dx = \frac13}

and for the standard normal, with {\Phi(x) = (1+\mathrm{erf}\,(x/\sqrt{2}))/2}, it is

{\displaystyle \int_{-\infty}^\infty \frac12 \left(1-\mathrm{erf}\,(x/\sqrt{2}) ^2 \right)\,dx = \frac{2}{\sqrt{\pi}}\approx 1.12838\ldots }


The trick with sorting and cumulative sums can also be used to find, for every point {x_k}, the sum (or average) of distances to all other points. To do this, we don’t sum over {k} but must also add {|x_j-x_k|} for {j>k}. The latter sum is simply {S - s_k - (n-k-1)x_k} where {S} is the total sum. So, all we need is

>>> (2*np.arange(1,n+1)-n)*x - 2*np.cumsum(x) + x.sum()

Unfortunately, the analogous problems for vector-valued sequences are not as easy. If the Manhattan metric is used, we can do the computations for each coordinate separately, and add the results. For the Euclidean metric…