Bourgain’s embedding theorem

Let X be a finite metric space, say of cardinality n. For n\ge 4 we cannot expect to have an isometric embedding of X into a Euclidean space; the 4-cycle with path metric cannot be so embedded. So we ask for an embedding f\colon X\to \mathbb R^n with reasonable distortion D(f). The distortion is defined as the product of the Lipschitz norms of f and of its inverse,

\displaystyle D=\mathrm{Lip}(f)\mathrm{Lip}(f^{-1}):=\sup_{a\ne b}\frac{|f(a)-f(b)|}{d_X(a,b)} \cdot \frac{d_X(a,b)}{|f(a)-f(b)|}

So, D(f)=1 if and only if f scales all distances by the same factor (which we can then make equal to 1 by rescaling the target space).

It is easy to isometrically embed X into the normed vector space \ell_{\infty}(X) which consists of all functions X\to\mathbb R with the maximum norm. This (Kuratowski) embedding simply sends each point p\in X into the function \rho_p(x)=d(x,p). It is easy to see that \max |\rho_p-\rho_q|=d(p,q), attained when x\in \lbrace p,q\rbrace. The identity map from \ell_{\infty}(X) to \ell_{2}(X) has distortion \sqrt{n}, and this is the price D(f) we have to pay.

In 1985 Bourgain achieved a dramatic improvement: D(f) \le C\log n, which is the best possible bound apart from the value of the absolute constant C. (It can be taken to be 128.) The starting point is remarkably simple: instead of \rho_p(x)=d(x,p), which measures the distances of p to other points, Bourgain used \rho_p(S)=d(p,S) where S runs over all nonempty proper subsets of X. The fact that the target space now has 2^n-2 dimensions is of no theoretical consequence, since the image of X is always contained in an (n-1) dimensional hyperplane. (From the practical viewpoint, even dimension n is much too high, but the Johnson-Lindenstrauss flattening lemma reduces it greatly.) We can hope to improve upon the Kuratowski embedding because the geometry of X is now measured by more sophisticated probes than singletons.

As before, the map p\mapsto \rho_p is easily seen to be an isometric embedding into \ell_{\infty}(P^X) where P^X is the set of all nonempty proper subsets of X. But using the identity map onto \ell_{2}(P^X) is out of question: its Lipschitz constant is almost 2^{n/2}. Instead, Bourgain applies the diagonal operator T\colon \ell_{\infty}(P^X)\to \ell_{2}(P^X) defined by

\displaystyle T(e_S) := \frac{1}{\sqrt{|S|}} \binom{n}{|S|}^{-1/2}e_S

To understand this better, write \ell_{\infty}(P^X) as the direct sum of spaces \ell_{\infty}(P_k^X) where P_k^X is the family of all k-point subsets of X. The dimension of \ell_{\infty}(P_k^X) is exactly \binom{n}{k}, and the square root appears because we are after the \ell_2 norm.

But it would be better to say that \sqrt{\cdot} appears because \ell_2 is halfway between \ell_\infty and \ell_1. Indeed, Bourgain applies T twice: first as a map \ell_{\infty}(P^X)\to \ell_{2}(P^X) and then as \ell_{2}(P^X)\to \ell_{1}(P^X). The point is that \|T^2\|_{\ell_\infty\to\ell_1} is easy to estimate: decomposing P_X by cardinality as above, we find that the \ell_\infty\to\ell_1 norm of T^2 on k-subsets is at most 1/k, due to the strategically placed factor 1/{\sqrt{|S|}}. Unfortunately, the harmonic series diverges, but fortunately it diverges slowly, yielding the estimate \|T^2\|_{\ell_\infty\to\ell_1}\le \sum_{k=1}^{n-1} \frac{1}{k}\le \log n. This is where the logarithm comes from.

The hard part of the proof is the lower bound \|T^2(\rho_p)-T^2(\rho_q)\|_{\ell_1} \ge c\, d(p,q) with universal c. I will not say anything about that.

Let’s test this on the 4-cycle with the path metric. Split \ell_{\infty}(P^X) by subset cardinality k=1,2,3, as before. Let a,b,c,d denote the vertices. The image of the vertex a is:

  • in \ell_{\infty}(P_1^X): (0,1,2,1): this is the Kuratowski embedding. The singletons are ordered a,b,c,d.
  • in \ell_{\infty}(P_2^X): (0,0,0,1,1,1) where the sets are ordered ab,ac,ad,bc,bd,cd.
  • in \ell_{\infty}(P_3^X): (0,0,0,1). Sets are ordered abc,abd,acd,bcd.

Now we apply T, which means multiplication by \frac{1}{\sqrt{k}} \binom{n}{k}^{-1/2}. The resulting vector

\displaystyle \left(0,\frac{1}{2},1,\frac{1}{2},0,0,0,\frac{1}{\sqrt{12}},\frac{1}{\sqrt{12}},\frac{1}{\sqrt{12}},0,0,0,\frac{1}{\sqrt{6}}\right)

represents the vertex a. In much the same way, the vertices b and c are mapped into

\displaystyle \left(\frac{1}{2},0,\frac{1}{2},1,0,\frac{1}{\sqrt{12}},\frac{1}{\sqrt{12}},0,0,\frac{1}{\sqrt{12}},0,0,\frac{1}{\sqrt{6}},0\right)


\displaystyle \left(1,\frac{1}{2},0,\frac{1}{2},\frac{1}{\sqrt{12}},0,\frac{1}{\sqrt{12}},0,\frac{1}{\sqrt{12}},0,0,\frac{1}{\sqrt{6}},0,0\right)

respectively. And so, |f(a)-f(b)|=\sqrt{15}/3\approx 1.29 while |f(a)-f(c)|=2\sqrt{6}/3\approx 1.633. This is compared to d_X(a,b)=1 and d_X(a,c)=2. The distortion D(f)\approx 1.58 is only slightly worse than for the optimal embedding of X (into the vertices of a square), which has D(f)=\sqrt{2}\approx 1.41. Well, we could hardly expect a general algorithm like this to find the absolutely best way to embed each finite metric space.

Apart from the constant, Bourgain’s theorem cannot be improved for general metric spaces. However, there are better embedding results for special classes of metric spaces (doubling, ultrametric etc.)