Graphical embedding

This post continues the theme of operating with functions using their graphs. Given an integrable function {f} on the interval {[0,1]}, consider the region {R_f} bounded by the graph {y=f(x)}, the axis {y=0}, and the vertical lines {x=0}, {x=1}.

Total area under and over the graph is the L1 norm
Total area under and over the graph is the L1 norm

The area of {R_f} is exactly {\int_0^1 |f(x)|\,dx}, the {L^1} norm of {f}. On the other hand, the area of a set is the integral of its characteristic function,

\displaystyle    \chi_f = \begin{cases}1, \quad x\in R_f, \\ 0,\quad x\notin R_f \end{cases}

So, the correspondence {f\mapsto \chi_f } is a map from the space of integrable functions on {[0,1]}, denoted {L^1([0,1])}, to the space of integrable functions on the plane, denoted {L^1(\mathbb R^2)}. The above shows that this correspondence is norm-preserving. It also preserves the metric, because integration of {|\chi_f-\chi_g|} gives the area of the symmetric difference {R_f\triangle R_g}, which in turn is equal to {\int_0^1 |f-g| }. In symbols:

\displaystyle    \|\chi_f-\chi_g\|_{L^1} = \int |\chi_f-\chi_g| = \int |f-g| = \|f-g\|_{L^1}

Distance between two functions in terms of their graphs
Distance between two functions in terms of their graphs

The map {f\mapsto \chi_f} is nonlinear: for example {2f} is not mapped to {2 \chi_f} (the function that is equal to 2 on the same region) but rather to a function that is equal to 1 on a larger region.

So far, this nonlinear embedding did not really offer anything new: from one {L^1} space we got into another. It is more interesting (and more difficult) to embed things into a Hilbert space such as {L^2(\mathbb R^2)}. But for the functions that take only the values {0,1,-1}, the {L^2} norm is exactly the square root of the {L^1} norm. Therefore,

\displaystyle    \|\chi_f-\chi_g\|_{L^2} = \sqrt{\int |\chi_f-\chi_g|^2} =    \sqrt{\int |\chi_f-\chi_g|} = \sqrt{\|f-g\|_{L^1}}

In other words, raising the {L^1} metric to power {1/2} creates a metric space that is isometric to a subset of a Hilbert space. The exponent {1/2} is sharp: there is no such embedding for the metric {d(f,g)=\|f-g\|_{L^1}^{\alpha} } with {\alpha>1/2}. The reason is that {L^1}, having the Manhattan metric, contains geodesic squares: 4-cycles where the distances between adjacent vertices are 1 and the diagonal distances are equal to 2. Having such long diagonals is inconsistent with the parallelogram law in Hilbert spaces. Taking the square root reduces the diagonals to {\sqrt{2}}, which is the length they would have in a Hilbert space.

This embedding, and much more, can be found in the ICM 2010 talk by Assaf Naor.

Polygonal inequalities: beyond the triangle

(Related to previous post but can be read independently). The triangle inequality, one of the axioms of a metric space, can be visualized by coloring the vertices of a triangle by red and blue.

Triangle with colored vertices
Colored Triangle

The inequality says that the sum of monochromatic distances must not exceed the sum of dichromatic distances. That is, for every assignment of the vertices to points in the space, the sum of all red-red and blue-blue distances does not exceed the sum of red-blue distances. An assignment is just a map from the set of vertices {V} to the metric space {X}; it need not be injective.

But why stop at the triangle? We can take any number of points (vertices), color them in some way, and require the same polygonal inequality:

\displaystyle \text{monochromatic } \le \text{ dichromatic}

Already for the square we have two distinct plausible colorings to explore: even red-blue split

Evenly colored square
Evenly Colored Square

and predominantly red square

(mostly) Red Square
(mostly) Red Square

But it turns out that the second coloring is useless: the inequality it defines fails in every metric space with more than one point. More generally, suppose we have {R} red points and {B} blue ones, and {R- B\ge 2}. Pick two distinct points {a,b\in X}. Assign one red point to {a} and all others to {b}. The sum of monochromatic distances is {(R-1)\,d(a,b)} while the sum of dichromatic distances is {B\,d(a,b)}, which is strictly less.

So, we are limited to nearly-even colorings: those with {|R-B|\le 1}. For even numbers of vertices this means even split, while odd numbers should be divided as evenly as possible: like 3+2 for the pentagon.

Here is the pentagram again.
Colored Pentagon

The inequalities turn out to be related. For every {n}, the {n}-gonal inequality implies the {(n-2)}-gonal inequality, because if we assign two opposite-colored vertices to the same point, their contributions cancel out. More interestingly: when {n} is odd, the {n}-gonal inequality implies the {(n-1)}-gonal inequality. Indeed, suppose we have {(n-1)} points, evenly colored. Add an arbitrary {n}th point. Whether the added point is blue or red, the {n}-gonal inequality holds. Averaging these two inequalities, we see the effect of the added points canceling out.

So, if the {n}-gonal inequality holds for all odd {n}, it holds for all {n}. This property is exactly the hypermetric property from the previous post. Except it was stated there in a different form:

\displaystyle  \sum_{i,j}b_i b_j d(x_i , x_j ) \le 0

for every finite sequence of points {x_i\in X} and every choice of integers {b_i} such that {\sum_i b_i=1}. But if the point {x_i} is repeated {|b_i|} times, we can replace {b_i} by {\mathrm{sign}\,b_i}. Then represent +1 as blue and -1 as red.

The hypermetric inequalities were introduced by John B. Kelly (a student of William Ted Martin) in late 1960s. He showed they are necessary for the space to be embeddable into the space {\ell^1}. It would be great if they were also sufficient (and for some classes of spaces they are), but this is not so: a counterexample was given by Patrice Assouad in 1977.

It is also interesting to consider the {n}-gonal inequalities for even {n} only. By repetition of vertices, they are equivalent to requiring

\displaystyle  \sum_{i,j}b_i b_j d(x_i , x_j ) \le 0 \quad\quad \quad (1)

for every finite sequence of points {x_i\in X} and every choice of integers {b_i} such that {\sum_i b_i=0}. But of course then we have (1) for rational {b_i} (just clear the denominators), hence for all real {b_i} (by approximation), as long as they sum to {0}. So, the requirement amounts to the matrix {(d(x_i,d_j))} being negative semidefinite on the subspace {\sum x_i=0}. Such metrics are called metrics of negative type.

Their relation to embeddability of the space is well-known: {(X,d)} is of negative type if and only if the “snowflake” {(X,\sqrt{d})} isometrically embeds into a Hilbert space. In other words, we can “draw” any finite metric space of negative type in a Euclidean space, with the understanding that Euclidean distances represent the square roots of the actual distances. This embedding result is a 1935 theorem of Isaac Schoenberg who is also known for connecting dots naturally (introducing splines).

Pentagrams and hypermetrics

The Wikipedia article Metric (mathematics) offers a plenty of flavors of metrics, from common to obscure: ultrametric, pseudometric, quasimetric, semimetric, premetric, hemimetric and pseudoquasimetric (I kid you not).

One flavor it does not mention is a hypermetric. This is a metric {d} on a set {X} such that the inequality

\displaystyle  \sum_{i,j}b_i b_j d(x_i , x_j ) \le 0 \ \ \ \ \ \ \ \ \ (1)

holds for every finite sequence of points {x_i\in X} and every choice of integers {b_i} such that {\sum_i b_i=1}. The requirement that {b_i} be integers gives some combinatorial meaning to (1); this is not just some quadratic form being negative semidefinite.

As a warm-up, observe that (1) contains in it the triangle inequality: with {b_1=b_2=1} and {b_3=-1} we get {d(x_1,x_2)-d(x_1,x_3)-d(x_2,x_3)\le 0}. But it appears that (1) says something about “polygons” with more vertices too.

To make (1) worth thinking about, it should be satisfied by some important metric space. Such as the real line {\mathbb R}, for example. It is not quite obvious that the inequality

\displaystyle  \sum_{i,j}b_i b_j |a_i - a_j| \le 0 \ \ \ \ \ \ \ \ \ (2)

holds for all reals {a_i} and all integers {b_i} adding up to {1}. It helps to order the numbers: {a_1\le \dots\le a_m} and focus on the contribution of a particular gap {[a_k,a_{k+1}]} to the sum (2). The amount it contributes is {|a_k-a_{k+1}|} multiplied by

\displaystyle    \sum_{i\le k<j} b_i b_j = \left(\sum_{i\le k} b_i \right) \left(\sum_{j > k} b_j \right) = \left(\sum_{i\le k} b_i \right) \left(1-\sum_{i\le k} b_i \right) \le 0

because {n(1-n)\le 0} for every integer {n}. This proves (2).

Now that we have one hypermetric space, {\mathbb R}, other such spaces can be created easily. If {X} is any set and {f \colon X\rightarrow\mathbb R} any function, consider {d(x,y) = |f(x)-f(y)|}, the pullback pseudometric on {X}. By applying (2) to the numbers {f(x_i)}, we see that {d} satisfies the hypermetric inequality. Since (1) is additive in {d}, we can take any family of functions {f_\alpha \colon X\rightarrow\mathbb R} and add together the corresponding pseudometrics. Or even integrate them against a positive measure: {d(x,y)=\int |f_\alpha(x)-f_\alpha(y)|\,d\mu(\alpha)}.

For example, the plane {\mathbb R^2} is a hypermetric space, because the distance between two points {(x_1,y_1)} and {(x_2,y_2)}, besides the familiar form

\displaystyle  \sqrt{(x_1-x_2)^2 + (y_1-y_2)^2}

can also be represented as an integral of the aforementioned pullbacks:

\displaystyle  \frac12 \int_0^\pi \big| (x_1-x_2)\cos \alpha + (y_1-y_2) \sin\alpha \big| \,d\alpha

A similar integral representation holds in all dimensions; thus, all Euclidean spaces are hypermetric.

Okay, what is not a hypermetric then? For example, the cube distance induced by the norm {\|x\|_\infty=\max |x_i|} is not, in dimensions 3 and higher. Specifically, (1) fails as the five-point inequality with {(b_1,\dots,b_5) =(1,1,1,-1,-1)}. I’ll call it the pentagram inequality:

also known as K_5 in mystical literature
also known as K_5 in mystical literature

It says that for any five points in the space the sum of monochromatic distances does not exceed the sum of all bi-chromatic (red-blue) distances.

The pentagram inequality fails when {x_1,\dots,x_5} are the columns of the matrix

\displaystyle  \begin{pmatrix}    1& 1& 1& 2& 0\\    0& 2& 2& 1& 1\\    0& 1& 0& 1& 0\\    \end{pmatrix}

(first three columns blue, the last two red). Indeed, the sum of monochromatic distances is {2+1+2+2=7} while the sum of bichromatic distances is {1+1+1+1+1+1=6}.

If the above example does not look conceptual enough, it’s because I found it via computer search. I don’t have much intuition for the pentagram inequality.

Anyway, the example delivers another proof that taking the maximum of three numbers is hard. More precisely, there is no isometric embedding of {\mathbb R^3} with the maximum metric into {\ell_1}. Unlike the earlier proof, this one does not assume the embedding is linear.

A good reference for hypermetric inequalities is the book Geometry of cuts and metrics by Deza and Laurent.

Maximum of three numbers: it’s harder than it sounds

This simple identity hold for any two real numbers {x,y}:

\displaystyle   \max(|x|,|y|) = \frac12\,(|x+y|+|x-y|)   \ \ \ \ \  \ \ \ \ \ (1)

Indeed, if {|x|} realizes the maximum, then both {x+y} and {x-y} have the same sign as {x}. After opening the absolute value signs, we get {y} to cancel out.

So, (1) represents {\max(|x|,|y|)}, also known as the {\ell_\infty} norm, as the sum of absolute values of linear functions. Let’s try the same with {\max(|x|,|y|,|z|)}. Since the right hand side of (1) is just the average of {|\pm x \pm y|} over all possible choices of {\pm } signs, the natural thing to do is to average {|\pm x \pm y \pm z|} over all eight choices. The sign in front of {x} can be taken to be {+}, which simplifies the average to

\displaystyle    \frac14\,(|x+y+z|+|x+y-z|+|x-y+z|+|x-y-z|)    \ \ \ \ \  \ \ \ \ \ (2)

Does this formula give {\max(|x|,|y|,|z|)}? Instead of trying random numbers, let’s just plot the unit ball for the norm given by (2). If the identity works, it will be a cube. I used Maple:

with(plots): f:=(abs(x+y+z)+abs(x+y-z)+abs(x-y-z)+abs(x-y+z))/4:
My precious!
My precious!

Close, but no cube. Luckily, this is my favorite Archimedean solid, the cuboctahedron.

Although all terms of (2) look exactly the same, the resulting shape has both triangular and square faces. Where does the difference of shapes come from?

More importantly, is (2) really the best we can do? Is there some other sum of moduli of linear functions that will produce {\max(|x|,|y|,|z|)}?

— No.

Even if negative coefficients are allowed?

— Even then. (But you can come arbitrarily close.)

What if we allow integrals with respect to an arbitrary (signed) measure, as in

\displaystyle    \iiint |\alpha x+\beta y+\gamma z|\,d \mu(\alpha, \beta, \gamma)    \ \ \ \ \  \ \ \ \ \ (3)

— Still no. But if {\mu} is allowed to be a distribution of higher order (an object more singular than a measure), then a representation exists (W. Weil, 1976). Yes, one needs the theory of distributions to write the maximum of three numbers as a combination of linear functions.

I’ll only prove that there is no identity of the form

\displaystyle  \max(|x|,|y|,|z|) = \sum_{i=1}^N |\alpha_i x+\beta_i y+ \gamma_i z|

Indeed, such an identity amounts to having an isometric embedding {T\colon \ell_\infty^3 \rightarrow \ell_1^N}. The adjoint operator {T^* \colon \ell_\infty^N \rightarrow \ell_1^3} is a submetry meaning that it maps the unit ball of {\ell_\infty^N } onto the unit ball {\ell_1^3}. The unit ball of {\ell_\infty^N } is just a cube; all of its faces are centrally symmetric, and this symmetry is preserved by linear maps. But {\ell_1^3} is an octahedron, with triangular faces. A contradiction. {\ \Box}

An aside: what if instead of averaging {|\pm x \pm y|} over all {\pm } choices (i.e., unimodular real coefficients) we take the average over all unimodular complex coefficients? This amounts to {\|(x,y)\| = \frac{1}{2\pi} \int_0^{2\pi} |x+e^{it}y|\,dt}. I expected something nice from this norm, but

Complex averaging in real space
Complex averaging in real space

it’s a strange shape whose equation involves the complete elliptic integral of second kind. Yuck.

Dunce hats in assorted dimensions

The standard n-simplex is the set \Delta^n=\{x\in\mathbb R^{n+1}\colon \forall i\ x_i\ge 0,\ \sum_{i=1}^{n+1} x_i=1\}. Let’s call two vectors (x_1,\dots,x_{n+1}) and (x_1',\dots,x_{n+1}') equivalent if they become equal after all zero entries are deleted. For example, (1/3,0,2/3,0,)\sim (0,1/3,2/3,0). The axioms of the equivalence relation are obviously satisfied. The quotient space \Delta^n/\sim is a compact topological space. How does it look?

When n=1, the simplex is a line segment, and the equivalence relation glues its endpoints together. Thus, \Delta^1/\sim is the circle.

When n=2, the simplex is a triangle. Gluing two sides that share the vertex (1,0,0) creates a cone, such as on the cover of Fastball debut album. The remaining identification takes an effort to carry out: (1) bend the cone so that the vertex touches the base, making the previous glue line into a circle. (2) glue this circle to the base of the cone. The result is a space in which all three former edges become a single circle; clearly this is not a manifold. It is the topological dunce hat. Despite the difficulty of twisting and gluing, one can realize this space as a subset of \mathbb R^3.

Making a dunce hat, by Claudio Rocchini. From

When n\ge 3, I don’t know if \Delta^n/\sim is homeomorphic to a subset of \mathbb R^{n+1}. Actually, one should ask for bi-Lipschitz equivalence, since \Delta^n/\sim carries a natural quotient metric. The question stems from a 1931 paper by Borsuk and Ulam. The term “higher dimensional dunce hat” was introduced by Andersen, Marjanović and Schori, although they apply it only to even-dimensional hats. They prove that \Delta^n/\sim is contractible when n is even, and has the homotopy type of S^{n} when n is odd. (This does not imply that \Delta^3/\sim is homeomorphic to S^3: the Poincaré conjecture concerns manifolds, and \Delta^n/\sim is not a manifold when n\ge 2).

Zeeman conjectured that the product of any contractible 2-complex (such as \Delta^2/\sim) with [0,1] is collapsible. This statement implies the Poincaré conjecture, but … oh well, so much for this chance to make your mama proud.

Bourgain’s embedding theorem

Let X be a finite metric space, say of cardinality n. For n\ge 4 we cannot expect to have an isometric embedding of X into a Euclidean space; the 4-cycle with path metric cannot be so embedded. So we ask for an embedding f\colon X\to \mathbb R^n with reasonable distortion D(f). The distortion is defined as the product of the Lipschitz norms of f and of its inverse,

\displaystyle D=\mathrm{Lip}(f)\mathrm{Lip}(f^{-1}):=\sup_{a\ne b}\frac{|f(a)-f(b)|}{d_X(a,b)} \cdot \frac{d_X(a,b)}{|f(a)-f(b)|}

So, D(f)=1 if and only if f scales all distances by the same factor (which we can then make equal to 1 by rescaling the target space).

It is easy to isometrically embed X into the normed vector space \ell_{\infty}(X) which consists of all functions X\to\mathbb R with the maximum norm. This (Kuratowski) embedding simply sends each point p\in X into the function \rho_p(x)=d(x,p). It is easy to see that \max |\rho_p-\rho_q|=d(p,q), attained when x\in \lbrace p,q\rbrace. The identity map from \ell_{\infty}(X) to \ell_{2}(X) has distortion \sqrt{n}, and this is the price D(f) we have to pay.

In 1985 Bourgain achieved a dramatic improvement: D(f) \le C\log n, which is the best possible bound apart from the value of the absolute constant C. (It can be taken to be 128.) The starting point is remarkably simple: instead of \rho_p(x)=d(x,p), which measures the distances of p to other points, Bourgain used \rho_p(S)=d(p,S) where S runs over all nonempty proper subsets of X. The fact that the target space now has 2^n-2 dimensions is of no theoretical consequence, since the image of X is always contained in an (n-1) dimensional hyperplane. (From the practical viewpoint, even dimension n is much too high, but the Johnson-Lindenstrauss flattening lemma reduces it greatly.) We can hope to improve upon the Kuratowski embedding because the geometry of X is now measured by more sophisticated probes than singletons.

As before, the map p\mapsto \rho_p is easily seen to be an isometric embedding into \ell_{\infty}(P^X) where P^X is the set of all nonempty proper subsets of X. But using the identity map onto \ell_{2}(P^X) is out of question: its Lipschitz constant is almost 2^{n/2}. Instead, Bourgain applies the diagonal operator T\colon \ell_{\infty}(P^X)\to \ell_{2}(P^X) defined by

\displaystyle T(e_S) := \frac{1}{\sqrt{|S|}} \binom{n}{|S|}^{-1/2}e_S

To understand this better, write \ell_{\infty}(P^X) as the direct sum of spaces \ell_{\infty}(P_k^X) where P_k^X is the family of all k-point subsets of X. The dimension of \ell_{\infty}(P_k^X) is exactly \binom{n}{k}, and the square root appears because we are after the \ell_2 norm.

But it would be better to say that \sqrt{\cdot} appears because \ell_2 is halfway between \ell_\infty and \ell_1. Indeed, Bourgain applies T twice: first as a map \ell_{\infty}(P^X)\to \ell_{2}(P^X) and then as \ell_{2}(P^X)\to \ell_{1}(P^X). The point is that \|T^2\|_{\ell_\infty\to\ell_1} is easy to estimate: decomposing P_X by cardinality as above, we find that the \ell_\infty\to\ell_1 norm of T^2 on k-subsets is at most 1/k, due to the strategically placed factor 1/{\sqrt{|S|}}. Unfortunately, the harmonic series diverges, but fortunately it diverges slowly, yielding the estimate \|T^2\|_{\ell_\infty\to\ell_1}\le \sum_{k=1}^{n-1} \frac{1}{k}\le \log n. This is where the logarithm comes from.

The hard part of the proof is the lower bound \|T^2(\rho_p)-T^2(\rho_q)\|_{\ell_1} \ge c\, d(p,q) with universal c. I will not say anything about that.

Let’s test this on the 4-cycle with the path metric. Split \ell_{\infty}(P^X) by subset cardinality k=1,2,3, as before. Let a,b,c,d denote the vertices. The image of the vertex a is:

  • in \ell_{\infty}(P_1^X): (0,1,2,1): this is the Kuratowski embedding. The singletons are ordered a,b,c,d.
  • in \ell_{\infty}(P_2^X): (0,0,0,1,1,1) where the sets are ordered ab,ac,ad,bc,bd,cd.
  • in \ell_{\infty}(P_3^X): (0,0,0,1). Sets are ordered abc,abd,acd,bcd.

Now we apply T, which means multiplication by \frac{1}{\sqrt{k}} \binom{n}{k}^{-1/2}. The resulting vector

\displaystyle \left(0,\frac{1}{2},1,\frac{1}{2},0,0,0,\frac{1}{\sqrt{12}},\frac{1}{\sqrt{12}},\frac{1}{\sqrt{12}},0,0,0,\frac{1}{\sqrt{6}}\right)

represents the vertex a. In much the same way, the vertices b and c are mapped into

\displaystyle \left(\frac{1}{2},0,\frac{1}{2},1,0,\frac{1}{\sqrt{12}},\frac{1}{\sqrt{12}},0,0,\frac{1}{\sqrt{12}},0,0,\frac{1}{\sqrt{6}},0\right)


\displaystyle \left(1,\frac{1}{2},0,\frac{1}{2},\frac{1}{\sqrt{12}},0,\frac{1}{\sqrt{12}},0,\frac{1}{\sqrt{12}},0,0,\frac{1}{\sqrt{6}},0,0\right)

respectively. And so, |f(a)-f(b)|=\sqrt{15}/3\approx 1.29 while |f(a)-f(c)|=2\sqrt{6}/3\approx 1.633. This is compared to d_X(a,b)=1 and d_X(a,c)=2. The distortion D(f)\approx 1.58 is only slightly worse than for the optimal embedding of X (into the vertices of a square), which has D(f)=\sqrt{2}\approx 1.41. Well, we could hardly expect a general algorithm like this to find the absolutely best way to embed each finite metric space.

Apart from the constant, Bourgain’s theorem cannot be improved for general metric spaces. However, there are better embedding results for special classes of metric spaces (doubling, ultrametric etc.)

Embeddings II: searching for roundness in ugliness

The concluding observation of Part I was that it’s hard to embed things into a Hilbert space: the geometry is the same in all directions, and the length of diagonals of parallelepipeds is tightly controlled. One may think that it should be easier to embed the Hilbert space itself into other things. And this is indeed so.

Let’s prove that the space L^p[0,1] contains an isomorphic copy of the Hilbert space \ell_2, for every 1\le p \le \infty. The case p=\infty can be immediately dismissed: this is a huge space which, by virtue of Kuratowski’s embedding, contains an isometric copy of every separable metric space. Assume 1\le p<\infty.

Recall the Rademacher functions r_n(t)=\mathrm{sign}\, \sin (2^{n+1}\pi t), n=0,1,2,3,\dots They are simply square waves:

Define a linear operator T\colon \ell_2\to L^p[0,1] by T(c_1,c_2,\dots)=\sum_{n}c_nr_n. Why is the sum \sum_{n}c_nr_n in L^p, you ask? By the Хинчин inequality:

\displaystyle A_p\sqrt{\sum c_n^2} \le \left\|\sum c_n r_n\right\|_{L^p} \le B_p\sqrt{\sum c_n^2}

The inequality tells us precisely that T is an isomorphism onto its image.

So, even the indescribably ugly space L^1[0,1] contains a nice roundish subspace.
(Why is L^1[0,1] ugly? Every point of its unit sphere is the midpoint of a line segment that lies on the sphere. Imagine that.) One might ask if the same holds for every Banach space, but that’s way too much to ask. For instance, the sequence space \ell_p for p\ne 2,\infty does not have any subspace isomorphic to \ell_2. Informally, this is because the underlying measure (the counting measure on \mathbb N) is not infinitely divisible; the space consists of atoms. For any given N we can model the first N Rademacher functions on sequences, but the process has to stop once we reach the atomic level. On the positive side, this shows that \ell_p contains isomorphic copies of Euclidean spaces of arbitrarily high dimension, with uniform control on the distortion expressed by A_p and B_p. And this property is indeed shared by all Banach spaces: see Dvoretzky’s theorem.

Embeddings vs embeddings

In mathematics, an embedding (or imbedding) is one instance of some mathematical structure contained within another instance, such as a group that is a subgroup. When some object X is said to be embedded in another object Y, the embedding is given by some injective and structure-preserving map f \colon X \to Y.

Thus speaks Wikipedia. And so we must insist on f(X) being isomorphic to X in whatever category we are dealing with; otherwise there is no reason to say that Y contains X in any way. This is why geometers distinguish immersions from embeddings, and note that even an injective immersion may fail to be an embedding.

Not an embedding

And yet, people speak of the Sobolev (Соболев) embedding theorem and its relatives due to Morrey, Rellich, Кондрашов… The common feature of these theorems is that one normed space X is set-theoretically contained in another normed space Y, and the inclusion map is continuous (or even compact in the case of Rellich-Кондрашов). But this inclusion map is not an isomorphism; the image of X inside of Y may look nothing like X itself. For instance, the Sobolev theorem says that, in the two-dimensional setting, the space of functions with integrable gradient (W^{1,1}) is “embedded” into L^2. Even though there is nothing inside of L^2 that looks like W^{1,1}.

Let’s consider a much simpler example: the space of absolutely summable sequences \ell_1 is contained in the space of square-summable sequences \ell_2. This inclusion is a continuous, indeed non-expanding map: \|x\|_2\le \|x\|_1 because \displaystyle \sum_i |x_i|^2\le \left(\sum_i |x_i|\right)^2. But there is no subspace of \ell_2 that is isomorphic to \ell_1, isomorphisms being invertible linear maps. This can be shown without any machinery: Suppose f\colon \ell_1\to\ell_2 is a linear map such that C^{-1}\|x\|_1 \le \|f(x)\|_2 \le C\|x\|_1 for some constant C, for all x\in \ell_1. Consider the unit basis vectors e_i, i.e., e_1=(1,0,0,0,\dots), etc. Denote v_i=f(e_i). For any choice of \pm signs, \|\sum_{i=1}^n \pm e_i \|_1 =n and, therefore,
\displaystyle (*) \qquad \left\|\sum_{i=1}^n \pm v_i \right\|_2^2 \ge C^{-2} n^2.
On the other hand, taking the average over all sign assignments in \displaystyle \left\|\sum_{i=1}^n \pm v_i \right\|_2^2 kills all cross terms, leaving us with \sum_{i=1}^n \| v_i \|_2^2, which is \le C^2n. As n\to\infty, we obtain a contradiction between the latter inequality and (*).

Parallelogram (or parallelepiped) law

In a Hilbert space, the sum of squared diagonals of a parallelepiped is precisely controlled by the square of sums of its sides. In \ell_p for 1\le p<2 the diagonals may be “too long'', but cannot be ''too short''. In \ell_p for 2\le p<\infty it's the opposite. (And in \ell_{\infty} anything goes, since it contains an isometric copy of every \ell_p.) Further development of these computations with diagonals leads to the notions of type and cotype of a Banach space: type gives an upper bound on the length of diagonals, cotype gives a lower bound.