Measure-distanced subsets of an interval

Suppose {(X, d)} is a bounded metric space in which we want to find {n} points at safe distance {\delta} from one other: {d(x_j, x_k)\ge \delta} for all {j\ne k}. Let {\delta_n} be the greatest value of {\delta} for which this is possible (technically, the supremum which might not be attained). We have {\mathrm{diam}\,(X) = \delta_2 \ge \delta_3\ge \cdots}.

The sequence {\delta_n} tends to zero if and only if {X} is totally bounded. From now on, consider a specific metric space which is bounded but not totally bounded: the space {X} of all measurable subsets of {[0, 1]} with the metric {d(A, B)=|A\triangle B|}, the measure of the symmetric difference. More specifically, the elements of {X} are equivalence classes of sets that differ by a null subset. Another way to think of it is the subset of {L^1([0, 1])} which consists of functions that attain only the values {0} and {1}. In this form we identify a set {A} with its characteristic function {\chi_A}. The measure space {[0, 1]} could be replaced by any isomorphic probability space, of which there are many.

The space {X} contains an infinite subset {\{E_1, E_2, \dots \}} with {\delta=1/2} separation: namely, let {E_k} be the set of all numbers {x\in [0, 1]} such that the fractional part of {2^k x} is less than {1/2}; or equivalently, the {k}th binary digit of {x} is zero. Each symmetric difference {E_j\triangle E_k} has measure exactly {1/2}. Therefore, {\delta_n\ge 1/2} for every {n}.

Upper bound for distances

To get an upper bound for {\delta_n}, suppose {\{E_1, \dots, E_n\}} are {\delta}-separated. Let {\chi_k} be the characteristic function of {E_k}. By assumption, {\int_0^1 |\chi_k-\chi_j|\ge \delta} whenever {1\le k < j \le n}. Summing over all such pairs {(k, j)} shows that {\int_0^1 \sum_{k<j} |\chi_k-\chi_j|\ge \delta n(n-1)/2}. Fix a point {x\in [0, 1]} and let {p} be the number of sets among {E_1, \dots, E_n} that contain {x}. An easy counting argument shows {\sum_{k<j} |\chi_k(x)-\chi_j(x)| = p(n-p)}. The latter quantity is bounded by {r(n-r)} where {r = \lceil n/2\rceil}. Since this bound holds for every {x}, it follows that {\delta n(n-1)/2 \le r(n-r)}, which simplifies to

{\displaystyle \delta \le \frac{r}{2r-1}}, where {r = \lceil n/2\rceil}.

For example, {\delta_4\le \delta_3\le 2/3}, {\delta_6\le \delta_5\le 3/5}, {\delta_8\le \delta_7\le 4/7}, and so on. Are these estimates sharp? For an affirmative answer it would be enough to demonstrate that {\delta_{2r}\ge r/(2r-1)} by finding a suitable collection of {2r} subsets. Note that sharpness requires that almost every point of {[0, 1]} be covered by exactly {r} of these subsets, and that all pairwise distances between sets must be equal to {r/(2r-1)}.

Sharpness of the upper bound

Four subsets with {2/3} separation are easy to find: {[0,1/3]}, {[1/3, 2/3]}, {[2/3, 1]}, and {[0, 1]}. Thus, {\delta_4=2/3}.

To find six subsets with {3/5} separation, some preparation is helpful.

For any set {A}, the map {E\mapsto E\triangle A} is an isometry of {X} onto {X}. So, there is no loss of generality in assuming that one of our {2r} sets is {[0, 1]}. Then we are to find {E_1,\dots, E_{2r-1}} of measure {1-r/(2r-1) = (r-1)/(2r-1)}. In order to have the specified distances, the measure of intersection of any two sets must be {(r-2)/(4r-2)}.

For example, if {r=2}, the above says we need three disjoint sets of measure {1/3}, which is a trivial thing to find. For {r>2} the sets will no longer be disjoint.

Discretization of the problem

A natural way to proceed is to partition {[0, 1]} into {4r-2} equal subintervals (or other sets of equal measure). Then we look for {2r-1} sets, each of which consists of {(4r-2)(r-1)/(2r-1) = 2r-2} subintervals. Their pairwise intersections must consist of {(4r-2)(r-2)/(4r-2) = r-2} subintervals.

The case {r=2} was done above. If {r=3}, we want to use {10} subintervals to form {5} sets with {4} subintervals in each. Pairwise intersections must consist of precisely {1} subinterval. Encoding subintervals by their indices 0-9, we can describe such sets as 0123, 3456, 6780, 1479, 2589. And since 10 is a triangular number, there is a neat illustration of these five sets: three sides of the triangle, and two “stars” connecting its center to the sides.

I admit it could be neater

There is no need to translate this back into sets formed by subintervals: as long as the combinatorial problem has a solution, we have the desired conclusion. In this case, the conclusion is {\delta_6=3/5}.

Next up, {r=4}. Following the above method, we partition {[0, 1]} into {4r-2 = 14} equal subintervals and look for {7} sets, each of which consists of {6} subintervals. Pairwise intersections must consist of {2} subintervals each. Can this be done?

Think of a cube. It has {8} vertices and {6} faces, and {8+6=14}. In the set {faces and vertices}, we choose {7} subsets of cardinality {6} as follows:

  • {face, all its vertices, and the opposite face}. There are 6 such sets. Any two of them have exactly 2 elements in common.
  • {all faces} is our 7th set, it has 2 elements in common with each of the above.

Thus, {\delta_8 = 4/7}. But it is concerning that {\delta_6} and {\delta_8} required different combinatorial constructions. What happens next?

For {r=5}, we partition {[0, 1]} into {18} equal subintervals and look for {9} sets, with {8} subintervals in each. Pairwise intersections must consist of {3} subintervals. I don’t see any neat construction here.

Summary so far

The results obtained so far for {r=2, 3, 4} can be summarized as follows: there exists a 0-1 matrix {B_r} of size {(2r-1)\times (4r-2)} in which the dot product of two rows is {r-2} when the rows are distinct and {2r-2} otherwise. Specifically,

{\displaystyle B_2 = \begin{pmatrix} 1 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 1 \end{pmatrix} }

{\displaystyle B_3 = \begin{pmatrix} 1 & 1 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 1 & 1 & 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1 & 1 & 0 \\ 0 & 0 & 1 & 0 & 0 & 1 & 0 & 1 & 0 & 1 \\ 0 & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 1 & 1 \\ \end{pmatrix}}

The matrix {B_4} is best written in block form {[F | V]} according to the faces-vertices interpretation. Here

{\displaystyle F = \begin{pmatrix} 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 0 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 1 & 0 & 0 \\ 0 & 0 & 1 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 1 \\ 0 & 0 & 0 & 0 & 1 & 1 \end{pmatrix}} and {\displaystyle V = \begin{pmatrix} 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & 1 & 1 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 1 & 1 & 1 \\ 1 & 1 & 0 & 0 & 1 & 1 & 0 & 0 \\ 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 1 & 0 & 1 & 0 & 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 \\ \end{pmatrix}}

Does {B_5} exist? Perhaps not.

Combinatorics of finite sets

In combinatorial terms, we are looking for a {k}-uniform {\ell}-intersecting family of {m} subsets of {[n]}, with specific values

{k = 2r-2}, {\ell=r-2}, {m=2r-1}, {n=4r-2}.

Explanation of terms: {[n]=\{1, 2, \dots, n\}}, being {k}-uniform means every set has {k} elements, and being {\ell}-intersecting means all pairwise intersections have {\ell} elements.

Thus began my fishing expedition into the literature on combinatorics of finite sets. Surprisingly, it ended with a pivot back to where it started: measurable subsets of an interval.

Fractional intersecting uniform families

A recent (2018) preprint Uniform sets in a family with restricted intersections (Yandong Bai, Binlong Li, Jiuqiang Liu, Shenggui Zhang) relates the combinatorial problem back to subsets of an interval. I quote the relevant result (Theorem 2.5) with slight changes of notation.

Let {X=[0,n]} be the real interval and let {\lambda} be the Lebesgue measure on {X}. If {\mathcal{F}=\{F_1,\ldots,F_m\}} is a family of subsets of {X} such that

  1. {\lambda(F_i)=k} for all {i\in [m]}, and
  2. {\lambda(F_i\cap F_j)=\ell} for all distinct {i,j\in[m]},

then we call {\mathcal{F}} a fractional {\ell}-intersecting {k}-uniform {m}-family of {X}. For given real numbers {n,\ell} and integer {m}, let {\kappa_{\ell}^{\text{frac}}(n,m)} be the largest real number {k} such that there exists a fractional {\ell}-intersecting {k}-uniform {m}-family of {[0,n]}.

Theorem 2.5 Let {n>\ell\ge 0} be two real numbers and let {m\ge 1} be an integer. Then

{\displaystyle \kappa_{\ell}^{\text{frac}}(n,m)={m-1\choose s-1}\alpha+{m-1\choose t-1}\beta, }

where {\displaystyle s=\left\lfloor\frac{1+\sqrt{1+4m(m-1)\ell/n}}{2}\right\rfloor}, {\displaystyle t=\left\lceil\frac{1+\sqrt{1+4m(m-1)\ell/n}}{2}\right\rceil}

and {(\alpha,\beta)} is the solution of the system

{\displaystyle \binom{m}{s}\alpha+\binom{m}{t}\beta=n}

{\displaystyle \binom{m-2}{s-2}\alpha+\binom{m-2}{t-2}\beta=\ell}.

This looks complicated, but with our values {\ell=r-2}, {m=2r-1}, {n=4r-2} both {s} and {t} turn out to be equal to {r-1}. This is an exceptional case when the linear system is degenerate: it amounts to {\alpha+\beta = (4r-2)/\binom{2r-1}{r-1}}. The final result is {\kappa_{\ell}^{\text{frac}}(n,m)=2r-2} as we wanted. Yay.

So, the upper bounds for distances are indeed sharp, meaning that {n} subsets of the unit interval can be separated by measure-distance of {r/(2r-1)}, where {r = \lceil n/2\rceil}. But dividing the intervals into {4r-2} equal parts may not be enough to construct such subsets: in general one has to use a larger number of parts, some integer multiple of {4r-2}.

The authors remark that if {\alpha, \beta} are integers, then a partition with {k=\kappa^{\text{frac}}} exists in the combinatorial realm of {\{1, \dots, n\}}. However, this is not “if and only if”. Indeed, {\alpha+\beta = (4r-2)/\binom{2r-1}{r-1}} which takes integer values when {r=1, 2, 3}, and is fractional (less than 1) when {r\ge 4}. But we do have a combinatorial solution when {r=4}. So it is not quite clear for what values of {r} there exists a {(2r-2)}-uniform {(r-2)}-intersecting family of {2r-1} subsets of {[4r-2]}. Perhaps an answer could be found by following the construction for this specific case, when the formulas magically simplify.

Bipartite metric spaces and positive kernels

Bipartite metric spaces provide simple examples of highly non-Euclidean metric spaces. Let {A} and {B} be two disjoint abstract sets, say, with {m} and {n} points respectively. Define a metric on the union {A\cup B} as follows, where {a, b, c} are some positive numbers:

  • The distance between distinct elements of {A} is {a}.
  • The distance between distinct elements of {B} is {b}.
  • The distance between two elements from different sets is {c}.

The triangle inequality is satisfied provided that {a, b\le 2c}. When {c} is large, the space is not very interesting: it is approximately the union of two simplexes placed at some large distance from each other. On the other hand, when {a=b=2c} we encounter a bizarre situation: every point of {B} is a “midpoint” between any two distinct points of {A}, and vice versa.

A function {K\colon X\times X \to \mathbb R} is called a positive kernel if the matrix {K(x_i, x_j)_{i, j=1}^n} is positive semidefinite for any choice of {n} and {x_1, \dots, x_n\in X}. A classical theorem of Schoenberg says that a metric space {(X, d)} is isometric to a subset of a Hilbert space if and only if {\exp(-\gamma  d^2)} is a positive kernel for all {\gamma >0}. One may ask: is there any function {f\colon [0, \infty)\to\mathbb R} such that {f\circ d} is a positive kernel on {X} for every metric space {(X, d)}?

Suppose {f} is such a function. Applying it to the bipartite metric defined above, we find that the following matrix must be positive semidefinite (the case {m=3}, {n=2} shown; in general the diagonal blocks have sizes {m\times m} and {n\times n}):

{\displaystyle \begin{pmatrix} f(0) & f(a) & f(a)  & f(c) & f(c) \\   f(a) & f(0) & f(a)  & f(c) & f(c) \\   f(a) & f(a) & f(0)  & f(c) & f(c) \\   f(c) & f(c) & f(c)  & f(0) & f(b) \\   f(c) & f(c) & f(c)  & f(b) & f(0)  \end{pmatrix} }

Test the PSD property against vectors of the form {(s, s, s, t, t)}, that is {m} entries equal to some number {s} followed by {n} entries equal to some number {t}. The result is the quadratic form {As^2 + Bt^2 + 2C^2st } where {A = (m^2-m) f(a)+mf(0)}, {B = (n^2-n)f(b)+nf(0)}, and {C = mnf(c)}.

For this quadratic form to be positive semidefinite, we need {A\ge 0} and {C^2\le AB} to hold. The inequality {A\ge 0} amounts to {f(a)\ge 0} and {f(0)\ge 0}, because {m} can be any positive integer. Simply put, {f} must be nonnegative.

The inequality {C^2\le AB} simplifies to

{ mn f(c)^2 \le ( (m-1) f(a)+f(0)) \cdot ((n-1)f(b)+ f(0)) }

When {m=n=1} and {a=b}, we obtain {f(c)\le f(0)}; so, {f} is maximized at {0}. Letting {m, n\to \infty} yields {f(c)^2 \le f(a)f(b)}. What does this mean?

If {c=(a+b)/2}, the inequality {f(c)^2 \le f(a)f(b)} amounts to log-convexity of {f}, and there are lots of log-convex functions. But crucially, our only constraint on {c} in the construction of a bipartite metric space is {c\ge \max(a, b)/2}. In the special case {a=b} we find that {f(c)\le f(a)} whenever {c\ge a/2} (and {a>0}). On one hand, {f} must be nonincreasing because all {c\ge a} are allowed. On the other hand, {f(a/2)\le f(a)} implies that the sequence {k\mapsto f(2^k)} is nondecreasing for {k\in\mathbb Z}. These conditions can be satisfied together only if {f} is constant on {(0, \infty)}.

And there you have it, a complete description of functions {f\colon [0, \infty)\to\mathbb R} that transform every distance matrix into a positive semidefinite matrix: {f\equiv c\ge 0} on {(0, \infty)}, and {f(0)\ge c}. It is immediate that such {f} indeed works for all metric spaces, not just bipartite ones.

Graphs with an invertible incidence matrix

Graphs are often encoded by their adjacency matrix: a symmetric matrix where {1} in the entry {(i,j)} means there is an edge between vertices labeled {i} and {j}. Another useful encoding is the incidence matrix, in which rows correspond to edges and columns to vertices (or the other way around). Having {1} in the entry {(i,j)} means that {i}th edge is incident to {j}th vertex. Simply put, each row contains exactly two nonzero entries, which indicate the endpoints of an edge.

An incidence matrix is generally rectangular. It is the square matrix when the graph has as many edges as vertices. For example, the 4-cycle has incidence matrix

{M = \displaystyle \begin{pmatrix} 1 & 1 & 0 & 0 \\ 0 & 1 & 1 & 0 \\ 0 & 0 & 1 & 1 \\ 1 & 0 & 0 & 1 \end{pmatrix}}

(up to relabeling of vertices). This matrix is not invertible. Indeed, {M(1,-1,1,-1)^T=0}, which is a reflection of the fact that the graph is bipartite. Indeed, for any bipartite graph the {\pm 1} vector describing a 2-coloring of the vertices belongs to the kernel of the incidence matrix.

The incidence matrix of an odd cycle is invertible, however. What other such “invertible” graphs exist? Clearly, the disjoint union of “invertible” graphs is itself invertible, so we can focus on connected graphs.

There are no invertible graphs on 1 or 2 vertices. For 3 through 8 vertices, exhaustive search yields the following counts:

  • 1 graph on 3 vertices (3-cycle)
  • 1 graph on 4 vertices (3-cycle with an edge attached)
  • 4 graphs on 5 vertices
  • 8 graphs on 6 vertices
  • 23 graphs on 7 vertices
  • 55 graphs on 8 vertices

The numbers match the OEIS sequence A181688 but it’s not clear whether this is for real or a coincidence. The brute force search takes long for more than 8 vertices. It’s probably better to count such graphs by using their (folklore?) characterization as “trees planted on an odd cycle”, described by Chris Godsil on MathOverflow.

Some examples:




The examples were found and plotted by this Python script.

import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
from itertools import combinations

def is_invertible(graph, n):
    matrix = []
    for i in range(n):
        row = [0]*n
        row[graph[i][0]] = 1
        row[graph[i][1]] = 1
    return abs(np.linalg.det(matrix)) > 0.5

n = 6   # number of vertices
graphs = combinations(combinations(range(n), 2), n)
inv = []
for g in graphs:
    if is_invertible(g, n):
        gr = nx.Graph(g)
        if nx.is_connected(gr) and not any([nx.is_isomorphic(gr, gr2) for gr2 in inv]):
            plt.savefig("graph{}_{}.png".format(n, len(inv)))

print("There are {} connected invertible graphs with {} vertices".format(len(inv), n))

Classifying words: a tiny example of SVD truncation

Given a bunch of words, specifically the names of divisions of plants and bacteria, I’m going to use a truncated Singular Value Decomposition to separate bacteria from plants. This isn’t a novel or challenging task, but I like the small size of the example. A similar type of examples is classifying a bunch of text fragments by keywords, but that requires a lot more setup.

Here are 33 words to classify: acidobacteria, actinobacteria, anthocerotophyta, aquificae, bacteroidetes, bryophyta, charophyta, chlamydiae, chloroflexi, chlorophyta, chrysiogenetes, cyanobacteria, cycadophyta, deferribacteres, deinococcus-thermus, dictyoglomi, firmicutes, fusobacteria, gemmatimonadetes, ginkgophyta, gnetophyta, lycopodiophyta, magnoliophyta, marchantiophyta, nitrospirae, pinophyta, proteobacteria, pteridophyta, spirochaetes, synergistetes, tenericutes, thermodesulfobacteria, thermotogae.

As is, the task is too easy: we can recognize the -phyta ending in the names of plant divisions. Let’s jumble the letters within each word:

aaabccdeiiort, aaabcceiinortt, aacehhnoooprttty, aacefiiqu, abcdeeeiorstt, abhoprtyy, aachhoprty, aacdehilmy, cefhilloorx, achhlooprty, ceeeghinorssty, aaabcceinorty, aaccdhoptyy, abcdeeeefirrrst, cccdeehimnoorsstuu, cdgiilmooty, cefiimrstu, aabcefiorstu, aadeeegimmmnostt, agghiknopty, aeghnoptty, acdhilooopptyy, aaghilmnoopty, aaachhimnoprtty, aeiinoprrst, ahinoppty, aabceeiooprrtt, adehiopprtty, aceehioprsst, eeeginrssstty, ceeeinrsttu, aabcdeeefhilmoorrsttu, aeeghmoortt

Not so obvious anymore, is it? Recalling the -phyta ending, we may want to focus on the presence of letter y, which is not so common otherwise. Indeed, the count of y letters is a decent prediction: on the following plot, green asterisks are plants and red are bacteria, the vertical axis is the count of letter Y in each word.

Count of Y in each word

However, the simple count fails to classify several words: having 1 letter Y may or may not mean a plant. Instead, let’s consider the entire matrix {A} of letter counts (here it is in a spreadsheet: 33 rows, one for each word; 26 columns, one for each letter.) So far, we looked at its 25th column in isolation from the rest of the matrix. Truncated SVD uncovers the relations between columns that are not obvious but express patterns such as the presence of letters p,h,t,a along with y. Specifically, write {A = UDV^T} with {U,V} unitary and {D} diagonal. Replace all entries of {D}, except the four largest ones, by zeros. The result is a rank-4 diagonal matrix {D_4}. The product {A_4 = UD_4V^T} is a rank-4 matrix, which keeps some of the essential patterns in {A} but de-emphasizes the accidental.

The entries of {D_4} are no longer integers. Here is a color-coded plot of its 25th column, which still somehow corresponds to letter Y but takes into account the other letters with which it appears.

The same column of the letter-count matrix, after truncation

Plants are now cleanly separated from bacteria. Plots made in MATLAB as follows:

[U, D, V] = svd(A);
D4 = D .* (D >= D(4, 4));
A4 = U * D4 * V';
plants = (A4(:, 25) > 0.8);
bacteria = (A4(:, 25) <= 0.8);
  % the rest is output
words = 1:33;
hold on
plot(words(plants), A4(plants, 25), 'g*');
plot(words(bacteria), A4(bacteria, 25), 'r*');

Angel dust of matrices

A matrix {A} with real entries has positive characteristic polynomial if {\det(tI-A)\ge 0} for all real {t}. For example,

\displaystyle A=\begin{pmatrix} 1 & -5 \\ 3 & -2 \end{pmatrix}

has this property: {\det (tI-A)=(t-1)(t+2)+15=t^2+t+13}. For brevity, let’s say that {A} is a PCP matrix.

Clearly, any PCP matrix must be of even size. Incidentally, this implies that the algebraists’ characteristic polynomial {\det (tI-A)} and the analysts’ characteristic polynomial {\det (A-tI)} coincide.

In general, there is no reason for the PCP property to be preserved under either addition or multiplication of matrices. But there are some natural rings of PCP matrices, such as

  • complex numbers {\begin{pmatrix} a & b \\ -b & a \end{pmatrix} }
  • quaternions {\begin{pmatrix} a & b & c & d \\ -b & a & -d & c    \\ -c & d & a & -b \\ -d & -c & b & a \end{pmatrix}}

A {2\times 2} matrix is PCP if and only if its eigenvalues are either complex or repeated. This is equivalent to {(\mathrm{tr}\, A)^2 \le 4\det A}. In general, a matrix of even size is PCP if and only if it has no real eigenvalues of odd algebraic multiplicity.

Is this post merely an excuse for 1995 flashback: Ангельская Пыль by Ария?

It might not be.

Integer sets without singular matrices

Let’s say that a set A\subset \mathbb N contains an n\times n singular matrix if there exists an n\times n matrix with determinant zero whose entries are distinct elements of A. For example, the set of prime numbers does not contain any 2\times 2 singular matrices; however, for every n\ge 3 it contains infinitely many n\times n singular matrices.

I don’t know of an elementary proof of the latter fact. By a 1939 theorem of van der Corput, the set of primes contains infinitely many progressions of length 3. (Much more recently, Green and Tao replaced 3 with an arbitrary k.) If every row of a matrix begins with a 3-term arithmetic progression, the matrix is singular.

In the opposite direction, one may want to see an example of an infinite set A\subset \mathbb N that contains no singular matrices of any size. Here is one:
Continue reading “Integer sets without singular matrices”