## The sum of pairwise distances and the square of CDF

Suppose we have ${n}$ real numbers ${x_0,\dots, x_{n-1}}$ and want to find the sum of all distances ${|x_j-x_k|}$ over ${j < k}$. Why? Maybe because over five years ago, the gradient flow of this quantity was used for "clustering by collision" (part 1, part 2, part 3).

If I have a Python console open, the problem appears to be solved with one line:

>>> 0.5 * np.abs(np.subtract.outer(x, x)).sum()

where the outer difference of x with x creates a matrix of all differences ${x_i-x_j}$, then absolute values are taken, and then they are all added up. Double-counted, hence the factor of 0.5.

But trying this with, say, one million numbers is not likely to work. If each number takes 8 bytes of memory (64 bits, double precision), then the array x is still pretty small (under 8 MB) but a million-by-million matrix will require over 7 terabytes, and I won’t have that kind of RAM anytime soon.

In principle, one could run a loop adding these values, or store the matrix on a hard drive. Both are going to take forever.

There is a much better way, though. First, sort the numbers in nondecreasing order; this does not require much time or memory (compared to quadratic memory cost of forming a matrix). Then consider the partial sums ${s_k = x_0+\dots+x_k}$; the cost of computing them is linear in time and memory. For each fixed ${k}$, the sum of distances to ${x_j}$ with ${j is simply ${kx_k - s_{k-1}}$, or, equivalently, ${(k+1)x_k - s_k}$. So, all we have to do is add these up. Still one line of code (after sorting), but a much faster one:

>>> x.sort()
>>> (np.arange(1, n+1)*x - np.cumsum(x)).sum()

For example, x could be a sample from some continuous distribution. Assuming the distribution has a mean (i.e., is not too heavy tailed), the sum of all pairwise distances grows quadratically with n, and its average approaches a finite limit. For the uniform distribution on [0, 1] the computation shows this limit is 1/3. For the standard normal distribution it is 1.128… which is not as recognizable a number.

As ${n\to \infty}$, the average distance of a sample taken from a distribution converges to the expected value of |X-Y| where X, Y are two independent variables with that distribution. Let’s express this in terms of the probability density function ${p}$ and the cumulative distribution function ${\Phi}$. By symmetry, we can integrate over ${x> y}$ and double the result:

${\displaystyle \frac12 E|X-Y| = \int_{-\infty}^\infty p(x)\,dx \int_{-\infty}^x (x-y) p(y)\,dy}$

Integrate by parts in the second integral: ${p(y) = \Phi'(y)}$, and the boundary terms are zero.

${\displaystyle \frac12 E|X-Y| = \int_{-\infty}^\infty p(x)\,dx \int_{-\infty}^x \Phi(y)\,dy}$

Integrate by parts in the other integral, throwing the derivative onto the indefinite integral and thus eliminating it. There is a boundary term this time.

${\displaystyle \frac12 E|X-Y| = \Phi(\infty) \int_{-\infty}^\infty \Phi(y)\,dy - \int_{-\infty}^\infty \Phi(x)^2\,dx}$

Since ${\Phi(\infty) = 1}$, this simplifies nicely:

${\displaystyle \frac12 E|X-Y| = \int_{-\infty}^\infty \Phi(x) (1-\Phi(x))\,dx}$

This is a lot neater than I expected: ${E|X-Y|}$ is simply the integral of ${2\Phi(1-\Phi)}$. I don’t often see CDF squared, like here. Some examples: for the uniform distribution on [0,1] we get

${\displaystyle E|X-Y| = \int_0^1 2x(1-x)\,dx = \frac13}$

and for the standard normal, with ${\Phi(x) = (1+\mathrm{erf}\,(x/\sqrt{2}))/2}$, it is

${\displaystyle \int_{-\infty}^\infty \frac12 \left(1-\mathrm{erf}\,(x/\sqrt{2}) ^2 \right)\,dx = \frac{2}{\sqrt{\pi}}\approx 1.12838\ldots }$

The trick with sorting and cumulative sums can also be used to find, for every point ${x_k}$, the sum (or average) of distances to all other points. To do this, we don’t sum over ${k}$ but must also add ${|x_j-x_k|}$ for ${j>k}$. The latter sum is simply ${S - s_k - (n-k-1)x_k}$ where ${S}$ is the total sum. So, all we need is

>>> (2*np.arange(1,n+1)-n)*x - 2*np.cumsum(x) + x.sum()

Unfortunately, the analogous problems for vector-valued sequences are not as easy. If the Manhattan metric is used, we can do the computations for each coordinate separately, and add the results. For the Euclidean metric…

## The infinitely big picture: tanh-tanh scale

When plotting the familiar elementary functions like x2 or exp(x), we only see whatever part of the infinitely long curve fits in the plot window. What if we could see the entire curve at once?

The double-tanh scale can help with that. The function u = tanh(x) is a diffeomorphism of the real line onto the interval (-1, 1). Its inverse, arctanh or artanh or arth or ${\tanh^{-1}x}$ or ${\frac12 \log((1+x)/(1-x))}$, whatever you prefer to call it, does the opposite. So, conjugating any function ${f\colon \mathbb R\to \mathbb R}$ by the hyperbolic tangent produces a function ${g\colon (-1, 1)\to (-1,1)}$ which we can plot in its entirety. Let’s try this.

Out of linear functions y = kx, only y=x and y=-x remain lines.

The powers of x, from 1 to 4, look mostly familiar:

Sine, cosine, and tangent functions are not periodic anymore:

The exponential function looks concave instead of convex, although I don’t recommend trying to prove this by taking the second derivative of its tanh-conjugate.

The Gaussian loses its bell-shaped appearance and becomes suspiciously similar to a semicircle.

This raises the question: which function does appear as a perfect semi-circle of radius 1 on the tanh-tanh scale? Turns out, it is ${f(x) = \log|\coth(x/2)|}$. Here it is shown in the normal coordinate system.

## Discrete Cosine Transforms: Trapezoidal vs Midpoint

The existence of multiple versions of Discrete Cosine Transform (DCT) can be confusing. Wikipedia explains that its 8 types are determined by how one reflects across the boundaries. E.g., one can reflect 1, 2, 3 across the left boundary as 3, 2, 1, 2, 3 or as 3, 2, 1, 1, 2, 3, and there are such choices for the other boundary too (also, the other reflection can be odd or even). Makes sense enough.

But there is another aspect to the two most used forms, DCT-I and DCT-II (types I and II): they can be expressed in terms of the Trapezoidal and Midpoint rules for integration. Here is how.

The cosines ${\cos kx}$, ${k=0,1,\dots}$ are orthogonal on the interval ${[0,\pi]}$ with respect to the Lebesgue measure. The basis of discrete transforms is that these cosines are also orthogonal with respect to some discrete measures, until certain frequency is reached. Indeed, if cosines are orthogonal with respect to measure ${\mu}$ whose support consists of ${n}$ points, then we can efficiently use them to represent any function defined on the support of ${\mu}$, and those are naturally identified with sequences of length n.

How to find such measures? It helps that some simple rules of numerical integration are exact for trigonometric polynomials up to some degree.

For example, the trapezoidal rule with n sample points exactly integrates the functions ${\cos kx}$ for ${k=0,\dots, 2n-3}$. This could be checked by converting to exponential form and summing geometric progressions, but here is a visual explanation with n=4, where the interval ${[0,\pi]}$ is represented as upper semi-circle. The radius of each red circle indicates the weight placed at that point; the endpoints get 1/2 of the weight of other sample points. To integrate ${\cos x}$ correctly, we must have the x-coordinate of the center of mass equal to zero, which is obviously the case.

Replacing ${\cos x}$ by ${\cos kx}$ means multiplying the polar angle of each sample point by ${k}$. This is what we get:

In all cases the x-coordinate of the center of mass is zero. With k=6 this breaks down, as all the weight gets placed in one point. And this is how it goes in general with integration of sines and cosines: equally spaced points work perfectly until they don’t work at all, which happens when the step size is equal to the period of the function. When ${k=2n-2}$, the period ${2\pi/k}$ is equal to ${\pi/(n-1)}$, the spacing of points in the trapezoidal rule.

The orthogonality of cosines has to do with the formula ${\cos kx \cos j x = \frac12(\cos (k-j)x) + \frac12(\cos (k+j)x)}$. Let ${\tau_n}$ be the measure expressing the trapezoidal rule on ${[0,\pi]}$ with ${n}$ sample points; so it’s the sum of point masses at ${0, \pi/(n-1), \dots , \pi}$. Then ${\{\cos k x \colon k=0,\dots, n-1\}}$ are orthogonal with respect to ${\tau_n}$ because any product ${\cos k x\cos j x }$ with ${k >j}$ taken from this range will have ${k-j, k+j \le 2n-3}$. Consequently, we can compute the coefficients of any function ${f}$ in the cosine basis as

${\displaystyle c_k = \int f(x)\cos kx\,d\tau_n(x) \bigg/ \int \cos^2 kx\,d\tau_n(x)}$

The above is what DCT-I (discrete cosine transform of type 1) does, up to normalization.

The DCT-II transform uses the Midpoint rule instead of the Trapezoidal rule. Let ${\mu_n}$ be the measure expressing the Midpoint rule on ${[0,\pi]}$ with ${n}$ sample points; it gives equal mass to the points ${(2j-1)\pi/(2n)}$ for ${j=1,\dots, n}$. These are spaced at ${\pi/n}$ and therefore the midpoint rule is exact for ${\cos kx}$ with ${k=0,\dots, 2n-1}$ which is better than what the trapezoidal rule does. Perhaps more significantly, by identifying the given data points with function values at the midpoints of subintervals we stay away from the endpoints ${0,\pi}$ where the cosines are somewhat restricted by having to have zero slope.

Let’s compare DCT-I and DCT-II on the same data set, ${y=(0, \sqrt{1}, \sqrt{2}, \dots, \sqrt{8})}$. There are 9 numbers here. Following DCT-I we place them at the sample points of the trapezoidal rule, and expand into cosines using the inner product with respect to ${\tau_n}$. Here is the plot of the resulting trigonometric polynomial: of course it interpolates the data.

But DCT-II does it better, despite having exactly the same cosine functions. The only change is that we use ${\mu_n}$ and so place the ${y}$-values along its support.

Less oscillation means the high-degree coefficients are smaller, and therefore easier to discard in order to compress information. For example, drop the last two coefficients in each expansion, keeping 6 numbers instead of 8. DCT-II clearly wins in accuracy then.

Okay, so the Midpoint rule is better, no surprise. After all, it’s in general about twice as accurate as the Trapezoidal rule. What about Simpson’s rule, would it lead to some super-efficient form of DCT? That is, why don’t we let ${\sigma_n}$ be the discrete measure that expresses Simpson’s rule and use the inner product ${\int fg\,d\sigma_n}$ for cosine expansion? Alas, Simpson’s rule on ${n}$ points is exact only for ${\cos kx}$ with ${k=0,\dots, n-2}$, which is substantially worse than either Trapezoidal or Midpoint rules. As a result, we don’t get enough orthogonal cosines with respect to ${\sigma_n}$ to have an orthogonal basis. Simpson’s rule has an advantage when dealing with algebraic polynomials, not with trigonometric ones.

Finally, the Python code used for the graphics; I did not use SciPy’s DCT method (which is of course more efficient) to keep the relation to numerical integration explicit in the code. The method trapz implements the trapezoidal rule, and the midpoint rule is just the summation of sampled values. In both cases there is no need to worry about factor dx, since it cancels out when we divide one numerical integral by the other.

import numpy as np
import matplotlib.pyplot as plt
#   Setup
y = np.sqrt(np.arange(9))
c = np.zeros_like(y)
n = y.size
#   DCT-I, trapezoidal
x = np.arange(n)*np.pi/(n-1)
for k in range(n):
c[k] = np.trapz(y*np.cos(k*x))/np.trapz(np.cos(k*x)**2)
t = np.linspace(0, np.pi, 500)
yy = np.sum(c*np.cos(np.arange(9)*t.reshape(-1, 1)), axis=1)
plt.plot(x, y, 'ro')
plt.plot(t, yy)
plt.show()
#   DCT-II, midpoint
x = np.arange(n)*np.pi/n + np.pi/(2*n)
for k in range(n):
c[k] = np.sum(y*np.cos(k*x))/np.sum(np.cos(k*x)**2)
t = np.linspace(0, np.pi, 500)
yy = np.sum(c*np.cos(np.arange(9)*t.reshape(-1, 1)), axis=1)
plt.plot(x, y, 'ro')
plt.plot(t, yy)
plt.show()

## Experiments with the significance of autocorrelation

Given a sequence of numbers ${x_j}$ of length ${L}$ one may want to look for evidence of its periodic behavior. One way to do this is by computing autocorrelation, the correlation of the sequence with a shift of itself. Here is one reasonable way to do so: for lag values ${\ell=1,\dots, \lfloor L/2 \rfloor}$ compute the correlation coefficient of ${(x_1,\dots, x_{L-\ell}}$ with ${(x_{\ell+1},\dots, x_L)}$. That the lag does not exceed ${L/2}$ ensures the entire sequence participates in the computation, so we are not making a conclusion about its periodicity after comparing a handful of terms at the beginning and the end. In other words, we are not going to detect periodicity if the period is more than half of the observed time period.

Having obtained the correlation coefficients, pick one with the largest absolute value; call it R. How large does R have to be in order for us to conclude the correlation is not a fluke? The answer depends on the distribution of our data, but an experiment can be used to get some idea of likelihood of large R.

I picked ${x_j}$ independently from the standard normal distribution, and computed ${r}$ as above. After 5 million trials with a sequence of length 100, the distribution of R was as follows:

Based on this experiment, the probability of obtaining |R| greater than 0.5 is less than 0.0016. So, 0.5 is pretty solid evidence. The probability of ${|R| > 0.6}$ is two orders of magnitude less, etc. Also, |R| is unlikely to be very close to zero unless the data is structured in some strange way. Some kind of correlation ought to be present in the white noise.

Aside: it’s not easy to construct perfectly non-autocorrelated sequences for the above test. For length 5 an example is 1,2,3,2,3. Indeed, (1,2,3,2) is uncorrelated with (2,3,2,3) and (1,2,3) is uncorrelated with (3,2,3). For length 6 and more I can’t construct these without filling them with a bunch of zeros.

Repeating the experiment with sequences of length 1000 shows a tighter distribution of R: now |R| is unlikely to be above 0.2. So, if a universal threshold is to be used here, we need to adjust R based on sequence length.

I did not look hard for statistical studies of this subject, resorting to an experiment. Experimentally obtained p-values are pretty consistent for the criterion ${L^{0.45}|R| > 4}$. The number of trials was not very large (10000) so there is some fluctuation, but the pattern is clear.

Length, L P(L0.45|R| > 4)
100 0.002
300 0.0028
500 0.0022
700 0.0028
900 0.0034
1100 0.0036
1300 0.0039
1500 0.003
1700 0.003
1900 0.0042
2100 0.003
2300 0.0036
2500 0.0042
2700 0.0032
2900 0.0043
3100 0.0042
3300 0.0025
3500 0.0031
3700 0.0027
3900 0.0042

Naturally, all this depends on the assumption of independent normal variables.

And this is the approach I took to computing r in Python:

import numpy as np
n = 1000
x = np.random.normal(size=(n,))
acorr = np.correlate(x, x, mode='same')
acorr = acorr[n//2+1:]/(x.var()*np.arange(n-1, n//2, -1))
r = acorr[np.abs(acorr).argmax()]


## Relating integers by differences of reciprocals

Let’s say that two positive integers m, n are compatible if the difference of their reciprocal is the reciprocal of an integer: that is, mn/(m-n) is an integer. For example, 2 is compatible with 1, 3, 4, and 6 but not with 5. Compatibility is a symmetric relation, which we’ll denote ${m\sim n}$ even though it’s not an equivalence. Here is a chart of this relation, with red dots indicating compatibility.

### Extremes

A few natural questions arise, for any given ${n}$:

1. What is the greatest number compatible with ${n}$?
2. What is the smallest number compatible with ${n}$?
3. How many numbers are compatible with ${n}$?

Before answering them, let’s observe that ${m\sim n}$ if and only if ${|m-n|}$ is a product of two divisors of ${n}$. Indeed, for ${mn/(m-n)}$ to be an integer, we must be able to write ${|n-m|}$ as the product of a divisor of ${m}$ and a divisor of ${n}$. But a common divisor of ${m}$ and ${m-n}$ is also a divisor of ${n}$.

Of course, “a product of two divisors of ${n}$” is the same as “a divisor of ${n^2}$“. But it’s sometimes easier to think in terms of the former.

Question 1 is now easy to answer: the greatest number compatible with ${n}$ is ${n+n^2 = n(n+1)}$.

But there is no such an easy answer to Questions 2 and 3, because of the possibility of overshooting into negative when subtracting a divisor of ${n^2}$ from ${n}$. The answer to Question 2 is ${n-d}$ where ${d}$ is the greatest divisor of ${n^2}$ than is less than ${n}$. This is the OEIS sequence A063428, pictured below.

The lines are numbers with few divisors: for a prime p, the smallest compatible number is p-1, while for 2p it is p, etc.

The answer to Question 3 is: the number of distinct divisors of ${n^2}$, plus the number of such divisors that are less than ${n}$. This is the OEIS sequence A146564.

### Chains

Since consecutive integers are compatible, every number ${n}$ is a part of a compatible chain ${1\sim \cdots\sim n}$. How to build a short chain like this?

Strategy A: starting with ${n}$, take the smallest integer compatible with the previous one. This is sure to reach 1. But in general, this greedy algorithm is not optimal. For n=22 it yields 22, 11, 10, 5, 4, 2, 1 but there is a shorter chain: 22, 18, 6, 2, 1.

Strategy B: starting with ${1}$, write the greatest integer compatible with the previous one (which is ${k(k+1)}$ by the above). This initially results in the OEIS sequence A007018: 1, 2, 6, 42, 1806, … which is notable, among other things, for giving a constructive proof of the infinitude of primes, even with a (very weak) lower density bound. Eventually we have to stop adding ${k^2}$ to ${k}$ every time; so instead add the greatest divisor of ${k^2}$ such that the sum does not exceed ${n}$. For 22, this yields the optimal chain 1, 2, 6, 18, 22 stated above. But for 20 strategy B yields 1, 2, 6, 18, 20 while the shortest chain is 1, 2, 4, 20. Being greedy is not optimal, in either up or down direction.

Strategy C is not optimal either, but it is explicit and provides a simple upper bound on the length of a shortest chain. It uses the expansion of n in the factorial number system which is the sum of factorials k! with coefficients less than k. For example ${67 = 2\cdot 4! + 3\cdot 3! + 0\cdot 2! + 1\cdot 1!}$, so its factorial representation is 2301.

If n is written as abcd in the factorial system, then the following is a compatible chain leading to n (possibly with repetitions), as is easy to check:

1, 10, 100, 1000, 1000, a000, ab00, abc0, abcd

In the example with 67, this chain is

1, 10, 100, 1000, 2000, 2300, 2300, 2301

in factorial notation, which converts to decimals as

1, 2, 6, 24, 48, 66, 66, 67

The repeated 66 (due to 0 in 2301) should be removed.

Thus, for an even number s, there is a chain of length at most s leading to any integer that is less than (s/2+1)!.

As a consequence, the smallest possible length of chain leading to n is ${O(\log n/\log \log n)}$. Is this the best possible O-estimate?

All of the strategies described above produce monotone chains, but the shortest chain is generally not monotone. For example, 17 can be reached by the non-monotone chain 1, 2, 6, 18, 17 of length 5 but any monotone chain will have length at least 6.

The smallest-chain-length sequence is 1, 2, 3, 3, 4, 3, 4, 4, 4, 4, 5, 4, 5, 5, 4, 5, 5, 4, 5, 4, 5, 5, 5, 4, 5, … which is not in OEIS.  Here is its plot:

The sequence 1, 2, 3, 5, 11, 29, 67, 283, 2467,… lists the smallest numbers that require a chain of given length — for example, 11 is the first number that requires a chain of length 5, etc. Not in OEIS; the rate of growth is unclear.

## Need for speed vs bounded position and acceleration

You are driving a car with maximal acceleration (and deceleration) A on a road that’s been blocked off in both directions (or, if you prefer, on the landing strip of an aircraft carrier). Let L be the length of the road available to you.
What is the maximal speed you can reach?

Besides A and L, the answer also depends on your mood: do you want to live, or are you willing to go out in a blaze of glory? In the latter case the answer is obvious: position the car at one end of the interval, and put the pedal to the metal. The car will cover the distance L within the time ${\sqrt{2L/A}}$, reaching the speed ${v=\sqrt{2AL}}$ at the end. In the former scenario one has to switch to the brake pedal midway through the distance, so the maximal speed will be attained at half the length, ${\sqrt{AL}}$.

Rephrased in mathematical terms: if ${f}$ is a twice differentiable function and ${M_k = \sup|f^{(k)}|}$ for ${k=0,1,2}$, then ${M_1^2 \le 4M_0M_2}$ if ${f}$ is defined on a half-infinite interval, and ${M_1^2 \le 2M_0M_2}$ if the domain of ${f}$ is the entire line. To connect the notation, just put ${L=2M_0}$ and ${A=M_1}$ in the previous paragraph… and I guess some proof other than “this is obvious” is called for, but it’s not hard to find one: this is problem 5.15 in Rudin’s Principles of Mathematical Analysis.

Perhaps more interesting is to study the problem in higher dimensions: one could be driving in a parking lot of some shape, etc. Let’s normalize the maximal acceleration as 1, keeping in mind it’s a vector. Given a set E, let S(E) be the square of maximal speed attainable by a unit-acceleration vehicle which stays in E indefinitely. Also let U(E) be the square of maximal speed one can attain while crashing out of bounds after the record is set. Squaring makes these quantities scale linearly with the size of the set. Both are monotone with respect to set inclusion. And we know what they are for an interval of length L: namely, ${S = L}$ and ${U=2L}$, so that gives some lower bounds for sets that contain a line interval.

When E is a circle of radius 1, the best we can do is to drive along it with constant speed 1; then the centripetal acceleration is also 1. Any higher speed will exceed the allowable acceleration in the normal direction, never mind the tangential one. So, for a circle both S and U are equal to its radius.

On the other hand, if E is a disk of radius R, then driving along its diameter is better: it gives ${S\ge 2R}$ and ${U\ge 4R}$.

Some questions:

1. If E is a convex set of diameter D, is it true that ${S(E) = D}$ and ${U(E) = 2D}$?
2. Is it true that ${U\le 2S}$ in general?
3. How to express S and U for a smooth closed curve in terms of its curvature? They are not necessarily equal (like they are for a circle): consider thin ellipses converging to a line segment, for which S and U approach the corresponding values for that segment.

The answer to Question 1 is yes. Consider the orthogonal projection of E, and of a trajectory it contains, onto some line L. This does not increase the diameter or the acceleration; thus, the one-dimensional result implies that the projection of velocity vector onto L does not exceed ${\sqrt{D}}$ (or ${\sqrt{2D}}$ for the crashing-out version). Since L was arbitrary, it follows that ${S(E) \le D}$ and ${U(E) \le 2D}$. These upper bounds hold for general sets, not only convex ones. But when E is convex, we get matching lower bounds by considering the longest segment contained in E.

I don’t have answers to questions 2 and 3.

## Lightness, hyperspace, and lower oscillation bounds

When does a map ${f\colon X\to Y}$ admit a lower “anti-continuity” bound like ${d_Y(f(a),f(b))\ge \lambda(d_X(a,b))}$ for some function ${\lambda\colon (0,\infty)\to (0, \infty)}$ and for all ${a\ne b}$? The answer is easy: ${f}$ must be injective and its inverse must be uniformly continuous. End of story.

But recalling what happened with diameters of connected sets last time, let’s focus on the inequality ${\textrm{diam}\, f(E)\ge \lambda (\textrm{diam}\, E)}$ for connected subsets ${E\subset X}$. If such ${\lambda}$ exists, the map f has the LOB property, for “lower oscillation bound” (oscillation being the diameter of image). The LOB property does not require ${f}$ to be injective. On the real line, ${f(x)=|x|}$ satisfies it with ${\lambda(\delta)=\delta/2}$: since it simply folds the line, the worst that can happen to the diameter of an interval is to be halved. Similarly, ${f(x)=x^2}$ admits a lower oscillation bound ${\lambda(\delta) = (\delta/2)^2}$. This one decays faster than linear at 0, indicating some amount of squeezing going on. One may check that every polynomial has the LOB property as well.

On the other hand, the exponential function ${f(x)=e^x}$ does not have the LOB property, since ${\textrm{diam}\, f([x,x+1])}$ tends to ${0}$ as ${x\to-\infty}$. No surprise there; we know from the relation of continuity and uniform continuity that things like that happen on a non-compact domain.

Also, a function that is constant on some nontrivial connected set will obviously fail LOB. In topology, a mapping is called light if the preimage of every point is totally disconnected, which is exactly the same as not being constant on any nontrivial connected set. So, lightness is necessary for LOB, but not sufficient as ${e^x}$ shows.

Theorem 1: Every continuous light map ${f\colon X\to Y}$ with compact domain ${X}$ admits a lower oscillation bound.

Proof. Suppose not. Then there exists ${\epsilon>0}$ and a sequence of connected subsets ${E_n\subset X}$ such that ${\textrm{diam}\, E_n\ge \epsilon}$ and ${\textrm{diam}\, f(E_n)\to 0}$. We can assume ${E_n}$ compact, otherwise replace it with its closure ${\overline{E_n}}$ which we can because ${f(\overline{E_n})\subset \overline{f(E_n)}}$.

The space of nonempty compact subsets of ${X}$ is called the hyperspace of ${X}$; when equipped with the Hausdorff metric, it becomes a compact metric space itself. Pass to a convergent subsequence, still denoted ${\{E_n\}}$. Its limit ${E}$ has diameter at least ${\epsilon}$, because diameter is a continuous function on the hyperspace. Finally, using the uniform continuity of ${f}$ we get ${\textrm{diam}\, f(E) = \lim \textrm{diam}\, f(E_n) = 0}$, contradicting the lightness of ${f}$. ${\quad \Box}$

Here is another example to demonstrate the importance of compactness (not just boundedness) and continuity: on the domain ${X = \{(x,y)\colon 0 < x < 1, 0 < y < 1\}}$ define ${f(x,y)=(x,xy)}$. This is a homeomorphism, the inverse being ${(u,v)\mapsto (u, v/u)}$. Yet it fails LOB because the image of line segment ${\{x\}\times (0,1)}$ has diameter ${x}$, which can be arbitrarily close to 0. So, the lack of compactness hurts. Extending ${f}$ to the closed square in a discontinuous way, say by letting it be the identity map on the boundary, we see that continuity is also needed, although it’s slightly non-intuitive that one needs continuity (essentially an upper oscillation bound) to estimate oscillation from below.

All that said, on a bounded interval of real line we need neither compactness nor continuity.

Theorem 2: If ${I\subset \mathbb R}$ is a bounded interval, then every light map ${f\colon I\to Y}$ admits a lower oscillation bound.

Proof. Following the proof of Theorem 1, consider a sequence of intervals ${(a_n, b_n)}$ such that ${b_n-a_n\ge \epsilon}$ and ${\textrm{diam}\, f((a_n,b_n))\to 0}$. There is no loss of generality in considering open intervals, since it can only make the diameter of the image smaller. Also WLOG, suppose ${a_n\to a}$ and ${b_n\to b}$; this uses the boundedness of ${I}$. Consider a nontrivial closed interval ${[c,d]\subset (a,b)}$. For all sufficiently large ${n}$ we have ${[c,d]\subset (a_n,b_n)}$, which implies ${\textrm{diam}\, f([c,d])\le \textrm{diam}\, f((a_n,b_n))\to 0}$. Thus ${f}$ is constant on ${[c,d]}$, a contradiction. ${\quad \Box}$

The property that distinguishes real line here is that nontrivial connected sets have nonempty interior. The same works on the circle and various tree-like spaces, but fails for spaces that don’t look one-dimensional.