A necklace of tears

The problem is: given a set of objects drawn from several groups, put all of them in a row in a “uniform” way. Whatever that means.

For example, suppose we have 21 gemstones: 9 red, 5 blue, 3 green, 2 cyan, 1 magenta and 1 yellow. How to place them on a string to make a reasonably looking necklace? The criterion is subjective, but we can probably agree that

Arrangement #1
Arrangement #1

looks better than

Arrangement #2
Arrangement #2

(referring to the uniformity of distributions, not the aesthetics of color.)

The approach I took was to repeatedly add the “most underrepresented” gem, defined by maximal difference between projected frequency of appearance (e.g., 5/21 for blue) and the frequency of appearance so far in the string. (Taking the convention 0/0=0 here.) Here is this algorithm in Python — not in Ruby, which would be more topical.

count = {'r': 9, 'b': 5, 'g': 3, 'c': 2, 'm': 1, 'y': 1}
length = sum(count.values())
str = ''
while len(str) < length:
    deficit = {}
    for char in count:
        deficit[char] = count[char]/length - (str.count(char)/len(str) if str else 0)
    str += max(deficit, key=deficit.get)

The output, “rbgcryrbrmgrbrcbrgrbr”, is what the first image in this post represents. The second image is the result of an ill-fated attempt to replace difference by ratio when determining the under-representation.

I initially thought that two unique gems (yellow and magenta) would end up together, but this hasn’t happened: after one is added, the frequency of more common gems drops, allowing them to come back into play for a while. Still, the left half of the string is noticeably more diverse than the right half. It’d be better if two unique gems were in roughly symmetrical position, and generally there would be no preference between left-right and right-left directions.

Perhaps the new character should be added to the string either on the right or on the left, in alternating fashion. That should make things nice and symmetric, right?


Alternating concatenation
Alternating concatenation

The search continues…

Update: Rahul suggested in a comment to adjust the deficit computation to

deficit[char] = count[char]/length - str.count(char)/(len(str) + 1)

This has some advantages but on the other hand, two unique gems (magenta and yellow) are placed next to each other, which is something I wanted to avoid.

Dividing by len(str) + 1

Bankers’ rounding and floating point arithmetics

Bankers’ rounding means preferring even digits in edge cases: both 3.5 and 4.5 are rounded to 4. When rounding a random amount of money to the nearest dollar in this way, the expected value added or subtracted is zero. Indeed: 0 cents don’t get rounded, 1 through 49 get rounded down, 51 through 99 get rounded up (and the amounts match 1-49), so 50 has to be split between the two directions.

Python follows this convention: rounding the numbers 1/2, 3/2, 5/2, … via

[round((2*k+1)/2) for k in range(10)]

results in [0, 2, 2, 4, 4, 6, 6, 8, 8, 10].

However, when rounding 0.05 = 1/20, 0.15 = 3/20, … to 1 digit after the decimal dot, we get a surprise:

[round((2*k+1)/20, 1) for k in range(10)]

returns [0.1, 0.1, 0.2, 0.3, 0.5, 0.6, 0.7, 0.8, 0.8, 0.9]. An irregularly spaced array, in which odd digits win 6 to 4…

This is because the numbers like 0.05 cannot be precisely represented in the double-precision format used for floating point numbers in Python (and other languages): the denominator of 1/20 is not a power of 2. And so, 0.05 gets represented by 3602879701896397/256. Since 0.05*256 = 3602879701896396.8, this representation is greater than the real number 0.05. And this pushes the rounding routine over the edge, toward 0.1 rather than 0.0.

Out of the numbers tested above, only 0.25 = 1/4 and 0.75 = 3/4 get a precise binary representation and are therefore rounded predictably. For the other 8 the direction of rounding has nothing to do with bankers… oddly enough, 6 out of these 8 are rounded toward an odd digit.

This isn’t the end of surprises, though. Recalling that numerical manipulations on arrays are best done with NumPy library, we may want to import it and try

numpy.around([(2*k+1)/20 for k in range(10)], 1)

The output ought to make any banker happy: [0., 0.2, 0.2, 0.4, 0.4, 0.6, 0.6, 0.8, 0.8, 1.]

Why? NumPy takes a different approach to rounding: when asked to round to 1 digit after the dot, it multiplies the numbers by 10, then rounds to the nearest integer, and divides by 10. Multiplication by 10 restores sanity: 0.05 * 10 = 0.5 is represented exactly and gets rounded to 0; similarly for the others.

This algorithm isn’t perfect, however. When rounding 0.005, 0.015, …, 0.995 to two decimal digits, it misplaces 0.545 (to 0.55) and 0.575 (to 0.57). The other 98 numbers are rounded as expected; tested with

numpy.around([(2*k+1)/200 for k in range(100)], 2)

Out of 1000 numbers of the form (2k+1)/2000, k=0,\dots, 999, there are six deviations from the bankers’ rule. They all appear together, in increments of 0.002, from 0.5015 to 0.5115.

Is there a rounding algorithm that follows bankers’ convention for all (not too long) finite decimal expansions?

Autogenerated numbers 2

Although this is a follow up to an earlier post, it can be read independently.

Repeatedly squaring the number 5, we get the following.

5^1 =                        5
5^2 =                       25
5^4 =                      625
5^8 =                   390625
5^16 =            152587890625
5^32 = 23283064365386962890625

There is a pattern here, which becomes more evident if the table is truncated to a triangle:

5^1 =                        5
5^2 =                       25
5^4 =                      625
5^8 =                  ...0625
5^16 =                ...90625
5^32 =               ...890625
5^64 =              ...2890625
5^128 =            ...12890625
5^256 =           ...212890625

What other digits, besides 5, have this property? None in the decimal system (ignoring the trivial 0 and 1). But the same pattern is shown by 3 in base 6 (below, the exponent is still in base 10, for clarity).

3^1 =                        3
3^2 =                       13
3^4 =                      213
3^8 =                  ...0213
3^16 =                ...50213
3^32 =               ...350213
3^64 =              ...1350213

and so on.

Here is a proof that the pattern appears when the base is 2 mod 4 (except 2), and the starting digit is half the base.

Let {x} be the starting digit, and {b=2x} the base. The claim is that {x^{2^k}\equiv x^{2^{k-1}} \bmod b^k} for all {k=1,2,\dots}.

Base of induction: {x^2-x = x(x-1)} is divisible by {x} and also by {2}, since {x-1} is even. Hence it is divisible by {2x=b}.

Step of induction: Suppose {x^{2^k} - x^{2^{k-1}}} is divisible by {b^k}. Then
{x^{2^{k+1}} - x^{2^{k}} = (x^{2^k} - x^{2^{k-1}})(x^{2^k} + x^{2^{k-1}})}
The second factor is divisible by {x} and also by {2}, being the sum of two odd numbers. So, it is divisible by {b}. Since the first factor is divisible by {b^k}, the claim follows: {x^{2^{k+1}} - x^{2^{k}} } is divisible by {b^{k+1}}.

So that was easy. The part I can’t prove is the converse. Numerical evidence strongly suggests that {b=2\bmod 4}, {x=b/2}
is also the necessary condition for the triangular pattern to appear.

Autogenerated numbers

An integer is automorphic if its square ends with that integer, like 762 = 5776. This notion is clearly base-dependent. Ignoring trivial 0 and 1, in base 10 there are two such numbers for any number of digits; they comprise two infinite sequences that can be thought of informally as “infinite” solutions of x2 = x, and formally as solutions of this equation in the ring of 10-adic numbers. They are …56259918212890625 and …740081787109376, both recorded in OEIS, as Wolfram MathWorld points out.

There is a difference between these two sequences. The former naturally grows from the single digit 5, by repeatedly squaring and keeping one more digit than we had: 52 = 25,  252 = 625, 6252= 390625, 06252 = 390625, 906252 = 8212890625, … (One should not get greedy and keep all the digits after squaring: for example, the leading 3 in 390625 is not the digit we want.)  The process described above does not work for 6, because 62 = 36 rather than 76. For the lack of a better term, I’ll call the infinite numbers such as …56259918212890625 autogenerated.

According to Wikipedia, the number of b-adic solutions of x2 = x is 2d where d is the number of distinct prime factors of b. (Another way to put it: there are as many solutions as square-free divisors of the base.) Two of the solutions are trivial: 0 and 1. So, infinite automorphic numbers exist in every base that is not a prime power, and only in those.

Autogenerated numbers are rarer. For example, there are none in base 12. Indeed, the two viable seed digits are 4 and 9: in base 12, 42 is 14 and 92 is 69. But 14 is not automorphic: 142 = 194. With 69 we get a little further: 692 = 3969. But then 969 is not automorphic, and the process stops.

Computer search suggests that autogenerated numbers exist if and only if the base is 2 mod 4 (and is not 2). Furthermore, there is exactly one autogenerated number for each such base, and its seed is half the base. Some examples, with 20 digits shown in each base:

  • … 21314 15515 22213 50213 in base 6
  • … 92256 25991 82128 90625 in base 10
  • … 8676a 8cba5 7337a a0c37 in base 14
  • … aea80 1g4c9 68da4 e1249 in base 18
  • … 179aa 1f0e7 igdi8 d185b in base 22
  • … b9ofn odpbn 31mm3 h1g6d in base 26
  • … f46rg 1jirj r6f3f e1q7f in base 30
  • … g2khh vlas5 k7h4h i248h in base 34

And this is how they were computed with Python: brute force search works fine, because unsuccessful candidates drop out very quickly.

def baseN(num, b, numerals="0123456789abcdefghijklmnopqrstuvwxyz"): 
    return ((num == 0) and numerals[0]) or (baseN(num // b, b, numerals).lstrip(numerals[0]) + numerals[num % b])
    # from http://stackoverflow.com/a/2267428
target = 100    # number of digits to compute
for b in range(3, 36):    # base
    for x in range(2, b):    # seed digit
        mod = b
        while x < b**(target-1) and (x**2 - x) % mod == 0:
            mod = mod*b
            x = x*x % mod
        if x >= b**(target-1):
            print('Autogenerated number {} in base {}'.format(baseN(x, b), b)) 

I don’t have a proof of the “2 mod 4” claim, but it may well have a proof somewhere already… According to Dave Richeson, the autogenerating property of 5 in base 10 was discussed in a blog post by Foxmaths!, but the blog is private now. It is also stated in their OEIS article as “a(n+1) = a(n)^2 mod 10^(n+1). – Eric M. Schmidt, Jul 28 2012”.

Convex up to a reparametrization

Given a function {f\colon \mathbb R\rightarrow \mathbb R}, we have several ways of checking if it is convex. But how to recognize the functions that are convex up to a reparametrization, meaning that there is a homeomorphism {h} of the real line such that the composition {g = f\circ h} is convex?

Let’s start by listing some necessary conditions for {f} to have this property.

1. Since convex functions are continuous, {f} must be continuous.

2. Convex functions are quasiconvex, meaning that each sublevel set {U_t = \{x\colon g(x) < t\}} is convex (in this case, an open interval). This property is preserved under reparametrization. So, {f} must be quasiconvex as well.

3. Nonconstant convex functions are unbounded above. So, {f} must have this property too.

Is this all? No. For example, let {f(x) = -x} if {x< 0} and {f(x)=\arctan x} if {x\ge 0}. This is a continuous quasiconvex function that is unbounded above. But it can’t be made convex by a reparametrization, because as {x\rightarrow +\infty}, it increases while staying bounded. This leads to another condition:

4. If {f} is not monotone, then its limits as {x\rightarrow+\infty} and {x\rightarrow -\infty} must both be {+\infty}.

And still we are not done. The function

{f(x) = \begin{cases} x+1, \quad &x<-1 \\ 0, \quad & -1 \le x\le 1 \\ x-1, \quad & x>1 \end{cases} }

satisfies all of 1–4. But it has a flat spot (the interval {[-1,1]} where it is constant) which does not achieve its minimum. This can’t happen to convex functions: if a convex function is constant on an interval, then (by virtue of being above its tangent lines) it has to attain its minimum there. This leads to the 5th condition:

5. {f} is either strictly monotone, or attains its minimum on some closed interval {I_{\min}(f)} (possibly unbounded or a single point). In the latter case, it is strictly monotone on each component of {\mathbb R\setminus I_{\min}(f)}.

Finally, we have enough: together, 1–5 is the necessary and sufficient condition for {f} to be convex up to a reparametrization. The proof of sufficiency turns out to be easy. A continuous strictly monotone function is a homeomorphism between its domain and range. The strictly monotone (and unbounded) parts of {f} give us homeomorphisms such that the composition of {f} with them is affine. Consequently, there is a homeomorphism {h} of the real line such that {f\circ h} is one of the following five functions:

  1. {g(x) \equiv 1} if {I_{\min}(f) = \mathbb R}
  2. {g(x) = \max(x,0)} if {I_{\min}(f)} is unbounded, but not all of {\mathbb R}
  3. {g(x) = \max(|x|-1, 0)} if {I_{\min}(f)} is a bounded nondegenerate interval
  4. {g(x) =|x|} if {I_{\min}(f)} is a single point
  5. {g(x) =x} if {I_{\min}(f)} is empty

Incidentally, we also see there are five equivalence classes of convex functions on {\mathbb R}, the equivalence being any homeomorphic change of variables.

What functions on {\mathbb R^n} are convex up to reparametrization? Unclear. (Related Math SE question)


String matching in Google Sheets/Docs/Apps Script

Here I summarize various ways of matching string patterns in selected Google products: Sheets, Docs, and Apps Script. They are ordered from least to most powerful.

  1. String comparison in Sheets, =if(A1 = "cat", "is", "is not") Perhaps surprisingly, this is not literal comparison, but case-insensitive match: the condition also holds true if A1 has “Cat” or “CAT”. This makes one wonder how to actually test strings for equality; I’ll return to this later.
  2. Wildcards in Sheets functions like countif and sumif: ? matches any single character, * matches any number of any characters.
    Example: =countif(A1:A9, "a?b*")
  3. like comparison in query, the Sheets function using the Google Visualization API Query Language. This is similar to the previous: underscore _ matches any single character, percentage symbol % any number of any characters.
    Example: =query(A1:B9, "select A where B like 'a_b%'").
  4. findText method of objects of class Text in Apps Script, extending Google Docs. Documentation says “A subset of the JavaScript regular expression features are not fully supported, such as capture groups and mode modifiers.” In particular, it does not support lookarounds.
  5. var body = DocumentApp.getActiveDocument().getBody().editAsText();
    var found = body.findText("ar{3,}gh").getElement().asText().getText();

    Yeah… I find Google Docs API clumsy, verbose and generally frustrating; unlike Google Sheets API.

  6. regexmatch function in Sheets and its relatives regexextract and regexreplace. Uses RE2 regex library, which is performance-oriented but somewhat limited, for example it does not support lookarounds. It does support capture groups.
    Example: =regexmatch(A1, "^h[ao]t+\b")
  7. matches comparison in query, the Sheets function using the Google Visualization API Query Language. Supports regular expressions (matching an entire string), including lookaheads but apparently not lookbehinds. Not clear what exactly is supported.
    Example: =query(A1:B9, "select A where B matches 'a.b(?!c).*'").
  8. JavaScript regular expression methods are fully supported in Apps Script.

So what about the literal, case-sensitive string comparison in Google Sheets? Apparently, the way to do it is to use regexmatch… Example: =regexmatch(A1, "^cat$") returns True when A1 contains “cat” in lower case.

Weak convergence in metric spaces

Weak convergence in a Hilbert space is defined as pointwise convergence of functionals associated to the elements of the space. Specifically, {x_n\rightarrow x} weakly if the associated functionals {f_n(y) = \langle x_n, y \rangle} converge to {f(y) = \langle x_n, y \rangle} pointwise.

How could this idea be extended to metric spaces without linear structure? To begin with, {f_n(y)} could be replaced with {\|x_n-y\|^2-\|x_n\|^2}, since this agrees with original {f_n} up to some constant terms. Also, {\|x_n\|^2} here could be {\|x_n-z\|^2} for any fixed {z}; to avoid introducing another variable here, let’s use {x_1} for the purpose of fixed reference point {z}. Now we have a metric space version of the weak convergence: the functions
{f_n(y) = d(x_n,y)^2 - d(x_n,x_1)^2}
must converge pointwise to
{f(y) = d(x,y)^2 - d(x,x_1)^2}

The triangle inequality shows that strong convergence {d(x_n,x)\rightarrow 0} implies weak convergence, as expected. And the converse is not necessarily true, as the example of a Hilbert space shows.

Aside: the above is not the only way to define weak convergence in metric spaces. Another approach is to think of {\langle x_n , y\rangle} in terms of projection onto a line through {y}. A metric space version of this concept is the nearest-point projection onto a geodesic curve. This is a useful approach, but it is only viable for metric spaces with additional properties (geodesic, nonpositive curvature).

Also, both of these approaches take the Hilbert space case as the point of departure, and do not necessarily capture weak convergence in other normed spaces.

Let’s try this out in {\ell^1} with the standard basis sequence {e_n}. Here {f_n(y) = \|e_n-y\|^2 - 4 \rightarrow (\|y\| + 1)^2 - 4}. Is there an element {x\in \ell^1} such that

{\|x-y\|^2 - \|x-e_1\|^2= (\|y\|+1)^2 - 4} for all {y}?

Considering both sides as functions of one variable {y_n}, for a fixed {n}, shows that {x_n=0} for {n}, because the left hand side is non-differentiable at {y_n=x_n} while the right hand side is non-differentiable at {y_n=0}. Then the desired identity simplifies to {\|y\|^2 - 1 = (\|y\|+1)^2 - 4} which is false. Oh well, that sequence wasn’t weakly convergent to begin with: by Schur’s theorem, every weakly convergent sequence in {\ell^1} also converges strongly.

This example also shows that not every bounded sequence in a metric space has a weakly convergent subsequence, unlike the way it works in Hilbert spaces.