Classifying words: a tiny example of SVD truncation

Given a bunch of words, specifically the names of divisions of plants and bacteria, I’m going to use a truncated Singular Value Decomposition to separate bacteria from plants. This isn’t a novel or challenging task, but I like the small size of the example. A similar type of examples is classifying a bunch of text fragments by keywords, but that requires a lot more setup.

Here are 33 words to classify: acidobacteria, actinobacteria, anthocerotophyta, aquificae, bacteroidetes, bryophyta, charophyta, chlamydiae, chloroflexi, chlorophyta, chrysiogenetes, cyanobacteria, cycadophyta, deferribacteres, deinococcus-thermus, dictyoglomi, firmicutes, fusobacteria, gemmatimonadetes, ginkgophyta, gnetophyta, lycopodiophyta, magnoliophyta, marchantiophyta, nitrospirae, pinophyta, proteobacteria, pteridophyta, spirochaetes, synergistetes, tenericutes, thermodesulfobacteria, thermotogae.

As is, the task is too easy: we can recognize the -phyta ending in the names of plant divisions. Let’s jumble the letters within each word:

aaabccdeiiort, aaabcceiinortt, aacehhnoooprttty, aacefiiqu, abcdeeeiorstt, abhoprtyy, aachhoprty, aacdehilmy, cefhilloorx, achhlooprty, ceeeghinorssty, aaabcceinorty, aaccdhoptyy, abcdeeeefirrrst, cccdeehimnoorsstuu, cdgiilmooty, cefiimrstu, aabcefiorstu, aadeeegimmmnostt, agghiknopty, aeghnoptty, acdhilooopptyy, aaghilmnoopty, aaachhimnoprtty, aeiinoprrst, ahinoppty, aabceeiooprrtt, adehiopprtty, aceehioprsst, eeeginrssstty, ceeeinrsttu, aabcdeeefhilmoorrsttu, aeeghmoortt

Not so obvious anymore, is it? Recalling the -phyta ending, we may want to focus on the presence of letter y, which is not so common otherwise. Indeed, the count of y letters is a decent prediction: on the following plot, green asterisks are plants and red are bacteria, the vertical axis is the count of letter Y in each word.

svd1
Count of Y in each word

However, the simple count fails to classify several words: having 1 letter Y may or may not mean a plant. Instead, let’s consider the entire matrix {A} of letter counts (here it is in a spreadsheet: 33 rows, one for each word; 26 columns, one for each letter.) So far, we looked at its 25th column in isolation from the rest of the matrix. Truncated SVD uncovers the relations between columns that are not obvious but express patterns such as the presence of letters p,h,t,a along with y. Specifically, write {A = UDV^T} with {U,V} unitary and {D} diagonal. Replace all entries of {D}, except the four largest ones, by zeros. The result is a rank-4 diagonal matrix {D_4}. The product {A_4 = UD_4V^T} is a rank-4 matrix, which keeps some of the essential patterns in {A} but de-emphasizes the accidental.

The entries of {D_4} are no longer integers. Here is a color-coded plot of its 25th column, which still somehow corresponds to letter Y but takes into account the other letters with which it appears.

svd2
The same column of the letter-count matrix, after truncation

Plants are now cleanly separated from bacteria. Plots made in MATLAB as follows:


A=[3,1,2,1,1,0,0,0,2,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0;3,1,2,0,1,0,0,0,2,0,0,0,0,1,1,0,0,1,0,2,0,0,0,0,0,0;2,0,1,0,1,0,0,2,0,0,0,0,0,1,3,1,0,1,0,3,0,0,0,0,1,0;2,0,1,0,1,1,0,0,2,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0;1,1,1,1,3,0,0,0,1,0,0,0,0,0,1,0,0,1,1,2,0,0,0,0,0,0;1,1,0,0,0,0,0,1,0,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,2,0;2,0,1,0,0,0,0,2,0,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,1,0;2,0,1,1,1,0,0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0;0,0,1,0,1,1,0,1,1,0,0,2,0,0,2,0,0,1,0,0,0,0,0,1,0,0;1,0,1,0,0,0,0,2,0,0,0,1,0,0,2,1,0,1,0,1,0,0,0,0,1,0;0,0,1,0,3,0,1,1,1,0,0,0,0,1,1,0,0,1,2,1,0,0,0,0,1,0;3,1,2,0,1,0,0,0,1,0,0,0,0,1,1,0,0,1,0,1,0,0,0,0,1,0;2,0,2,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,2,0;1,1,1,1,4,1,0,0,1,0,0,0,0,0,0,0,0,3,1,1,0,0,0,0,0,0;0,0,3,1,2,0,0,1,1,0,0,0,1,1,2,0,0,1,2,1,2,0,0,0,0,0;0,0,1,1,0,0,1,0,2,0,0,1,1,0,2,0,0,0,0,1,0,0,0,0,1,0;0,0,1,0,1,1,0,0,2,0,0,0,1,0,0,0,0,1,1,1,1,0,0,0,0,0;2,1,1,0,1,1,0,0,1,0,0,0,0,0,1,0,0,1,1,1,1,0,0,0,0,0;2,0,0,1,3,0,1,0,1,0,0,0,3,1,1,0,0,0,1,2,0,0,0,0,0,0;1,0,0,0,0,0,2,1,1,0,1,0,0,1,1,1,0,0,0,1,0,0,0,0,1,0;1,0,0,0,1,0,1,1,0,0,0,0,0,1,1,1,0,0,0,2,0,0,0,0,1,0;1,0,1,1,0,0,0,1,1,0,0,1,0,0,3,2,0,0,0,1,0,0,0,0,2,0;2,0,0,0,0,0,1,1,1,0,0,1,1,1,2,1,0,0,0,1,0,0,0,0,1,0;3,0,1,0,0,0,0,2,1,0,0,0,1,1,1,1,0,1,0,2,0,0,0,0,1,0;1,0,0,0,1,0,0,0,2,0,0,0,0,1,1,1,0,2,1,1,0,0,0,0,0,0;1,0,0,0,0,0,0,1,1,0,0,0,0,1,1,2,0,0,0,1,0,0,0,0,1,0;2,1,1,0,2,0,0,0,1,0,0,0,0,0,2,1,0,2,0,2,0,0,0,0,0,0;1,0,0,1,1,0,0,1,1,0,0,0,0,0,1,2,0,1,0,2,0,0,0,0,1,0;1,0,1,0,2,0,0,1,1,0,0,0,0,0,1,1,0,1,2,1,0,0,0,0,0,0;0,0,0,0,3,0,1,0,1,0,0,0,0,1,0,0,0,1,3,2,0,0,0,0,1,0;0,0,1,0,3,0,0,0,1,0,0,0,0,1,0,0,0,1,1,2,1,0,0,0,0,0;2,1,1,1,3,1,0,1,1,0,0,1,1,0,2,0,0,2,1,2,1,0,0,0,0,0;1,0,0,0,2,0,1,1,0,0,0,0,1,0,2,0,0,1,0,2,0,0,0,0,0,0];
[U, D, V] = svd(A);
D4 = D .* (D >= D(4, 4));
A4 = U * D4 * V';
plants = (A4(:, 25) > 0.8);
bacteria = (A4(:, 25) <= 0.8);
  % the rest is output
words = 1:33;
hold on
plot(words(plants), A4(plants, 25), 'g*');
plot(words(bacteria), A4(bacteria, 25), 'r*');

Bi-Lipschitz equivalence and fixed points

Two metric spaces {X,Y} are bi-Lipschitz equivalent if there is a bijection {f:X\rightarrow Y} and a constant {L} such that
{L^{-1}d_X(a,b) \le d_Y(f(a),f(b)) \le L\,d_X(a,b)} for all {a,b\in X}. In other words, {f} must be Lipschitz with a Lipschitz inverse; this is asking more than just a homeomorphism (continuous with continuous inverse).

For example, the graph {y=\sqrt{|x|}} is not bi-Lipschitz equivalent to a line. The cusp is an obstruction:

Curve with a cusp

Indeed, the points {A,B = (\pm \epsilon,\sqrt{\epsilon})} are separated by {C=(0,0)} and lie at distance {\sim \sqrt{\epsilon}} from it. So, the images {f(A), f(B)} would lie on opposite sides of {f(C)}, at distance {\sim \sqrt{\epsilon}} from it. But then the distance between {f(A)} and {f(B)} would be of order {\sqrt{\epsilon}}, contradicting {d(A,B)\sim \epsilon}.

There are other pairs of spaces which are homeomorphic but not bi-Lipschitz equivalent, like a circle and Koch snowflake.


What is the simplest example of two spaces that are not known to be bi-Lipschitz equivalent? Probably: the unit ball {B} and the unit sphere {S} in an infinite-dimensional Hilbert space.

In finite dimensions these are not even homeomorphic: e.g., the sphere has a self-homeomorphism without fixed points, namely {R(x) = -x }, while the ball has no such thing due to Brouwer’s fixed point theorem. But in the infinite-dimensional case {B} and {S} are homeomorphic. Moreover, there exists a Lipschitz map {F\colon B\rightarrow B} such that the displacement function {\|F(x)-x\|} is bounded below by a positive constant: no hope for anything like Brouwer’s fixed point theorem.

It’s hard to see what an obstruction to bi-Lipschitz equivalence could be: there are no cusps, nothing is fractal-like, dimensions sort of agree (both infinite), topologically they are the same… Here is a possible direction of attack (from Geometric Nonlinear Functional Analysis by Benyamini and Lindenstrauss). If a bi-Lipschitz map {f\colon B\rightarrow S} exists, then the composition {f^{-1}\circ R\circ f} is a Lipschitz involution of {B} with displacement bounded from below. So, if such a map could be ruled out, the problem would be answered in the negative. As far as I know, it remains open.

Irrational sunflowers

A neat way to visualize a real number {\alpha} is to make a sunflower out of it. This is an arrangement of points with polar angles {2 \pi \alpha k} and polar radii {\sqrt{k}} (so that the concentric disks around the origin get the number of points proportional to their area). The prototypical sunflower has {\alpha=(\sqrt{5}+1)/2}, the golden ratio. This is about the most uniform arrangement of points within a disk that one can get.

Golden ratio sunflower
Golden ratio sunflower

But nothing stops us from using other numbers. The square root of 5 is not nearly as uniform, forming distinct spirals.

Square root of 5
Square root of 5

The number {e} begins with spirals, but quickly turns into something more uniform.

Euler's sunflower
Euler’s sunflower

The number {\pi} has stronger spirals: seven of them, due to {\pi\approx 22/7} approximation.

pi sunflower
pi sunflower

Of course, if {\pi} was actually {22/7}, the arrangement would have rays instead of spirals:

Rational sunflower: 22/7
Rational sunflower: 22/7

What if we used more points? The previous pictures have 500 points; here is {\pi} with {3000}. The new pattern has 113 rays: {\pi\approx 355/113}.

pi with 3000 points
pi with 3000 points

Apéry’s constant, after beginning with five spirals, refuses to form rays or spirals again even with 3000 points.

Apéry's constant, 3000 points
Apéry’s constant, 3000 points

The images were made with Scilab as follows, with an offset by 1/2 in the polar radius to prevent the center from sticking out too much.

n = 500
alpha = (sqrt(5)+1)/2
r = sqrt([1:n]-1/2)  
theta = 2*%pi*alpha*[1:n]
plot(r.*cos(theta), r.*sin(theta), '*');
set(gca(), "isoview", "on")

2015 syllabus

Reviewing Calculus VII material: one post from each month of 2015.

Richardson Extrapolation and Midpoint Rule

The Richardson extrapolation is a simple yet extremely useful idea. Suppose we compute some quantity {Q} (such as an integral) where the error of computation depends on a positive parameter {h} (such as step size). Let’s write {Q=A(h)+E(h)} where {A(h)} is the result of computation and {E(h)} is the error. Often, one can observe either numerically or theoretically that {E(h)} is proportional to some power of {h}, say {h^k}.

The quantity {A(h)-2^k A(h/2)} is an approximation to {(1-2^k)Q} in which the main source of error is cancelled out. Thus, writing

\displaystyle Q\approx \frac{2^kA(h/2)-A(h)}{2^k-1}

should give us a much better approximation than either {A(h)} or {A(h/2)} were.

It is not quite intuitive that one can improve the accuracy of {A(h/2)} by mixing it with a less accurate approximation {A(h)}.


For a concrete example, let’s apply the trapezoidal rule to the integral {\int_{-1}^1 e^x\,dx=2\sinh 1} with step size {h=2}. The rule gives

\displaystyle A(2) = \frac{2}{2}(f(-1)+f(1)) = 2\cosh 1

which is quite wrong: the error is {2e^{-1}\approx 0.74}. Little surprise, considering the geometry of this approximation:

trap
Trapezoidal approximation to exp(x)

With the smaller step we get

\displaystyle A(1) = \frac{1}{2}(f(-1)+2f(0)+f(1)) = 1+\cosh 1

with the error {\approx 0.19}. The rule being of second order, this error is about {4} times smaller.

trap2
Half-step trapezoidal approximation

The Richardson extrapolation gives

\displaystyle \frac{4A(1)-A(2)}{4-1} = \frac{f(-1)+4f(0)+f(1)}{3} = \frac{4+2\cosh 1}{3}

reducing the error by the factor of {16}: it’s now only {0.012}. And this did not require any additional evaluations of the integrand.

The formula {\frac{f(-1)+4f(0)+f(1)}{3}} may look familiar: it’s nothing but Simpson’s rule. But the above derivation is much easier than what is typically shown in a calculus textbook: fitting parabolas to the graph of {f} and integrating those.


All this is nice. But what if we started with the Midpoint rule? Let’s find out, following the example above. With {h=2},

\displaystyle A(2) = 2f(0) = 2

which is off by {0.35} (the Midpoint rule is generally twice as accurate as Trapezoidal). Then

\displaystyle A(1) = f(-1/2)+f(1/2) = 2\cosh(1/2)

with error {\approx 0.1}. Extrapolate:

\displaystyle \frac{4A(1)-A(2)}{4-1} = \frac{4f(-1/2)-2f(0)+4f(1/2)}{3} = \frac{8\cosh(1/2) - 2}{3}

and the error drops to {0.010}, slightly less than for Simpson’s rule.

So, the extrapolated-midpoint (EM) rule appears to be slighly more accurate than Simpson’s, with an equal number of evaluations. Why isn’t it in textbooks alongside Simpson’s rule, then?


A few factors make the EM rule impractical.

1. There is a negative coefficient, of {f(0)}. This means that the rule can give a negative result when integrating a positive function! For example, {\int_{-1}^1 \exp(-9x^2)\,dx} evaluates to {-0.39} with EM rule. Simpson’s rule, as pretty much any practical quadrature rule, preserves positivity.

2. The sample points for EM are less convenient than for Simpson’s rule.

3. The gain in accuracy isn’t that great. The reason that Midpoint outperforms Trapezoidal by the factor of {2} has to do with how these rule approximate {\int_{-1}^1 x^2\,dx=\frac23}. Midpoint gives {0}, Trapezoidal gives {2}; the former is twice as accurate. To compare Simpson’s and EM rules, we should consider {\int_{-1}^1 x^4\,dx} since both are of the {4}th order of accuracy: they evaluate cubic polynomials exactly. The results are {1/6} for EM and {2/3 } for Simpson’s: these are nearly equidistant from the correct value {2/5}. The difference in accuracy is less than {15\%}.


Despite being impractical, the EM rule has an insight to offer. By moving two evaluation points from {\pm 1} to {\pm 1/2}, we made the coefficient of {f(0)} go from positive {4/3} to negative {-2/3}. (The coefficients are determined by the requirement to evaluate {\int_{-1}^1 x^2\,dx} exactly.) So, there must be an intermediate position at which the coefficient of {f(0)} is zero, and we don’t need to compute {f(0)} at all. This idea leads to the two-point Gaussian quadrature

\displaystyle \int_{-1}^1 f(x)\,dx \approx f(-1/\sqrt{3})+f(1/\sqrt{3})

which on the example {\int_{-1}^1 e^x\,dx} beats both EM and Simpson’s rules in accuracy, while using fewer function evaluations.


After writing the above, I found that the Extrapolated Midpoint rule has a name: it is Milne’s rule, a member of the family of open Newton-Cotes formulas.

Poor man’s discrete cosine transform

The discrete cosine transform can be understood as a trigonometric interpolation problem. Given some data, like {y = (3,1,4,1,5,9)}, I want to find the coefficients {A_0,\dots,A_5} such that

\displaystyle \sum_{k=0}^5 A_k\cos kx_j = y_j,\quad j= 0,\dots,5

with {x_j=\pi j/6}. (This is different from the unitary scaling preferred in image processing.) Matlab’s signal processing toolbox, as well as “signal” package of Octave-Forge both have dcp command. But the former is expensive and the latter is not entirely painless to install. Let’s work with the standard tool fft, readily available in Matlab, Scilab, and Octave.

Screenshot 2015-12-13 at 2.37.29 AM
Points to interpolate

The full Fourier series reduces to cosines when the function is even. Therefore, we need to extend {y} by an even reflection. One might imagine such an extension as

\displaystyle (3,1,4,1,5,9,9,5,1,4,1,3)

but this is incorrect. Thinking of the six {y} values as the samples of a function at {0,\pi/5,2\pi/5,\dots,\pi} helps: the even reflection across {\pi} should not duplicate the boundary value at {\pi}. Also, {0} gets reflected to {2\pi}, which is the same as {0} due to periodicity, so this value also should not be repeated. The appropriate reflection is

\displaystyle (3,1,4,1,5,9,5,1,4,1)

In Matlab terms,

y = [3 1 4 1 5 9]
z = [y, y(end-1:-1:2)]
w = fft(y)

The output is real, as desired. It also has the same kind of symmetry as the input.

\displaystyle (34, -10.618, 7.618, -8.382, 5.382, 8, 5.382, -8.382, 7.618, -10.618)

The term {34} is simply the sum of the {z}-values, but the constant term in Fourier expansion is the mean value. So, division by the data size ({10}) is needed to get the coefficients. Now that we have them, it seems that the interpolating function should be

\displaystyle 3.4 -1.0618\cos x+0.7618 \cos 2x -\dots -1.0618\cos 9x

but this is incorrect. Although this function does pass through the given points,

Screenshot 2015-12-13 at 2.38.24 AM
Interpolation, 1st attempt

it wiggles too much due to high frequency terms such as {\cos 9x}. The aliasing comes into play: at the multiples of {\pi/5}, the function {\cos 9x} is the same as {\cos x}.

Screenshot 2015-12-13 at 2.39.32 AM
Aliasing: at the sample points marked in red, the functions agree

Similarly, {\cos 8x} should be replaced by {\cos 2x} in the interpolating function, etc. This results in a much smoother interpolant, which is a linear combination of cosines up to {\cos 5x}.

Screenshot 2015-12-13 at 2.48.08 AM.png
Correct cosine expansion in green

The final answer is

\displaystyle y= 3.4 -2.1236\cos x + 1.5236\cos 2x -1.6764\cos 3x +1.0764\cos 4x + 8\cos 5x

Here is the complete Matlab code for this process of reflection, transforming and subsequent “folding” of the coefficients.

y = [3 1 4 1 5 9]
z = [y, y(end-1:-1:2)]
w = real(fft(z))/length(z)
a = [w(1), 2*w(2:end/2), w(end/2+1)]

Taking real part of FFT makes sure that nothing imaginary slips in through the errors associated with numeric computation.

Maintaining restyled Google Calendar

RESTYLEgc is a pretty old (2008-11) MIT-licensed project by Brian Gibson, which enables, among other things, replacing the default stylesheet of embedded Google Calendar with a custom one. One can see it in action on the front page of Stony Brook math department.

Due to various changes on Google’s side made since 2011, the version available from Google Code link above no longer works. So I put up a GitHub repo with slightly modified files; only the two PHP scripts and the CSS file are included, the optional resources are not.

Keeping up with the changes in Google URL structure

Line 120 of restylegc.php changed from

$url = "https://www.google.com/calendar/embed?" . $_SERVER['QUERY_STRING'];

to

$url = "https://calendar.google.com/calendar/embed?" . $_SERVER['QUERY_STRING'];

Line 19 of restylegc-js.php changed from

$url = "https://www.google.com/calendar/" . $_SERVER['QUERY_STRING'];

to

$url = "https:" . $_SERVER['QUERY_STRING'];

Line 42 of restylegc-js.php changed from

$replacement = '"https://www.google.com/calendar';

to

$replacement = '"https://calendar.google.com/calendar';

Better wrapping of text in narrow calendars

Line 1181 of restylegc.css changed from

  white-space:nowrap;

to

  white-space:pre-wrap;

Line 1263 of restylegc.css changed from

  white-space:nowrap;

to

  white-space:normal;

Also (line 1323) added

white-space:normal;

to the rule

.ie6 .agenda .event-title

although in 2015, IE6 isn’t quite as big a deal anymore.

Tangentially related to better space use: changed font-weight:bold; to font-weight:normal; in line 1263, the rules for .agenda .event-summary-expanded