Classifying words: a tiny example of SVD truncation

Given a bunch of words, specifically the names of divisions of plants and bacteria, I’m going to use a truncated Singular Value Decomposition to separate bacteria from plants. This isn’t a novel or challenging task, but I like the small size of the example. A similar type of examples is classifying a bunch of text fragments by keywords, but that requires a lot more setup.

Here are 33 words to classify: acidobacteria, actinobacteria, anthocerotophyta, aquificae, bacteroidetes, bryophyta, charophyta, chlamydiae, chloroflexi, chlorophyta, chrysiogenetes, cyanobacteria, cycadophyta, deferribacteres, deinococcus-thermus, dictyoglomi, firmicutes, fusobacteria, gemmatimonadetes, ginkgophyta, gnetophyta, lycopodiophyta, magnoliophyta, marchantiophyta, nitrospirae, pinophyta, proteobacteria, pteridophyta, spirochaetes, synergistetes, tenericutes, thermodesulfobacteria, thermotogae.

As is, the task is too easy: we can recognize the -phyta ending in the names of plant divisions. Let’s jumble the letters within each word:

aaabccdeiiort, aaabcceiinortt, aacehhnoooprttty, aacefiiqu, abcdeeeiorstt, abhoprtyy, aachhoprty, aacdehilmy, cefhilloorx, achhlooprty, ceeeghinorssty, aaabcceinorty, aaccdhoptyy, abcdeeeefirrrst, cccdeehimnoorsstuu, cdgiilmooty, cefiimrstu, aabcefiorstu, aadeeegimmmnostt, agghiknopty, aeghnoptty, acdhilooopptyy, aaghilmnoopty, aaachhimnoprtty, aeiinoprrst, ahinoppty, aabceeiooprrtt, adehiopprtty, aceehioprsst, eeeginrssstty, ceeeinrsttu, aabcdeeefhilmoorrsttu, aeeghmoortt

Not so obvious anymore, is it? Recalling the -phyta ending, we may want to focus on the presence of letter y, which is not so common otherwise. Indeed, the count of y letters is a decent prediction: on the following plot, green asterisks are plants and red are bacteria, the vertical axis is the count of letter Y in each word.

Count of Y in each word

However, the simple count fails to classify several words: having 1 letter Y may or may not mean a plant. Instead, let’s consider the entire matrix {A} of letter counts (here it is in a spreadsheet: 33 rows, one for each word; 26 columns, one for each letter.) So far, we looked at its 25th column in isolation from the rest of the matrix. Truncated SVD uncovers the relations between columns that are not obvious but express patterns such as the presence of letters p,h,t,a along with y. Specifically, write {A = UDV^T} with {U,V} unitary and {D} diagonal. Replace all entries of {D}, except the four largest ones, by zeros. The result is a rank-4 diagonal matrix {D_4}. The product {A_4 = UD_4V^T} is a rank-4 matrix, which keeps some of the essential patterns in {A} but de-emphasizes the accidental.

The entries of {D_4} are no longer integers. Here is a color-coded plot of its 25th column, which still somehow corresponds to letter Y but takes into account the other letters with which it appears.

The same column of the letter-count matrix, after truncation

Plants are now cleanly separated from bacteria. Plots made in MATLAB as follows:

[U, D, V] = svd(A);
D4 = D .* (D >= D(4, 4));
A4 = U * D4 * V';
plants = (A4(:, 25) > 0.8);
bacteria = (A4(:, 25) <= 0.8);
  % the rest is output
words = 1:33;
hold on
plot(words(plants), A4(plants, 25), 'g*');
plot(words(bacteria), A4(bacteria, 25), 'r*');

Lattice points in a disk

The closed disk of radius {72} has area {\pi \cdot 72^2\approx 16286}. But it happens to contain only {16241} points with integer coordinates. Here is a picture of one quarter of this disk.

Unusually few lattice points in the disk
Unusually few lattice points in the disk

The radius {72} is somewhat notable is that the discrepancy of {45} between the area and the number of integer points is unusually large. Here is the plot of the absolute value of the difference |area-points| as a function of integer radius {n}. The curve in red is {y = 1.858 r^{0.745}}, which is an experimentally found upper bound for the discrepancy in this range of {n}.

Radius from 1 to 100
Radius up to 100

On the scale up to {n=1000}, the upper bound is {4.902 r^{0.548}}, and the radii bumping against this bound are {449} and {893}. The exponent {0.548} begins to resemble the conjectural {0.5+\epsilon} in the Gauss circle problem.

Radius up to 1000
Radius up to 1000

Finally, over the interval {1000\le n\le 3000} the upper bound comes out as {6.607n^{0.517}}. The exponent {0.517} looks good.

Radius from 1000 to 3000
Radius from 1000 to 3000

This little numerical experiment in Matlab involved using the convex hull function convhull on log-log scale. The function identifies the vertices of the convex hull, which is a polygon. I pick the side of the polygon lying over the midpoint of the range; this yields a linear upper bound for the set of points. On the normal scale, this line becomes a power function. Matlab code is given below; it’s divided into three logical steps.

Find the difference between area and the number of lattice points
a = 1000; 
b = 3000;
R = a:b;
E = [];
for n = a:b
    [X,Y] = meshgrid(1:n, 1:n);
    pts = 4*n + 1 + 4*nnz(X.^2+Y.^2<=n^2);
    E = [E, max(1,abs(pts - pi*n^2))];
Pick a suitable side of log-log convex hull
ix = convhull(log(R), log(E));
k = numel(ix);
while (R(ix(k))<(a+b)/2)
    k = k-1;
Plot the result and output the parameters of the upper bound
R1 = R(ix(k)); E1 = E(ix(k));
R2 = R(ix(k+1)); E2 = E(ix(k+1));
b = log(E1/E2)/log(R1/R2);
a = E1/R1^b;
plot(R, E, '.');
hold on
plot(R, a*R.^b , 'r-');
axis tight
hold off
fprintf('a = %.3f, b = %.3f\n', a, b);