Given a bunch of words, specifically the names of divisions of plants and bacteria, I’m going to use a truncated Singular Value Decomposition to separate bacteria from plants. This isn’t a novel or challenging task, but I like the small size of the example. A similar type of examples is classifying a bunch of text fragments by keywords, but that requires a lot more setup.
Here are 33 words to classify: acidobacteria, actinobacteria, anthocerotophyta, aquificae, bacteroidetes, bryophyta, charophyta, chlamydiae, chloroflexi, chlorophyta, chrysiogenetes, cyanobacteria, cycadophyta, deferribacteres, deinococcus-thermus, dictyoglomi, firmicutes, fusobacteria, gemmatimonadetes, ginkgophyta, gnetophyta, lycopodiophyta, magnoliophyta, marchantiophyta, nitrospirae, pinophyta, proteobacteria, pteridophyta, spirochaetes, synergistetes, tenericutes, thermodesulfobacteria, thermotogae.
As is, the task is too easy: we can recognize the -phyta ending in the names of plant divisions. Let’s jumble the letters within each word:
aaabccdeiiort, aaabcceiinortt, aacehhnoooprttty, aacefiiqu, abcdeeeiorstt, abhoprtyy, aachhoprty, aacdehilmy, cefhilloorx, achhlooprty, ceeeghinorssty, aaabcceinorty, aaccdhoptyy, abcdeeeefirrrst, cccdeehimnoorsstuu, cdgiilmooty, cefiimrstu, aabcefiorstu, aadeeegimmmnostt, agghiknopty, aeghnoptty, acdhilooopptyy, aaghilmnoopty, aaachhimnoprtty, aeiinoprrst, ahinoppty, aabceeiooprrtt, adehiopprtty, aceehioprsst, eeeginrssstty, ceeeinrsttu, aabcdeeefhilmoorrsttu, aeeghmoortt
Not so obvious anymore, is it? Recalling the -phyta ending, we may want to focus on the presence of letter y, which is not so common otherwise. Indeed, the count of y letters is a decent prediction: on the following plot, green asterisks are plants and red are bacteria, the vertical axis is the count of letter Y in each word.
However, the simple count fails to classify several words: having 1 letter Y may or may not mean a plant. Instead, let’s consider the entire matrix of letter counts (here it is in a spreadsheet: 33 rows, one for each word; 26 columns, one for each letter.) So far, we looked at its 25th column in isolation from the rest of the matrix. Truncated SVD uncovers the relations between columns that are not obvious but express patterns such as the presence of letters p,h,t,a along with y. Specifically, write with unitary and diagonal. Replace all entries of , except the four largest ones, by zeros. The result is a rank-4 diagonal matrix . The product is a rank-4 matrix, which keeps some of the essential patterns in but de-emphasizes the accidental.
The entries of are no longer integers. Here is a color-coded plot of its 25th column, which still somehow corresponds to letter Y but takes into account the other letters with which it appears.
Plants are now cleanly separated from bacteria. Plots made in MATLAB as follows: