In the few spare minutes I’ve had this week, I’ve been trying out n-grams as a comparable with other text-mining processes. This still fits squarely in the category of “a lot to learn,” but I’m happy to be running Perl and various Lingua::EN modules on this (nothing super-complicated in the switch, but all of my previous Perl tinkering was on a PC). Today’s corpus was perhaps too small to yield many insights: just 2300 words across 10-12 email messages sent via WPA-L on Tuesday, August 24. All the same, insights or no, here the top five bigrams, with T-Units enumerated to the fourteenth decimal place, i.e., something like hundred-trillionth or quadrillionth position. Such precision is useful, I suppose, for avoiding ties.
Bi-grams (T-Score, count, bigram) 2.22372310460229 5 audience awareness 1.72376348313075 3 writing centers 1.72376348313075 3 writing spaces 1.41265204574184 2 develop new 1.41109052911059 2 new ways
It looks like “develop new ways” is part of a trigram that shows up twice in the corpus. This script–a fine one, by the way–renders those three words into a 2×2 bigram. But that’s exactly what it was assigned to do.