N-grams

In the few spare minutes I’ve had this week, I’ve been trying out n-grams as a comparable with other text-mining processes. This still fits squarely in the category of “a lot to learn,” but I’m happy to be running Perl and various Lingua::EN modules on this (nothing super-complicated in the switch, but all of my previous Perl tinkering was on a PC). Today’s corpus was perhaps too small to yield many insights: just 2300 words across 10-12 email messages sent via WPA-L on Tuesday, August 24. All the same, insights or no, here the top five bigrams, with T-Units enumerated to the fourteenth decimal place, i.e., something like hundred-trillionth or quadrillionth position. Such precision is useful, I suppose, for avoiding ties.

Bi-grams (T-Score, count, bigram)
2.22372310460229	5	audience awareness	
1.72376348313075	3	writing centers	
1.72376348313075	3	writing spaces	
1.41265204574184	2	develop new	
1.41109052911059	2	new ways	

It looks like “develop new ways” is part of a trigram that shows up twice in the corpus. This script–a fine one, by the way–renders those three words into a 2×2 bigram. But that’s exactly what it was assigned to do.