N-grams

In the few spare minutes I’ve had this week, I’ve been trying out n-grams as a comparable with other text-mining processes. This still fits squarely in the category of “a lot to learn,” but I’m happy to be running Perl and various Lingua::EN modules on this (nothing super-complicated in the switch, but all of my previous Perl tinkering was on a PC). Today’s corpus was perhaps too small to yield many insights: just 2300 words across 10-12 email messages sent via WPA-L on Tuesday, August 24. All the same, insights or no, here the top five bigrams, with T-Units enumerated to the fourteenth decimal place, i.e., something like hundred-trillionth or quadrillionth position. Such precision is useful, I suppose, for avoiding ties.

Bi-grams (T-Score, count, bigram)
2.22372310460229	5	audience awareness	
1.72376348313075	3	writing centers	
1.72376348313075	3	writing spaces	
1.41265204574184	2	develop new	
1.41109052911059	2	new ways	

It looks like “develop new ways” is part of a trigram that shows up twice in the corpus. This script–a fine one, by the way–renders those three words into a 2×2 bigram. But that’s exactly what it was assigned to do.

2 Comments

  1. dmueller,

    What is the process involved for gathering
    a large file of bigrams and trigrams from the
    web?

  2. I’m running the process against .txt files. You can produce .txt files in a number of ways. So, you could either piece together a text file to include whatever text you want to process, or you could probably set something up that would produce a periodic text file from an RSS feed.

    Maybe I’m not answering your question straightforwardly enough, Frank460. Let me know if you have something else in mind.

Comments are closed.