Earth Wide Moth: N-grams

Wednesday, August 25, 2010

N-grams

In the few spare minutes I've had this week, I've been trying out n-grams as a comparable with other text-mining processes. This still fits squarely in the category of "a lot to learn," but I'm happy to be running Perl and various Lingua::EN modules on this (nothing super-complicated in the switch, but all of my previous Perl tinkering was on a PC). Today's corpus was perhaps too small to yield many insights: just 2300 words across 10-12 email messages sent via WPA-L on Tuesday, August 24. All the same, insights or no, here the top five bigrams, with T-Units enumerated to the fourteenth decimal place, i.e., something like hundred-trillionth or quadrillionth position. Such precision is useful, I suppose, for avoiding ties.

Bi-grams (T-Score, count, bigram)
2.22372310460229	5	audience awareness	
1.72376348313075	3	writing centers	
1.72376348313075	3	writing spaces	
1.41265204574184	2	develop new	
1.41109052911059	2	new ways

It looks like "develop new ways" is part of a trigram that shows up twice in the corpus. This script--a fine one, by the way--renders those three words into a 2x2 bigram. But that's exactly what it was assigned to do.

Posted by Derek Mueller at August 25, 2010 6:27 PM to Methods

Comments

dmueller,

What is the process involved for gathering
a large file of bigrams and trigrams from the
web?

Posted by: Frank460 at August 26, 2010 10:00 AM

I'm running the process against .txt files. You can produce .txt files in a number of ways. So, you could either piece together a text file to include whatever text you want to process, or you could probably set something up that would produce a periodic text file from an RSS feed.

Maybe I'm not answering your question straightforwardly enough, Frank460. Let me know if you have something else in mind.

Posted by: Derek at August 26, 2010 10:05 AM