N-grams

In the few spare minutes I’ve had this week, I’ve been trying out n-grams as a comparable with other text-mining processes. This still fits squarely in the category of “a lot to learn,” but I’m happy to be running Perl and various Lingua::EN modules on this (nothing super-complicated in the switch, but all of my previous Perl tinkering was on a PC). Today’s corpus was perhaps too small to yield many insights: just 2300 words across 10-12 email messages sent via WPA-L on Tuesday, August 24. All the same, insights or no, here the top five bigrams, with T-Units enumerated to the fourteenth decimal place, i.e., something like hundred-trillionth or quadrillionth position. Such precision is useful, I suppose, for avoiding ties.

Bi-grams (T-Score, count, bigram)
2.22372310460229	5	audience awareness	
1.72376348313075	3	writing centers	
1.72376348313075	3	writing spaces	
1.41265204574184	2	develop new	
1.41109052911059	2	new ways

It looks like “develop new ways” is part of a trigram that shows up twice in the corpus. This script–a fine one, by the way–renders those three words into a 2×2 bigram. But that’s exactly what it was assigned to do.

2 Comments

dmueller,

What is the process involved for gathering
a large file of bigrams and trigrams from the
web?

I’m running the process against .txt files. You can produce .txt files in a number of ways. So, you could either piece together a text file to include whatever text you want to process, or you could probably set something up that would produce a periodic text file from an RSS feed.

Maybe I’m not answering your question straightforwardly enough, Frank460. Let me know if you have something else in mind.

Comments are closed.

Earth Wide Moth

N-grams

Like this:

2 Comments

Share this:

Like this:

2 Comments