I have written and uploaded to CPAN version 0.02 of my Perl module Lingua::EN::Bigram. From the README file:
This module is designed to: 1) pull out all of the two-, three-, and four-word phrases in a given text, and 2) list these phrases according to their frequency. Using this module is it possible to create lists of the most common phrases in a text as well as order them by their probable occurrence, thus implying significance. This process is useful for the purposes of textual analysis and “distant reading”.
Using this module I wrote a script called n-grams.pl. Feed it a plain text file, and it will return the top 10 most significant bigrams (as calculated by T-Score) as well as the top 10 most common trigrams and quadgrams. For example, here is the output of n-grams.pl when Henry David Thoreau’s Walden is input:
Bi-grams (T-Score, count, bigram) 4.54348783312048 22 one day 4.35133234596553 19 new england 3.705427371426 14 walden pond 3.66575742655033 14 one another 3.57857056272537 13 many years 3.55592136768501 13 every day 3.46339791276118 12 fair haven 3.46101939872834 12 years ago 3.38519781332654 12 every man 3.29818626191729 11 let us Tri-grams (count, trigram) 41 in the woods 40 i did not 28 i do not 28 of the pond 27 as well as 27 it is a 26 part of the 25 that it was 25 as if it 25 out of the Quad-grams (count, quadgram) 20 for the most part 16 from time to time 15 as if it were 14 in the midst of 11 at the same time 9 the surface of the 9 i think that i 8 in the middle of 8 worth the while to 7 as if they were
The whole thing gets more interesting when you compare that output to another of Thoreau’s works — A Week on the Concord and Merrimack Rivers:
Bi-grams (T-Score, count, bi-gram) 4.62683939320543 22 one another 4.57637831535376 21 new england 4.08356124174142 17 let us 3.86858364314677 15 new hampshire 3.43311180449584 12 one hundred 3.31196701774012 11 common sense 3.25007069543896 11 can never 3.15955504269006 10 years ago 3.14821552996352 10 human life 3.13793008615654 10 told us Tri-grams (count, tri-gram) 41 as well as 38 of the river 34 it is a 30 there is a 30 one of the 28 it is the 27 as if it 26 it is not 26 if it were 24 it was a Quad-grams (count, quad-gram) 21 for the most part 20 as if it were 17 from time to time 9 on the bank of 8 the bank of the 8 in the midst of 8 a quarter of a 8 the middle of the 8 quarter of a mile 7 at the same time
Ask yourself, “Are their similarities between the outputs? How about differences? Do you notice any patterns or anomalies? What sorts of new discoveries might be made if n-grams.pl where applied to the entire corpus of Thoreau’s works? How might the output be different if a second author’s works were introduced?” Such questions are the core of digital humanities research. With the increasing availability of full text content in library collections, such are the questions the library profession can help answer if the profession were to expand it’s definition of “service”.
Search and retrieve are not the pressing problems to solved. People can find more data and information than they know what to do with. Instead, the pressing problems surround use and understanding. Lingua::EN::Bigram is an example of how these newer and more pressing problems can be addressed. The module is available for downloading (locally as well as from CPAN). Also for your perusal is n-grams.pl.