The Wall Street Journal (WSJ) has a column devoted to statistics called The Numbers Guy by Carl Bialik, which recently talked about words. Making Every Word Count is an update of statistics about Corpus Linguistics, the study of using large groups of words (Corpus) to analyze language (Linguistics). The science has been really heating up in the last decade, with a some very nice dictionaries based on Corpora.
Computers have spawned a burst of activity in the field. But even computers don’t suffice for the daunting task of word collecting and counting. Brown University’s one-million-word corpus was considered adequate in the 1960s. Today, the 100-million-word British National Corpus is considered small — and dated — because it preceded the Internet era, and other sources of new language.
The problem these days is that verbal speech costs about 5 times what text speech does to collect. And with the larger corpora necessary for finer distinctions of language, the cost becomes prohibitive.