Full-text corpus data


Once you have the full-text data on your computer, there is no end to the possible uses for the data. The following are just a few ideas:

  • Create your own frequency lists -- in the entire corpus, for specific genres (COCA, e.g. Fiction), dialects (GloWbE, e.g. Australia), time periods (COHA, e.g. 1950s-1960s), topics (Wikipedia, e.g. molecular biology), websites/dates (NOW, e.g. Wall Street Journal from Sep-Oct 2016), very informal language (TV, Movies, or SOAP corpus) or specific sub-genres (COCA, e.g. Newspaper-Financial).

  • Find collocates -- what are the most common words occurring near a specific word, which provides great insight into the meaning and usage of the word.

  • Create your own n-grams lists -- what are the most common strings involving whatever words you want.

  • Generate your own concordance lines -- thousands or tens of thousands of lines of data for any list of words -- without the limits imposed by using the web interfaces for the corpora from English-Corpora.org

  • If you're a computational linguist, you can do all of the things that you can do only with full-text data -- sentiment analysis, topic modeling, named entity recognition, advanced regex searches, creating treebanks, etc.

Note that "pre-packaged" COCA-based frequency lists, collocates, and n-grams are all available (see samples) for those who don't want to extract their own data. But with the full-text data, you have much more control over this data.

  • Remember that in your queries, you can search by word form, lemma (e.g. walk = walks, walked, walking), or part of speech, or any combination of these. This can be very useful, for example, for advanced work on syntactic constructions.

  • You can compare one section of the corpus to another -- for example, words that occur much more in Magazine-Financial than magazines in general (COCA), adjectives that are much more common in Great Britain than the United States (GloWbE), or the collocates of a word in the 1800s and 1900s (COHA). (Some of this data is already available in "pre-packaged" form, but you will have much more control over the data.)

Using the lexicon and sources files:

  • You have access to a file that lists all sources in the corpora (along with its associated metadata). You can use this data to create your own "sub-corpora" -- texts from just a particular year, or a certain source, or whatever other criteria you want.

  • Because you have access to a lexicon of all word forms in the corpus (millions of entries for each corpus), you can add any features you want for any word -- pronunciation, meaning, etc -- and then use that as part of your search.

Basically, anything that you can do with the corpora online, you can do with this data -- and much more. But because the data is on your own computer, there are no limits on how many queries you can do each day; you don't have to worry about hundreds of other people using the corpora at the same time you are; and you can even leave programs running overnight to search the hundreds of millions (or billions) of words in complex ways. The possibilities are endless.