have the full-text data on your computer, there is no end to the possible uses for the data. The following
are just a few ideas:
Create your own frequency lists -- in the entire corpus, for
specific genres (COCA, e.g. Fiction), dialects (GloWbE, e.g. Australia),
time periods (COHA, e.g. 1950s-1960s), topics (Wikipedia, e.g. molecular
biology), websites/dates (NOW, e.g. Wall Street Journal from Sep-Oct 2016), or specific sub-genres (COCA, e.g. Newspaper-Financial).
Find collocates -- what are the most common words occurring near
a specific word, which provides great insight into the meaning and usage of
Create your own n-grams lists -- what are the most common strings
involving whatever words you want.
Generate your own
concordance lines -- thousands or tens of thousands of lines of data
for any list of words -- without the limits imposed by using the
web interfaces for the
If you're a computational linguist, you can do all of the things
that you can do only with full-text data -- sentiment analysis, topic
modeling, named entity
recognition, advanced regex searches, creating treebanks, etc.
Note that "pre-packaged" COCA-based
frequency lists, collocates, and
n-grams are all available (see
samples) for those who
don't want to extract their own data. But with the full-text data,
you have much more control over this data.
Remember that in your queries, you can search by word form, lemma
(e.g. walk = walks, walked, walking), or part of speech,
or any combination of these. This can be very useful, for example, for
advanced work on syntactic constructions.
You can compare one section of the corpus to another -- for
example, words that occur much more in Magazine-Financial than magazines in
general (COCA), adjectives that are much more common in Great Britain
than the United States (GloWbE), or the collocates of a word in the 1800s
and 1900s (COHA). (Some of this data is
in "pre-packaged" form, but you will have much more control over the data.)
Using the lexicon and sources files:
You have access to a file that lists all sources
in the corpora (along with its associated
metadata). You can use this data to create your own "sub-corpora" --
texts from just a particular year, or a certain source, or whatever other
criteria you want.
Because you have access to a lexicon of
all word forms in the corpus (millions of entries for each corpus), you can add any
features you want for any word -- pronunciation, meaning, etc -- and then
use that as part of your search.
Basically, anything that you can do with the
you can do with this data -- and much more. But because the
data is on your own computer, there are no limits on how many
queries you can do each day; you don't have to worry about hundreds
of other people using the corpora at the same time you are; and you
can even leave programs running overnight to search the hundreds of
millions (or billions) of words in complex ways. The possibilities