Once you
have the full-text data on your computer, there is no end to the possible uses for the data. The following
are just a few ideas:
-
Create your own frequency lists -- in the entire corpus, for
specific genres (COCA, e.g. Fiction), dialects (GloWbE, e.g. Australia),
time periods (COHA, e.g. 1950s-1960s), topics (Wikipedia, e.g. molecular
biology), websites/dates (NOW, e.g. Wall Street Journal from Sep-Oct 2016),
very informal language (TV, Movies, or SOAP corpus) or specific sub-genres (COCA, e.g. Newspaper-Financial).
-
Find collocates -- what are the most common words occurring near
a specific word, which provides great insight into the meaning and usage of
the word.
-
Create your own n-grams lists -- what are the most common strings
involving whatever words you want.
-
Generate your own
concordance lines -- thousands or tens of thousands of lines of data
for any list of words -- without the limits imposed by using the
web interfaces for the corpora from English-Corpora.org
-
If you're a computational linguist, you can do all of the things
that you can do only with full-text data -- sentiment analysis, topic
modeling, named entity
recognition, advanced regex searches, creating treebanks, etc.
Note that "pre-packaged" COCA-based
frequency lists, collocates, and
n-grams are all available (see
samples) for those who
don't want to extract their own data. But with the full-text data,
you have much more control over this data.
-
Remember that in your queries, you can search by word form, lemma
(e.g. walk = walks, walked, walking), or part of speech,
or any combination of these. This can be very useful, for example, for
advanced work on syntactic constructions.
-
You can compare one section of the corpus to another -- for
example, words that occur much more in Magazine-Financial than magazines in
general (COCA), adjectives that are much more common in Great Britain
than the United States (GloWbE), or the collocates of a word in the 1800s
and 1900s (COHA). (Some of this data is
already available
in "pre-packaged" form, but you will have much more control over the data.)
Using the lexicon and sources files:
-
You have access to a file that lists all sources
in the corpora (along with its associated
metadata). You can use this data to create your own "sub-corpora" --
texts from just a particular year, or a certain source, or whatever other
criteria you want.
-
Because you have access to a lexicon of
all word forms in the corpus (millions of entries for each corpus), you can add any
features you want for any word -- pronunciation, meaning, etc -- and then
use that as part of your search.
Basically, anything that you can do with the
corpora online,
you can do with this data -- and much more. But because the
data is on your own computer, there are no limits on how many
queries you can do each day; you don't have to worry about hundreds
of other people using the corpora at the same time you are; and you
can even leave programs running overnight to search the hundreds of
millions (or billions) of words in complex ways. The possibilities
are endless.
|