Database is
one of the three data formats.
When you
purchase the data,
you purchase rights to all three formats, and you can download whichever ones you
want.
This format is composed of three
tables:
-
[corpus]: one line
for each word in the corpus, showing:
the textID, the ID (offset value: 293, 294, 295, etc), and then an integer
value for the word (wordID).
-
[lexicon]:
information on each wordID: word (walked), lemma (walk), and
part of speech (vvd).
-
[sources]:
information on each textID: genre or country, source, title, etc,
This format allows you to carry out
extremely powerful queries on the corpora. You can search by word form, lemma,
or part of speech. You can also limit the search to just certain texts in the
corpus via JOINs on the [sources] table (e.g. just Magazines-Sports in COCA, Great Britain in GloWbE,
fiction from the 1870s in COHA, a particular website in NOW, or pages with a
given word in the title in Wikipedia). You can also modify the [lexicon] table to add
whatever additional features you would like for each wordID, or modify the
[sources] table to create your own sub-corpora. (Information
about the similar corpus architecture used at English-Corpora.org).
corpus table |
lexicon table |
sources table (selected columns for this web
page;
there are
more columns in the downloadable sources table) |
textID |
ID |
wordID |
2040250 | 110933753 |
848 | 2040250 |
110933754 | 3 | 2040250 |
110933755 | 560 |
2040250 | 110933756 | 459620 |
2040250 |
110933757 | 6891 | 2040250 |
110933758 | 10 |
2040251 | 110933779 | 7140251 |
2040251 |
110933780 | 11850 | 2040251 |
110933781 |
187678 | 2040251 | 110933782 |
26 |
2040251 | 110933783 | 19957 |
2040251 |
110933784 | 19 | 2040251 |
110933785 | 41 |
2040251 | 110933786 | 64 |
2040251 | 110933787 |
160 | 2040251 | 110933788 |
4 |
2040251 | 110933789 | 4155 |
|
wordID |
word |
lemma |
PoS |
71186 | swab | swab |
nn1 |
77653 | swag | swag |
nn1 |
36155 | swagger | swagger |
nn1 |
62562 | Swahili | swahili |
jj_nn1 |
44807 | Swain | swain |
np1 |
13782 | swallow | swallow |
vvi |
37384 | swallow | swallow |
vv0 |
44376 | swallow | swallow |
nn1@ |
73593 | swallow | swallow |
nn1@_vv0 |
16873 | swallowed |
swallow | vvd |
22661 | swallowed |
swallow | vvn |
36293 | swallowed |
swallow | vvd_vvn |
23040 | swallowing |
swallow | vvg |
39247 | swallows |
swallow | vvz |
57594 | swallows |
swallow | nn2 |
14960 | swam | swim |
vvd |
15768 | swamp | swamp |
nn1 |
|
COCA
textID |
genre |
sub-genre |
year |
sourceTitle |
textTitle |
2040250 | MAG | Religion |
2007 | Source_A |
Title_L |
2040251 | MAG | Sports |
2007 | Source_B |
Title_M |
2040252 | NEWS | Financial |
2012 | Source_C |
Title_N |
2040253 | ACAD | Science |
1993 | Source_D |
Title_O |
2040254 | FIC | MovieScript |
1997 | Source_E |
Title_P |
2040255 | SPOK | CNN |
2001 | Source_F |
Title_Q |
2040256 | NEWS | Financial |
2012 | Source_G |
Title_R |
COHA
textID |
Year |
genre |
sourceTitle |
textTitle |
728282 | 1837 |
FIC |
Source_A | SampleTitle_N |
728283 | 1872 |
FIC |
Source_B | SampleTitle_0 |
728284 | 1904 |
NF |
Source_C | SampleTitle_P |
728285 | 1938 |
MAG |
Source_D | SampleTitle_Q |
728286 | 1959 |
NEWS |
Source_E | SampleTitle_R |
728287 | 1987 |
MAG |
Source_F | SampleTitle_S |
GloWbE
textID |
country |
genre |
url |
textTitle |
3282569 | AU |
BLOG |
SampleURL_A |
SampleTitle_N |
3282570 | IN |
BLOG |
SampleURL_B |
SampleTitle_0 |
3282571 | US |
GENL |
SampleURL_C |
SampleTitle_P |
3282572 | GB |
BLOG |
SampleURL_D |
SampleTitle_Q |
3282573 | IE |
GENL |
SampleURL_E |
SampleTitle_R |
3282574 | NZ |
GENL |
SampleURL_F |
SampleTitle_S |
3282575 | SG |
BLOG |
SampleURL_G |
SampleTitle_T |
There are similar tables for NOW, Wikipedia, and Spanish
|
NOTE: The database format
assumes that you know SQL (Structured Query Language), and that you can create the tables, populate them
with the downloaded data, and (most importantly) run the SQL queries to extract
the data. Please do not use this format unless you are well-acquainted with
databases and SQL. The following are just a handful of SQL queries that you can
run on the data, but there is of course no limit to what you can do. |
1. Find 1000 most frequent nouns in (COCA) ACAD-Science
select top 1000 count(*),lex.word
from lexicon as lex, sources, corpus where
sources.genre = 'ACAD' and sources.sub-genre = 'Science' and
sources.textID = corpus.textID and
lex.pos like 'nn%' and
lex.wordID = corpus.wordID
group by lex.word
order by count(*) desc
2. Find top 500 strings of get V-ed (e.g. got married, gets paid)
(using "runtime" self-join on corpus; much faster with multi-column table; see
below)
select top 500 count(*),lex1.word, lex2.word
from lexicon as lex1, lexicon as lex2, corpus as corpus1, corpus as corpus
where
lex1.lemma = 'get' and
lex2.pos like 'v_n%' and
lex1.wordID = corpus1.wordID and
lex2.wordID = corpus.wordID and
corpus.ID = corpus1.ID + 1
group by lex1.word, lex2.word
order by count(*) desc
3. Find top 500 3-grams, with point in the second position
(using "runtime" self-join on corpus; much faster with multi-column table; see
below)
select top 500 count(*),lex1.word, lex2.word, lex3.word
from lexicon as lex1, lexicon as lex2, lexicon as lex3, corpus as corpus1,
corpus as corpus, corpus as corpus3 where
lex2.word = 'point' and lex2.wordID = corpus.wordID and
lex1.wordID = corpus1.wordID and
lex3.wordID = corpus3.wordID and
group by lex1.word, lex2.word, lex3.word
order by count(*) desc
Note: rather than using self-joins (as in #2
and 3 above) the architecture for the corpora from English-Corpora.org has
tables like that shown below. The [w5] column here corresponds to the [wordID]
column in the [corpus] table above, but a
massive self-join has been done on this table (as the corpus was created;
not as each query is run) to create "adjacent" [w1]-[w4] and [w6]-[w9] columns.
As a result, the four preceding and four following words are already on the row
when one searches [w5]. With the full-text data, you can create similar tables
yourself.
w1 |
w2 |
w3 |
w4 |
w5 |
w6 |
w7 |
w8 |
w9 |
43 |
3 |
858 |
5 |
432 |
3319 |
9 |
132 |
2876 |
3 |
858 |
5 |
432 |
3319 |
9 |
132 |
2876 |
3643 |
858 |
5 |
432 |
3319 |
9 |
132 |
2876 |
3643 |
5 |
5 |
432 |
3319 |
9 |
132 |
2876 |
3643 |
5 |
1729 |
432 |
3319 |
9 |
132 |
2876 |
3643 |
5 |
1729 |
72 |
3319 |
9 |
132 |
2876 |
3643 |
5 |
1729 |
72 |
43 |
9 |
132 |
2876 |
3643 |
5 |
1729 |
72 |
43 |
21887 |
132 |
2876 |
3643 |
5 |
1729 |
72 |
43 |
21887 |
746929 |
2876 |
3643 |
5 |
1729 |
72 |
43 |
21887 |
746929 |
676 |
3643 |
5 |
1729 |
72 |
43 |
21887 |
746929 |
676 |
62900 |
This allows for much faster queries (than self-joins at SQL runtime). For
example, to find the most frequent collocates for a given word, the SQL query
would be:
4.
Find top 200 noun collocates of break as a verb, in the four "slots"
after break (columns w6-w9 above).
select top 200 count(*),w6 from (
SELECT x.w6 FROM corpus as x, lexicon as x1 where x1.lemma like 'break' and x1.pos
like 'v%' and x.w5 = x1.wordID UNION ALL
SELECT x.w7 FROM corpus as x, lexicon as x1 where x1.lemma like 'break' and
x1.pos like 'v%' and x.w5 = x1.wordID UNION ALL
SELECT x.w8 FROM corpus as x, lexicon as x1 where x1.lemma like 'break' and
x1.pos like 'v%' and x.w5 = x1.wordID UNION ALL
SELECT x.w9 FROM corpus as x, lexicon as x1 where x1.lemma like 'break' and
x1.pos like 'v%' and x.w5 = x1.wordID
) a, lexicon b where
b.pos like 'nn%' and
a.word2 = b.wordID
group by a.w6
order by count(*) desc
On a fairly fast machine, this will only take about two seconds for COCA (one
billion words)
(Information
about the similar corpus architecture used at English-Corpora.org).
|