Full-text corpus data


The full-text corpus data is available in three different formats. When you purchase the data, you purchase the rights to all three formats, and you can download whichever ones you want.

Samples: The sample data that is linked to below is taken completely at random from each of the corpora (usually about 1/100th the total number of texts). No attempt has been made to "clean up" this sample data in any way. If you're happy with the sample data that you download, you should be equally as happy with the complete set of data.

Note that the size shown in the first column is the total amount of words that you can download, after purchasing the data. The size in the other three columns is for the samples. Note also that the shared files below (sources and lexicon) are just for the sample texts.

NOW corpus: the samples below are just for 2010-2016, but the full-text data continues to grow by 130-150 million words each month. The last update is for March 2024. (More info)

Corpus (size of
complete full-text data)

 
Database (more)
 
Word/lemma/PoS Linear text


iWeb (14 billion)
COCA (950 million)
COHA (385 million)
GloWbE (1.8 billion)
NOW ( billion)
Coronavirus (1.5 billion)
Wikipedia (1.8 billion)
TV (310 million)
Movies (190 million)
SOAP (95 million)

Shared files
(See)
Lexicon  Sources
Lexicon  Sources
Lexicon  Sources
Lexicon  Sources
Lexicon  Sources
Lexicon  Sources
Lexicon  Sources
Lexicon  Sources
Lexicon  Sources

Samples
(See): 14 mw
COCA: 8.9 mw
COHA: 3.6 mw
GloWbE: 2.1 mw
NOW: 1.7 mw
Corona: 3.2 mw
Wiki: 1.8 mw
TV: 2.1 mw
Movies: 1.6 mw
SOAP: 2.1 mw

Samples
(See): 14 mw
COCA: 8.9 mw
COHA: 3.6 mw
GloWbE: 2.1 mw
NOW: 1.7 mw
Corona: 3.2 mw
Wiki: 1.8 mw
TV: 2.1 mw
Movies: 1.6 mw
SOAP: 2.1 mw

Samples:
(See): 14 mw
COCA: 8.9 mw
COHA: 3.6 mw
GloWbE: 2.1 mw
NOW: 1.7 mw
Corona: 3.2 mw
Wiki: 1.8 mw
TV: 2.1 mw
Movies: 1.6 mw
SOAP: 2.1 mw

Spanish (1.8 billion)
Portuguese (1 billion)
Lexicon  Sources
Lexicon  Sources
Spanish: 2.0 mw
Portuguese: 9.5 mw
Spanish: 2.0 mw
Portuguese: 9.5 mw
Spanish: 2.0 mw
Portuguese: 9.5 mw
Explanation and notes   Most robust format, but requires knowledge of SQL. Allows for powerful JOINs across corpus, lexicon, and sources tables. Word, lemma, and part of speech in vertical format; can be imported into a database. In most of the corpora, texts are separated by a line with ## and the textID. In COHA, each text is its own file). This format provides a textID for each text, and then the entire text on the same line. In this format, words are not annotated for part of speech or lemma. In addition, contracted words like <can't> are separated into two parts (ca n't) and punctuation is separated from words (eye level . As her).
Short sample  
textID ID wordID
2002364153180333 69
2002364153180334 3
2002364153180335 978
2002364153180336 8880
2002364153180337 8047
2002364153180338 12
2002364153180339 3
2002364153180340 351
2002364153180341 19630
2002364153180342 134
2002364153180343 6720
2002364153180344 38
2002364153180345 42
2002364153180346 3355
2002364153180347 3923
2002364153180348 52
2002364153180349 10985
2002364153180350 3
2002364153180351 44306
2002364153180352 3792
2002364153180353 22
2002364153180354 3
2002364153180355 809
2002364153180356 449
2002364153180357 3531
word lemma PoS
Butbutccb
thetheat
hugehugejj
bonusbonusnn1
prizeprizenn1
isbevbz
thetheat
realrealjj
drawdrawnn1@
----x
announcedannounce vvn
bybyii
anaat1
electronicelectronic jj
displaydisplay nn1
thatthatcst_dd1
resemblesresemble vvz
thetheat
tickingticking jj
wheelwheelnn1
ononii
thetheat
TVtvnn1
gamegamenn1
showshownn1_vv0
##2002364 But the huge bonus prize is the real draw -- announced by an electronic display that resembles the ticking wheel on the TV game show , placed just above eye level . As her losses mounted to more than $200 , Budz fed the machine $5 tokens , pressing the Spin button almost rhythmically -- no serious slot player touches the pull handle on a one-armed bandit .