Full-text corpus data

The full-text corpus data is available in three different formats. When you purchase the data, you purchase the rights to all three formats, and you can download whichever ones you want.

Samples: The sample data that is linked to below is taken completely at random from each of the corpora (usually about 1/100th the total number of texts; 1/1,000 in the case of NOW). No attempt has been made to "clean up" this sample data in any way. If you're happy with the sample data that you download, you should be equally as happy with the complete set of data.

Note that the size shown in the first column is the total amount of words that you can download, after purchasing the data. The size in the other three columns is for the samples. Note also that the shared files below (sources and lexicon) are just for the sample texts.

NOW corpus: the samples below (for the entire period 2010-2024) are quite large (more than 215 million words). If you have limited bandwidth, you might want to download just the samples data from 2024, which are "just" 21.7 million words. (More info)

Corpus (size of
complete full-text data)

Database (more)

Word/lemma/PoS

Linear text

NOW ( billion)
iWeb (14 billion)
COCA (950 million)
COHA (385 million)
GloWbE (1.8 billion)
Coronavirus (1.5 billion)
Wikipedia (1.8 billion)
TV (310 million)
Movies (190 million)
SOAP (95 million)

Shared files
Lexicon Sources
(See)
Lexicon Sources
Lexicon Sources
Lexicon Sources
Lexicon Sources
Lexicon Sources
Lexicon Sources
Lexicon Sources
Lexicon Sources

Samples
NOW: 215 mw (2024)
(See): 14 mw
COCA: 8.9 mw
COHA: 3.6 mw
GloWbE: 2.1 mw
Corona: 3.2 mw
Wiki: 1.8 mw
TV: 2.1 mw
Movies: 1.6 mw
SOAP: 2.1 mw

Samples:
NOW: 215 mw (2024)
(See): 14 mw
COCA: 8.9 mw
COHA: 3.6 mw
GloWbE: 2.1 mw
Corona: 3.2 mw
Wiki: 1.8 mw
TV: 2.1 mw
Movies: 1.6 mw
SOAP: 2.1 mw

Spanish (1.8 billion)
Portuguese (1 billion)

Lexicon Sources
Lexicon Sources

Spanish: 2.0 mw
Portuguese: 9.5 mw

Explanation and notes

Most robust format, but requires knowledge of SQL. Allows for powerful JOINs across corpus, lexicon, and sources tables.

Word, lemma, and part of speech in vertical format; can be imported into a database. In most of the corpora, texts are separated by a line with ## and the textID. In COHA, each text is its own file).

This format provides a textID for each text, and then the entire text on the same line. In this format, words are not annotated for part of speech or lemma. In addition, contracted words like <can't> are separated into two parts (ca n't) and punctuation is separated from words (eye level . As her).

Short sample

textID	ID	wordID
2002364	153180333	69
2002364	153180334	3
2002364	153180335	978
2002364	153180336	8880
2002364	153180337	8047
2002364	153180338	12
2002364	153180339	3
2002364	153180340	351
2002364	153180341	19630
2002364	153180342	134
2002364	153180343	6720
2002364	153180344	38
2002364	153180345	42
2002364	153180346	3355
2002364	153180347	3923
2002364	153180348	52
2002364	153180349	10985
2002364	153180350	3
2002364	153180351	44306
2002364	153180352	3792
2002364	153180353	22
2002364	153180354	3
2002364	153180355	809
2002364	153180356	449
2002364	153180357	3531

word	lemma	PoS
But	but	ccb
the	the	at
huge	huge	jj
bonus	bonus	nn1
prize	prize	nn1
is	be	vbz
the	the	at
real	real	jj
draw	draw	nn1@
--	--	x
announced	announce	vvn
by	by	ii
an	a	at1
electronic	electronic	jj
display	display	nn1
that	that	cst_dd1
resembles	resemble	vvz
the	the	at
ticking	ticking	jj
wheel	wheel	nn1
on	on	ii
the	the	at
TV	tv	nn1
game	game	nn1
show	show	nn1_vv0

##2002364 But the huge bonus prize is the real draw -- announced by an electronic display that resembles the ticking wheel on the TV game show , placed just above eye level . As her losses mounted to more than $200 , Budz fed the machine $5 tokens , pressing the Spin button almost rhythmically -- no serious slot player touches the pull handle on a one-armed bandit .