Full-text corpus data


For more information on texts and composition, click on the    icon at the top of the page of each corpus.
 
Corpus Texts (95% available in full-text data) Focus / strengths
iWeb: The Intelligent Web Corpus
(more info)
14 billion words / 22 million web pages / ~100,000 websites Size, size, and more size. Taken from ~100,000 of the most widely-used websites (for English) in the world. Probably the best for "web / tech" language
NOW: News on the Web
(more info)
6.04 billion words / 6.0+ million texts. (As of early Dec 2016; continually growing). 20 countries. The most up-to-date corpus of English. 4-5 words added each day (130 million each month, 1.5 billion each year). Wide range of online newspapers and magazines (technology, entertainment, sports, politics, etc)
GloWbE: Global Web-based English 1.9 billion words / 1.8 million texts. 20 countries. About 60% blogs (very informal). Recent: 2013. Comparing varieties of English: American, British, Australian, etc. 100x as large as the next-largest corpus of English dialects.
Wikipedia Corpus
(more info)
1.9 billion words / 4.4 million texts. Best corpus for specialized language for an almost unlimited range of topics: science, entertainment, technology, history, sports, etc
COCA: Corpus of Contemporary American English 560 million words / 220,000 texts. US, 1990-2017. Best coverage of all types of genres (informal to formal): spoken, fiction, magazines, newspaper, academic. The most widely-used corpus of English.
COHA: Corpus of Historical American English 400 million words / 107,000 texts. US, 1810-2009. Historical change. 100x as large as next-largest historical corpus of English.
Corpus del Espaņol (Spanish)
(more info)
2.0 billion words / 2.0 million texts. 21 countries. The largest well-annotated corpus of Spanish. All of the strengths of GloWbE (above), but for Spanish