Full-text corpus data


For more information on texts and composition, click on the    icon at the top of the page of each corpus.
 
Corpus Texts (95% available in full-text data) Focus / strengths
iWeb: The Intelligent Web Corpus
 
14 billion words / 22 million web pages / ~100,000 websites Size, size, and more size. Taken from ~100,000 of the most widely-used websites (for English) in the world. Probably the best for "web / tech" language
NOW: News on the Web
 
8.4 billion words / 8 million texts. (Still growing every month) The most up-to-date corpus of English. Wide range of online newspapers and magazines (technology, entertainment, sports, politics, etc)
GloWbE: Global Web-based English 1.9 billion words / 1.8 million texts. 20 countries About 60% blogs (very informal). Recent: 2013. Comparing varieties of English: American, British, Australian, etc. 100x as large as the next-largest corpus of English dialects.
Wikipedia Corpus
 
1.9 billion words / 4.4 million texts Best corpus for specialized language for an almost unlimited range of topics: science, entertainment, technology, history, sports, etc
COCA: Corpus of Contemporary American English 560 million words / 220,000 texts. US, 1990-2017 Best coverage of all types of genres (informal to formal): spoken, fiction, magazines, newspaper, academic. The most widely-used corpus of English.
COHA: Corpus of Historical American English 400 million words / 107,000 texts. US, 1810-2009 Historical change. 100x as large as next-largest historical corpus of English.
TV Corpus 325 million words / 75,000 episodes. US, UK, 4 other dialects, 1950-2018 Extremely informal language (more info). Can also be used to compare dialects and changes since the 1950s.
Movies Corpus 200 million words / 25,000 movies. US, UK, 4 other dialects, 1930-2018 Extremely informal language (more info). Can also be used to compare dialects and changes since the 1930s.
SOAP Corpus 100 million words / US, 2000-2012 Very informal language from US soap opera.
     
Corpus del Espaņol (Spanish)
 
2.0 billion words / 2.0 million texts. 21 countries The largest well-annotated corpus of Spanish. All of the strengths of GloWbE (above), but for Spanish