Full-text corpus data


For more information on texts and composition, click on the    icon at the top of the page of each corpus.
 
Corpus Texts (95% available in full-text data) Focus / strengths
iWeb: The Intelligent Web Corpus
(More info)
14 billion words / 22 million web pages / ~100,000 websites Size, size, and more size. Taken from ~100,000 of the most widely-used websites (for English) in the world. Probably the best for "web / tech" language
NOW: News on the Web
(Two datasets; more info)
 
13.4 billion words / 23,602,535 texts. (Still growing every month; last update is for Aug 2021) The most up-to-date corpus of English. Wide range of online newspapers and magazines (technology, entertainment, sports, politics, etc)
Coronavirus Corpus
(Two datasets; more info)
 
1.81 billion words / 1,551,198 texts. (Still growing every month; last update is for Aug 2021) Designed to be the definitive record of the social, cultural, and economic impact of the coronavirus (COVID-19) in 2020 and beyond.
COCA: Corpus of Contemporary American English
(More info)
1 billion words / 485,000 texts. US, 1990-2019 Best coverage of all types of genres (informal to formal): TV/Movies subtitles, blogs, web pages, spoken, fiction, magazines, newspaper, academic. The most widely-used corpus of English.
GloWbE: Global Web-based English 1.9 billion words / 1.8 million texts. 20 countries About 60% blogs (very informal). Recent: 2013. Comparing varieties of English: American, British, Australian, etc. 100x as large as the next-largest corpus of English dialects.
Wikipedia Corpus
 
1.9 billion words / 4.4 million texts Best corpus for specialized language for an almost unlimited range of topics: science, entertainment, technology, history, sports, etc
COHA: Corpus of Historical American English 400 million words / 107,000 texts. US, 1810-2009 Historical change. 100x as large as next-largest historical corpus of English.
TV Corpus 325 million words / 75,000 episodes. US, UK, 4 other dialects, 1950-2018 Extremely informal language (more info). Can also be used to compare dialects and changes since the 1950s.
Movies Corpus 200 million words / 25,000 movies. US, UK, 4 other dialects, 1930-2018 Extremely informal language (more info). Can also be used to compare dialects and changes since the 1930s.
SOAP Corpus 100 million words / US, 2000-2012 Very informal language from US soap opera.
     
Corpus del EspaŮol (More info)
 
2.0 billion words / 2.0 million texts. 21 countries The largest well-annotated corpus of Spanish. All of the strengths of GloWbE (above), but for Spanish.
Corpus do PortuguÍs (More info)
 
1.0 billion words / 1.1 million texts. 4 countries The largest well-annotated corpus of Portuguese. All of the strengths of GloWbE (above), but for Portuguese.