Full-text corpus data

For more information on texts and composition, click on the icon at the top of the page of each corpus.

Corpus Texts (95% available in full-text data) Focus / strengths

iWeb: The Intelligent Web Corpus
(More info) 14 billion words / 22 million web pages / ~100,000 websites Size, size, and more size. Taken from ~100,000 of the most widely-used websites (for English) in the world. Probably the best for "web / tech" language

NOW: News on the Web
(Two datasets; more info)
18.9 billion words / 32,190,286 texts. 20 countries.
(Still growing every month; last update is for March 2024) The most up-to-date corpus of English. Wide range of online newspapers and magazines (technology, entertainment, sports, politics, etc)

Coronavirus Corpus
1.5 billion words, 1.9 million texts; 20 countries, Jan 2020 - Dec 2022 Designed to be the definitive record of the social, cultural, and economic impact of the coronavirus (COVID-19) in 2020-2022.

COCA: Corpus of Contemporary American English
(More info) 1 billion words / 485,000 texts. US, 1990-2019 Best coverage of all types of genres (informal to formal): TV/Movies subtitles, blogs, web pages, spoken, fiction, magazines, newspaper, academic. The most widely-used corpus of English.

GloWbE: Global Web-based English 1.9 billion words / 1.8 million texts. 20 countries About 60% blogs (very informal). Recent: 2013. Comparing varieties of English: American, British, Australian, etc. 100x as large as the next-largest corpus of English dialects.

Wikipedia Corpus
1.9 billion words / 4.4 million texts Best corpus for specialized language for an almost unlimited range of topics: science, entertainment, technology, history, sports, etc

COHA: Corpus of Historical American English 400 million words / 107,000 texts. US, 1810-2009 Historical change. 100x as large as next-largest historical corpus of English.

TV Corpus 325 million words / 75,000 episodes. US, UK, 4 other dialects, 1950-2018 Extremely informal language (more info). Can also be used to compare dialects and changes since the 1950s.

Movies Corpus 200 million words / 25,000 movies. US, UK, 4 other dialects, 1930-2018 Extremely informal language (more info). Can also be used to compare dialects and changes since the 1930s.

SOAP Corpus 100 million words / US, 2000-2012 Very informal language from US soap opera.

Corpus del Español (More info)
2.0 billion words / 2.0 million texts. 21 countries The largest well-annotated corpus of Spanish. All of the strengths of GloWbE (above), but for Spanish.

Corpus d o Português (More info)
1.0 billion words / 1.1 million texts. 4 countries The largest well-annotated corpus of Portuguese. All of the strengths of GloWbE (above), but for Portuguese.