For more information on texts and composition, click on
the
icon at the top of the page of each corpus.
Corpus |
Texts (95% available
in full-text data) |
Focus / strengths |
iWeb: The Intelligent Web Corpus (More
info) |
14
billion words / 22 million web pages /
~100,000 websites |
Size, size, and
more size. Taken from ~100,000 of the most
widely-used websites (for English) in the world.
Probably the best for "web / tech" language |
NOW: News on the Web
(Two datasets; more
info)
|
19.8 billion words /
33,800,973 texts. 20
countries.
(Still growing every month;
last update is for Sep 2024) |
The most up-to-date
corpus of English. Wide
range of online newspapers and magazines
(technology, entertainment, sports, politics,
etc) |
Coronavirus Corpus
|
1.5
billion words, 1.9 million texts; 20
countries, Jan 2020 - Dec 2022 |
Designed to be the
definitive record of the social, cultural, and
economic impact of the coronavirus (COVID-19) in
2020-2022. |
COCA: Corpus of Contemporary American English (More
info) |
1
billion words
/ 485,000 texts. US, 1990-2019 |
Best coverage of
all types of genres (informal to formal):
TV/Movies subtitles, blogs, web pages, spoken, fiction, magazines, newspaper, academic.
The most widely-used corpus of English. |
GloWbE: Global Web-based English |
1.9
billion words
/ 1.8 million texts. 20 countries |
About 60% blogs
(very informal). Recent: 2013. Comparing
varieties of English: American, British,
Australian, etc. 100x as large as the
next-largest corpus of English dialects. |
Wikipedia Corpus
|
1.9
billion words / 4.4 million texts |
Best corpus for
specialized language for an almost unlimited
range of topics: science, entertainment,
technology, history, sports, etc |
COHA: Corpus of Historical American English |
400 million words
/ 107,000 texts. US, 1810-2009 |
Historical change.
100x as large as next-largest historical corpus
of English. |
TV Corpus |
325 million words
/ 75,000 episodes. US, UK, 4 other dialects,
1950-2018 |
Extremely informal
language (more
info). Can also be used to compare dialects
and changes since the 1950s. |
Movies
Corpus |
200 million words
/ 25,000 movies. US, UK, 4 other dialects,
1930-2018 |
Extremely informal
language (more
info). Can also be used to compare dialects
and changes since the 1930s. |
SOAP Corpus |
100 million words
/ US, 2000-2012 |
Very informal
language from US soap opera. |
|
|
|
Corpus del Español (More
info)
|
2.0 billion words
/ 2.0 million texts. 21 countries |
The largest
well-annotated corpus of Spanish. All of the
strengths of GloWbE (above), but for Spanish. |
Corpus do
Português (More
info)
|
1.0 billion words
/ 1.1 million texts. 4 countries |
The largest
well-annotated corpus of Portuguese. All of the
strengths of GloWbE (above), but for Portuguese. |