Full-text corpus data



After you have purchased the data, we will send you links to download the data. There are five files for each corpus: the three formats of text, word / lemma / PoS, and database, as well as a sources file (with metadata on each of the texts), and a "lexicon" (all distinct word forms + lemma + PoS). You can download just the formats / file that you need.

The preferred method of download is from Google Drive. The person who will be downloading the data should check to see if they have access to Google Drive. If they can download and view this file (which says You have access to Google Drive), then things are fine.

Things work best if you have a Google account, but you can also download the data even if you don't have an account. In this case, you will simply need to click on a link from an email that you receive from Google, to confirm the email address that we will enter.

If you do not have access to Google Drive (for example, if you are in a country that does not allow access to Google), you can download the data from our server, but there will be a surcharge for each corpus purchased: $25 (for TV, Movies, SOAP, COHA), $45 (for COCA, Wikipedia, GloWbE, Coronavirus, Spanish, Portuguese), or $95 (for NOW or iWeb).