Full-text corpus data


The new Corpus del Espaņol (Web / Dialects) contains about two billion words of data from 21 different countries. It is the largest corpus of Spanish that has been carefully lemmatized and tagged for part of speech. In addition, it has many advantages over the much smaller CORPES corpus from the Real Academia Espaņola.

In terms of large, recent, well-annotated corpora of Spanish, there really are no other options -- especially since you can now download virtually the entire corpus for offline use.

The corpus can be used for many different purposes: creating word frequency and n-grams data, creating teaching materials, natural language processing, and linguistic analysis (e.g. comparing dialects, looking for complex syntactic constructions, etc).

The corpus comes in three different formats. See samples (about 1/100th the entire corpus) of text, word / lemma / part of speech, and database formats. When you purchase the data, you have a license to download and use all three formats.