The new
Corpus del Espaņol (Web / Dialects) contains about
two billion words of data from 21 different
countries. It is the largest corpus of Spanish that has
been
carefully lemmatized and tagged for part of speech.
In addition, it has
many advantages over the much smaller CORPES corpus
from the Real Academia Espaņola.
In terms of large, recent,
well-annotated corpora of Spanish, there really are no
other options -- especially since you can now download
virtually the entire corpus for offline use.
The corpus can be used for
many different purposes: creating
word frequency and n-grams data, creating teaching
materials, natural language processing, and linguistic
analysis (e.g. comparing dialects, looking for complex
syntactic constructions, etc).
The corpus comes in three
different formats. See
samples (about 1/100th the
entire corpus) of
text,
word / lemma / part of speech, and
database formats. When you purchase the data, you
have a license to download and use all three formats.
|