The
Corpus do Portuguese (Web / Dialects) contains about
one billion words of data from four different
countries (656 million words from Brazil, 327 million
from Portugal, 35 million from Angola, and 32 million
from Mozambique). It is the largest corpus of Portuguese that has
been
carefully lemmatized and tagged
for part of speech.
The corpus can be used for
many different purposes: creating
word frequency and n-grams data, creating teaching
materials, natural language processing, and linguistic
analysis (e.g. comparing dialects, looking for complex
syntactic constructions, etc).
The corpus comes in three
different formats. See
samples (about 1/100th the
entire corpus) of
text,
word / lemma / part of speech, and
database formats. When you purchase the data, you
have a license to download and use all three formats.
|