Corpus do Portuguese (Web / Dialects) contains about
two billion words of data from four different
countries. It is the largest corpus of Portuguese that has
carefully lemmatized and tagged
for part of speech.
The corpus can be used for
many different purposes: creating
word frequency and n-grams data, creating teaching
materials, natural language processing, and linguistic
analysis (e.g. comparing dialects, looking for complex
syntactic constructions, etc).
The corpus comes in three
different formats. See
samples (about 1/100th the
entire corpus) of
word / lemma / part of speech, and
database formats. When you purchase the data, you
have a license to download and use all three formats.