Full-text corpus data

The Corpus do Portuguese (Web / Dialects) contains about one billion words of data from four different countries (656 million words from Brazil, 327 million from Portugal, 35 million from Angola, and 32 million from Mozambique). It is the largest corpus of Portuguese that has been carefully lemmatized and tagged for part of speech.

The corpus can be used for many different purposes: creating word frequency and n-grams data, creating teaching materials, natural language processing, and linguistic analysis (e.g. comparing dialects, looking for complex syntactic constructions, etc).

The corpus comes in three different formats. See samples (about 1/100th the entire corpus) of text, word / lemma / part of speech, and database formats. When you purchase the data, you have a license to download and use all three formats.