Full-text corpus data


The Corpus do Portuguese (Web / Dialects) contains about two billion words of data from four different countries. It is the largest corpus of Portuguese that has been carefully lemmatized and tagged for part of speech.

The corpus can be used for many different purposes: creating word frequency and n-grams data, creating teaching materials, natural language processing, and linguistic analysis (e.g. comparing dialects, looking for complex syntactic constructions, etc).

The corpus comes in three different formats. See samples (about 1/100th the entire corpus) of text, word / lemma / part of speech, and database formats. When you purchase the data, you have a license to download and use all three formats.