Full-text corpus data


The lexicon ("dictionary" entries) is composed of five different files. "Types" refers to unique combinations of (case sensitive) word form, lemma, and part of speech. For example, string as a noun and verb are different types, and string and String are different types. Tokens is the total number of occurrences.

File

Types

Tokens

Explanation

lexicon1.txt

3,845,588

16,202,034,905

Words that occur 15 times or more in the corpus (excluding punctuation: 3,757,760 types, 13,968,339,638 tokens)

lexicon2.txt

19,705,013

78,816,683

Words that occur between 2 and 14 times

lexicon3.txt

396,593

396,593

Words that occur just once.

lexicon4.txt

23,580,110

23,580,110

Text delimiters, which are found at the beginning of each text. These have the format "@@{digits}", where the digits refer to the textID for one of the 22 million texts (web pages) in the corpus.

lexicon5.txt

1,046,934

35,132,660

"Placeholders" for duplicate strings, such as "this material has been copyrighted", etc. These have the form "@qwx{digits}", where the digits refer to the entries in this file.

lexicon6.txt

17,217,679

18,757,272

Miscellaneous entries starting with "@", which are neither text delimiters, nor do they refer to duplicate strings. In nearly all cases, they simply refer to the line after the text delimiter, and they refer to the textID for the text at an earlier stage of processing. They can be safely ignored.

If you use just the first two parts of the lexicon, this will account for 99.997% of all of the "word" ( = non-delimiter) tokens in the corpus.

An example of #4-6 above comes from the beginning of a sample text. #4 (compare to lexicon4.txt above) is the text delimiter, #6 (which can be safely ignored) is the text delimiter for a previous version of the text, and #5 refers to the duplicate phrase "If this is your first visit, be sure to check out the FAQ by clicking the link above", which occurs 565 times in website 94099. (Note that the line before (5a), with a similar "word" (203275 ~= 403275)  also refers to this repeated phrase.)
 
websiteID textID offset ID   word
94099 34653740 9069765341 4 @@34653741
94099 34653740 9069765342 6 @3653741/
94099 34653740 9069765343 (5a) 203275
94099 34653740 9069765344 5 @qwx403275
94099 34653740 9069765345   <h>
94099 34653741 9069765346   Hood
94099 34653741 9069765347   has
94099 34653741 9069765348   tons