The lexicon ("dictionary" entries) is composed of five different
files. "Types" refers to unique combinations of (case sensitive)
word form, lemma, and part of speech. For example, string as a noun
and verb are different types, and string and String are different
types. Tokens is the total number of occurrences.
File |
Types |
Tokens |
Explanation |
lexicon1.txt |
3,845,588 |
16,202,034,905 |
Words that occur 15 times or
more in the corpus (excluding punctuation: 3,757,760
types, 13,968,339,638 tokens) |
lexicon2.txt |
19,705,013 |
78,816,683 |
Words that occur between 2
and 14 times |
lexicon3.txt |
396,593 |
396,593 |
Words that occur just once. |
lexicon4.txt |
23,580,110 |
23,580,110 |
Text delimiters, which are
found at the beginning of each text. These have the format
"@@{digits}", where the digits refer to the textID for one
of the 22 million texts (web pages) in the corpus. |
lexicon5.txt |
1,046,934 |
35,132,660 |
"Placeholders" for duplicate
strings, such as "this material has been copyrighted", etc.
These have the form "@qwx{digits}", where the digits refer
to the entries in this file. |
lexicon6.txt |
17,217,679 |
18,757,272 |
Miscellaneous entries starting
with "@", which are neither text delimiters, nor do they
refer to duplicate strings. In nearly all cases, they simply
refer to the line after the text delimiter, and they refer
to the textID for the text at an earlier stage of
processing. They can be safely ignored. |
If you use just the first two parts of the lexicon, this will
account for 99.997% of all of the "word" ( = non-delimiter) tokens
in the corpus. An example of #4-6 above comes from the beginning
of a sample text. #4 (compare to lexicon4.txt above) is the text
delimiter, #6 (which can be safely ignored) is the text delimiter
for a previous version of the text, and #5 refers to the duplicate
phrase "If this is your first visit, be sure to check out the FAQ
by clicking the link above", which occurs 565 times in website
94099. (Note that the line before (5a), with a similar "word" (203275
~= 403275) also refers to this repeated phrase.)
websiteID |
textID |
offset ID |
|
word |
94099 |
34653740 |
9069765341 |
4 |
@@34653741 |
94099 |
34653740 |
9069765342 |
6 |
@3653741/ |
94099 |
34653740 |
9069765343 |
(5a) |
203275 |
94099 |
34653740 |
9069765344 |
5 |
@qwx403275 |
94099 |
34653740 |
9069765345 |
|
<h> |
94099 |
34653741 |
9069765346 |
|
Hood |
94099 |
34653741 |
9069765347 |
|
has |
94099 |
34653741 |
9069765348 |
|
tons |
|