Full-text corpus data


The iWeb corpus contains almost 14 billion words from more than 22 million web pages in nearly 95,000 websites. With a corpus of this size, there are bound to be some limitations and errors. The following are some of the known issues:

1. Duplicate texts / material. Hundreds of thousands of duplicate texts have been removed from the corpus are are not part of the full-text data. These were identified by looking for duplicate 11-grams (11 word sequences) across the 22 million texts. But the more difficult issue has been removing duplicate phrases, such as the phrase "If this is your first visit, be sure to check out the FAQ by clicking the link above", which occurs 565 times in a single website. The problem is where to draw the line. Phrases that occur 565 times probably ought to be removed (or indicated in some way), but what about phrases that occur just 5 or 10 times. The vast majority of these phrases where removed (and replaced with a placeholder; see list) as the texts were being processed.

2. "Choppy texts". As we were removing other duplicate phrases (such as "Copyright 2017 by Company XYZ . . . (word1) . . . .(word n)  Please contact us. . . , this sometimes produced "choppy texts"  where the duplicate phrases had been. These are not nearly as common as the duplicate phrases discussed in #1, but these fragments do exist in the full-text data.

3. Extended ASCII characters and apostrophes. There were a wide range of text encodings for the 22 million web pages. We tried to retain all of the characters as we were processing the text, but in some cases some of the extended ASCII characters have been lost. In addition, some "curly quote" apostrophes have been lost (resulting in forms like cant instead of can't). These are a small percentage of all of the apostrophes in the original text, but some have been lost.

Again, the full-text data is very clean, but there are issues of the types discussed above. If you are developing applications that need an extremely low level of errors (e.g. just one "error" every 100,000 words or so), please take a look at the samples before purchasing the data, in order to make sure that it meets your needs.