The iWeb corpus contains almost 14 billion words from more than 22
million web pages in nearly 95,000 websites. With a corpus of this
size, there are bound to be some limitations and errors. The
following are some of the known issues:
1. Duplicate texts / material.
Hundreds of thousands of duplicate texts have been removed from the
corpus are are not part of the full-text data. These were identified
by looking for duplicate 11-grams (11 word sequences) across the 22
million texts. But the more difficult issue has been removing
duplicate phrases, such as the phrase "If this is your first
visit, be sure to check out the FAQ by clicking the link above",
which occurs 565 times in a single website. The
problem is where to draw the line. Phrases that occur 565 times
probably ought to be removed (or indicated in some way), but what
about phrases that occur just 5 or 10 times. The vast majority of
these phrases where removed (and replaced with a placeholder;
see list) as the texts were being processed.
2. "Choppy texts". As we were
removing other duplicate phrases (such as "Copyright 2017 by
Company XYZ . . . (word1) . . . .(word n) Please contact us. .
. , this sometimes produced "choppy texts" where the
duplicate phrases had been. These are not nearly as common as the
duplicate phrases discussed in #1, but these fragments do exist in
the full-text data.
3. Extended ASCII characters and
apostrophes. There were a wide range of text encodings for the
22 million web pages. We tried to retain all of the characters as we
were processing the text, but in some cases some of the
extended ASCII characters
have been lost. In addition, some
"curly
quote" apostrophes have been lost (resulting in forms like
cant instead of can't). These are a small percentage of
all of the apostrophes in the original text, but some have been
lost.
Again, the full-text data is very
clean, but there are issues of the types discussed above. If you are
developing applications that need an extremely low level of errors
(e.g. just one "error" every 100,000 words or so), please take a
look at the samples before purchasing the data, in order to
make sure that it meets your needs.
|