The NOW Corpus currently (as of late 2024) has about 20.6 billion
words from 2010 to the present time, in about 34.4 million texts.
This is currently growing by about 2.5 - 3.0 billion words per year,
in about 5.5 million new texts. With so much data, there are bound
to be some issues with the texts. Before you order the data, please
be aware of the following.
Textual data: first few words in a text
(2010-2016)
1. In some of the 3.4 million texts from 2010-2015, there are
anywhere from three to five words (usually five words) missing at
the beginning of the text:
11-11-02
Original text: SmartCo, a consortium
of electricity lines companies, has confirmed plans
to install 500,000 "smart" electricity meters "
In full-text data:
of electricity lines companies , has confirmed plans to
install 500,000 " smart " electricity meters "
2. Less commonly in these texts from 2010-2015, the 3-5 missing words at
the beginning of the text are replaced with @ @ @ @ @:
12-08-13
The Future of Apps for
Young Children: Beyond ABC & 123 <p> Apps on multi-touch devices
like . . .
@ @ @ @ @ @ Young
Children : Beyond ABC & 123 <p> Apps on multi-touch devices like
. . .
3. In some texts from 2016, either the fourth or the fifth word is
missing, and that one single word is replaced by "...":
16-04-12
ASOS on track ... hit
year targets as first-half profit rises
ASOS on track to hit
year targets as first-half profit rises
In the 18 billion words of data from 2017 to the present, there
should be few if any of these missing words in the first five words
of the text. Please do remember, however, about the purposely
missing words (10 words every 200 words),
so that we don't violate US Copyright Law.
Textual data: duplicate texts
and duplicate text
1. There will be some duplicate texts in the full-text data --
for example, an article that appeared in one web domain but was then
republished in another web domain. Or more commonly, a text that
appears multiple times in the same web domain. We have used advanced
algorithms, employing 11-grams (eleven word sequences) to try to
find and eliminate these, but there undoubtedly some duplicate texts
still remain.
2. A given website (web domain) might have "boilerplate" text
that appears on every page. (For example, "this text cannot be
duplicated or copied without the express written consent of ..." or
"to subscribe to the Daily Journal, please send us...".) As we are
processing the texts, we use
JusText to try to find boilerplate
paragraphs, but some will still appear in the full-text data.
Metadata (country)
1. The most common error is the presumed country of origin for
the text. In the data from 2010-2019, we simply used whatever
country was assigned by Google, as we did Google searches (related
information from the
GloWbE corpus). From 2020
to the present, we have tried to determine the country of origin
ourselves. If the URL ends in a country code (for example, .ph for
the Philippines or .aufor Australia), then that is easy. But with
.com URLs, for example, it is much more complicated. In these cases,
we did the following to try to determine the country:
1. When the web domain was the same as the web domain from texts
from 2010-2019 (again, which were assigned by Google), we simply
used that information.
2. For texts from 2020 to the present time where we didn't know
the country, we used ChatGPT and Google Gemini to guess the country,
at least for the top 1,500-2,000 web domains (by number of words in
the corpus). When both sources agreed, we used the country that they
suggested. When they disagreed, we manually went to the website to
try to determine the country.
3. When it was still not clear, or when it was a website that
didn't really have an assignable country (like
phys.org or
FXStreet), or when the
country was not from one of the 20 main English-speaking countries
in the corpus (like politico.eu),
then we assigned "??" as the country.
At any rate, you will definitely find domains that are not
assigned to the correct country. But we have tried to minimize this
as much as is reasonable. And the date for the text should be
correct in nearly all cases.
Bottom line
We have tried to be as transparent as possible about textual
issues, as well as some possible issues with the metadata. If you
are concerned about either of these issues,
please look carefully at
the samples before you order the data. There are no refunds after
you download (even part of) the data, unless the full dataset is different from the
randomly-selected samples (which it isn't).
And in terms of the metadata, please remember that even if the
metadata is 99.9% correct (an enviable goal), that would still
result in about 34,000 texts that have issues, or that are
categorized incorrectly. If you need the corpus data to be 100%
correct, then you might want to look at something like the
British
National Corpus, which has been carefully corrected for pretty much
each and every one of the ~4,000 texts. The downside, of course, is
that you will have much less data than with NOW -- about 1/200th the
size (100 million vs 20 billion words) and many fewer texts (about
4000 vs 34 million texts). So for a word or phrase or syntactic
construction that might appear 800 times in NOW, you might be lucky
to have just four tokens in the BNC. In addition, in the BNC you would
have data from just one country (the UK) and data that is all more
than 30 years old. So just choose whichever corpus
works best for your research needs.
|