Full-text corpus data


 

The NOW Corpus currently (as of late 2024) has about 20.6 billion words from 2010 to the present time, in about 34.4 million texts. This is currently growing by about 2.5 - 3.0 billion words per year, in about 5.5 million new texts. With so much data, there are bound to be some issues with the texts. Before you order the data, please be aware of the following.


Textual data: first few words in a text (2010-2016)

1. In some of the 3.4 million texts from 2010-2015, there are anywhere from three to five words (usually five words) missing at the beginning of the text:

11-11-02
Original text: SmartCo, a consortium of electricity lines companies, has confirmed plans to install 500,000 "smart" electricity meters "
In full-text data:    of electricity lines companies , has confirmed plans to install 500,000 " smart " electricity meters "

2. Less commonly in these texts from 2010-2015, the 3-5 missing words at the beginning of the text are replaced with @ @ @ @ @:

12-08-13
The Future of Apps for Young Children: Beyond ABC & 123 <p> Apps on multi-touch devices like . . .
@ @ @ @ @ @ Young Children : Beyond ABC &amp; 123 <p> Apps on multi-touch devices like . . .

3. In some texts from 2016, either the fourth or the fifth word is missing, and that one single word is replaced by "...":

16-04-12
ASOS on track ... hit year targets as first-half profit rises
ASOS on track to hit year targets as first-half profit rises

In the 18 billion words of data from 2017 to the present, there should be few if any of these missing words in the first five words of the text. Please do remember, however, about the purposely missing words (10 words every 200 words), so that we don't violate US Copyright Law.


Textual data: duplicate texts and duplicate text

1. There will be some duplicate texts in the full-text data -- for example, an article that appeared in one web domain but was then republished in another web domain. Or more commonly, a text that appears multiple times in the same web domain. We have used advanced algorithms, employing 11-grams (eleven word sequences) to try to find and eliminate these, but there undoubtedly some duplicate texts still remain.

2. A given website (web domain) might have "boilerplate" text that appears on every page. (For example, "this text cannot be duplicated or copied without the express written consent of ..." or "to subscribe to the Daily Journal, please send us...".) As we are processing the texts, we use JusText to try to find boilerplate paragraphs, but some will still appear in the full-text data.


Metadata (country)

1. The most common error is the presumed country of origin for the text. In the data from 2010-2019, we simply used whatever country was assigned by Google, as we did Google searches (related information from the GloWbE corpus). From 2020 to the present, we have tried to determine the country of origin ourselves. If the URL ends in a country code (for example, .ph for the Philippines or .aufor Australia), then that is easy. But with .com URLs, for example, it is much more complicated. In these cases, we did the following to try to determine the country:

1. When the web domain was the same as the web domain from texts from 2010-2019 (again, which were assigned by Google), we simply used that information.

2. For texts from 2020 to the present time where we didn't know the country, we used ChatGPT and Google Gemini to guess the country, at least for the top 1,500-2,000 web domains (by number of words in the corpus). When both sources agreed, we used the country that they suggested. When they disagreed, we manually went to the website to try to determine the country.

3. When it was still not clear, or when it was a website that didn't really have an assignable country (like phys.org or FXStreet), or when the country was not from one of the 20 main English-speaking countries in the corpus (like politico.eu), then we assigned "??" as the country.

At any rate, you will definitely find domains that are not assigned to the correct country. But we have tried to minimize this as much as is reasonable. And the date for the text should be correct in nearly all cases.


Bottom line

We have tried to be as transparent as possible about textual issues, as well as some possible issues with the metadata. If you are concerned about either of these issues, please look carefully at the samples before you order the data. There are no refunds after you download (even part of) the data, unless the full dataset is different from the randomly-selected samples (which it isn't).

And in terms of the metadata, please remember that even if the metadata is 99.9% correct (an enviable goal), that would still result in about 34,000 texts that have issues, or that are categorized incorrectly. If you need the corpus data to be 100% correct, then you might want to look at something like the British National Corpus, which has been carefully corrected for pretty much each and every one of the ~4,000 texts. The downside, of course, is that you will have much less data than with NOW -- about 1/200th the size (100 million vs 20 billion words) and many fewer texts (about 4000 vs 34 million texts). So for a word or phrase or syntactic construction that might appear 800 times in NOW, you might be lucky to have just four tokens in the BNC. In addition, in the BNC you would have data from just one country (the UK) and data that is all more than 30 years old. So just choose whichever corpus works best for your research needs.