Full-text corpus data


The Wikipedia corpus contains about 2 billion words of text from a 2014 dump of the Wikipedia (about 4.4 million pages). As far as we are aware, our Wikipedia full-text data is the only version available from a recent copy of Wikipedia. Previous versions from other sites are from 2006 and 2008, when Wikipedia was only a small fraction of its current size.

Of course you can download your own copy of Wikipedia for "free". But then you would have to spend many hours extracting the raw text (eliminating "info tables", headers, footers, etc). You would then have to tag and lemmatize the entire two billion words of text (see sample). If you're lucky, you might be able to do all of this in about 40-50 hours. Compared to the $245 cost (for an individual academic license), that means that you'd be paying yourself about $4-5 USD per hour. Or you can just get the entire cleaned, tagged, and lemmatized copy from us.

If you know SQL, you might be interested in the database format of the corpus. Using the "sources" table for the 4.4 million texts (see sample), you could search for all pages with a given word or phrase in the title, and then create a JOIN to the actual corpus to create a "customized corpus" "on the fly" for any topic -- biology, computer science, linguistics, cricket, Harry Potter -- whatever. Or you could just use a simple Python script to place the text files (see sample) for a given topic into a folder (again, using the "title" information in the sources table), and then search those texts as a standalone corpus.

With the online interface, you can also quickly and easily create "virtual corpora" like this, and then search and compare among them. But in that case, you'd be limited in terms of how many searches you can do each day, when types of searches are available via that interface, and so on. With the offline, full-text version of the corpus, you can have that same power (and potentially much more) on your own machine.