The
Wikipedia corpus contains about 2 billion words
of text from a 2014 dump of the Wikipedia (about 4.4
million pages). As far as we are aware, our Wikipedia
full-text data is the only version available from a
recent copy of Wikipedia. Previous versions from other
sites are from
2006 and
2008, when Wikipedia was only a small fraction of
its current size.
Of course you can download
your own copy of Wikipedia
for "free". But then you would have to spend many
hours extracting the raw text (eliminating "info
tables", headers, footers, etc). You would then have to
tag and lemmatize the entire two billion words of text (see
sample). If you're lucky, you might be able to do
all of this in about 40-50 hours. Compared to the
$245 cost (for an individual
academic license), that means that you'd be paying
yourself about $4-5 USD per hour. Or you can just get
the entire cleaned, tagged, and lemmatized copy from us.
If you know SQL, you might
be interested in the
database format of the corpus. Using the "sources"
table for the 4.4 million texts (see
sample), you could search for all pages with a given
word or phrase in the title, and then create a JOIN to
the actual corpus to create a "customized corpus" "on
the fly" for any topic -- biology, computer science,
linguistics, cricket, Harry Potter -- whatever. Or you
could just use a simple Python script to place the text
files (see
sample) for a given topic into a folder (again,
using the "title" information in the sources table), and
then search those texts as a standalone corpus.
With the
online interface, you can also quickly and easily
create "virtual corpora" like this, and then search and
compare among them. But in that case, you'd be limited
in terms of how many searches you can do each day, when
types of searches are available via that interface, and
so on. With the offline, full-text version of the
corpus, you can have that same power (and potentially
much more) on your own machine.
|