Full-text corpus data

You can purchase information on all 22 million web pages in the 14 billion word iWeb corpus. As far as we are aware, nothing like this is available from any other large corpus.

The following shows the information that is given for each URL, and an explanation of the columns is found in the downloadable sample.

websiteID	textID	# words	URL	title
194853	99057612	981	https://centsationalgirl.com/2013/01/painted-mullions-trim/	Painted Mullions + Trim
194853	99057615	269	https://centsationalgirl.com/2016/07/new-summer-fabrics/	New Summer Fabrics
194853	99057616	1072	https://centsationalgirl.com/2014/04/day-trip-st-helena/	Day Trip: St. Helena
194853	99057619	327	https://centsationalgirl.com/2017/01/modern-mountain-homes/	Modern Mountain Homes
194853	99057621	345	https://centsationalgirl.com/2016/05/cornerstone-gardens/	Cornerstone Gardens
194853	99057622	268	https://centsationalgirl.com/2017/04/weekend-reading-114/	Weekend Reading
194853	99057623	1598	https://centsationalgirl.com/2012/03/a-gray-striped-dresser/	A Gray Striped Dresser

Some might wonder why the 22 million URLs need to be purchased, rather than being freely available. Actually, getting the URLs was one of the most difficult parts of creating iWeb. We had to run about 200,000 queries against Google to get "random" URLs for each of 200,000 websites that we started with. Because Google will block you if you do more than a certain number of queries per day, we had to purchase five new machines (a total of about $6,000) to very slowly get the 22 million URLs over a period of about 2-3 months. All of this made it very expensive (buying the new machines) as well as extremely time-consuming (doing multiple Captcha every week, when we still ended up getting blocked).