Full-text corpus data


You can purchase information on all 22 million web pages in the 14 billion word iWeb corpus. As far as we are aware, nothing like this is available from any other large corpus.

The following shows the information that is given for each URL, and an explanation of the columns is found in the downloadable sample.

websiteID textID # words URL title

194853

99057612

981

http://centsationalgirl.com/2013/01/painted-mullions-trim/

Painted Mullions + Trim

194853

99057615

269

http://centsationalgirl.com/2016/07/new-summer-fabrics/

New Summer Fabrics

194853

99057616

1072

http://centsationalgirl.com/2014/04/day-trip-st-helena/

Day Trip: St. Helena

194853

99057619

327

http://centsationalgirl.com/2017/01/modern-mountain-homes/

Modern Mountain Homes

194853

99057621

345

http://centsationalgirl.com/2016/05/cornerstone-gardens/

Cornerstone Gardens

194853

99057622

268

https://centsationalgirl.com/2017/04/weekend-reading-114/

Weekend Reading

194853

99057623

1598

http://centsationalgirl.com/2012/03/a-gray-striped-dresser/

A Gray Striped Dresser

Some might wonder why the 22 million URLs need to be purchased, rather than being freely available. Actually, getting the URLs was one of the most difficult parts of creating iWeb. We had to run about 200,000 queries against Google to get "random" URLs for each of 200,000 websites that we started with. Because Google will block you if you do more than a certain number of queries per day, we had to purchase five new machines (a total of about $6,000) to very slowly get the 22 million URLs over a period of about 2-3 months. All of this made it very expensive (buying the new machines) as well as extremely time-consuming (doing multiple Captcha every week, when we still ended up getting blocked).