Full-text corpus data


You can purchase information on all 22 million web pages in the 14 billion word iWeb corpus. As far as we are aware, nothing like this is available from any other large corpus.

The following shows the information that is given for each URL, and an explanation of the columns is found in the downloadable sample.

websiteID textID # words URL title

194853

99057612

981

https://centsationalgirl.com/2013/01/painted-mullions-trim/

Painted Mullions + Trim

194853

99057615

269

https://centsationalgirl.com/2016/07/new-summer-fabrics/

New Summer Fabrics

194853

99057616

1072

https://centsationalgirl.com/2014/04/day-trip-st-helena/

Day Trip: St. Helena

194853

99057619

327

https://centsationalgirl.com/2017/01/modern-mountain-homes/

Modern Mountain Homes

194853

99057621

345

https://centsationalgirl.com/2016/05/cornerstone-gardens/

Cornerstone Gardens

194853

99057622

268

https://centsationalgirl.com/2017/04/weekend-reading-114/

Weekend Reading

194853

99057623

1598

https://centsationalgirl.com/2012/03/a-gray-striped-dresser/

A Gray Striped Dresser

Some might wonder why the 22 million URLs need to be purchased, rather than being freely available. Actually, getting the URLs was one of the most difficult parts of creating iWeb. We had to run about 200,000 queries against Google to get "random" URLs for each of 200,000 websites that we started with. Because Google will block you if you do more than a certain number of queries per day, we had to purchase five new machines (a total of about $6,000) to very slowly get the 22 million URLs over a period of about 2-3 months. All of this made it very expensive (buying the new machines) as well as extremely time-consuming (doing multiple Captcha every week, when we still ended up getting blocked).