You can purchase
information on all 22 million web pages in the
14 billion word iWeb
corpus. As far as we are aware, nothing like this is available from any
other large corpus.
The following shows the information that is given for each
URL, and an explanation of the columns is found in the
downloadable sample.
websiteID |
textID |
# words |
URL |
title |
194853 |
99057612 |
981 |
https://centsationalgirl.com/2013/01/painted-mullions-trim/ |
Painted Mullions + Trim |
194853 |
99057615 |
269 |
https://centsationalgirl.com/2016/07/new-summer-fabrics/ |
New Summer Fabrics |
194853 |
99057616 |
1072 |
https://centsationalgirl.com/2014/04/day-trip-st-helena/ |
Day Trip: St. Helena |
194853 |
99057619 |
327 |
https://centsationalgirl.com/2017/01/modern-mountain-homes/ |
Modern Mountain Homes |
194853 |
99057621 |
345 |
https://centsationalgirl.com/2016/05/cornerstone-gardens/ |
Cornerstone Gardens |
194853 |
99057622 |
268 |
https://centsationalgirl.com/2017/04/weekend-reading-114/ |
Weekend Reading |
194853 |
99057623 |
1598 |
https://centsationalgirl.com/2012/03/a-gray-striped-dresser/ |
A Gray Striped Dresser |
Some might wonder why the 22 million URLs need to be purchased, rather than
being freely available. Actually, getting the URLs was one of the most difficult
parts of creating iWeb. We had to run about 200,000 queries against Google
to get "random" URLs for each of 200,000 websites that we started with. Because Google will block you
if you do more than a certain number of queries per day, we had to purchase five
new machines (a total of about $6,000) to very slowly get the 22 million URLs
over a period of about 2-3 months. All of this made it very expensive (buying
the new machines) as well as extremely time-consuming (doing multiple Captcha
every week, when we still ended up getting blocked). |