Full-text corpus data


Some of the corpus texts are copyrighted, which might mean that there would be a problem in distributing them in "full text" format. However, US Fair Use Law suggests that they can be used, as long as 1) their format is "transformed" from the original format (which it is), and 2) there is little or no effect on the potential market for the text. In order to follow US Fair Use Law, we have "transformed" our texts so that people will not use them for the originally-intended purpose, which means that there should be no effect on the potential market.

The original texts were meant to be read as full texts. For example take this 1996 article from Psychology Today. The same text is available in the corpus (see below), but notice how the text has been modified. Every 200 words, ten words are removed and are replaced with "@" (see yellow portions below). This occurs in all three versions of the text, and this makes the text rather useless for anyone who wanted to read it as a text. As a result, there is essentially no effect on the market for this text.

##2037539 For many women , happiness is n't a prince and a wedding away . So they 're just saying no to nuptials. // Every January from the time she turned 20 , Katherine Wallace* has tried on bridal gowns . " The first few times , it was because I was a bridesmaid at my sister 's or friends ' weddings and all of us got in the spirit of imagining our own special day , " says Wallace , a stockbroker in San Francisco . " Then , in my late twenties , I got engaged and went shopping for real . I found the perfect dress but we called the wedding off when my fiance and I discovered we really wanted different things . He 'd envisioned a quiet life out in the country and I 'm a city girl born and bred . After that , trying on bridal gowns became sort of an annual ritual . " But when January rolled around this year , Katherine broke with tradition . Instead , she booked a flight to Australia and spent a couple of weeks @ @ @ @ @ @ @ @ @ @ turned 45 this year and I realized that that walk down the aisle probably is n't going to happen , " Wallace says with a rich laugh . " But even more important , I realize I do n't need it to happen . I have a terrific life and I do n't have to be married to enjoy it . " Happily never married ? The words just do n't seem to belong together . They 're an oxymoron , like military music or honest politician . Never-married women are supposed to be needy neurotics frantically hunting down a spouse , lonely depressives who hole up with a clutch of cats , or , a more recent image , icy workaholics who trade the cozy warmth of husband and home for glitzy high-power careers . No matter how you look at them , they 're unloved , unwanted , unhealthy Take a closer look . After years of being dismissed and ignored , the never married are coming into the spotlight . And much to everyone 's surprise , psychologists are discovering that " happily never married " rings true @ @ @ @ @ @ @ @ @ @ have been staging a quiet revolution , battling social prejudice , family expectations , and their own apprehensions to set a new standard for what it means to be a successful , fulfilled , and content woman . To be sure , the majority of American adults still say " I do , " though at an increasingly later age , but the ranks of the unmarried have been growing . According to the U.S. Census Bureau , in 1984 about 3 million women age 35 and older had never married . By 1994 the figure had climbed to nearly 4.5 million . Some of these women are living with men in what amount to common-law marriages and some are gay ; exact statistics are n't available . But the vast majority are women who have remained single and on their own , many by choice . WHO 'S REMAINING RINGLESS ? Women no longer need to wed out of economic necessity With thriving careers , even steady jobs , they can afford monthly mortgage and car payments on their own . Nor is marriage a requirement for motherhood . The @ @ @ @ @ @ @ @ @ @ wedlock are waning and adoption has become a viable option for single women . . . . . . . .

Original text
 

Nevertheless, for those who just want the linguistic data, the text is still very useful. They will just be missing 10 out of every 200 words, but 95% of the data is still there. Users can still generate word frequencies, n-grams, collocates, and concordance lines, and where a "@" appears in the results, they can simply filter them out.

And note that the "relative frequency" of words, collocates, and so on will not be affected by these omissions. Since the omissions occur "blindly" every 200 words, without regard to context, they will affect all words equally the same. So if you want to find the top 500,000 words in the corpus, or see the top 1,200 collocates with a given word, or extract all 50,000 n-grams for a given word, this will work essentially as well as if you had 100% of the corpus there. It's just that the totals for all words and strings will be about 5% lower than otherwise.

This is a reasonable solution to the copyright issue. Because of the "missing" text, it is unlikely that anyone would purchase our data to "read" the texts in their entirety. Consequently, there is no economic impact on the copyright holder. But users who want the linguistic data still have access to 95% of that data. The end result is that our use is still within US "Fair Use Law", and you can have large amounts of data -- often 50-100x times as large as the next-largest corpus.