Some of the corpus texts are copyrighted, which might mean that
there would be a problem in distributing them in "full text" format. However,
US Fair Use Law
suggests that they can be used, as long as 1) their format is "transformed"
from the original format (which it is), and 2) there is
little or no effect on the
potential market for the text. In order to follow US Fair Use Law, we have
"transformed" our texts so that people will not use them for the
originally-intended purpose, which means that there should be no
effect on the potential market.
The
original texts were meant to be read as full texts. For example take this
1996 article from Psychology Today. The same text is available in the
corpus (see below), but notice how the text has been modified. Every 200 words,
ten words are removed and are replaced with "@" (see yellow portions below).
This occurs in all three versions of the text, and this makes the text
rather useless for anyone who wanted to read it as a
text. As a result, there is essentially no effect on the market for this
text.
##2037539 For many women , happiness is n't a prince and a wedding
away . So they 're just saying no to nuptials. // Every January from the
time she turned 20 , Katherine Wallace* has tried on bridal gowns . "
The first few times , it was because I was a bridesmaid at my sister 's
or friends ' weddings and all of us got in the spirit of imagining our
own special day , " says Wallace , a stockbroker in San Francisco . "
Then , in my late twenties , I got engaged and went shopping for real .
I found the perfect dress but we called the wedding off when my fiance
and I discovered we really wanted different things . He 'd envisioned a
quiet life out in the country and I 'm a city girl born and bred . After
that , trying on bridal gowns became sort of an annual ritual . " But
when January rolled around this year , Katherine broke with tradition .
Instead , she booked a flight to Australia and spent a couple of weeks
@ @ @ @ @ @ @ @ @ @
turned 45 this year and I realized that that walk down the aisle
probably is n't going to happen , " Wallace says with a rich laugh . "
But even more important , I realize I do n't need it to happen . I have
a terrific life and I do n't have to be married to enjoy it . " Happily
never married ? The words just do n't seem to belong together . They 're
an oxymoron , like military music or honest politician . Never-married
women are supposed to be needy neurotics frantically hunting down a
spouse , lonely depressives who hole up with a clutch of cats , or , a
more recent image , icy workaholics who trade the cozy warmth of husband
and home for glitzy high-power careers . No matter how you look at them
, they 're unloved , unwanted , unhealthy Take a closer look . After
years of being dismissed and ignored , the never married are coming into
the spotlight . And much to everyone 's surprise , psychologists are
discovering that " happily never married " rings true
@ @ @ @ @ @ @ @ @ @ have
been staging a quiet revolution , battling social prejudice , family
expectations , and their own apprehensions to set a new standard for
what it means to be a successful , fulfilled , and content woman . To be
sure , the majority of American adults still say " I do , " though at an
increasingly later age , but the ranks of the unmarried have been
growing . According to the U.S. Census Bureau , in 1984 about 3 million
women age 35 and older had never married . By 1994 the figure had
climbed to nearly 4.5 million . Some of these women are living with men
in what amount to common-law marriages and some are gay ; exact
statistics are n't available . But the vast majority are women who have
remained single and on their own , many by choice . WHO 'S REMAINING
RINGLESS ? Women no longer need to wed out of economic necessity With
thriving careers , even steady jobs , they can afford monthly mortgage
and car payments on their own . Nor is marriage a requirement for
motherhood . The @ @ @ @ @ @ @ @
@ @ wedlock are waning and adoption has become a viable option
for single women . . . . . . . .
Original text
|
Nevertheless, for those who just want the linguistic data, the text is still very useful.
They will just be missing 10 out of every 200 words, but 95% of
the data is still there. Users can still generate word frequencies, n-grams,
collocates, and concordance lines, and where a "@" appears in the results, they can
simply filter them out.
And note that the "relative frequency" of words,
collocates, and so on will not be affected by these omissions. Since the
omissions occur "blindly" every 200 words, without regard to context, they will
affect all words equally the same. So if you want to find the top 500,000 words
in the corpus, or see the top 1,200 collocates with a given word, or extract all
50,000 n-grams for a given word, this will work essentially as well as if you
had 100% of the corpus there. It's just that the totals for all words and strings
will be about 5% lower than otherwise.
This is a reasonable solution to the copyright issue. Because of the
"missing" text, it is unlikely that anyone would
purchase our data
to "read" the texts in their entirety. Consequently, there is no economic impact
on the copyright holder. But users who want the linguistic data still have
access to 95% of that data. The end result is that our use is still within US
"Fair Use Law", and you can have large amounts of data
-- often 50-100x times as large as the next-largest corpus.
|