Full-text corpus data


How to create a multi-column table with adjacent words

insert into corpusBig(ID,textID,w1,w2,w3,w4,w5,w6,w7,w8,w9)
select a.ID,a.textID,a.wordID,b.wordID,c.wordID,d.wordID,e.wordID,f.wordID,g.wordID,h.wordID,i.wordID
from corpus as a, corpus as b, corpus as c, corpus as d, corpus as e, corpus as f, corpus as g, corpus as h, corpus as i where
a.ID between 1 and 1000000 and
a.ID+1 = b.ID and
a.ID+2 = c.ID and
a.ID+3 = d.ID and
a.ID+4 = e.ID and
a.ID+5 = f.ID and
a.ID+6 = g.ID and
a.ID+7 = h.ID and
a.ID+8 = i.ID
order by a.ID asc

You don't need to use the line in red above, but if you have a clustered index on the ID column, and if you take the corpus in 1 million word groups (1-1,000,000, then 1,000,001-2,000,000 etc) it will really speed things up. On a fairly fast machine, you should be able to create a 440 million word, 9 column/word table in 3-4 hours.

To make it even faster, you may want to iteratively copy the million or so rows into a temp table, copy from there into [corpusBig], then delete from the temp table, and do the next million rows, etc.