Monday, September 13, 2010
I thought I would carry out an interesting test today, a count of how many words it takes to understand 50% of the words of a typical novel, Bram Stoker's Dracula. The entire book is 160877 words, and we can use Wordle.net to remove the most frequent words first. This is not an exact science though, as even after removing formatting the find and replace function will not properly recognize some words (a word at the beginning of a line for example without a space before it, as simply removing the word and without spaces would also remove and from land, band, and so on), and though the most frequent words appear in the largest font it's not always possible to tell which word is the absolute #1 most frequent if they all have a similar font size. But nevertheless, it's accurate enough.
the: 7839 instances
Now down to 122547 words, so with these ten we can already understand 23.8% of the words we encounter.
Total count now: 107564, thus up to 33%.
Total count now: 96939, 39.7% comprehension.
Total count now: 90345, 43.8% comprehension.
Total count now: 85549, 46.8% comprehension. Now we encounter our first name, and names are bonuses since they don't have to really be memorized, just kept in short-term memory.
Total count now: 81665, 49.2% comprehension. Almost there!
80401 words remaining, 50%! And it only took 63 words. I'm too lazy to make a graph but here's what it looks like:
10 words: ************-------------------------------------- 20 words: *****************---------------------------------
30 words: ********************------------------------------
40 words: **********************----------------------------
50 words: **********************----------------------------
60 words: ***********************---------------------------
63 words: *************************-------------------------
Now, keep in mind of course that this 50% only means understanding 50% of the words, not comprehending 50% of what has been written. Being able to understand words like the, was, I and the rest will not do anything for a person's understanding, and you'll have to reach a much higher level (probably 90% or so) before this can be done. The easiest way to imagine this is to picture an average sentence, perhaps some 20 words in length. At 50% that means having to look up 10 words per sentence for a complete understanding, meaning that looking up just a few sentences is enough to tire the mental faculties of even a good student. Even increasing this to 90% means not knowing two words out of each sentence, which is probably enough that you'll be able to skim through most of the time without having to worry about the extra 10%, but sometimes that remaining 10% will be the most crucial part and this will have to be looked up too.
And indeed, if you look at the remaining 50% of the book (here is the first page) you can see that the 50% we have removed really isn't necessary to understand the story; the bits and pieces there in the first 63 words actually end up contributing very little. It's the other 50% that really counts.
Jonathan Harker's Journal
3 May Bistritz --Left Munich 8:35 P M 1st May arriving Vienna early next morning should arrived 6:46 train an hour late Buda-Pesth seems wonderful place glimpse got train little walk through streets feared go very far station arrived late start near correct possible
impression leaving West entering East most western splendid bridges over Danube here noble width depth took among traditions Turkish rule
left pretty good came after nightfall Klausenburgh Here stopped night Hotel Royale dinner rather supper chicken done way red pepper very good thirsty (Mem get recipe Mina ) asked waiter called paprika hendl, national dish should able get anywhere along Carpathians
I found smattering German very useful here indeed don't how should able get without
Having disposal London visited British Museum made search among books maps library regarding Transylvania struck foreknowledge country hardly fail importance dealing nobleman country
I find district named extreme east country just borders three states Transylvania Moldavia Bukovina midst Carpathian mountains wildest least known portions Europe
I able light any map work giving exact locality Castle Dracula maps country yet compare own Ordnance Survey Maps found Bistritz post town named Count Dracula fairly well-known place enter here notes they may refresh memory talk over travels Mina
In population Transylvania four distinct nationalities: Saxons South mixed Wallachs who descendants Dacians Magyars West Szekelys East North am going among latter who claim descended Attila Huns may Magyars conquered country eleventh century they found Huns settled
I read every known superstition world gathered into horseshoe Carpathians centre sort imaginative whirlpool stay may very interesting (Mem ask Count about )
I did sleep well though bed comfortable enough sorts queer dreams dog howling night under window may something may paprika drink water carafe still thirsty Towards morning slept wakened continuous knocking door guess sleeping soundly then
I breakfast paprika sort porridge maize flour they mamaliga egg-plant stuffed forcemeat very excellent dish they call impletata (Mem get recipe also )
I hurry breakfast train started little before eight rather ought done after rushing station 7:30 sit carriage than an hour before began move
seems further east go unpunctual trains What ought they China?
All day long seemed dawdle through country full beauty every kind Sometimes saw little towns castles top steep hills such see old missals sometimes ran rivers streams seemed wide stony margin each side subject great floods takes lot water running strong sweep outside edge river clear
At every station groups people sometimes crowds sorts attire just like peasants home those saw coming through France Germany short jackets round hats home-made trousers others very picturesque
women looked pretty except got near they very clumsy about waist They full white sleeves kind other most big belts lot strips something fluttering like dresses ballet course petticoats under