How many words does it take to understand 50% of the vocabulary in a text? Not that many, but...

Monday, September 13, 2010

I thought I would carry out an interesting test today, a count of how many words it takes to understand 50% of the words of a typical novel, Bram Stoker's Dracula. The entire book is 160877 words, and we can use to remove the most frequent words first. This is not an exact science though, as even after removing formatting the find and replace function will not properly recognize some words (a word at the beginning of a line for example without a space before it, as simply removing the word and without spaces would also remove and from land, band, and so on), and though the most frequent words appear in the largest font it's not always possible to tell which word is the absolute #1 most frequent if they all have a similar font size. But nevertheless, it's accurate enough.

We begin with a list of words heavily skewed towards the, and, to, and I.

Let's begin.

160877 words

the: 7839 instances
and: 5860
to: 4420
I: 4426
of: 3586
that: 2416
a: 2901
in: 2460
was: 1869
he: 2553

Now down to 122547 words, so with these ten we can already understand 23.8% of the words we encounter.

as: 1534
it: 2107
for: 1500
is: 1475
his: 1443
me: 1412
not: 1377
with: 1259
you: 1341
we: 1535

Total count now: 107564, thus up to 33%.

have: 1045
be: 1103
her: 1039
all: 1117
had: 1019
my: 1216
so: 1063
on: 1031
at: 1057
him: 935

Total count now: 96939, 39.7% comprehension.

but: 1032
which: 657
from: 614
could: 487
were: 548
said: 541
she: 772
when: 690
there: 712
are: 573

Total count now: 90345, 43.8% comprehension.

if: 625
must: 434
by: 490
will: 438
them: 464
this: 588
one: 461
up: 435
or: 456
us: 451

Total count now: 85549, 46.8% comprehension. Now we encounter our first name, and names are bonuses since they don't have to really be memorized, just kept in short-term memory.

do: 450
would: 428
some: 433
shall: 414
been: 387
what: 371
know: 386
more: 359
time: 379
Van: 305

Total count now: 81665, 49.2% comprehension. Almost there!

out: 421
no: 439
our: 404

80401 words remaining, 50%! And it only took 63 words. I'm too lazy to make a graph but here's what it looks like:

10 words: ************--------------------------------------
20 words: *****************---------------------------------
30 words: ********************------------------------------
40 words: **********************----------------------------
50 words: **********************----------------------------
60 words: ***********************---------------------------
63 words: *************************-------------------------

Now, keep in mind of course that this 50% only means understanding 50% of the words, not comprehending 50% of what has been written. Being able to understand words like the, was, I and the rest will not do anything for a person's understanding, and you'll have to reach a much higher level (probably 90% or so) before this can be done. The easiest way to imagine this is to picture an average sentence, perhaps some 20 words in length. At 50% that means having to look up 10 words per sentence for a complete understanding, meaning that looking up just a few sentences is enough to tire the mental faculties of even a good student. Even increasing this to 90% means not knowing two words out of each sentence, which is probably enough that you'll be able to skim through most of the time without having to worry about the extra 10%, but sometimes that remaining 10% will be the most crucial part and this will have to be looked up too.

And indeed, if you look at the remaining 50% of the book (here is the first page) you can see that the 50% we have removed really isn't necessary to understand the story; the bits and pieces there in the first 63 words actually end up contributing very little. It's the other 50% that really counts.



Jonathan Harker's Journal

3 May Bistritz --Left Munich 8:35 P M 1st May arriving Vienna early next morning should arrived 6:46 train an hour late Buda-Pesth seems wonderful place glimpse got train little walk through streets feared go very far station arrived late start near correct possible

impression leaving West entering East most western splendid bridges over Danube here noble width depth took among traditions Turkish rule

left pretty good came after nightfall Klausenburgh Here stopped night Hotel Royale dinner rather supper chicken done way red pepper very good thirsty (Mem get recipe Mina ) asked waiter called paprika hendl, national dish should able get anywhere along Carpathians

I found smattering German very useful here indeed don't how should able get without

Having disposal London visited British Museum made search among books maps library regarding Transylvania struck foreknowledge country hardly fail importance dealing nobleman country

I find district named extreme east country just borders three states Transylvania Moldavia Bukovina midst Carpathian mountains wildest least known portions Europe

I able light any map work giving exact locality Castle Dracula maps country yet compare own Ordnance Survey Maps found Bistritz post town named Count Dracula fairly well-known place enter here notes they may refresh memory talk over travels Mina

In population Transylvania four distinct nationalities: Saxons South mixed Wallachs who descendants Dacians Magyars West Szekelys East North am going among latter who claim descended Attila Huns may Magyars conquered country eleventh century they found Huns settled

I read every known superstition world gathered into horseshoe Carpathians centre sort imaginative whirlpool stay may very interesting (Mem ask Count about )

I did sleep well though bed comfortable enough sorts queer dreams dog howling night under window may something may paprika drink water carafe still thirsty Towards morning slept wakened continuous knocking door guess sleeping soundly then

I breakfast paprika sort porridge maize flour they mamaliga egg-plant stuffed forcemeat very excellent dish they call impletata (Mem get recipe also )

I hurry breakfast train started little before eight rather ought done after rushing station 7:30 sit carriage than an hour before began move

seems further east go unpunctual trains What ought they China?

All day long seemed dawdle through country full beauty every kind Sometimes saw little towns castles top steep hills such see old missals sometimes ran rivers streams seemed wide stony margin each side subject great floods takes lot water running strong sweep outside edge river clear

At every station groups people sometimes crowds sorts attire just like peasants home those saw coming through France Germany short jackets round hats home-made trousers others very picturesque

women looked pretty except got near they very clumsy about waist They full white sleeves kind other most big belts lot strips something fluttering like dresses ballet course petticoats under

  © Blogger templates Newspaper by 2008

Back to TOP