Monday, March 15, 2010

Wordle is a great site you can use to create word clouds, by simply inputting a ton of text and hitting create after which it will organize them and display the most frequent words, with the most frequent ones in larger font. Since frequency lists aren't available for all languages on Wiktionary, Wordle is a good way to create your own. Here's how to best use it.

First, find a ton of text in the language you are studying. The more the better, but it doesn't have to be unbelievably extensive. Keep in mind that when using another language you're going to be talking about subjects you know about anyway (i.e. if you are a rock climber then you'll probably have an easy time talking to others about rock climbing and other outdoorsy subjects), and in fact choosing content that interests you might even end up with a better frequency list than those on Wiktionary, as they are sometimes biased towards literature and hard news, since that's where most of the content usually comes from. And if you just want to chat in the language then find forums and chat rooms to copy content from.

To make things easy, let's just copy this huge page on the Norwegian Wikipedia on the history of the church in China, as well as the one on the Winter War (Vinterkrigen) in order to balance out words on China and the church with some other subjects. Be sure to save it separately though, as a Wordpad/Notebook/Word-type of file, as you will be altering it later.

Okay, so let's paste that into Wordle. Be sure to change the option to remove frequent words, otherwise it will take out words like og (and), er (is/are) and so on that you might not know yet. Doing that gives us the following:

The default setting for Wordle is 150 words (this can be changed), so there we have a fairly good representation of the 150 most frequent words in Norwegian.

Now comes the fun part. Most of these words you will know even after a few days studying the language, so begin with the largest word and start taking them out one by one by using find and replace, replacing the words with nothing. Be sure to use a space in between the words (e.g.  og  not og) because otherwise it will remove the text from inside other words as well, and ruin everything. You'll also need to replace the word with a space or the remaining words will stick together.

Okay, let's take out og (and), som (which/as/that), av (of), det (that/it), i (in), var (was), for (for), med (with), and til (to/till).

Now those words we removed are gone, and remaining words like de and ble have become the most common. Let's remove a lot more this time. We'll get rid of en (a/an for common gender nouns, also one), hadde (had), de (they), å (to, as in to eat), at (that, as in I said that), ikke (not), på (on/in), også (also), et (a/an for neuter gender nouns), om (on/about), den (the/it, used for common gender nouns), fra (from), and ble (became). Now the word cloud looks like this.

Much better. Now that we've gotten rid of the most common words, the ones left over are a bit more similar in frequency to each other and thus their size has become a bit more normal too. You can see that since we only took content from two articles certain words like China, Finland and Soviet Union are overrepresented, so when making your own it's probably best to choose from at least ten different sources in order to avoid that.

And now, after removing the words you already know, just keep the image handy or print it out to carry with you, and begin the process of removing one word at a time as you become familiar with it. In this way instead of just learning words willy-nilly you'll be learning words you are most certain to encounter, and that will make texts in the language that much easier to understand that much sooner.

One other interesting note is that Wordle doesn't turn words into their dictionary or uninflected forms as most frequently lists do (e.g. oxen becomes ox and slew becomes slay), but this is actually better for the student as words are encountered in daily life in their inflected and conjugated forms, so using this means that you are able to confirm that you understand the words in use, not just as they appear in the dictionary.


