Wordly wisdom

What determines the length of words? MIT researchers say they know.


Why are some words short and others long? For decades, a prominent theory has held that words used frequently are short in order to make language efficient: It would not be economical if “the” were as long as “phenomenology,” in this view. But now a team of MIT cognitive scientists has developed an alternative notion, on the basis of new research: A word’s length reflects the amount of information it contains.

“It may seem surprising, but word lengths are better predicted by information content than by frequency,” says Steven Piantadosi, a PhD candidate in MIT’s Department of Brain and Cognitive Sciences (BCS), and the lead author of a paper on the subject that evaluates word use in 11 languages. The paper was published online last month in the Proceedings of the National Academy of Sciences (PNAS).

The notion that frequency of use engenders shorter words stems from work published by Harvard scholar George Zipf in the 1930s. The Zipf idea, Piantadosi notes, has an intuitive appeal to it, but only offers a limited explanation of word lengths. “It makes sense that if you say something over and over again, then you want it to be short,” Piantadosi says. “But there is a more refined communications story to be told than that. Frequency doesn't take into account dependencies between words.”

That is, many words typically appear in predictable sequences along with other words. Short words are not necessarily highly frequent; more often, the researchers found, short words do not contain much information by themselves, but appear with strings of other familiar words that, as an ensemble, convey information.

In turn, this clustering of short words helps “smooth out” the flow of information in language by forming strings of similar-sized language packets, which creates an efficiency of its own — albeit not exactly the one Zipf envisioned. “If you take the view that people should be trying to communicate efficiently, you get this uniform rate,” adds Piantadosi; whether delivered through clusters of shorter words or through individual longer words carrying greater information, language tends to convey information at consistent rates.

Written in the script

Piantadosi conducted the study along with Edward Gibson, a professor in BCS who also has a joint appointment in the Department of Linguistics, and Harry Tily, a postdoctoral associate in BCS. In the paper, the MIT researchers studied an enormous data set of online documents posted by Google. Since the documents included a lot of Internet-specific character sequences not comprising words — think "www" — the team began its search by cataloguing texts from Open Subtitles, a database of movie translations, and searched for the words used in those documents when mining the larger Google database. “Movie subtitles are words used naturalistically, so we took words used frequently in that data set and pulled their statistics from Google,” explains Piantadosi. The 11 languages in the study are all European.

To evaluate how much information is contained in a word, the researchers defined information as existing in an inverse relationship to the predictability of words. That is, the words most often occurring after familiar sequences of two, three or four other words — such as the “eat” in “you are what you eat” — contain the least information individually. By contrast, words whose appearances have a minimal relationship to the words preceding them — such as the “contagious” in “you are contagious” — contain, individually, more information. This principle is based on the highly influential work of former MIT information-theory pioneer Claude Shannon.

The MIT team found that 10 percent of the variation in word length is attributable to the amount of information contained in those words — not a high figure by itself, but one about three times as large as the variation in word length attributable to frequency, the notion Zipf championed. For English words, 9 percent of the variation in length is due to amount of information, and 1 percent stems from frequency. It turns out, for instance, that words as disparate in length as “mind” and “organization” appear with virtually the same frequency. However, as Gibson acknowledges, “the data itself is noisy,” and there are counter-examples that do not necessarily support their thesis; for instance, “menu” and “selection” have about the same informational content.

Colleagues believe the study’s new insight about the mechanics of language will prove important over time. “This is exciting work,” says Roger Levy, an assistant professor in the Department of Linguistics at the University of California, San Diego. In Levy’s view, the paper answers an important objection to Zipf’s law lodged by George Miller, a psychologist at Princeton University. As Miller pointed out, any random language generator using a space key — the proverbial monkeys on a typewriter — would also create language patterns in which shorter strings of characters appear most frequently.

By contrast, the current paper, while offering an alternative view of efficiency to the one Zipf held, does imply that word length has a non-random basis. “The notion of monkeys on a typewriter can’t explain these findings,” adds Levy.

Still, the researchers acknowledge there is much more work to be done in this area of language studies. Piantadosi, for one, is using similar data-mining techniques to study the role of ambiguity in language, studying how the meaning of words with multiple potential definitions becomes clarified by the presence of frequently appearing words around them. He hopes to publish results about the subject as a follow-up to the current PNAS paper.


Topics: Brain and cognitive sciences, Language, Linguistics, Psychology

Comments

The idea is very interesting and it would be more productive for the research to include the Greek language, where the words are far than what they seem and mean. Maybe the authors could use the Greek language as a control.
What is the meaning of the number on the picture?
For the example above: "eat" vs. "contagious", as well as with many other words in English and Latin-based languages, couldn't the fact that longer "words" contain more morphemes, with each morpheme containing information, by definition have more information in it than shorter words with less morphemes (and thus less informational units)? Longer words with "X" number of morphemes should be compared with shorter words with "X" (the same number) of morphemes, to decide if length really affects meaning/information-content. Also, metaphysically rich words like "soul" seems to have more information in it than informationally simple terms like "hydrogen". In fact, "hydrogen" and "oxygen" are both longer than "water" but possess less information than "water", the difference being "water" is more familiar to us, and was a word before we knew what it was made of (and thus, I believe, it is one morph). Anyway, food for thought.
leads us to the importance of character in language. In Chinese classics characters are seemed to be equivalent to Chinese language itself. WANG Guowei showed that a word's meaning is defined by the form of character using the study of ancient characters inscribed on oracle bones and tortoise shells. I ever wrote some papers on characters under the influence of KARCEVSKIJ Sergej. From there now I indicate some models of language using geometrical approach extracted from the feature of character. For more details are given at /sekinan.org/.
The idea is very interesting and it would be more productive for the research to include the Greek language, where the words are far than what they seem and mean. Maybe the authors could use the Greek language as a control.
"Water" contains less information than either "hydrogen" or "oxygen". We're talking about the words, not the concepts. "Hydrogen" even contains "hydro" which has the same meaning as "water" (indeed, both descend from the same PIE word).
The picture is of scrabble tiles. You must add up the numbers printed on each tile to score the word that was played. Maybe the graphic is literly true and individual letters contain the information making up a word, with more letters adding up to more value for the word, just like in scrabble. This idea replaces Chazzle's idea of morphemes being the distinct informational unit. If you use the word lightly there are two morphemes, light and ly. If you take away the l and leave ightly, you can still guess the l and have a complete word. Since ightly only has one morpheme and we use our brains to guess the leading letter, it follows that the letters are the informational units.
See Shannon, 1948.
Commenters have failed to understand the concept of "information" used in this research. The "information" content of a word as these researchers define it is an inverse function of its predictability in a particular string of words, which is determined by statistical analysis of a large corpus. You don't have to know what any of the words mean to do this analysis. There is no way of knowing just by inspection of the words whether "water" or "oxygen" has a greater information content; you can only know this after running an analysis of several million words of running text from representative sources, and seeing which one has the overall lower predictability of occurrence in the contexts in which it appears. That is the one that has the greater information content, in the sense of this research. And looking at texts in Greek, or any other language, would be irrelevant to the analysis.
If one were to design word length to maximize the accuracy of Automatic Speech Recognition (ASR), this is exactly the thing to do. ASR multiplies the probability of a word given a context of one or two preceding words, with the probability of an acoustic match, for various recognition hypotheses. It makes a lot of sense to compensate lower word frequency with higher acoustic content (i.e. making the word longer, so it stands out acoustically from competing recognition hypotheses).
The concept of MIT's "the length of words" is very important find in language research. Because modern linguistic has developed the research from the contemporary phases of language. It is, of course, important. But all the language phenomena is almost exist at the past, in which characters are inevitable tools for overcoming the forgetting. I have researched natural language mainly from the characters of Chinese classics, that are one of the most important heritage of the literature of the world. As a result, Chinese characters are not only remain the trace of the speech at the time, but also show the the kernel essence of language itself. I wrote the paper on the theme of "time inherent in characters" in 2003, and it was reported at the conference opened by UNESCO at Nara ,Japan,in winter 2003. Now the research has kept on going by making the models using the mathematical method. " Symplectic Language Theory" and " Floer Homology Language" are the recent work. For more details, refer to the sites,SEKINAN CREDERE and SEKINAN METRIA.
Any thoughts on the study's implications for thought formation? Is a terse vocabulary likely to result in a person who thinks differently than one who is internally sesquipedalian? Also, since this study only focused on European languages, do you predict that the results would vary had it included a tonal language?
Back to the top