# Word Frequency, Zipf's Law, and Finding Trends

#### This was originally written in January, 2014.

Word frequency for some selected terms

In my previous post, I wrote about a poem-finding algorithm which digs through user comments from the New York Times to search for poems. But what else can we do with the thousands of comments posted every day?

Suppose we want to find “trending” words that are especially common today. There are many wrong ways to do this: for example, let’s consider a day when there is an article about Edward Snowden, and we want to know how much more people are using the word “snowden” compared to the previous day. One simple (wrong!) answer is to count the number of times Snowden is mentioned today, compared to how many times he was mentioned yesterday. The problem with this is there might just be a lot more comments total today, making all words more common. For example, if there’s twice as many comments, the word ‘the’ will be used about twice as many times, and not because it’s more popular. (Side note: Evan Miller points out similar pitfalls in a great post about product ratings.)

So we definitely don’t want to measure that. Instead of counting the number of times Snowden is mentioned, let’s divide by the total number of words in the day’s comments to see how often he’s mentioned as a fraction of all words: $f_\textrm{snowden} = \frac{\#(\textrm{snowden})}{\textrm{total words}}$ Now we can look at how often Snowden is mentioned today divided by how often he was mentioned yesterday, and maybe we find something like a 100x increase. Take a look at the chart at the top of the post for an example of this kind of data: Snowden shoots up whenever he’s in the news, and Christie rockets to the top after “bridge-gate”. There’s fairly constant discussion of “government”, while “insurance” and “cancer” tend to rise and fall together, presumably with health-related articles. Now we’re measuring the right thing, but it’s still not quite the right way to identify trending words. The reason why is a bit subtle though, so first let’s take a detour through another interesting property of word frequencies.

## Zipf Code

In 1935, a linguist by the name of George Kingsley Zipf noticed an interesting regularity in human language. Imagine taking a book—just about any book should work—and cutting it up into individual words. Next take those thousands (80,000-100,000 for a typical novel, quite a few more if you’re reading Tolstoy… or George RR Martin) of little slips of paper and stack any repeated words on top of each other. Finally, sort those piles from left to right in order of tallest to shortest. Chances are, on your left you’ll have a whole bunch of of ‘the’, and to the right of that, slightly shorter stacks with words like ‘and’, ‘of’, and ‘to’. What Zipf noticed was that the second biggest stack is always about half the height of the first stack, while the third is a third of the height, and so on, with the $N^\textrm{th}$ stack being $1/N$ of the height of the first. It turns out this also works quite well for New York Times comments:

Zipf's law probability distribution. Word rank is 1 for the most common word, 2 for the second most common, and so on.

And this isn’t true only for English, but for just about any language that you care to look at. Even weirder, people aren’t quite sure why this works (though ideas about maximizing entropy suggest that it’s reasonable, given a few assumptions). Most amazingly of all, this sort of behavior (often known as a power law), shows up all over the place!

## Confidence is Sexy

Okay, back to finding trending words. Why not just take the word that increases most in frequency? The answer is related to sampling error and probability distributions. When we’re identifying trending words, we’re not really interested in some word that is very uncommon and just happens to be used five times today compared to once yesterday (words that only show up a few times are especially vulnerable to statistical fluctuations!). Instead, we’d like to be fairly sure that the words we’re identifying are showing up for a reason, not just random chance.

Fortunately, there’s a mathematically rigorous way to do exactly this, if we assume that the words we happen to find in comments are being drawn randomly from a probability distribution (like Zipf’s law above). It’s important that we don’t have to know the exact distribution; maybe it exists only in the collective thoughts of the New York Times readership. All we have to do is calculate a confidence interval for the frequency of each word, which is just a way of saying we need to pick a range of frequencies that we’re pretty sure contains the true frequency independent of random fluctuations. Or, to be more specific, we pick a range $(f_{min},f_{max})$ such that $% $ with probability $p>0.95$ (or 90% or 99% or whatever confidence level you want to choose). There’s quite a bit of detail that goes into just what the best way to calculate this is—for the stats geeks, I’m using the Wilson score interval, which strikes a good balance between “correctness” and ease of calculation. (Full disclosure: we assume for each candidate trending word $C$, the probability of any given word $W$ in a comment being either equal to $C$ or not equal to $C$ follows a binomial distribution, which isn’t exactly true, but close enough for what we’re doing.)

Once we have a confidence interval, we can use a conservative estimate for frequency by comparing the lower-bound estimate for today’s frequency with the upper-bound estimate for yesterday’s frequency. In other words, we’re answering the question: even if we’re really unlucky with our statistical error, how much can we confidently say the word frequency increased by?

To see the results of this, and also the potential poems identified by the machine learning algorithm, take a look here. As before, the code for this is all available on GitHub!