The Long Tail of the English Language

Words API:

In the English language, the most common words are incredibly common. Though there are at least 1 million words in the English language, “you”, “I”, and “the” account for 10% of the words we actually use. By the time you reach “is”, at number 10, you’ve covered 20%.

The top 100 most common English words account for over 50% of the words we use, which is about how many words a 2-year old know. A 3-year old would probably know most of the top 1,000 words, which covers 75%. And by the 10,000th most common word, “remorse”, you’ve covered over 88% of the words we commonly use. That leaves a lot of words you don’t hear very much.

If you put word frequency on a graph, like the one below, you quickly see an interesting distribution called the Long Tail. It happens when a small number of items account for a disproportionate number of occurrences, such as the books that Amazon sells.