Comparison of letter positions in eight languages

Proofreader.com:

My May 27 blog post of the distribution of letters in English toward the beginning, middle and end of words seemed well-received, and generated quite a few compliments, and not a few requests to do the same for other languages. One reader was even inspired to do a similar project in French.

Since I already had the code, I thought, why not? Now the only problem was getting my hands on a corpus; you can read about my adventures in this regard, as well as some more esoteric analysis of this data set, on my other, geekier blog; suffice it to say I was quite fortunate to find the Europarl Parallel Corpus, a collection of proceedings of the European Parliament with simultaneous translations in twenty languages. Since every language has the same subject matter, we’re maximizing the chances that any differences we see are actually due to the language, not because of differences in the corpus.

I chose the seven languages with the most speakers in the European Parliament, plus Finnish because I thought it would be interesting to have a non-Indo-European language to compare as well.

Note that accents are aggregated with their non-accented versions; this is not ideal, since many languages consider accented characters separate letters, but it’s really the only way we can make the datasets comparable, by reducing everything to the Basic Latin alphabet.