In my talk, I want to present results of the, to my knowledge, largest cross-linguistic analysis of written language to date (Koplenig et al. 2022). To this end, I have trained a language-modelling algorithm to learn the rules and the vocabulary of 2,069 languages as represented in 6,513 different documents which belong to 41 parallel/multilingual corpora that cover a large variety of different text types, e.g. religious texts, legalese texts, subtitles for various movies and talks, newspaper texts, web crawls, Wikipedia articles, Ubuntu localization files, or translated example sentences from a free collaborative online database. By statistically inferring the entropy rate of each language-model as an index of complexity for both words and characters as information encoding units, I show that the long-standing linguistic axiom that all languages are equally complex is likely wrong (Sampson 2009). In addition, I present evidence for a previously undocumented complexity-efficiency trade-off: languages that are more complex are more efficient as they tend to need fewer symbols to encode messages. I demonstrate that this trade-off predicts both geographic and linguistic distance between languages/doculects.
Koplenig, Alexander, Sascha Wolfer & Peter Meyer. 2022. Human languages trade off complexity against efficiency. Preprint. In Review. https://www.researchsquare.com/article/rs-1462001/v1 (25 April, 2022).
Sampson, Geoffrey. 2009. A linguistic axiom challenged. In David Gil & Peter Trudgill (eds.),Language complexity as an evolving variable, 1–18. Oxford: Oxford University Press.