Abstract: Human languages are often claimed to fundamentally differ from other communication systems. But what is it exactly that unites them as
a separate category? This article proposes to approach this problem – here termed the Zipfian Challenge – as a standard classification
task. A corpus with textual material from diverse writing systems and languages, as well as other symbolic and non-symbolic systems,
is provided. These are subsequently used to train and test binary classification algorithms, assigning labels “writing” and “non-writing”
to character strings of the test sets. The performance is generally high, reaching 98% accuracy for the best algorithms. Human languages
emerge to have a statistical fingerprint: large unit inventories, high entropy, and few repetitions of adjacent units. This fingerprint can be
used to tease them apart from other symbolic and non-symbolic systems.
Loading