Geographical Erasure in Language Generation

Pola Schwöbel; Jacek Golebiowski; Michele Donini; Cedric Archambeau; Danish Pruthi

Geographical Erasure in Language Generation

Pola Schwöbel, Jacek Golebiowski, Michele Donini, Cedric Archambeau, Danish Pruthi

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 FindingsEveryoneRevisionsBibTeX

Submission Type: Regular Long Paper

Submission Track: Ethics in NLP

Submission Track 2: Theme Track: Large Language Models and the Future of NLP

Keywords: large language models, fairness, language generation, bias, world knowledge

TL;DR: Large language models underpredict certain countries, erasing them from dialogue. We measure and mitigate this effect.

Abstract: Large language models (LLMs) encode vast amounts of world knowledge. However, since these models are trained on large swaths of internet data, they are at risk of inordinately capturing information about dominant groups. This imbalance can propagate into generated language. In this work, we study and operationalise a form of geographical erasure wherein language models underpredict certain countries. We demonstrate consistent instances of erasure across a range of LLMs. We discover that erasure strongly correlates with low frequencies of country mentions in the training corpus. Lastly, we mitigate erasure by finetuning using a custom objective.

Submission Number: 1562

Loading