Abstract: Large language models (LLMs) have recently entered the public spotlight as a powerful tool capable of generating fluent, relevant, and coherent text. We expect these models to have a significant societal impact as they are used for downstream tasks; however, research on these models to date has largely focused on English-language tasks, white-box approaches, or both.
In non-English or multilingual language models, one issue - OOV (out of vocabulary), arises frequently in character-diverse languages where tokenizers often do not capture the full range of possible inputs. In the black-box setting the lack of direct access to the LLM's internal representation makes it nontrivial to elicit useful responses to inputs with OOVs or even identify inputs where OOVs are interfering with understanding. In our work, we propose a method of prompt-directed probing to identify OOVs in a multilingual LLM (XGLM-7.5B), and assess a corresponding OOV patch method with a set of machine reading-comprehension (MRC) tasks. Through experiments, we demonstrate that it is possible to both probe and mitigate OOV without access to the internals.
Paper Type: short
Research Area: Multilinguality and Language Diversity
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: Japanese, Korean
Consent To Share Submission Details: On behalf of all authors, we agree to the terms above to share our submission details.
0 Replies
Loading