Abstract: In languages without orthographic word boundaries, NLP models perform word segmentation,
either as an explicit preprocessing step or as
an implicit step in an end-to-end computation.
This paper shows that Chinese NLP models are
vulnerable to morphological garden path errors—errors caused by a failure to resolve local
word segmentation ambiguities using sentence-level morphosyntactic context. We propose
a benchmark, ERAS, that tests a model’s vulnerability to morphological garden path errors
by comparing its behavior on sentences with
and without local segmentation ambiguities.
Using ERAS, we show that word segmentation models make garden path errors on locally
ambiguous sentences, but do not make equivalent errors on unambiguous sentences. We further show that sentiment analysis models with
character-level tokenization make implicit garden path errors, even without an explicit word
segmentation step in the pipeline. Our results
indicate that models’ segmentation of Chinese
text often fails to account for morphosyntactic
context.
Loading