Abstract: In contrast with their monolingual counterparts, little attention has been paid to the effects that
misspelled queries have on the performance of Cross-Language Information Retrieval (CLIR) systems. The
present work makes a first attempt to fill this gap by extending our previous work on monolingual retrieval in
order to study the impact that the progressive addition of misspellings to input queries has, this time, on the
output of CLIR systems. Two approaches for dealing with this problem are analyzed in this paper. Firstly, the
use of automatic spelling correction techniques for which, in turn, we consider two algorithms: the first one for
the correction of isolated words and the second one for a correction based on the linguistic context of the
misspelled word. The second approach to be studied is the use of character n-grams both as index terms and
translation units, seeking to take advantage of their inherent robustness and language-independence. All these
approaches have been tested on a from-Spanish-to-English CLIR system, that is, Spanish queries on English
documents. Real, user-generated spelling errors have been used under a methodology that allows us to study the
effectiveness of the different approaches to be tested and their behavior when confronted with different error
rates. The results obtained show the great sensitiveness of classic word-based approaches to misspelled queries,
although spelling correction techniques can mitigate such negative effects. On the other hand, the use of
character n-grams provides great robustness against misspellings.
0 Replies
Loading