Keywords: comprehension, entropy, surprisal, reading times
Abstract: Recent studies of human sentence processing have reported evidence of psycholinguistic effects from the contextual entropy ${\sf H}(W_i \mid w_{1..i-1})$ of the current word $W_i$ being processed.
A word's contextual entropy is defined as its expected surprisal: ${\sf H}(W_i \mid w_{1..i-1}) = -\sum_{w \in V} {\sf P}(w \mid w_{1..i-1}) \log_2 {\sf P}(w \mid w_{1..i-1})$, where $V$ is the vocabulary.
This measure captures the predictive processing difficulty before each new word is encountered, in contrast to raw surprisal, whose effects can be understood as integration costs for an already observed word.
Previous work estimates contextual entropy using a language model (LM) like GPT2.
However, because words can span multiple subword tokens in an LM's vocabulary---and therefore are intractable to sum probabilities over---entropy is typically calculated over each word's first token instead.
This practice results in a systematic underprediction of true word entropy, which is magnified in contexts in which multi-token words are probable.
To address this issue, we calculate LM-based entropy estimates using a Monte Carlo (MC) technique that randomly samples token sequences to explore, and thus allows words to span multiple tokens.
We then evaluate the fit of the MC estimates to naturalistic reading times.
Mixed-effects regression experiments were conducted on five English reading time corpora: the Natural Stories and Brown corpora, containing self-paced reading times; and the Dundee, Provo, and GECO corpora, containing first-pass (FP) and go-past (GP) durations from eye tracking.
Baseline predictors in the regression models included word length, word index, unigram surprisal, LM surprisal of the current and previous word (SPR, FP, and GP), and whether the previous word was fixated (FP and GP only).
Per-subject random slopes were initially included for all predictors, but some were removed to ensure convergence.
For each corpus and response type, the increase in log likelihood ($\Delta$LogLik) was calculated between a regression model containing only the baseline predictors, and a regression model additionally containing an entropy predictor (either first-token entropy or MC-based word entropy).
GPT2-small was the LM used to calculate entropy and surprisal predictors.
MC estimates of word entropy were based on 512 next-word samples.
Replacing first-token entropy with word entropy improved $\Delta$LogLik scores in the two self-paced reading corpora, although most eye-tracking corpora showed an opposite pattern (Table 1).
To evaluate the whether the observed differences were significant, a permutation test was conducted over squared errors aggregated over all corpora; this showed a significant improvement ($p<0.01$) in $\Delta$LogLik from word entropy compared to first-token entropy.
To illustrate the difference between the two entropy predictors, Figure 1 compares the average first-token and word entropies of words in the 10 most frequent part-of-speech categories in Natural Stories.
As expected, word entropy values are generally higher than first-token entropies, but the relative difference is greatest for nouns and adjectives (NN, NNS, and JJ), perhaps reflecting a wider range of multi-token words within these open-class categories.
The results suggest that LM-based predictors operating at the word level provide a closer match to human reading times than token-level predictors, even when the former can only be approximated based on sampling.
The concrete difference across the two conditions warrants caution against using first-token entropy in psycholinguistic modeling.
Submission Number: 49
Loading