Morphosyntactic Embeddings: Markov Transition Networks for Authorship in Morphologically Rich Languages
Keywords: authorship attribution, stylometry, word adjacency networks, markov chains, hindi NLP
Abstract: Traditional authorship attribution models, typically reliant on lexical frequencies, often struggle with the morphological richness and syntactic flexibility (scrambling) inherent to Indian languages like Hindi. To address this, we present a framework that derives morphosyntactic embeddings from Word Adjacency Networks, directed weighted graphs that model authorial style as the Markovian transition dynamics between function words. We treat these transition dynamics as stylistic signatures encapsulated in matrices $P$ and propose three vectorization functionals $\phi: P \to \mathbb{R}^d$ to map these signatures into discriminative embedding spaces: (A) Spectral Decomposition via Principal Component Analysis yielding $\phi_{PCA}(P) \in \mathbb{R}^{100}$, (B) deep feature extraction using a Convolutional Neural Network yielding $\phi_{CNN}(P) \in \mathbb{R}^{64}$, and (C) graph-theoretic feature extraction yielding $\phi_{G}(P) \in \mathbb{R}^{615}$. Empirical evaluation on a large-scale corpus ($n=75{,}225$ ) we have curated, demonstrates that $\phi_{CNN}$ significantly outperforms traditional methods, establishing a new State-Of-The-Art for Hindi stylometry with 99.40% accuracy and 0.98 macro-F1 in 15-way authorship attribution, 95.46 % accuracy in verification, and 98.91% accuracy in characterization.
Paper Type: Long
Research Area: Sentiment Analysis, Stylistic Analysis, and Argument Mining
Research Area Keywords: style analysis, graph-based methods, representation learning, language resources
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Data resources
Languages Studied: Hindi
Submission Number: 9858
Loading