Morphosyntactic Embeddings: Markov Transition Networks for Authorship in Morphologically Rich Languages

Morphosyntactic Embeddings: Markov Transition Networks for Authorship in Morphologically Rich Languages

ACL ARR 2026 January Submission9858 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: authorship attribution, stylometry, word adjacency networks, markov chains, hindi NLP

Abstract: Traditional authorship attribution models, typically reliant on lexical frequencies, often struggle with the morphological richness and syntactic flexibility (scrambling) inherent to Indian languages like Hindi. To address this, we present a framework that derives morphosyntactic embeddings from Word Adjacency Networks, directed weighted graphs that model authorial style as the Markovian transition dynamics between function words. We treat these transition dynamics as stylistic signatures encapsulated in matrices $P$ and propose three vectorization functionals $\phi: P \to \mathbb{R}^d$ to map these signatures into discriminative embedding spaces: (A) Spectral Decomposition via Principal Component Analysis yielding $\phi_{PCA}(P) \in \mathbb{R}^{100}$, (B) deep feature extraction using a Convolutional Neural Network yielding $\phi_{CNN}(P) \in \mathbb{R}^{64}$, and (C) graph-theoretic feature extraction yielding $\phi_{G}(P) \in \mathbb{R}^{615}$. Empirical evaluation on a large-scale corpus ($n=75{,}225$ ) we have curated, demonstrates that $\phi_{CNN}$ significantly outperforms traditional methods, establishing a new State-Of-The-Art for Hindi stylometry with 99.40% accuracy and 0.98 macro-F1 in 15-way authorship attribution, 95.46 % accuracy in verification, and 98.91% accuracy in characterization.

Paper Type: Long

Research Area: Sentiment Analysis, Stylistic Analysis, and Argument Mining

Research Area Keywords: style analysis, graph-based methods, representation learning, language resources

Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Data resources

Languages Studied: Hindi

Submission Number: 9858

Loading