A Controlled Synthetic Benchmark for Educational Aspect-Based Sentiment Analysis

A Controlled Synthetic Benchmark for Educational Aspect-Based Sentiment Analysis

TMLR Paper9739 Authors

14 Jun 2026 (modified: 21 Jun 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models are increasingly used to synthesize labeled training data where annotation is scarce, but the generated labels are often unfaithful to the text, and it is difficult to tell whether a synthetic benchmark carries genuine learnable signal or merely reproduces label frequencies. We study both problems in educa-tional aspect-based sentiment analysis (ABSA), a setting where real aspect-labeled feedback is private and costly to annotate, with a methodology that applies beyond it. We release a controlled corpus of 10,000 synthetic course reviews over a 20-aspect pedagogical schema, generated by sampling supervision targets separately from nuance attributes, so the same labels recur across varied course contexts and styles. We then introduce a faithfulness-aware pipeline: a cost-matched LLM audit scores how well each declared label is supported by the text, and that score both fil-ters the training data and is itself validated across independent LLM judges and against human labels (Cohen's kappa 0.56 on two annotated corpora). Three con-trols establish that the corpus carries learnable signal rather than label priors: per-muting the labels collapses detection to the trivial floor (0.182 versus 0.276 micro-F1), accuracy scales monotonically with training size (0.183 to 0.285), and restrict-ing to faithfully labeled rows raises the ceiling. Evaluated by transfer to real human-annotated feedback (the unbiased metric), faithfulness-aware row filtering lowers transferred sentiment error on Herath across two architectures (paired 95% boot-strap CI excludes zero), and a lowest-faithfulness control collapses transfer on both real benchmarks (Herath and EduRABSA). Synthetic-only training recovers about 60% of a real-trained model, and synthetic pre-training followed by real fine-tuning exceeds real-only training. The audit-filter-validate recipe is a general tool for quali-ty-controlling LLM-supervised data; the corpus and code are released.

Submission Type: Long submission (more than 12 pages of main content)

Assigned Action Editor: ~Mirella_Lapata1

Submission Number: 9739

Loading