From TextBlob to LLM Agents: Sentiment Model Selection for B2B Technical Support with CSAT Ground Truth
Keywords: sentiment analysis, customer satisfaction, CSAT prediction, LLM agents, alignment tax, B2B technical support, model selection, deployed NLP systems, cross-vendor LLM comparison, imbalanced classification
TL;DR: Five-year study comparing 17 sentiment approaches for CSAT prediction in B2B support; dedicated LLM agents dominate, the most expensive model performs worst due to alignment-induced neutral bias, and 38% of dissatisfied customers are undetectable.
Abstract: We present a five-year case study of sentiment model selection for customer satisfaction (CSAT) prediction in B2B technical support. Our evaluation uses the complete population of CSAT-rated tickets from an enterprise software company: over 500 tickets comprising ${\sim}$2,500 customer comments from 100+ organizations over five years. We evaluate 17 approaches across 5 paradigms (lexicon, off-the-shelf transformers, NLI zero-shot, multi-task LLM agent, and 12 dedicated LLM agents from 6 vendor families), plus 11 fine-tuning experiments (all achieving MCC$\leq$0). Key findings: (1) a dedicated single-task LLM agent reduces neutral bias from 69% to 22%, improving MCC from $-$0.018 to 0.347 ($p$$<$0.001); (2) our results are consistent with the "Alignment Tax" (Lin et al., 2024; Wu et al., 2025) in sentiment classification: Claude Opus 4.6 exhibits 41% neutral predictions and lower recall than its budget model Haiku 4.5 ($p$=0.003); (3) ${\sim}$38% of dissatisfied customers are undetectable by all 12 LLMs due to administrative requests lacking emotional language; (4) Gemini 3 Flash achieves the best MCC (0.347) at $0.60/1K, over 100$\times$ cheaper than Claude Opus. We describe the three-phase production deployment and provide practitioner recommendations.
Submission Type: Deployed
Copyright Form: pdf
Submission Number: 423
Loading