Keywords: language models, steganography
TL;DR: We implement a benchmark of steganographic encoding schemes for a suite of language models and measure their capabilities during supervised trained
Abstract: Steganography is the act of hiding information in plain sight. Text-based steganography is an AI safety concern: hidden messages embedded in ordinary language model outputs would undermine their monitoring. This paper asks which known steganographic encoding algorithms from the wider literature can be trained as a model capability via supervised fine-tuning. We introduce \textsc{StegoBench}, a benchmark of 18 text steganography schemes spanning lexical substitution, syntactic transformation, structural encoding, and distributional generation. Across our full fine-tuning runs, SFT succeeds when encoding reduces to local token-level choices. The strongest evidence comes from lexical schemes: the \textit{Chang & Clark} scheme reaches a hidden bit error rate (BER) of 0.007--0.032, and the \textit{Shirali} scheme reaches a BER of 0.002--0.017. Schemes that require structural or distributional control are far less reliable, and several generative schemes collapse during training. Model scale is not the main determinant of performance: small and mid-sized models often match larger models on trainable schemes. \textsc{StegoBench} provides a standardised testbed for measuring when language models can be trained to produce hidden communication, and for comparing trainability, detectability, and failure modes across steganographic mechanisms.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 453
Loading