When Will the Tokens End? Graph-Based Forecasting for LLMs Output Length

Grzegorz Piotrowski; Mateusz Bystroński; Mikołaj Hołysz; Jakub Binkowski; Grzegorz Chodak; Tomasz Jan Kajdanowicz

When Will the Tokens End? Graph-Based Forecasting for LLMs Output Length

Grzegorz Piotrowski, Mateusz Bystroński, Mikołaj Hołysz, Jakub Binkowski, Grzegorz Chodak, Tomasz Jan Kajdanowicz

Published: 22 Jun 2025, Last Modified: 17 Jul 2025ACL-SRW 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Output Length Prediction, Transformer Hidden States, Graph Neural Networks, Token-Level Regression, Instruction-Tuned Models, Layerwise Representation, Sequence Scheduling, LLM Interpretability, Latent Progress Estimation, Cost management, Cost forecasting, LLM provider

TL;DR: We predict the number of tokens remaining in LLM outputs by modeling transformer hidden states with a graph-based regressor, enabling more efficient and interpretable generation.

Abstract: Large Language Models (LLMs) are typically trained to predict the next token in a sequence. However, their internal representations often encode signals that go beyond immediate next-token prediction. In this work, we investigate whether these hidden states also carry information about the remaining length of the generated output—an implicit form of foresight \cite{pal-etal-2023-future}. We formulate this as a regression problem where, at generation step $t$, the target is the number of remaining tokens $y_t = T - t$, with $T$ as the total output length. We propose two approaches: (1) an aggregation-based model that combines hidden states from multiple transformer layers $\ell \in \{8, \dots, 15\}$ using element-wise operations such as mean or sum, and (2) a \textit{Layerwise Graph Regressor} that treats layerwise hidden states as nodes in a fully connected graph and applies a Graph Neural Network (GNN) to predict $y_t$. Both models operate on frozen LLM embeddings without requiring end-to-end fine-tuning. Accurately estimating remaining output length has both theoretical and practical implications. From an interpretability standpoint, it suggests that LLMs internally track their generation progress. From a systems perspective, it enables optimizations such as output-length-aware scheduling \cite{shahout2024dontstopnowembedding}. Our graph-based model achieves state-of-the-art performance on the Alpaca dataset using LLaMA-3-8B-Instruct, reducing normalized mean absolute error (NMAE) by over 50\% in short-output scenarios.

Archival Status: Archival

Acl Copyright Transfer: pdf

Paper Length: Short Paper (up to 4 pages of content)

Submission Number: 216

Loading