Skip To The Good Part: Representation Structure & Inference-Time Layer Skipping in Diffusion vs Autoregressive LLM

Raghavv Goel; Risheek Garrepalli; Sudhanshu Agrawal; Christopher Lott; Fatih Porikli; Mingu Lee

Skip To The Good Part: Representation Structure & Inference-Time Layer Skipping in Diffusion vs Autoregressive LLM

Raghavv Goel, Risheek Garrepalli, Sudhanshu Agrawal, Christopher Lott, Fatih Porikli, Mingu Lee

Published: 02 Mar 2026, Last Modified: 14 Mar 2026Sci4DL 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Diffusion Language Models, Representation Analysis, Layer Skipping, Initialization Bias

TL;DR: Diffusion language models exhibit hierarchical, redundant representations that enable simple, task‑agnostic layer skipping, while autoregressive models remain brittle due to tightly coupled depth‑dependent representations.

Abstract: Autoregressive (AR) language models form representations incrementally through left‑to‑right prediction, whereas diffusion language models (dLLMs) are trained via full‑sequence denoising. Although recent dLLMs match AR performance, it remains unclear whether diffusion objectives fundamentally reshape internal representations across depth. We perform the first layer‑ and token‑wise representational analysis comparing native dLLMs (LLaDA), native AR models (Qwen2.5), and AR‑initialized dLLMs (Dream‑7B). We find that diffusion objectives induce more hierarchical abstraction, with substantial early‑layer redundancy and reduced recency bias, while AR objectives produce tightly coupled, depth‑dependent representations. AR‑initialized dLLMs retain AR‑like representational dynamics despite diffusion training, revealing persistent initialization bias. Leveraging this representational redundancy, we introduce a static, task‑agnostic, inference‑time layer‑skipping method requiring no architectural changes or KV‑cache sharing. Native dLLMs achieve up to 18.75% FLOPs reduction while preserving over 90% performance on reasoning and code benchmarks, whereas AR models degrade sharply under comparable skipping. These findings link training objectives to representational structure and enable practical, cache‑orthogonal efficiency gains.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Style Files: I have used the style files.

Submission Number: 51

Loading