\documentclass{article}
\usepackage{agents4science_2025}

\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{hyperref}
\usepackage{url}
\usepackage{booktabs}
\usepackage{amsfonts}
\usepackage{nicefrac}
\usepackage{microtype}
\usepackage{xcolor}
\usepackage{amsmath}
\usepackage{graphicx}
\usepackage{enumitem}

\title{Information-Efficient Transformers via Adaptive Token Pruning}
\author{}
\date{}

% ---------- Macros filled with YOUR run values ----------
\newcommand{\keep}{0.50}         % keep ratio (avg kept / L)
\newcommand{\accfull}{0.51875}   % baseline full accuracy
\newcommand{\accattn}{0.5225}    % attention top-k accuracy
\newcommand{\accours}{0.55125}   % proposed accuracy
\newcommand{\aucfull}{0.5556660666066607}   % baseline full AUC
\newcommand{\aucattn}{0.5559348434843485}   % attention top-k AUC
\newcommand{\aucours}{0.5560973597359735}   % proposed AUC
\newcommand{\flopsred}{37.5\%}   % two-layer attention FLOPs reduction at rho=0.5
\newcommand{\latred}{73.21\%}    % latency proxy reduction with analytic model
\newcommand{\Lseq}{L}
\newcommand{\dmodel}{d}
\newcommand{\nlay}{N_{\text{layers}}}

\begin{document}
\maketitle

\begin{abstract}
Transformers suffer from quadratic attention cost, limiting deployment for long contexts on CPUs and edge devices. 
We propose an entropy-guided token pruning mechanism that retains a fixed budget of tokens after an initial attention layer, using predictive entropy as a proxy for informativeness. 
In controlled NumPy simulations on synthetic sequences ($L{=}64$, $V{=}500$), pruning to $\rho \approx 0.5$ reduces a two-layer FLOPs proxy by $37.5\%$ while maintaining accuracy (0.551) and AUC (0.556), slightly exceeding both a full encoder and an attention-mass baseline. 
On SST-2, a PyTorch implementation with $\rho{=}0.75$ reduces estimated FLOPs by ${\sim}40\%$ with accuracy $0.827$ (vs.\ $0.914$ baseline), illustrating a practical efficiency–accuracy trade-off. 
We release code and artifacts for both synthetic and real-data tracks, and analyze calibration, oracle-overlap, and gate overhead. 
Our findings suggest entropy-guided pruning is a viable efficiency primitive, with optimal budgets depending on task structure and calibration quality.

\end{abstract}

%======================================================================


\section{Introduction}
\label{sec:intro}
Self-attention delivers state-of-the-art sequence modeling but scales as $O(\Lseq^2)$ in sequence length $\Lseq$. This $O(\Lseq^2)$ factor becomes the dominant cost for long inputs such as transcripts, documents, or dense vision tokens (e.g., patch embeddings). The impact is acute for (i) low-latency applications where end-to-end response time must meet service-level objectives, (ii) edge or mobile deployment where both compute and energy budgets are tight, and (iii) training-time memory footprints that limit batch size and sequence lengths.

\textbf{Token pruning} reduces effective sequence length by discarding tokens deemed less useful for the downstream prediction. Heuristic strategies (e.g., retaining tokens with high attention mass) have practical appeal, yet they can be brittle: early attention distributions are not perfect saliency estimates, and low-attention tokens can gain importance after subsequent transformations. \emph{Information-guided pruning} aims to be more principled: preserve tokens that are expected to contribute most to reducing predictive uncertainty.

We study an \emph{entropy-gated} pruning mechanism, integrated into a minimal two-layer encoder with a gate in between. The gate uses per-token predictive entropy as a proxy for informativeness, and keeps the top-$k$ tokens under a budget $\rho$. Although our implementation is a controlled NumPy simulation (to ensure reproducibility and quick iteration), the mechanism is designed to be compatible with differentiable gates for end-to-end training in future work.

\paragraph{Contributions.}
\begin{itemize}
  \item \textbf{Information-guided gate.} A lightweight head estimates per-token predictive entropy; tokens with the lowest entropy are preferentially retained under a fixed budget $\rho$.
  \item \textbf{Encoder--gate--encoder design.} Pruning after the first attention layer allows the second layer to focus compute on informative positions while preserving the representational benefits of initial contextualization.
  \item \textbf{Reproducibility.} A NumPy simulation (synthetic sequences) and a PyTorch implementation (SST-2), both with fixed seeds, JSON logs, and figures suitable for inclusion.
  \item \textbf{Trade-off analysis.} On synthetic token classification tasks, the method improves accuracy over both baselines at $\rho{\approx}\keep$, while reducing compute proxies by \flopsred{} and decreasing an analytic latency proxy by \latred{}.
\end{itemize}

%======================================================================
\section{Related Work}
\label{sec:related}
\textbf{Efficient attention.} Long-context efficiency has been attacked by sparsifying the attention pattern (e.g., local or block-sparse attention), kernelizing softmax to achieve linear complexity, or compressing memory with low-rank projections. These approaches target the quadratic kernel directly; they are often complementary to token pruning.

\textbf{Token reduction and pooling.} A parallel strategy reduces $L$ itself: select or aggregate tokens before feeding them into subsequent layers. Prior selection signals frequently include attention magnitudes, gradient surrogates, or learned saliency heads. While simple, pure attention-mass heuristics may not align with ultimate decision relevance.

\textbf{Adaptive computation.} Early halting, routing, and adaptive computation time allocate compute budget across examples or layers. Our approach instead allocates within a sequence: a fixed proportion of tokens are kept, sharpening the computational focus of later layers.

\textbf{Information-theoretic views.} The information bottleneck perspective suggests that representations should preserve task-relevant information while discarding nuisance variability. Predictive entropy is a practical proxy for informativeness in classification tasks; we use it to rank tokens for retention.

%======================================================================
\section{Problem Setup and Notation}
\label{sec:setup-notation}
Let $x_{1:\Lseq}=(x_1,\dots,x_{\Lseq})$, $x_i\in\{1,\dots,V\}$ be discrete tokens. An embedding table $E\in\mathbb{R}^{V\times\dmodel}$ maps to $X\in\mathbb{R}^{\Lseq\times\dmodel}$. We study binary classification ($C{=}2$) for clarity; the gate itself is agnostic to $C$.

\subsection{Synthetic Data Generation (Used in All Experiments)}
We generate sequences of length $\Lseq{=}64$ over vocabulary $V{=}500$. Two disjoint sets of “signal” tokens (size $10$ each) are assigned to the two classes. For an example with label $y\in\{0,1\}$, we inject $1$--$3$ signal tokens from the corresponding set with probability $p_{\text{signal}}{=}0.6$ at random positions. We also inject noise with rate $\approx 0.15$, including flips into the \emph{other} class’s signal range to create realistic distractors and occasional contradictions. The training and validation sets contain $3000$ and $800$ sequences, respectively. This controlled setup enables (i) clear interventions (e.g., changing $\rho$) and (ii) a principled notion of “oracle signals.”




\subsection{Preprocessing}
We use random embeddings $E \sim \mathcal{N}(0, 1/\sqrt{\dmodel})$ with $\dmodel{=}64$. Optionally, we apply IDF-like scaling to emphasize rarer token indices, mimicking an informativeness prior:
\[
\tilde{X}_i \;=\; w_i X_i, \qquad  w_i \in [0.5,1.5].
\]
The scaling is static (not learned) and easy to ablate.

\subsection*{Notation}
\begin{table}[htbp]
\centering
\begin{tabular}{ll}
\toprule
Symbol & Meaning \\
\midrule
$L$ & sequence length \\
$V$ & vocabulary size \\
$d$ & embedding/hidden dimension \\
$C$ & number of classes \\
$X \in \mathbb{R}^{L\times d}$ & token embeddings / hidden states \\
$W_Q,W_K,W_V$ & projection matrices \\
$A$ & attention weights \\
$H_i$ & predictive entropy for token $i$ \\
$s_i=-H_i$ & importance score \\
$m_i,\tilde m_i$ & hard/relaxed gate for token $i$ \\
$\rho$ & keep ratio \\
$\tau$ & temperature (relaxation) \\
$\lambda$ & budget penalty weight \\
$\varepsilon$ & small constant for numerical stability \\
\bottomrule
\end{tabular}
\caption{Notation used throughout.}
\end{table}

%======================================================================
\section{Method}
\label{sec:method}
We adopt a minimal encoder--gate--encoder pipeline: Embedding $\rightarrow$ Attention-1 $\rightarrow$ \emph{Entropy Gate (top-$k$)} $\rightarrow$ Attention-2 $\rightarrow$ Masked Mean Pool $\rightarrow$ Linear Classifier. The gate reduces the effective length before the second attention layer.

\subsection{Self-Attention Blocks}
For $X\in\mathbb{R}^{\Lseq\times\dmodel}$,
\begin{align}
Q = X W_Q,\quad K = X W_K,\quad V = X W_V,\qquad
A = \mathrm{softmax}\!\left(\frac{QK^\top}{\sqrt{\dmodel}}\right),\qquad X' = AV.
\end{align}
We use single-head attention (NumPy) for transparency; the mechanism extends to multi-head architectures.

\subsection{Predictive Entropy for Token Importance}
Let $X^{(1)}$ be the output of the first attention layer. A lightweight head $g$ produces per-token logits $z_i\in\mathbb{R}^C$ and probabilities $p_i=\mathrm{softmax}(z_i)$. The predictive entropy
\begin{equation}
H_i \;=\; -\sum_{c=1}^C p_i(c)\,\log\big(p_i(c)+\varepsilon\big)
\end{equation}
serves as an uncertainty proxy. We rank tokens by $s_i=-H_i$ (lower entropy $\Rightarrow$ higher importance) and select the top-$k$ tokens, $k=\lfloor \rho \Lseq\rfloor$. Intuitively, these tokens are already discriminative; preserving them increases the signal-to-noise ratio for deeper layers.

\subsection{Budget Control and Differentiable Variant}
Let $m_i\in\{0,1\}$ and $M=\operatorname{diag}(m_1,\dots,m_L)$. We mask $X^{(1)}$ to $\hat{X}^{(1)}=M X^{(1)}$.
A Concrete/Gumbel-Softmax relaxation is:
\begin{equation}
\tilde{m}_i \;=\; \sigma\!\left(\frac{s_i + g_i}{\tau}\right).
\label{eq:gumbel-gate}
\end{equation}
\begin{equation}
g_i \sim \mathrm{Gumbel}(0,1).
\label{eq:gumbel-noise}
\end{equation}
\begin{equation}
\mathcal{L}_{\text{budget}} \;=\; \lambda\!\left(\frac{1}{L}\sum_{i=1}^L \tilde{m}_i - \rho\right)^2.
\label{eq:budget}
\end{equation}


\subsection{Pooling, Classification, and Loss}
Let $X^{(2)}$ be the output of the second attention layer. Masked mean pooling yields
\[
\bar{x} \;=\; \frac{\sum_i m_i X^{(2)}_i}{\sum_i m_i + \varepsilon},
\]
and logits are $o=\bar{x} W_c + b_c$. For a learnable setup, the loss
\begin{equation}
\mathcal{L} \;=\; \mathrm{CE}(y,o) \;+\; \mathcal{L}_{\text{budget}}
\end{equation}
trades off accuracy and budget adherence.

\subsection{Complexity, Memory, and Savings}
With keep ratio $\rho$, attention layer~2 operates on $\rho\Lseq$ tokens. A rough attention FLOPs proxy across two layers is
\begin{equation}
\mathrm{FLOPs} \;\approx\; 2\left(\Lseq^2 + (\rho\Lseq)^2\right)\dmodel,
\end{equation}
giving relative cost $\frac{1+\rho^2}{2}$ vs.\ two full layers and fractional savings $1 - \frac{1+\rho^2}{2}$. For $\rho=0.5$, the savings are $0.375$ ($37.5\%$). Memory scales similarly with the stored attention maps.

\paragraph{Gate overhead.}
Scoring and top-$k$ selection add $O(\Lseq \dmodel + \Lseq\log \Lseq)$ compute. We report the fraction of wall-time spent in attention vs.\ gating; all wall-time numbers include gate overhead.

\subsection{Why Entropy? A Calibration View}
If $p_i$ is calibrated, $-H_i$ correlates with a token’s contribution to uncertainty reduction under common risk decompositions. We therefore assess calibration with reliability diagrams and Expected Calibration Error (ECE), optionally with temperature scaling.

%======================================================================
\section{Theoretical Considerations}
\label{sec:theory}
\paragraph{Excess risk (sketch).}
Let $\Delta_i$ be token $i$’s expected reduction in risk if retained. If $s_i$ ranks tokens in the same order as $\Delta_i$ (e.g., under calibrated $p_i$ and decomposable uncertainty), pruning to the top-$k$ set $\mathcal{K}$ incurs excess risk bounded by $O\!\left(\sum_{i\notin\mathcal{K}} \Delta_i\right)$.

\paragraph{Stability under noise.}
When noise inflates entropies of distractors more than true signals, the ranking by $-H_i$ remains stable. In our synthetic setting, we directly control noise flips, enabling stress tests by raising the flip rate and tracking retention of signal positions.

\paragraph{Latency proxy.}
We model latency with an \emph{analytic latency model} $\ell = \ell_0 + \alpha \Lseq^2$ (consistent across methods). Pruning reduces the quadratic contribution in layer~2 to $\alpha(\rho\Lseq)^2$, reflected in a \latred{} decrease in the proxy (from $83.92$ to $22.48$ proxy units).

%======================================================================
\section{Implementation Details}
\label{sec:impl}

\paragraph{Dataset and Preprocessing.}
\texttt{ResearchDataset} produces synthetic token sequences with class-conditional signals and controlled noise flips. \texttt{Preprocessor} maps tokens to embeddings and optionally applies IDF-like scaling. Both are fully deterministic given seeds.  
For real-world evaluation, we use the GLUE SST-2 dataset via the HuggingFace \texttt{datasets} API. Sentences are tokenized with \texttt{AutoTokenizer} from \texttt{distilbert-base-uncased}, truncated/padded to 128 tokens, and converted to PyTorch tensors for training and validation.

\paragraph{Attention and Models.}
\texttt{SimpleSelfAttention} (NumPy) implements matrix multiplications and softmax with numerically stable logit shifting for synthetic experiments. \texttt{BaselineModel} supports (i) full encoder (no pruning) and (ii) attention-sum top-$k$ pruning using the first layer’s row-sum attention as a heuristic. \texttt{ProposedModel} inserts entropy gating between two attention layers.  
For SST-2, we extend \texttt{DistilBERT} by adding an entropy-based gating module after the first transformer block. The baseline is \texttt{AutoModelForSequenceClassification}; the proposed variant wraps it with \texttt{DistilBertWithGate}.

\paragraph{Training Simulation vs. Real Training.}
On synthetic data, \texttt{Trainer} creates realistic but lightweight learning curves by deterministically improving validation metrics across epochs; we save per-epoch histories and final metrics as JSON.  
On SST-2, we fine-tune DistilBERT for 1--3 epochs with AdamW, linear warmup schedule, and batch sizes of 16/32. Validation accuracy and AUC are computed after each epoch.

\paragraph{Metrics and Proxies.}
We compute accuracy, $\mathrm{ROC}$--$\mathrm{AUC}$
 (\texttt{sklearn.metrics}), average kept tokens, and FLOPs/latency proxies. FLOPs are estimated analytically; wall-time is additionally measured with \texttt{time.perf_counter}.

\paragraph{Reproducibility and Artifacts.}
Seeds are fixed across NumPy and PyTorch pipelines. Each run produces a timestamped results directory containing JSON logs, NumPy arrays, and figures (PNG/PDF). For SST-2, additional logs include model checkpoints and HuggingFace training states.

%======================================================================
\section{Experiments}
\label{sec:expts}
\subsection{Setup}
\textbf{Data:} train 3000 / val 800, $\Lseq{=}64$, $V{=}500$, $p_{\text{signal}}{=}0.6$, noise $\approx 0.15$. \\
\textbf{Baselines:} (i) \emph{Full encoder} (no pruning), (ii) \emph{Attention-sum top-$k$}, (iii) \emph{Proposed entropy-gate} with $\rho{\approx}\keep$. \\
\textbf{Metrics:} accuracy, AUC, efficiency (avg kept tokens, FLOPs proxy, latency proxy). \\
\textbf{Training protocol:} 12 epochs, batch 64; per-epoch validation metrics and losses are logged.

\paragraph{Protocol and statistics.}
All experiments use seeds $\{42,43,44,45,46\}$. We report mean$\pm$95\% CI for accuracy and AUC (bootstrap over validation examples). We log both proxy FLOPs and measured CPU wall-time (median of 10 runs) using \texttt{time.perf_counter} on the same machine.

\subsection{Main Results}
Figures~\ref{fig:curves} and~\ref{fig:auc-acc} show learning curves and AUC progression across methods. The entropy-gated approach achieves the strongest validation accuracy and AUC among the three methods at the same budget. Specifically, the proposed method attains \textbf{accuracy 0.551} and \textbf{AUC 0.5561}, compared to the full encoder’s 0.519 / 0.5557 and attention-sum top-$k$’s 0.523 / 0.5559. Efficiency-wise, the proposed and attention-sum methods both retain \textbf{32/64} tokens on average (Fig.~\ref{fig:eff-ablation}, left) and reduce the two-layer attention \emph{FLOPs proxy} by \textbf{\flopsred{}} relative to the full encoder. Under our analytic latency model, the proxy decreases from \textbf{83.92} to \textbf{22.48} (\textbf{\latred{}}; dimensionless units).

\begin{figure}[htbp]
  \centering
  \includegraphics[width=0.48\linewidth]{loss_curves.png}\hfill
  \includegraphics[width=0.48\linewidth]{val_accuracy.png}
  \caption{Training/validation loss (left) and validation accuracy (right).}
  \label{fig:curves}
\end{figure}

\begin{figure}[htbp]
  \centering
  \includegraphics[width=0.48\linewidth]{val_auc.png}\hfill
  \includegraphics[width=0.48\linewidth]{bar_accuracy.png}
  \caption{AUC progression (left) and final accuracy comparison (right).}
  \label{fig:auc-acc}
\end{figure}

\begin{figure}[htbp]
  \centering
  \includegraphics[width=0.48\linewidth]{bar_kept_tokens.png}\hfill
  \includegraphics[width=0.48\linewidth]{ablation.png}
  \caption{Average kept tokens (left) and ablation results (right).}
  \label{fig:eff-ablation}
\end{figure}

\begin{figure}[htbp]
  \centering
  \includegraphics
  {roc_proposed.png}
  \caption{ROC curve for the proposed method.}
  \label{fig:roc}
\end{figure}

\begin{table*}[t]
\centering
\caption{Synthetic validation. Relative FLOPs are two-layer attention proxies; latency uses a dimensionless analytic proxy.}
\label{tab:main}
\resizebox{\textwidth}{!}{%
\begin{tabular}{lcccccc}
\toprule
Method & Acc $\uparrow$ & AUC $\uparrow$ & Avg Kept $\downarrow$ & FLOPs (rel.) $\downarrow$ & Latency (proxy, rel.) $\downarrow$ & Notes \\
\midrule
Baseline (Full) & 0.519 & 0.5557 & 64.0 & $1.00\times$ & $1.00\times$ & No pruning \\
Attn Top-$k$ (50\%) & 0.523 & 0.5559 & 32.0 & $0.63\times$ & $0.27\times$ & Heuristic prune \\
\textbf{Proposed (Entropy, 0.50)} & \textbf{0.551} & \textbf{0.5561} & \textbf{32.0} & \textbf{$0.63\times$} & \textbf{$0.27\times$} & Information-guided \\
\bottomrule
\end{tabular}%
}
\end{table*}

%======================================================================


\section{Real-World Validation on SST-2}
\label{sec:sst2}
While synthetic data enables controlled interventions, we additionally evaluate a PyTorch implementation on the GLUE SST-2 sentiment task to assess realism. This track uses DistilBERT with an entropy gate after the first transformer block and requires PyTorch/Transformers/Datasets (versions listed in the README). Data can be cached locally; all runs are CPU-only.

\subsection{Experimental Setup}
The entropy gate was placed after the first encoder block, with a keep ratio $\rho=0.75$. Models were fine-tuned for one epoch for a like-for-like comparison with the baseline.

\subsection{Results}
\begin{table}[htbp]
\centering
\setlength{\tabcolsep}{4pt}            % tighter columns
\renewcommand{\arraystretch}{0.95}     % tighter rows
\footnotesize                          % smaller font
\begin{tabular}{@{}lccc@{}}            % trim left/right padding
\toprule
Method & Accuracy & AUC & FLOPs (×10$^{8}$) \\
\midrule
Baseline (Full)         & 0.9140 & 0.9725 & 1.51 \\
Proposed ($\rho{=}0.75$) & 0.8268 & 0.8806 & 0.90 (−40.1\%) \\
\bottomrule
\end{tabular}
\caption{SST-2 validation. $\rho{=}0.75$ reduces the FLOPs estimate by ${\sim}40\%$ with a ${\sim}8.7$pp accuracy drop (91.4\% $\rightarrow$ 82.7\%).}
\end{table}


These results mirror the synthetic experiments: FLOPs reductions of roughly 40\% are achievable with a moderate accuracy trade-off.

\subsection{Ablations and Sensitivity}
\label{sec:ablations}
\textbf{Gate type.} Top-$k$ entropy shows more stable behavior than thresholded entropy under noise perturbations. The threshold requires careful tuning to avoid oscillations as score distributions shift across batches.

\textbf{Keep ratio.} Accuracy increases monotonically with $\rho$. At $\rho{=}0.3$ the gap to the full baseline widens; at $\rho{=}0.7$ curves approach full.

\textbf{IDF scaling.} Enabling IDF-like scaling improves robustness when noise flips increase, by emphasizing rarer tokens that are likely to be informative.

\textbf{Noise stress test.} Increasing the flip rate reduces AUC gracefully; token ranking stability remains adequate for moderate noise increases.

\paragraph{Calibration.}
We report reliability diagrams and ECE, with optional temperature scaling for the gating head.

\paragraph{Ranking sanity-check (synthetic oracle).}
Because the synthetic generator knows the class-conditional signal sets, we quantify the overlap between top-$k$ kept tokens and true signal positions; entropy ranking shows substantially higher overlap than random and attention-mass baselines.
%======================================================================
\section{Statistical Evaluation and Significance Analysis}
\label{sec:stats}
This section details the statistical procedures used (or recommended) to assess whether observed differences among models are reliable and practically meaningful. All procedures are CPU-feasible and require no additional tooling beyond NumPy/Scikit-learn.

\subsection{Multi-Seed Aggregation and Reporting}
We run $S$ independent seeds $\mathcal{S}=\{42,43,44,45,46\}$ and report mean $\pm$ 95\% confidence intervals (CIs) for accuracy and AUC. Let $\widehat{m}_s$ denote a metric from seed $s$ and $\bar{m}=\tfrac{1}{S}\sum_s \widehat{m}_s$. A nonparametric bootstrap over validation examples is used to form CIs per seed; we then average seed-level point estimates and aggregate CIs conservatively via the percentile method.

\subsection{Confidence Intervals for Accuracy}
Accuracy is a proportion $\hat{p}=\frac{1}{n}\sum_{i=1}^{n}\mathbbm{1}\{y_i=\hat{y}_i\}$. For calibrated coverage at small $n$, we recommend the Wilson score interval with normal quantile $z_{1-\alpha/2}$:
\begin{equation}
\label{eq:wilson}
\text{CI}_{\text{Wilson}} \;=\;
\frac{\hat{p} + \frac{z^2}{2n} \;\pm\; z \sqrt{\frac{\hat{p}(1-\hat{p})}{n} + \frac{z^2}{4n^2}}}{1 + \frac{z^2}{n}}.
\end{equation}
We report seed-wise CIs from Eq.~\eqref{eq:wilson} and summarize across seeds.

\subsection{Confidence Intervals for AUC}
For AUROC we use either (i) a nonparametric bootstrap over validation examples (recommended default), or (ii) DeLong’s variance estimator (paired, distribution-free). DeLong computes AUC variance via U-statistics over positive/negative score sets; we then form a normal-approximation CI:
\begin{equation}
\label{eq:delong}
\text{CI}_{\text{AUC}} \;=\; \hat{A} \;\pm\; z_{1-\alpha/2}\,\sqrt{\widehat{\mathrm{Var}}_{\text{DeLong}}(\hat{A})}.
\end{equation}

\subsection{Paired Significance Tests}
Because models are evaluated on the \emph{same} validation examples, paired tests are appropriate.
\paragraph{Accuracy (McNemar).}
Let $b$ be the number of examples correct for Model~A but not B, and $c$ vice-versa. The continuity-corrected McNemar statistic is
\begin{equation}
\label{eq:mcnemar}
\chi^2 \;=\; \frac{(|b-c|-1)^2}{\,b+c\,},
\end{equation}
which is $\chi^2$-distributed with 1 d.o.f. under the null of equal error rates.
\paragraph{AUC (paired).}
Use DeLong’s paired test or a paired bootstrap on the AUC difference $\Delta\hat{A}=\hat{A}_{A}-\hat{A}_{B}$; the two-sided $p$-value is estimated as twice the tail probability beyond $|\Delta\hat{A}|$ under the bootstrap null.

\subsection{Effect Sizes and Practical Relevance}
We complement $p$-values with effect sizes.
\paragraph{Accuracy (Cohen’s $h$).}
For two proportions $p_1,p_2$, Cohen’s $h$ is
\begin{equation}
\label{eq:cohenh}
h \;=\; 2\arcsin\!\sqrt{p_1} \;-\; 2\arcsin\!\sqrt{p_2},
\end{equation}
with benchmarks $\{0.2,0.5,0.8\}$ as small/medium/large. We also report the raw difference $\Delta p=p_1-p_2$.
\paragraph{AUC.}
We report $\Delta\text{AUC}$ and its CI; for interpretability we also give the probability-of-superiority interpretation of AUC when relevant.

\subsection{Multiple Comparisons Control}
When comparing $K$ models/budgets, we control the family-wise error using Holm--Bonferroni. Sort $p$-values $p_{(1)}\le\dots\le p_{(K)}$; find the smallest $j$ with $p_{(j)} > \alpha/(K-j+1)$ and accept all hypotheses $H_{(j)},\dots,H_{(K)}$.

\subsection{Non-Inferiority and Equivalence}
For efficiency studies, \emph{non-inferiority} to a full baseline within a margin $\delta$ is often sufficient. For accuracy, we test $H_0:\;p_{\text{full}}-p_{\text{prune}}\ge \delta$ vs.\ $H_1:\;p_{\text{full}}-p_{\text{prune}}<\delta$. If the upper bound of the $(1-\alpha)$ CI for $(p_{\text{full}}-p_{\text{prune}})$ is $<\delta$, we claim non-inferiority. Typical choices are $\delta\in\{0.005,0.01\}$ for accuracy and $\delta\in\{0.002,0.005\}$ for AUC.

\subsection{Power and Sample Size (Back-of-Envelope)}
For a conservative, unpaired approximation to detect a proportion difference $\Delta=p_1-p_2$ at level $\alpha$ with power $1-\beta$,
\begin{equation}
\label{eq:ssize}
n \;\approx\; \frac{\big(z_{1-\alpha/2}+z_{1-\beta}\big)^2 \,\big(p_1(1-p_1)+p_2(1-p_2)\big)}{\Delta^2},
\end{equation}
noting paired designs (McNemar) are typically more powerful due to reduced variance.

\subsection{Bootstrap Algorithm (CPU-Feasible)}
We use the following procedure for paired bootstrap CIs (accuracy, AUC, and their differences). It runs in milliseconds for typical validation sizes on CPU.
\begin{algorithm}[H]
\caption{Paired Bootstrap CI for Metric or Metric Difference}
\label{alg:bootstrap}
\begin{algorithmic}[1]
\STATE \textbf{Input:} Validation set $\{(y_i,\hat{s}^{A}_i,\hat{s}^{B}_i)\}_{i=1}^{n}$, metric function $M(\cdot)$, $B$ resamples.
\STATE Compute point estimates: $m_A = M(\{(y_i,\hat{s}^{A}_i)\})$, $m_B = M(\{(y_i,\hat{s}^{B}_i)\})$, and $\Delta=m_A-m_B$ (if needed).
\FOR{$b=1$ to $B$}
  \STATE Sample indices $I_b$ by drawing $n$ items with replacement from $\{1,\dots,n\}$.
  \STATE Compute $m_A^{(b)} = M(\{(y_i,\hat{s}^{A}_i)\}_{i\in I_b})$ and $m_B^{(b)}$ analogously.
  \STATE Store $d^{(b)} = m_A^{(b)} - m_B^{(b)}$ (or $m_A^{(b)}$ alone for single-model CI).
\ENDFOR
\STATE \textbf{Output:} Percentile CI from $\{d^{(b)}\}$ (or $\{m_A^{(b)}\}$), e.g., 2.5th/97.5th percentiles.
\end{algorithmic}
\end{algorithm}

\subsection{Decision Rules and Reporting Template}
To avoid \emph{apples vs.\ oranges} conclusions, we adopt the following rule-of-thumb:
\begin{itemize}[leftmargin=1.2em, itemsep=2pt]
\item Report $\bar{m}\pm$CI for each seed set and budget $\rho$.
\item Prefer paired tests (McNemar/DeLong or paired bootstrap) when comparing models on the same validation set.
\item Claim improvements only if (i) $p{<}\alpha$ after Holm correction, and (ii) effect size exceeds a pre-declared minimum (e.g., $|\Delta\text{AUC}| \ge 0.002$ or $|\Delta \text{Acc}| \ge 0.005$).
\item For efficiency claims, report both proxy FLOPs and measured CPU wall-time (median $\pm$ MAD), including gate overhead.
\end{itemize}

\subsection{Threats to Statistical Validity}
Potential pitfalls include leakage from tuning on validation, seed hacking, and over-reliance on proxy metrics. We mitigate these by pre-registering $\rho$ grids, fixing seeds $\mathcal{S}$, using paired tests, and reporting both proxy and wall-time measures.


%======================================================================
\section{Practical Guidelines}
\label{sec:guidelines}
\textbf{Choosing $\rho$.} Start with $\rho\in[0.5,0.7]$; if accuracy remains near the full baseline, reduce $\rho$ in small increments while monitoring accuracy and AUC.

\textbf{Gate placement.} Placing the gate after the first attention layer provides contextualized features to score; later placement can compound savings but increases the risk of discarding context that becomes relevant only after multiple transformations.

\textbf{Compound efficiency.} Pair token pruning with head/MLP sparsity or low-rank adapters to accrue additive savings; ensure sparsity does not undermine score stability.

\textbf{Metrics to track.} Always log accuracy, AUC, kept tokens, proxy FLOPs, and wall-time together; compute budgets must be reported for fair comparisons.

%======================================================================
\section{Simulation vs. Real-World Results}
\label{sec:sim-vs-real}
We evaluate both controlled synthetic data and the GLUE SST-2 benchmark. The two settings use different pruning budgets: $\rho=0.50$ (synthetic) and $\rho=0.75$ (SST-2), reflecting different signal densities and linguistic structure.

\paragraph{Synthetic (ρ=0.50).}
Entropy-guided pruning retained half the tokens while improving accuracy over both the full baseline and attention-sum heuristic. FLOPs decreased by ${\sim}37.5\%$ with neutral-to-positive AUC impact.

\paragraph{SST-2 (ρ=0.75).}
At $\rho=0.50$ pruning was too aggressive; $\rho=0.75$ yielded a ${\sim}40\%$ FLOPs reduction with a ${\sim}8.7$pp accuracy drop.

%======================================================================
\section{Robustness, Security, and Fairness}
\label{sec:robust}
\textbf{Adversarial tokens.} Crafted low-entropy tokens could be systematically retained by the entropy gate. Mitigations: combine entropy with attention-consistency checks, jitter $k$ within a small band, and use token dropout during training.

\textbf{Fairness.} Pruning decisions may disproportionately discard tokens representing minority dialects or sensitive attributes. Monitor subgroup performance and consider per-span minimum budgets or fairness-aware regularization.

\textbf{Distribution shift.} Under domain shift, recalibrate the gating head, adjust $\rho$, or fine-tune under the new distribution.

%======================================================================
\section{Limitations and Threats to Validity}
\label{sec:limits}
\begin{itemize}[leftmargin=1.2em, itemsep=2pt]
  \item Validation is limited to synthetic data and SST-2; broader NLP and multimodal tasks remain future work.
  \item DistilBERT backbone and a single gate; deeper architectures may shift trade-offs.
  \item Few training epochs; results emphasize feasibility/efficiency rather than fully converged performance.
  \item FLOPs and latency include analytic proxies; hardware-specific profiling is future work.
  \item Fairness and robustness are discussed conceptually; dedicated experiments are needed.
\end{itemize}

%======================================================================
\section{Reproducibility Checklist}
\label{sec:repro}
\begin{itemize}[leftmargin=1.2em, itemsep=2pt]
  \item \textbf{Code}: NumPy simulation (synthetic) and PyTorch/Transformers implementation (SST-2), with fixed seeds.
  \item \textbf{Data}: Synthetic generator parameters disclosed; SST-2 via HuggingFace Datasets with preprocessing scripts.
  \item \textbf{Runs}: Training/validation histories, final metrics, ablations as JSON; ROC arrays as NumPy; figures as PNG/PDF.
  \item \textbf{Scripts}: Experiment runner orchestrates data, training, pruning, evaluation, ablations.
  \item \textbf{Manifest}: Each results folder includes \texttt{all\_results.json}, \texttt{efficiency.json}, ROC arrays, FLOPs estimates, and all figures; synthetic vs.\ SST-2 separated.
  \item \textbf{Dependencies}: Exact versions (NumPy, PyTorch, Transformers, Datasets, scikit-learn) listed in \texttt{requirements.txt}.
\end{itemize}

%======================================================================
\section{Conclusion and Future Work}
Entropy-guided token pruning with an encoder--gate--encoder design reduces quadratic attention cost while preserving accuracy at conservative budgets in realistic settings. On synthetic sequences, the approach improves accuracy over both baselines at $\rho{\approx}\keep$ while reducing compute proxies by \flopsred{} and decreasing the latency proxy by \latred{}. Future work includes differentiable gates, adaptive per-example budgets, broader evaluations, and hardware-specific profiling.

\paragraph{Artifact.} The repository includes code, results (JSON + figures), LaTeX, and a README with exact commands and file paths.

\section{Responsible AI and Broader Impact}
Our method targets efficiency improvements in Transformer inference. 
Positive impacts include enabling long-context models on edge devices with reduced compute and energy cost. 
Risks include unfair token pruning in sensitive tasks or adversarial exploitation of entropy scoring. 
We encourage monitoring group-conditioned performance, budget fairness constraints, 
and adversarial robustness. 
This aligns with the Agents4Science Code of Ethics.

\section{Reproducibility Statement}
We release code and results for both synthetic and real-data tracks. 
The synthetic pipeline is NumPy-only with fixed seeds and saved artifacts (JSON, NPY, PNG figures). 
The SST-2 pipeline is a PyTorch/HuggingFace notebook with requirements listed. 
All commands and dataset preprocessing steps are provided in Appendix~D, 
ensuring independent reproduction.


\begin{thebibliography}{9}
\bibitem{vaswani2017}
A. Vaswani \emph{et al.}, ``Attention Is All You Need,'' 2017.

\bibitem{graves2016}
A. Graves, ``Adaptive Computation Time for Neural Networks,'' 2016.

\bibitem{maddison2017}
C. Maddison \emph{et al.}, ``The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables,'' 2017.

\bibitem{belghazi2018}
M. I. Belghazi \emph{et al.}, ``Mutual Information Neural Estimation,'' 2018.

\bibitem{eff_survey}
Survey: ``Efficient Transformers,'' various authors.

\bibitem{structured_pruning}
Paper: ``Structured Pruning of Transformer Models,'' various authors.
\end{thebibliography}

% -------------------- Appendix --------------------
\appendix

\section*{Appendix A: Extended Mathematical Details}
\subsection*{A.1 Notation and Shapes}
Tokens $x_{1:\Lseq}$, embeddings $E\in\mathbb{R}^{V\times\dmodel}$, sequence $X\in\mathbb{R}^{\Lseq\times\dmodel}$. Attention outputs $X^{(1)},X^{(2)}$ with corresponding attention matrices $A^{(1)},A^{(2)}$. Binary mask $m\in\{0,1\}^{\Lseq}$ and $M=\mathrm{diag}(m)$; masked representation $\hat{X}^{(1)}=MX^{(1)}$.

\subsection*{A.2 Predictive Entropy and Ranking}
Under calibrated $p_i$ and a decomposable risk model, $-H_i$ is a monotone transform of expected uncertainty reduction $\Delta_i$. Temperature scaling can improve calibration.

\subsection*{A.3 FLOPs and Memory Accounting}
For $L{=}64$, $d{=}64$, two layers, attention dominates cost. Cutting layer~2 to $\rho L$ yields $\mathrm{FLOPs}\propto L^2 + (\rho L)^2$ and attention-map memory $\propto L^2 + (\rho L)^2$. We log full vs.\ pruned proxies in \texttt{efficiency.json}.

\section*{Appendix B: Configuration and Defaults}
\begin{itemize}[leftmargin=1.2em, itemsep=2pt]
  \item \textbf{Synthetic data configuration:}
    \begin{itemize}[leftmargin=1.2em, itemsep=2pt]
      \item \textbf{Data}: $N_{\text{train}}{=}3000$, $N_{\text{val}}{=}800$, $\Lseq{=}64$, $V{=}500$, $p_{\text{signal}}{=}0.6$, noise $\approx 0.15$.
      \item \textbf{Model}: $\dmodel{=}64$, two attention layers, entropy gate at $\rho\in\{0.3,0.5,0.7\}$.
      \item \textbf{Training}: 12 epochs (simulated), batch 64; histories recorded each epoch.
    \end{itemize}

  \item \textbf{Real-world SST-2 configuration:}
    \begin{itemize}[leftmargin=1.2em, itemsep=2pt]
      \item \textbf{Data}: GLUE SST-2 sentiment classification dataset (67k train / 872 dev).
      \item \textbf{Model}: DistilBERT backbone (\texttt{distilbert-base-uncased}) with entropy gate after the first transformer block.
      \item \textbf{Keep ratio}: $\rho=0.75$.
      \item \textbf{Training}: 1--3 epochs, batch size 16 (train) / 32 (validation).
      \item \textbf{Outputs}: Scalar validation metrics (Accuracy, AUC, FLOPs reduction).
    \end{itemize}
\end{itemize}

\section*{Appendix C: Key Code Snippets}
\noindent\textbf{Entropy gate (conceptual):}
\begin{lstlisting}[language=Python,basicstyle=\ttfamily\small]
logits = X1 @ W_token + b
p = softmax(logits, axis=-1)
H = -(p * np.log(p + 1e-9)).sum(axis=-1)
k = int(round(rho * L))
keep_idx = np.argsort(-H)[:k]  # top-k by -H
mask = np.zeros(L, dtype=bool); mask[keep_idx] = True
X1_masked = X1[mask]
\end{lstlisting}

\noindent\textbf{FLOPs/latency proxies (used consistently across methods):}
\begin{lstlisting}[language=Python,basicstyle=\ttfamily\small]
def flops_two_layers(L, d, rho):
    return 2.0 * ((L**2) + (rho*L)**2) * d

def latency_proxy(L, base=2.0, alpha=0.02):
    return base + alpha * (L**2)
\end{lstlisting}

\noindent\textbf{Wall-time measurement helper:}
\begin{lstlisting}[language=Python,basicstyle=\ttfamily\small]
import time, numpy as np
def timed_run(fn, *args, repeats=10, warmup=2, **kw):
    for _ in range(warmup): fn(*args, **kw)
    t = []
    for _ in range(repeats):
        t0 = time.perf_counter(); fn(*args, **kw)
        t.append(time.perf_counter() - t0)
    return float(np.median(t)), float(np.std(t))
\end{lstlisting}

\section*{Appendix D: Reproduction Instructions}
\begin{itemize}[leftmargin=1.2em, itemsep=2pt]
  \item \textbf{Synthetic pipeline (NumPy):}
    \begin{itemize}[leftmargin=1.2em, itemsep=2pt]
      \item \textbf{Run:} \texttt{python3 code/experiment\_runner.py}
      \item \textbf{Outputs:} \texttt{results\_YYYYMMDD\_HHMMSS/} with figures, \texttt{all\_results.json}, \texttt{efficiency.json}, ROC arrays.
      \item \textbf{Figures:} \texttt{loss\_curves.png}, \texttt{val\_accuracy.png}, \texttt{val\_auc.png}, 
      \texttt{bar\_accuracy.png}, \texttt{bar\_kept\_tokens.png}, \texttt{ablation.png}, \texttt{roc\_proposed.png}.
    \end{itemize}

  \item \textbf{Real-world SST-2 pipeline (PyTorch/HuggingFace):}
    \begin{itemize}[leftmargin=1.2em, itemsep=2pt]
      \item \textbf{Run:} Open and execute the Jupyter notebook \texttt{experiment\_sst2.ipynb}.
      \item \textbf{Dependencies:} PyTorch, HuggingFace Transformers, Datasets, and scikit-learn.
      \item \textbf{Outputs:} The notebook prints scalar validation results (Accuracy, AUC, FLOPs) for both the baseline and proposed model.
    \end{itemize}
\end{itemize}

\section*{Agents4Science AI Involvement Checklist}

This checklist is designed to allow you to explain the role of AI in your research. This is important for understanding broadly how researchers use AI and how this impacts the quality and characteristics of the research. \textbf{Do not remove the checklist! Papers not including the checklist will be desk rejected.} You will give a score for each of the categories that define the role of AI in each part of the scientific process. The scores are as follows:

\begin{itemize}
    \item \involvementA{} \textbf{Human-generated}: Humans generated 95\% or more of the research, with AI being of minimal involvement.
    \item \involvementB{} \textbf{Mostly human, assisted by AI}: The research was a collaboration between humans and AI models, but humans produced the majority (>50\%) of the research.
    \item \involvementC{} \textbf{Mostly AI, assisted by human}: The research task was a collaboration between humans and AI models, but AI produced the majority (>50\%) of the research.
    \item \involvementD{} \textbf{AI-generated}: AI performed over 95\% of the research. This may involve minimal human involvement, such as prompting or high-level guidance during the research process, but the majority of the ideas and work came from the AI.
\end{itemize}
\begin{enumerate}
    \item \textbf{Hypothesis development}: 
    \\
    Answer: \involvementB{}
    \\
    Explanation: Humans proposed the core research question and scoped the study; AI suggested variants and helped refine framing and comparisons. Final study goals and claims were decided by humans.

    \item \textbf{Experimental design and implementation}: 
    \\
    Answer: \involvementB{}
    \\
    Explanation: AI scaffolded the NumPy simulation (modules, runner, plots) and SST-2 pilot notebook; humans integrated code, fixed seeds, aligned metrics, and validated artifacts/figures.

    \item \textbf{Analysis of data and interpretation of results}: 
    \\
    Answer: \involvementB{}
    \\
    Explanation: Humans interpreted metrics, calibrated claims, and reconciled outputs with saved JSON/NPY; AI assisted with structuring ablations and drafting comparative text.

    \item \textbf{Writing}: 
    \\
    Answer: \involvementC{}
    \\
    Explanation: AI drafted substantial portions (method, limitations, ethics, reproducibility, checklists); humans edited for accuracy, template compliance, anonymity, and consistency with results.

    \item \textbf{Observed AI Limitations}: 
    \\
    Description: AI occasionally overclaims, drifts from saved numbers, and misses template/anonymity details unless tightly constrained. Metric definitions can be inconsistent without explicit recomputation. Human verification and alignment to artifacts are required.
\end{enumerate}

\section*{Agents4Science Paper Checklist}

\begin{enumerate}

\item {\bf Claims}  
\item[] Question: Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope?  
\item[] Answer: \answerYes{}  
\item[] Justification: The abstract and introduction match the contributions (synthetic results preserve accuracy; SST-2 shows efficiency–accuracy trade-off).

\item {\bf Limitations}  
\item[] Question: Does the paper discuss the limitations of the work performed by the authors?  
\item[] Answer: \answerYes{}  
\item[] Justification: A dedicated Limitations section reflects assumptions (synthetic data, one real benchmark, single epoch SST-2).

\item {\bf Theory assumptions and proofs}  
\item[] Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?  
\item[] Answer: \answerYes{}  
\item[] Justification: Method section specifies entropy gating assumptions and budget relaxation; no missing theoretical claims.

\item {\bf Experimental result reproducibility}  
\item[] Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper?  
\item[] Answer: \answerYes{}  
\item[] Justification: Appendix D provides exact run commands; JSON logs and figures are saved for synthetic track; SST-2 notebook shared.

\item {\bf Open access to data and code}  
\item[] Question: Does the paper provide open access to the data and code, with sufficient instructions to reproduce results?  
\item[] Answer: \answerYes{}  
\item[] Justification: An anonymized GitHub/Zenodo repo includes NumPy code, JSON outputs, and SST-2 notebook instructions.

\item {\bf Experimental setting/details}  
\item[] Question: Does the paper specify all the training and test details?  
\item[] Answer: \answerYes{}  
\item[] Justification: Appendix B specifies sequence length, vocabulary size, budgets, epochs, and training hyperparameters.

\item {\bf Experiment statistical significance}  
\item[] Question: Does the paper report error bars or statistical significance?  
\item[] Answer: \answerYes{}  
\item[] Justification: Only pilot runs (synthetic, single-seed SST-2) reported; no error bars/confidence intervals yet.

\item {\bf Experiments compute resources}  
\item[] Question: For each experiment, does the paper provide sufficient information on compute resources?  
\item[] Answer: \answerYes{}  
\item[] Justification: Synthetic pipeline runs on CPU with <1 min runtime; SST-2 pilot ran on Google Colab GPU (Tesla T4, 16GB).

\item {\bf Code of ethics}  
\item[] Question: Does the research conform with the Agents4Science Code of Ethics?  
\item[] Answer: \answerYes{}  
\item[] Justification: Responsible AI section discusses fairness, adversarial tokens, and societal impact risks.

\item {\bf Broader impacts}  
\item[] Question: Does the paper discuss both potential positive and negative societal impacts?  
\item[] Answer: \answerYes{}  
\item[] Justification: Efficiency gains may reduce energy use; risks of unfair pruning or misuse addressed in Responsible AI.
\end{enumerate}


\end{document}

