Abstract: Vision models deployed on microcontrollers (MCUs) are quantized to integer-only arithmetic, which removes the ability to run backpropagation: the standard tool for adapting a model to the distribution shift (sensor noise, blur, lighting) it meets in the field. Existing
forward-only test-time adaptation (TTA) methods either run only on server- or edge-GPU-class models (not true microcontroller integer execution), or require the batch-normalization (BN) layers that integer deployment fuses away. We present a forward-only TTA method
that operates on deployed, BN-folded, integer-only convolutional networks. The key observation is that fusing BN into the preceding convolution, a mandatory step for integer inference, destroys the statistics that normalization-based adaptation relies on. We restore adaptation by re-normalizing each folded convolution’s per-channel output to its clean training statistics, using only forward-pass estimates. The method (i) recovers most of gradient-based TENT’s accuracy gain (+20.9 vs. +24.9 points) and matches forward-only BN adaptation, while being the only method that runs on a folded integer-only model; (ii) needs to adapt only 3 of 21 layers (selected without seeing the test corruptions) to recover 93% of the benefit; (iii) survives single-sample streaming with a batch-size-scaled momentum; and (iv) generalizes across three datasets (up to 200 classes) and two architectures. We validate true integer-only execution and deploy on an ESP32-S3, where, measured with a Nordic PPK2 power profiler, adaptation costs only 8.3 mJ (6.8% of inference energy) and
21.9 ms on the deployed SIMD-optimized model: forward-only adaptation is cheap on a real microcontroller.
Submission Type: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=A45I5p25dd
Changes Since Last Submission: # Response to Reviewer LKDB
We thank Reviewer LKDB for an exceptionally careful and constructive review, and for recognizing the on-device energy measurement and the BN-folding adaptation gap as the core contributions. The two statistical observations were especially valuable and have directly improved the paper. All changes are in the revised PDF.
**1. Algebraic cancellation (Alg. 1).** Added as **Appendix A**. Writing a folded channel's clean output as $y_c=\gamma_c z_c+\beta_c$ (mean $\beta_c$, std $|\gamma_c|$ by construction) and corruption as a per-channel affine map $\tilde y_c=a_c y_c+d_c$, we show standardize-and-rescale with the running $(\bar\mu_c,\bar\sigma^2_c)$ yields $\hat x_c=y_c$ exactly: the scale $a_c$ is divided out independently of its value and $d_c$ is subtracted off. This is the equivalence the reviewer describes, now explicit.
**2. Law of Total Variance bias, now quantified.** The reviewer is exactly right that our EMA tracks $\mathbb E[\mathrm{Var}(X|\mathrm{Batch})]$ and omits $\mathrm{Var}(\mathbb E[X|\mathrm{Batch}])$. We added the decomposition to Appendix A **and measured the omitted term on the deployed model** (new Table 9, all 783 channels): for random batches of size $B$ it equals $(\sigma^2_\mu/B)(N{-}B)/(N{-}1)$, i.e. it shrinks as $1/B$, and the induced scaling error is $\approx f/2$. **(a)** At the $B{=}64$ of our main results it is **0.11%** of total variance (**0.05%** scaling error), so the cancellation holds to high precision. **(b)** It grows as batch size falls, reaching **8.0%** at single-sample streaming on Gaussian noise (16% contrast, 10% fog), which is precisely the bs=1 collapse we report in §4.3/Fig. 5. The reviewer's decomposition is the theoretical explanation for that effect, and our window-matched momentum $m=\text{bs}/640$ holds this between-batch term constant as $B$ shrinks. We've added this connection.
**3. Selective-layer recalibration vs. the no-data premise.** Clarified in §4.2. Selection is a **one-time, design-time step the developer runs before deployment** on source/held-out synthetic corruptions (never the test stream, never on-device), so it neither needs test-time data nor conflicts with the deployed-model premise. We rank on held-out corruptions and evaluate on a **disjoint** set; importance is largely corruption-independent (5/6 top layers shared), so it transfers without seeing the test shift. It is also **optional**: adapting all 21 layers needs no selection or calibration data and already recovers +18.4; selection is a cheap optimization on top.
**4. Memory and layer-by-layer operation.** Added to §4.7. The only added state is a per-channel running mean+variance per site: 784 values ≈ **6 KB** (≈1–2 KB for the selective top-3). Recalibration runs **layer-by-layer, in place**, reusing the activation buffer inference already holds, so it adds **no extra activation buffer and no increase in peak memory**.
**5. Inference library / physical MCU.** §4.7 now states it explicitly: **ESP-NN** SIMD int8 kernels, and recalibration **ran on the physical ESP32-S3, not a simulator**: the 8.3 mJ / 21.9 ms are measured on the firmware whose logits match the integer reference exactly.
**6. Prior work (AdaBN, He & Cheng ECCV 2018) and pruning.** Added to Related Work. These recalibrate normalization forward-only **but on models that still contain BN layers**; our regime begins where folding has removed them. We also note (already stated in §1/§6) that we do not claim a new adaptation rule. FORGE is orthogonal to and composable with pruning (it recalibrates conv outputs regardless of sparsity), noted as an extension.
**7. Newer architectures.** The submission **already evaluates MobileNetV2** (depthwise-separable, inverted-residual): +20.5 CIFAR-10-C, +11.0 Tiny-ImageNet-C, so the method does not rely on vanilla convolutions, now stated explicitly (§4.4). ViTs use LayerNorm, not fused into convolutions; FORGE targets the BN-fold regime, added to Limitations as honest scope.
Assigned Action Editor: ~Christopher_Mutschler1
Submission Number: 9819
Loading