MID-L: Matrix-Interpolated Dropout Layer with Layer-wise Neuron Selection

Pouya Shaeri; Ariane Middel

MID-L: Matrix-Interpolated Dropout Layer with Layer-wise Neuron Selection

Pouya Shaeri, Ariane Middel

Published: 22 Sept 2025, Last Modified: 25 Nov 2025ScaleOPT PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Dynamic sparsity, conditional computation, input-aware gating, Top-k gating, straight-through estimator (STE), low-rank projection, efficient inference, GPU optimization, FLOPs reduction

TL;DR: MID-L is a plug-and-play layer that uses input-conditioned Top-k gating to interpolate between lightweight and full paths, reducing FLOPs and active neurons while matching or improving accuracy and robustness.

Abstract: Modern neural networks often activate all neurons for every input, leading to unnecessary computation and inefficiency. We introduce Matrix-Interpolated Dropout Layer (MID-L), a novel module that dynamically selects and activates only the most informative neurons by interpolating between two transformation paths via a learned, input-dependent gating vector. Unlike conventional dropout or static sparsity methods, MID-L employs Top-k masking with straight-through gradient estimation (STE), enabling per-input adaptive computation while preserving end-to-end training. MID-L is model-agnostic and integrates seamlessly into existing architectures. Extensive experiments on six benchmarks, including MNIST, CIFAR-10, CIFAR-100, SVHN, UCI Adult, and IMDB, show that MID-L achieves up to 55\% reduction in active neurons, 1.7$\times$ FLOPs savings, and maintains or exceeds baseline accuracy. We further validate the informativeness and selectivity of the learned neurons via Sliced Mutual Information (SMI) and observe improved robustness under overfitting and noisy data conditions. From a systems perspective, MID-L’s conditional sparsity reduces memory traffic and intermediate activation sizes, yielding favorable wall-clock latency and VRAM usage on GPUs (and is compatible with mixed-precision/Tensor Core execution). These results position MID-L as a general-purpose, plug-and-play dynamic computation layer, bridging the gap between dropout regularization and GPU-efficient inference.

Submission Number: 25

Loading