When Smaller Models Run Slower: Hardware Efficiency Bottlenecks in VLM Inference on Unified Memory

Published: 06 Jan 2026, Last Modified: 13 Feb 2026IEEE CCWC 2026EveryoneCC BY 4.0
Abstract: Speculative execution achieves 2–5× speedups in text-only large language models but degrades to 1.5–2.5× in multimodal systems. Through systematic investigation of SmolVLM2 models (256M and 2.2B parameters) on Apple Silicon’s unified memory architecture using MLX framework, we identify performance inversion: smaller models exhibit slower text-only inference than larger counterparts. Our primary finding shows the 256M draft model runs 22.4% slower than the 2.2B target (0.72ms vs 0.59ms, p < 0.001, Cohen’s d = 1.45) despite 8.6× fewer parameters. Initial investigation suggested framework overhead dominated—fixed costs for kernel launches that would disproportionately affect smaller models. However, comprehensive kernel-level profiling revealed hardware efficiency as the primary factor: the draft model achieves only 0.94% of peak GPU throughput while the target reaches 2.59%—a 2.8× efficiency gap. Small matrix operations fundamentally cannot saturate unified memory’s 20 GPU cores, making smaller models slower despite fewer parameters. This hardware underutilization accounts for 60–80% of the performance gap, with framework overhead contributing 20–40%. Real-world COCO validation confirms findings generalize to authentic visual content (21.4% slowdown). Precision ablation shows variation: float32 exhibits 24.6% slowdown, bfloat16 shows 28.5% slowdown, and float16 shows 39.2% slowdown. Sequential speculation achieves 3.28× speedup despite the inversion, while parallel speculation fails (0.70× slowdown) due to coordination overhead. We provide: (1) First direct measurement distinguishing framework overhead from hardware efficiency in multimodal performance inversion on unified memory; (2) Mechanistic explanation of GPU utilization effects; (3) Practical deployment guidelines for edge AI systems; (4) Rigorous statistical validation across 5 independent runs with very large effect sizes. While f indings are Apple Silicon/MLX specific, they establish that smaller models are not universally faster—practitioners must profile actual speeds considering hardware efficiency in their deployment environment. Index Terms—vision-language models, hardware efficiency, GPU utilization, unified memory, Apple Silicon, edge deployment, speculative execution
Loading