Abstract: Speculative execution achieves 2–5× speedups in
text-only large language models but degrades to 1.5–2.5×
in multimodal systems. Through systematic investigation of
SmolVLM2 models (256M and 2.2B parameters) on Apple
Silicon’s unified memory architecture using MLX framework, we
identify performance inversion: smaller models exhibit slower
text-only inference than larger counterparts.
Our primary finding shows the 256M draft model runs 22.4%
slower than the 2.2B target (0.72ms vs 0.59ms, p < 0.001, Cohen’s
d = 1.45) despite 8.6× fewer parameters. Initial investigation
suggested framework overhead dominated—fixed costs for kernel
launches that would disproportionately affect smaller models.
However, comprehensive kernel-level profiling revealed hardware
efficiency as the primary factor: the draft model achieves only
0.94% of peak GPU throughput while the target reaches 2.59%—a
2.8× efficiency gap. Small matrix operations fundamentally cannot
saturate unified memory’s 20 GPU cores, making smaller models
slower despite fewer parameters. This hardware underutilization
accounts for 60–80% of the performance gap, with framework
overhead contributing 20–40%.
Real-world COCO validation confirms findings generalize to
authentic visual content (21.4% slowdown). Precision ablation
shows variation: float32 exhibits 24.6% slowdown, bfloat16 shows
28.5% slowdown, and float16 shows 39.2% slowdown. Sequential
speculation achieves 3.28× speedup despite the inversion, while
parallel speculation fails (0.70× slowdown) due to coordination
overhead.
We provide: (1) First direct measurement distinguishing
framework overhead from hardware efficiency in multimodal
performance inversion on unified memory; (2) Mechanistic
explanation of GPU utilization effects; (3) Practical deployment
guidelines for edge AI systems; (4) Rigorous statistical validation
across 5 independent runs with very large effect sizes. While
f
indings are Apple Silicon/MLX specific, they establish that
smaller models are not universally faster—practitioners must
profile actual speeds considering hardware efficiency in their
deployment environment.
Index Terms—vision-language models, hardware efficiency,
GPU utilization, unified memory, Apple Silicon, edge deployment,
speculative execution
Loading