Inference-Time Distillation: Cost-Efficient Agents Without Fine-Tuning or Manual Prompt Engineering

Vishnu Sarukkai; Asanshay Gupta; James Hong; Michaël Gharbi; Kayvon Fatahalian

Inference-Time Distillation: Cost-Efficient Agents Without Fine-Tuning or Manual Prompt Engineering

Vishnu Sarukkai, Asanshay Gupta, James Hong, Michaël Gharbi, Kayvon Fatahalian

Published: 02 Mar 2026, Last Modified: 02 Apr 2026ICLR 2026 Workshop DATA-FMEveryoneRevisionsCC BY 4.0

Keywords: in-context learning, agents, cost

Abstract: Deploying LLM agents at scale typically requires choosing between quality and cost. Existing cost-reduction approaches fail to preserve agility: the ability to iterate rapidly without human time bottlenecks. Prompt engineering is brittle and slows iteration, while fine-tuning requires multi-day training and commitment to fixed designs; both are impractical for iterative workflows and time-sensitive batch jobs. We demonstrate that established inference-time techniques—dynamic in-context learning and self-consistency cascades—can be leveraged to shift the cost-accuracy Pareto frontier while preserving agility. Practitioners run the teacher on a small task subset to collect demonstrations, then immediately deploy a cheaper student on the remainder. At each step, the system retrieves relevant teacher demonstrations as in-context examples. When multiple student samples agree, we proceed; when they diverge, we fall back to the teacher. This requires no prompt engineering or training. On ALFWorld, we match teacher accuracy at 2.5x lower cost ($0.059 → $0.024 per episode). On AppWorld, we achieve 3.5x cost reduction while recovering 79% of teacher accuracy. Our empirical analyses provide guidance on key design choices: teacher database size, demonstration set size, retrieval strategy, and cascade thresholds. These analyses highlight inference-time levers for navigating cost-performance tradeoffs without sacrificing human development speed.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 160

Loading