# Models Tested

This file lists the model identifiers evaluated in our experiments.  
**API snapshot date:** 2025-05-05  
**Default generation settings (unless stated otherwise):** temperature=0, deterministic decoding.

## Baseline Models (recommended for quick reproduction)
- gemini-2.0-flash      # baseline A
- gpt-4.1               # baseline B

## Additional Models 
- gpt-4o
- gpt-4o-mini
- gpt-4.1-mini
- o3-mini
- o4-mini
- claude-3-5-sonnet
- claude-3-5-haiku
- claude-3-7-sonnet
- gemini-2.5-flash-preview
- gemini-2.5-pro-preview
- gemini-2.5-pro-exp
- gemini-1.5-flash-8b
- gemini-2.0-flash-thinking-exp
- deepseek-chat
- deepseek-reasoner
- qwen2.5-72b-instruct
- qwen2.5-32b-instruct
- qwen2.5-7b-instruct
- qwen2.5-3b-instruct
- qwen2.5-1.5b-instruct
- qwen2.5-0.5b-instruct
- qwen3-0.6b
- qwen3-1.7b
- qwen3-4b
- qwen3-8b
- qwen3-14b
- qwen3-32b
- qwen3-30b-a3b
- qwen3-235b-a22b
- qwen3-0.6b-thinking
- qwen3-1.7b-thinking
- qwen3-4b-thinking
- qwen3-8b-thinking
- qwen3-14b-thinking
- qwen3-32b-thinking
- qwen3-30b-a3b-thinking
- qwen3-235b-a22b-thinking
- llama-3.1-405b-instruct
- llama-3.1-70b-instruct
- llama-3.1-8b-instruct
- llama-3.2-90b-vision-instruct
- llama-4-maverick-17b-128e-instruct-maas
- llama-4-scout-17b-16e-instruct-maas
- grok-3-beta
- grok-3-mini-beta
- mistral-small-2503
- nvidia_llama-3.1-nemotron-ultra-253b-v1
- nvidia_llama-3.3-nemotron-super-49b-v1
- nvidia_llama-3.1-nemotron-ultra-253b-v1-thinking
- nvidia_llama-3.3-nemotron-super-49b-v1-thinking

---

**Notes**

- **API access & cutoff.** Models were accessed via API. For identifiers without explicit date suffixes, we use the latest available API endpoints as of the testing cutoff date: **May 5, 2025**.
- **Reasoning/CoT mode.** The `-thinking` suffix indicates the model was evaluated with its native reasoning / Chain-of-Thought (CoT) mode enabled. The corresponding counterpart **without** the suffix was evaluated with this mode disabled.
- **Thinking budget.** For models with native reasoning modes, the thinking budget was set to **unlimited (i.e., the maximum permitted within each model's context window limits)**.
- **Output length & latency (reasoning models).** Many reasoning models produce longer outputs (often up to ~3× input tokens) and run with higher latency; some adopt step-by-step strategies (e.g., "count each value by each"), further increasing token usage.
- **Reasoning modes & decoding controls.** Some reasoning models do not support standard decoding controls (e.g., temperature, top_p, top_k). When these parameters are unsupported or ignored, we do not set them.
