{
  "query": "I want to know how LLMs are actually being used in the gaming industry.",
  "query-id": 22,
  "gold_response": "Below is a study of how LLMs are used inside commercially released games (not dev tools), grouped by development scale, with concrete implementations and cost/latency notes.\n\n1) What’s shipping now (in-game LLMs only)\n\nLarge-scale (AAA/live service)\n- Justice Online / Justice Mobile (NetEase, China, live MMO)\n  - What: AI NPCs that hold free-form conversations, remember players, and can spawn side-quests. Rolled out in live servers (2023–2024 updates).\n  - How it’s used: LLM mediates open chat with lore-grounding; agent “proposes” quest steps that are validated and compiled into existing quest templates; persistent player–NPC memory.\n  - Notes on scale/budget: Full game is AAA-scale (multi-year, large team). LLM ops are a small slice of total budget but meaningful at MMO concurrency; see cost model below.\n\nSmall-scale / indie (shipped titles)\n- AI Dungeon (Latitude; web/mobile; subscription live since 2019)\n  - What: The LLM is the game master. Player types anything; the story/world react. Conversation is the core mechanic.\n  - How used: RAG for world lore, persistent memory per story, safety filtering, and action-to-state updates in the game engine.\n- AI Roguelite (Steam, 2023)\n  - What: LLM generates rooms/items/enemies and interprets free-form player actions that affect combat and environment.\n  - How used: LLM proposals → validator → compile into Unity commands/assets; summarizes state each turn.\n- Inworld Origins (Steam, 2022; free playable)\n  - What: Stealth/infiltration slice where NPCs converse freely and react believably.\n  - How used: Inworld’s character graph + LLM + guardrails; outputs intents/actions that drive NPC behaviors in-engine.\n\nModded/commercial games (community or server-operator features inside released titles)\n- Skyrim (via ChatGPT/Inworld/Convai NPC mods; live on Nexus/mod communities)\n  - What: Speak to any NPC; they converse in-character with memory.\n  - How used: STT → LLM persona+RAG → JSON intent + TTS; Papyrus/creature AI execute actions.\n- GTA V / FiveM roleplay servers (operator-integrated AI NPCs)\n  - What: Shops/clerks/informants you can question freely; some drive quest hints/world state.\n  - How used: LLM tool-use to call server scripts (e.g., open shop, spawn mission waypoint), with strict whitelists.\n\n2) Core in-game LLM patterns and concrete pipelines\n\nA) Dynamic NPC agent (brain-in-the-loop)\n- Goal: Free-form talk + believable actions, consistent with lore and systems.\n- Typical loop (per interaction):\n  1) Perceive: Engine snapshots relevant state → structured context (who, where, goals, affordances; last N turns; local objects).\n  2) Retrieve: RAG pulls relevant lore/quest facts from a vector DB (e.g., FAISS/Milvus/PGVector) using embeddings.\n  3) Reason: LLM with persona+policy prompt proposes (a) dialogue and (b) zero or more intents/actions.\n  4) Verify/Constrain: Rule checker validates intents against allowed API (grammar-constrained decoding or JSON schema). Optional cross-check with a second lightweight model (answer verification) or regex/policy engine.\n  5) Act: Engine executes whitelisted commands (e.g., move_to(x,y), open_door(id), give_hint(quest_id)). Dialogue streams to player via TTS.\n  6) Memorize: Summarize salient facts to short-term cache and append long-term memory selectively (episodic/per-player caps).\n- Implementation notes:\n  - Persona and canonical facts live in system prompt + RAG docs; keep prompts <8–12k tokens to manage latency/cost.\n  - Use function-calling or structured outputs to map LLM decisions to engine commands.\n  - For always-on NPCs, run a small sLM locally (intent gating) and escalate to a larger model only on demand.\n\nB) Conversation-as-core mechanic (text-first play)\n- Goal: The LLM is the rules arbiter and storyteller.\n- Loop:\n  1) Parse player input → classify intent (combat, explore, talk, craft).\n  2) Consult rules/lore via RAG; few-shot examples enforce tone and mechanics.\n  3) Produce outcome + state diff in JSON (e.g., {hp_delta:-5, inventory_add:[…]}).\n  4) Validate against balance/safety; clamp illegal moves; re-ask model if needed.\n  5) Render narrative + apply state; write concise memory summary to keep context small.\n- Notes: Deterministic gameplay-critical fields often use grammar-constrained decoding; narrative uses creative sampling.\n\nC) LLM-assisted procedural content that affects gameplay\n- Goal: Generate quests, items, or level variations that plug into existing systems.\n- Loop:\n  1) Designer seed or system trigger (biome X, tier 3 difficulty).\n  2) LLM drafts spec (quest beats, enemy mix, reward table) referencing allowed templates.\n  3) Validator enforces schemas, budget, and difficulty; rejects/reprompts until valid.\n  4) Compiler converts spec to engine assets (prefab IDs, navpoints, spawn tables).\n  5) Playtest bots or heuristic checks ensure no softlocks; cache approved content.\n\n3) How these are deployed (models, hosting, latency, costs)\n\nModels and middleware\n- Managed APIs: OpenAI (gpt-4o/4o-mini), Anthropic (Claude 3 family), Google (Gemini 1.5), plus regionals (e.g., ERNIE, iFlytek, NetEase Fuxi-backed stacks).\n- Open-source for on-prem: Llama 3.x (8B/70B), Mixtral/Mistral 7B. Serve with vLLM/TGI; quantize to 4–8-bit for throughput.\n- NPC middleware: Inworld, Convai, Charisma, Replica/PlayHT (voice). These provide persona graphs, safety, and tool APIs out of the box.\n\nLatency targets (player-facing)\n- Text reply: 300–800 ms first token; <2 s total for rich turns.\n- Voice: stream TTS; 300–500 ms to start speaking feels natural.\n- Tactics: smaller sLM locally for intent gating; defer to big model only when needed; cache recent Q/A and tool outputs.\n\nCost trade-offs (2024 pricing ballpark)\n- OpenAI gpt-4o-mini: ~$0.15/M input tokens, ~$0.60/M output tokens; good for real-time NPC.\n- Anthropic Claude 3 Haiku: ~$0.25/M in, ~$1.25/M out; low-latency, safe.\n- On-prem 8–13B models: with 1× A100/L4, effective <$0.10/M tokens at high utilization; engineering heavy but predictable.\n- Example MMO ops model (Justice-like feature): 200k DAU × 3 AI chats/day × 250 in + 250 out tokens ≈ 300M tokens/day. At 4o-mini: ≈$225/day in + $450/day out ≈ $675/day (excl. retrieval, safety, TTS). With voice and guardrails, total might land $1–3k/day. Spiky events need headroom.\n\n4) Guardrails, memory, and failure handling\n- Guardrails: system prompts with policies; grammar- or JSON-constrained decoding; allow-list tool APIs; profanity/PII filters (pre/post); answer verification (second-pass sLM).\n- Memory: sliding window (recent N turns), episodic per-player (key–value summaries), and long-term facts in a vector DB. Evict aggressively; summarize after each turn.\n- Hallucination control: RAG over authoritative lore; refuse outside-canon questions with helpful redirections; temperature split—low for facts, higher for flavor.\n- Safety/fairness: sandbox LLM actions; timeouts with graceful fallbacks to authored barks; log/trace prompts for QA.\n\n5) Compact compare (methods vs. shipped examples)\n\n| Category | Dynamic NPC agent | Conversation-as-game | PCG affecting gameplay |\n| --- | --- | --- | --- |\n| Player impact | Free-form talk + believable actions | The conversation is the game | Fresh quests/items that still play by the rules |\n| Typical loop | Perceive → RAG → Reason → Verify → Act → Memorize | Parse → RAG → Outcome JSON → Validate → Apply | Seed → Draft spec → Validate → Compile → Test |\n| Tech focus | Tool APIs, persona, memory, safety | State diffs, rules grounding, few-shot | Schemas, validators, auto-playtests |\n| Latency | <2 s turn, TTS streaming | <1–2 s turn | Offline/async or background |\n| Examples (released) | Justice Online/Mobile (AAA); Skyrim, GTA V via mods (live) | AI Dungeon (indie); AI Roguelite (indie) | AI Roguelite (indie) |\n| Budget/ops notes | AAA: LLM ops usually a small % of live-ops; on-prem or regional LLMs reduce per-token costs | Indie: subscription or per-play gating to cover API; heavy use of summarization | Generate offline, cache; pay once, reuse |\n\nPractical takeaways\n- Start with tool-gated agents: define a tight JSON/grammar + allow-list of engine commands.\n- Ground everything: RAG over your lore and live world state; refuse unknowns.\n- Control spend: small sLM for triage; escalate to bigger models; cache aggressively; measure tokens/turn.\n- Design for failure: authored fallbacks; rate limits; observability on prompts/actions.\n\nIf you want, I can expand any case above (e.g., deeper quest compiler flow or a full prompt/tool schema) or build a cost model from your DAU/concurrency.",
  "gold_information": [
    "Large online games ship AI non-player characters that converse freely, remember players, and trigger side-quests.",
    "Language models are grounded in game lore to propose actions that are validated and compiled into quest templates.",
    "Indie titles use a language model as the game master where open-ended player input drives the story.",
    "Some games generate rooms, items, and enemies with a language model and interpret free-form player actions during combat and exploration.",
    "Middleware stacks combine character graphs, language models, and guardrails to produce intents that drive engine behaviors.",
    "Modded games and private servers integrate AI NPCs that support open conversation and tool-use via whitelisted scripts.",
    "A dynamic NPC agent pipeline perceives state, retrieves lore, reasons about dialogue and actions, verifies outputs, acts via engine commands, and updates memory.",
    "Retrieval-augmented generation over a vector database is used to ground responses in canonical lore.",
    "Structured outputs or function calls map model decisions to whitelisted engine commands.",
    "Grammar- or schema-constrained decoding validates intents and prevents unsafe actions.",
    "Dialogue can be delivered via streaming speech synthesis for natural interactions.",
    "Short-term caches and episodic per-player memories are summarized and selectively promoted to long-term storage.",
    "Conversation-first games parse player input, consult rules via retrieval, output state diffs, validate results, and render narrative.",
    "Gameplay-critical fields use deterministic decoding while narrative text uses creative sampling.",
    "Procedural content pipelines draft quests or items against templates and budgets, then validate and compile them into assets.",
    "Automated playtests or heuristic checks identify softlocks before content is published.",
    "Real-time interactions use small local models for intent gating and escalate to larger models on demand.",
    "Aggressive caching of recent questions, answers, and tool results reduces both cost and latency.",
    "Open-source models hosted on-premises can reduce per-token costs but require engineering and operations.",
    "Managed API models offer faster integration and safety tooling at a higher operating cost.",
    "Quantized models and optimized inference servers increase throughput for on-prem deployments.",
    "Safety layers include profanity and personal-information filters applied before and after generation.",
    "A lightweight secondary model can verify answers to reduce hallucinations and policy violations.",
    "Hallucination control relies on authoritative retrieval and refusal of out-of-canon queries.",
    "Tool-use is restricted to allow-listed APIs that the game engine can safely execute.",
    "Fallback behaviors provide authored responses when the model times out or fails validation.",
    "Observability with prompt and action logs supports quality assurance and debugging.",
    "Budget control strategies include small-model triage, rate limits, caching, and token measurement per turn.",
    "Text interactions target responsive first tokens and complete turns suitable for real-time play.",
    "Voice interactions use streaming synthesis to maintain natural pacing.",
    "Mods often use speech recognition, persona prompts with retrieval, structured intents, and speech synthesis for replies.",
    "Roleplay servers employ language-model tool-use to open shops, give hints, or spawn waypoints via whitelisted scripts.",
    "Procedurally generated content is cached for reuse without repeated model calls.",
    "Prompt design is kept compact to manage latency and cost.",
    "Balancing constraints clamp illegal moves or out-of-bounds outcomes before state changes are applied.",
    "NPC actions are executed through a narrow command surface such as movement, opening objects, and giving hints."
  ]
}